Transcript: Datapalooza 2012

How much data is currently being stored about YOUR children? Do you have any idea who is storing that data, what they know, or how that information can be used?

This is a work in progress – as a slightly improved version of the transcript of the Knewton data video below:

So the human race is about to enter a totally data mined existence and it’s going to be really fun to watch. It’s going to be one of those things where our grandkids are going to tell our kids, “I can’t believe you grew up in a world like that” just the way our kids complained that we went to record stores.

You know, when Tom Cruise walks through the mall in Minority Report and the ad beams right to his eyes and says “Hey Mr. Cruise you should you go on that Caribbean vacation you’ve been thinking about.” I know some entrepreneurs who work on that technology right now. And, um, I’m still waiting for the day when my refrigerator’s going to know when I’m running out of milk and it’s ordered for me automatically on Fast Track. I think that day’s coming in a few years it’s not far off.

The world in 30 years is going to be unrecognizably data mined. So what does that mean for education?

Well, education happens to be, today, the world’s most data minable industry by far, and it’s not even close. So maybe, one day, healthcare will be up there – when they have little nanobots that are in your bloodstream that are doing real time analysis, but until then it’s not close. Education beats everything else, hands down.

So let’s look at other big data industries:

The really big data industries in the world right now are, not surprisingly, on the internet because that’s where it’s easy to grab the data and that’s also where there’s a congregation of talent that understands data.

So, um, well, let’s just look at it by the numbers – because the name of the game is “Data Per User.”

Okay, so, one of the things that fakes us out about data and education is: education, because it’s so big (it’s like the fourth biggest industry in the world that produces incredible quantity of data).

But data that just produces one or two points per user, per da,y is not really all that valuable to an individual user. It might be valuable to like a school district administrator, but maybe not even then. So let’s just compare. Netflix and Amazon get in the ones of data points per user per day. Google and Facebook get in the tens of data points per user per day. So you do 10 minutes of messing around in Google you produce about a dozen data points for Google. Okay great.

So Newton today gets five to ten million actionable data, per student, per day. Now we do that because we get people (if you can believe it) to tag every single sentence of their content (we have a large publishing partnership with Pearson, and they tag all their content) and we’re in open standard so anyone can tag to us.

If you tag all your content and you do it down to the automatic concept level, down to the sentence, down to the clause, you unlock an incredible amount of trapped hidden data.

Why do you do that?

Well if you use programmatic taxonomy models and item response theory and that thing at the bottom (we haven’t given that a name yet), what you figure out is: everything in education is correlated to everything else down to the concept.



Now this is where education’s different from search or social networking. If someone tagged every single line, every single sentence of all the world’s web pages for Google, or every single line of dialogue from Netflix, which no one’s done, but even if they had they’re not really a whole lot of interesting correlations there.

Everything in education is correlated to everything else. Every single concept is correlated in a predictable way to everything else using psychometrics right.

So if you do 10 minutes of work in Google you produce a dozen data points for Google. Because everything that we do is tagged at such a grandeur level if you do 10 minutes of work for Newton you cascade out lots and lots of other data, and here’s why. When you took the SAT there might be 40 different concepts about equilateral triangles that are tested on all the SATs ever given in any one year.

But you didn’t get all 40 questions you got two questions on equilateral triangles, because, they figure, if you’re in the Top 14th percentile at those two questions, 13th percentile on this one and 15th percentile on that one… If you’re in the Top 14th percentile on those two questions in equilateral triangles, the odds are a 98th percentile chance that you’re in the Top 14th percentile at every concept on equilateral triangles. And there’s a 96% chance that you’re in the Top 15th percentile at all triangle concepts, three, four five, 30, 60, 90, isosceles, etc., etc.

You did a little bit of work for Newton and we used just established signs of psychometrics to cascade out hundreds of other data.

So we can produce incredible quantities of data per user, per day. It’s really, really hard to get that, okay? But, if you can get all that tagging done…

({refers to slide} …and that’s one of our tags. That’s a small part of our overall taxonomy. That’s just part of one course and we have dozens of taxonomies), then you can do this.

Granular understanding


What you can do with the data, if you actually do all that work, is you can figure out exactly what students know and how well they know it. You can figure it out down to the percentile versus the rest of the population.

So, Newton students today: we have about 180,000 right now, by December it’ll be 650,000, early next year it’ll be in the millions and the next year it’ll be closer to 10 million, and that’s just through our Pearson partnership.

So for every one of the students, we can figure out, within a few hours, what they’re strong at and what they’re weak at, at the beginning of the course. So we can produce a unique syllabus for each student each day, literally unique.

There’s not enough time in the universe for any two students to have the same syllabus on any one day, that’s how many there are. So it’s optimized for each kid down to the atomic concept. And then we can figure out things like well here’s your homework tomorrow night, you’re going to struggle with that homework or you’re going to fail it, because concepts in that homework that we know you haven’t mastered the previous concepts for that build up to that. Or there’s concepts in that homework that [inaudible 04:53] very highly concepts always have trouble with.

So we know you’re going to fail, we know it in advance and we can prevent it in advance. We go grab some content from somewhere else in the portfolio and going to seamlessly blend that into your homework tonight. So every kid gets a perfectly optimized textbook, except it’s also video and other rich media dynamically generated in real time. And it also uses the combined data power of the entire network. So here’s what I mean by that, like I said next year we’ll have close to 10 million students, a few years from now we’ll have a 100 million.

A 100 million first shows up to learn something like rules of exponents or subject per agreement, whatever. We take the combined data problem all hundred million to figure out exactly how to teach every concept to each kid. So the 100 million first shows up to learn the rules of exponents, great let’s go find a group of people who are psychometrically equivalent to that kid. They learn the same ways, they have the same learning style, they know the same stuff, because Newton can figure out things like you learn math best in the morning between 8:40 and 9:13 am. You learn science best in 42 minute bite sizes the 44 minute mark you click right [inaudible 05:47], you start missing questions you would normally get right. You learn social studies best with video clips or 22% video to 78% text, or whatever your optimal cocktail. We can tell when we should return content to you for optimal retention.

We literally know everything about what you know and how you learn best, everything because we have five orders of magnitude and more data about you than Google has.

We literally have more data about our students than any company has about anybody else about anything, and it’s not even close. That’s why we can do all that stuff right.

So then what we can do is take that profile the 100 million kids, next it’ll be 10 million. We can go figure out okay who’s exactly like that kid? Whose learning styles up and down the line are just the same? Who knew the same stuff at the same level of mastery when they had [inaudible 06:24]? Great.

Statistically speaking it has to be the case that some 5% or 10% through shared bad luck did the absolute wrong thing for themselves without knowing it. They did questions that were too hard, that got discouraged, they bounced. They accessed text they should have gotten the video, whatever. It also has to be a fact or statistics that through pure blind luck, some Top 1% the absolute perfect thing for themselves without realizing it.

And we go take the whole combined data power that network of millions, soon to be tens of millions, eventually it’ll be hundreds of millions of people. And for every single concept that your child learns 2000 concepts in a particular semester along math course, for every single autonomic concept we take the combined data part, that vast network and use it to fund perfect plan forward for that kid for that concept. So that’s what we do right now.

Let me give you a couple of examples. This is one student. There’s a few hundred learning clusters there, there’s a few tens of thousands of autonomic learning objects there. That’s one student’s path, this is a real student in a US college right now. And you’ll see that each student has a totally different path. Some students have short paths, some have long paths, in this particular course there were students who finished it in 14 days, there were students who finished it in two semesters.

This is a course at ASU. They had to change their semester structure to a modulate semester structure because we were suddenly telling them things like if you give this woman here the final right now she’ll get an A, it’s only 14 days into the course. I promise you she’ll get an A. You can keep her in that seat if you want, and that’s what we’ve always done now we don’t have to.

So let’s show you this. This is a 150 student’s one class and they kind of all look like fleas but that’s all an individual learning path. Notice that some of them are going really fast, some of them are going really slow, and then they’ll all kind of speed up when the test comes. It’s kind of like organic and so those different color coded things are like concept clusters. Like some test obviously just happened, that’s why they all started working.

And you can look at some of those students and think boy that pure schmuck is really in a lot of trouble because they’re going too slowly. So where we think we’re going with this obviously it’s in market right now. We’re going to be in K-12 starting next year and it’s an open platform anyone can plug it in and use it by APIs. And where we think we’re going with the data side of it, which is the really fun stuff for today, is we think within a few years we’ll be able to start predicting great performance.

So teachers grade persistently year in and year out, if that teacher grades consistently we can match up the student profiles down to the autonomic concept levels versus great performance. We can tell you you’re on track to get a B- in this course right now. Either that or if your teacher gets totally inconstant we can’t tell you that, but that’s another problem.

If your teacher grades consistently we can tell you what your grade’s going to be based on what you know and how fast you’re learning it. But if you do another 30 minutes a day for three days a week you can get it up to an A-. We can tell you things like that.

We’re really excited to correlate with other people’s datasets by open API things like, something we’ve talked about as kind of a joke but it really should work, is like the food diary. You tell us what you had for breakfast every morning at the beginning of the semester, by the end of the semester we should be able to tell you what you had for breakfast because you always do better on the days you have scrambled eggs or whatever. And more importantly we should be able to tell you what you should have for breakfast.

So the power of data when you unlock millions of data points per user per day you can accomplish things that people aren’t even conceiving of right now.

But that world is coming we’re trying to bring it to you and we’re going to be an open system to allow anyone to just plug that data, take it out, and then plug it back in.

Thanks very much.