A SkillsCast for this session is not available.
As programmers, we tend to treat data as generic stuff to feed into the algorithms and architectures we love. We don’t really pay attention to the data itself, especially when we have terabytes or petabytes of it.
Huge mistake. And we are trained to make it! It is why it takes a year for a new programmer to be productive at working on the Google ranking algorithm. It held back progress on genome sequencing algorithms. It has cost me more time than I’d care to imagine.
The good news is that you don’t have to look at all of your petabytes of data. Just eyeball a ten record sample when you start, and repeat as you work the data. Even then, eyeballing can be hard work, and done wrong can be worse than doing nothing. But done right, it can be fun, and the data will almost always surprise you. Better yet, you can use your favorite algorithms and architectures to build tools to make eyeballing your data easier and much more effective.
YOU MAY ALSO LIKE:
The One Weird Trick for Analyzing Big Data … Eyeball it Early and Often!
John Lamping
Principal ScientistXerox PARC