Work is the refuge of people who have nothing better to do. : Oscar Wilde

Thursday, June 11, 2009

Upward Trending Job: Data Scientist

You can find riffs on the net about this one in a few places: “Rise of the Data Scientist” is one that forms a pretty good introduction to this topic.

Basic ideas?

We are now being inundated with data. Much of it might be useful. Lots of it is not. Much of it is like the iron in iron ore. It needs extracting. Much of this data is buried in web pages or other places which were never intended to be "mined" for information.

To be a data scientist you need to be part statistician to be able to make sense of data in the traditional ways based on probabilistic models. You need to be good at extracting data from a host of different kinds of sources, including databases, web pages, maps and other graphical entities and natural language, both recorded, and visual or spoken. You need to be able to present the meaning of what you have extracted visually for best comprehension by other members of your species.

Of course, no-one can master all of these techniques, even now. Data scientists will specialise in some of these and be able to communicate with other specialists as well as with people who need the visual information that they are creating.

For a sample of one visualisation technique here is what Jeff Clarke of Neoinformix in Toronto is doing with StreamGraph. Modern browsers (which leaves out Internet Explorer) also make it possible to produce graphs directly in the browser window; see the InfoVis Toolkit, for example.


Anonymous said...

Great post on extracting data, with some well thought out points, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for other projects that include documents, files, or the web i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs

Bill Bell said...

Thanks for your comment.

Are you using BeautifulSoup to extend your reach with Python?

The other development I've noticed since I wrote this article is about crowd-sourcing of data extraction. The idea is to get many people working on documents across the 'net, filling in forms to provide the requisite information. I'm sure we'll see more and more approaches to the problem of taming the tsunami of data.