Data Wrangling

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
Steve Lohr
on NYTimes.com.

A few months ago, we spoke of the Sexiest Job in the 21st Century (ie Data Analysts), and how there were five high-level tasks involved: discovery, wrangling, profiling, modeling, and reporting.

Anyone who has been involved with data migration or data analysis on OSS projects will be able to confirm that as much time is spent on cleansing and tweaking the data as there is in actually migrating or mining the data. The advent of big data tools has allowed for a larger variety of data sources, often unstructured, to be aggregated more easily by CSPs.

I think there is a huge opportunity for an off-shoot to OSS in the data preparation/wrangling space. These types of tools could be used for:

  • Identifying data integrity issues and helping to resolve them
  • Streamlining the cleansing stage of major data migration activities
  • Data set manipulation and alignment prior to mining

CSPs won’t necessarily need these tools continuously, so a third-party fee-for-service offering could be quite attractive to CSPs. Similarly, if such a service/tool could significantly reduce the amount of data wrangling done by a CSP‘s data analysts, then CSPs would be happy to pay to free up their valuable analyst resources to focus on profiling, modeling, and reporting.

The NY Times link above provides reference to a number of startups already working on automated data wrangling, so I’ll be watching this space with interest. Will you?

PS. As fate would have it, HBR has just released a new post in relation to the Sexiest Job of the 21st Century.

Read the Passionate About OSS Blog for more or Subscribe to the Passionate About OSS Blog by Email

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.