Lack Of High-Quality Training Data Impedes AI Advances

Artificial Intelligence systems may one day perform a plethora of human tasks, however the third annual Data Scientist Survey conducted by CrowdFlower finds the work to get there is tedious and time consuming. The survey of approximately 200 data scientists found the jobs they hate the most, the ones that consume the bulk of their timem are cleaning, labeling, and categorizing data. They spend 500% more time cleaning, labeling and categorizing data than they spend mining the data. In fact, those surveyed said they spend double the amount of time on these laborious task than creating and building algorithms.


The reason for this is twofold.  First, the lack of high quality training data is the single biggest reason AI systems fail according to the results of the survey.  It is so critical, respondents said they'd rather break their leg than delete their training data.  Secondly, data scientists have concerns about the integrity of the training data and worry that if they aren't careful, the wrong training data could bias an AI system because it could be influenced by human prejudices around things such as religion, race or gender. 

Free Monthly Newsletter

Compelling read? Subscribe to FierceEmbeddedTech!

The embedded tech sector runs the market’s trends. FierceEmbeddedTech subscribers rely on our suite of newsletters as their must-read source for the latest news, developments and analysis impacting their world. Sign up today to get news and updates delivered to your inbox and read on the go.


As AI systems increasingly enter the mainstream, their usefulness is often defined by the quality of the training data used. While a machine can process complex mathematical equations or structured data in milliseconds, training data teaches a machine how to process more abstract data like flagging inappropriate content or distinguishing between objects in images. While higher quality initial training data will improve the accuracy of an algorithm's initial output, ongoing training data is required to constantly improve upon the algorithm's results.


To view the full report, visit

For more information, visit