Lack Of High-Quality Training Data Impedes AI Advances

Artificial Intelligence systems may one day perform a plethora of human tasks, however the third annual Data Scientist Survey conducted by CrowdFlower finds the work to get there is tedious and time consuming. The survey of approximately 200 data scientists found the jobs they hate the most, the ones that consume the bulk of their timem are cleaning, labeling, and categorizing data. They spend 500% more time cleaning, labeling and categorizing data than they spend mining the data. In fact, those surveyed said they spend double the amount of time on these laborious task than creating and building algorithms.


The reason for this is twofold.  First, the lack of high quality training data is the single biggest reason AI systems fail according to the results of the survey.  It is so critical, respondents said they'd rather break their leg than delete their training data.  Secondly, data scientists have concerns about the integrity of the training data and worry that if they aren't careful, the wrong training data could bias an AI system because it could be influenced by human prejudices around things such as religion, race or gender. 

Fierce AI Week

Register today for Fierce AI Week - a free virtual event | August 10-12

Advances in AI and Machine Learning are adding an unprecedented level of intelligence to everything through capabilities such as speech processing and image & facial recognition. An essential event for design engineers and AI professionals, Engineering AI sessions during Fierce AI Week explore some of the most innovative real-world applications today, the technological advances that are accelerating adoption of AI and Machine Learning, and what the future holds for this game-changing technology.


As AI systems increasingly enter the mainstream, their usefulness is often defined by the quality of the training data used. While a machine can process complex mathematical equations or structured data in milliseconds, training data teaches a machine how to process more abstract data like flagging inappropriate content or distinguishing between objects in images. While higher quality initial training data will improve the accuracy of an algorithm's initial output, ongoing training data is required to constantly improve upon the algorithm's results.


To view the full report, visit

For more information, visit

Suggested Articles

Research dollars for AV are expected to remain, if slowed, especially for companies that see self-driving as a key to their success

Hydrogen refueling stations are limited in the U.S., restricting interest in use of fuel cell electric cars

Silicon Labs is providing the BT module needed for detecting proximity with another Maggy device