Have you heard of the Darwin Awards? Hop on YouTube and take a look. It’s generally pretty funny stuff. It’s a tongue-in-cheek honor that recognizes people for the most sophisticated attempts to do something they think is cool. One takes a selfie with a wounded bear, another one screws a jet engine to a skate. These bold actions lead to fatal mistakes with dire consequences and funny comments. Spoiler alert — sadly — they all die. You don’t want your startup “to die” from the mistakes of machine learning.
For the past 25 years, I’ve seen thousands of times when a person makes errors — but never when a machine makes a mistake. Today, a blunder in the learning projects can cost companies millions and several years of useless work. For this reason, the most common errors in machine learning related to data, metrics, validation, and technology are collected here.
Chances to make a mistake working with data are rather high. It is easier to successfully pass a minefield than not to make a mistake while working with the data set. Moreover, there can be several common mistakes:
Unprocessed data. Unprocessed data is rubbish that will not allow you to be confident about the adequacy of the constructed model. Therefore, only pre-processed data should be the basis of any AI project.
Anomalies. To check data on deviations and anomalies and get rid of them. Getting rid of errors is one of the priorities of every machine learning project. The data may always be incomplete, incorrect, or some information may be lost for some period.
Lack of data. Perhaps, the easiest way is to conduct 10 experiments and get the result, but still not the most correct one. A small and unbalanced amount of data would drive to a conclusion far from the truth. So, if you need to train the network to distinguish spectacled penguins from spectacled bears, a couple of bears’ photos won’t fly. Even if there are thousands of penguins’ images.
Lots of data. Sometimes limiting the amount of data is the only correct solution. That is how you can get, for example, the most objective picture of human actions in the future. Our world and the human race are incredibly unpredictable. As a rule, to foretell someone’s response based on their behavior in 1998 is like reading tea leaves. The result, being quite the same, will be far from reality.
Accuracy is an essential metric in machine learning. However, senseless seeking absolute accuracy can become a problem for an AI project. Particularly, if the goal is to create a predictive recommendation system. It is obvious that the accuracy can reach an incredible 99% if the grocery online-supermarket offers to buy milk. I bet a buyer will take it, and the recommendation system will work. But I’m afraid he would buy it anyway thus there is little sense in such a recommendation. In the case of a city resident, who buys milk daily, it is an individual approach and promotion of goods (which the one didn’t have in the basket earlier) that matters in such systems.
A child learning the alphabet gradually masters letters, simple words, and idioms. He learns and processes information at a certain level. At the same time, the analysis of scientific papers is incomprehensible for the toddler, although the words in the articles consist of the same letters that he learned.
The model of an AI project also learns from a specific data set. However, the project won’t handle an attempt to check the quality of the model on the same data set. To estimate the model, it is necessary to use specially selected for verification pieces of information that were not used in training. In such a way, one can achieve the most accurate model quality assessment.
The choice of technology in an AI project is still a common mistake, leads if not to fatal, but serious consequences that influence the efficiency and time of the project deadline.
No wonder, you can hardly find a more hyped theme in machine learning than neural networks, due to its suitable-to-any-task universal algorithm. But this tool won’t be the most effective and the fastest for any task.
The brightest example is Kaggle competition. Neural networks do not always take the first place; on the contrary, random tree networks have more chances to win; it is primarily related to tabular data.
Neurons are more often used to analyze visual information, voice, and more complex data.
Using a neural network as a guide one can see, nowadays, it is the simplest solution. But at the same time, the project team should understand clearly what algorithms are suitable for a particular task.
I truly believe machine learning hype won’t be false, exaggerated, and ungrounded. Machine learning is another engineering tool that makes our life simpler and more comfortable, gradually changing it for the better.
For many massive projects, this article may be just a nostalgic retrospective about the mistakes they have already made but still managed to survive and overcome serious difficulties on the way to the product company.
But for those who are just starting their AI venture, this is an opportunity to understand why it isn’t the best idea to take a selfie with a wounded bear and how not to fill up the endless lists of “dead” startups.