Which Machine Learning algorithm to use?
.
Motivation
Machine Learning is a new technology and it is booming day by day. Machine Learning is everywhere. There’s a lot of examples of Machine Learning in everyday life. Autonomous driving, chatbots, Recommender systems, Amazon’s books recommendations, Netflix’s TV show recommendations, product predictions and many more and the list goes on. Machine Learning is also playing a great role in health care to identify many type of diseases like Cancer and others.
Machine Learning is all about data. All Machine Learning algorithms want a large amount of data to work. When we provide a large amount of data to algorithm it will predict very clearly. So the most important thing is data.
Persons who are new to Data Science and Machine Learning, who face problems of deciding when and which Machine Learning algorithm to use. In this article, I will make it clear that which algorithm to use for your problem.
Why we use Machine Learning?
Machine Learning is used to solve very complex type of data. With our traditional programming system it is very difficult and time consuming to deal with complex and large amount of data. In our traditional programming system we give Data and Program as input and we get Results as output. But in Machine Learning systems, We give Data and Results as inputs and we get Program as output. See the fig, below. Machine Learning have solved a lot of problems and it works on very very low time as compared to traditional programming approach.
Defining your problem
First of all, it is necessary to define your problem to get a better solution. This can easily be done by answering these three questions,
1) What do you want to do?
2) What is available? and
3) What are my constraints?
What do you want to do?
Do you want to predict a category? That’s classifying. For instance, you want to know if an input image belongs to the cat category or the dog category.
Do you want to predict a quantity? That’s regression. For instance, knowing the area of the floor plan of a house, where it is, whether it has a garage or not, predicting its value on the market. In this case, go for a regression approach because you want to predict a price ie. a quantity, not a category.
Do you want to detect an anomaly? That’s anomaly detection. You want to detect money withdrawal anomalies. Imagine that you live in England and you have never been abroad, and that money has been withdrawn 5 times in Las Vegas from your bank account. In this case you might want the bank to detect that and prevent it to happen it from happening again.
Do you want to discover structure in unexplored data? That’s clustering. For instance: imagine having a large amount of website logs, you might want to explore them to see if there are groups of similar visitor behavior in your website logs. These groups of visitor behaviors might help you improve your website.
What is available?
How much data do you have? Of course, this depends on the problem you want to solve and the kind of data you’re playing with. Knowing the amount of data you have is important. If you have more than 100.000 data points you will be able to use every kind of algorithm!
Do your data points have labels? That is, do we know the category of each data point we have? If we know the category an image belongs to, we know the label. If we don’t, then we cannot label them
Do you have a lot of features to work with? The number of features you have might influence your algorithm choice. In the case of house price forecasting, you might need to know the total area of the floor plan of the house, the number of floors, the proximity to the city center, and so on. The more features you have, the more accurate your analysis will be. Too many or too few features will restrict your choice of algorithm. Having too many features might increase the occurrence of redundant features… Features that are correlated, such as the area of the house and its inner volume, affect the performance of some algorithms.
How many classes do you have? Knowing how many classes (categories) is important for some ML algorithms, especially for some exploratory ML algorithms.
What are your constraints?
What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems.
Does the prediction have to be fast? In real time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents, obviously…
Does the learning have to be fast? In some circumstances, training models quickly is necessary: sometimes, you need to rapidly update, on the fly, your model with a different dataset.
Two also very important aspects we, enthusiastic developers, have a tendency to forget is the maintainability of the solution we choose, and, communication.
Maintainability: it is sometime more judicious to go for a simpler solution giving correct results, instead of a very sophisticated solution you’re not 100% confident with giving slightly better results. We might not be able to easily update the solution or correct a bug in the future.
Communication: we, developers, are sometime working with non-developers. For some projects it is sometime necessary to expose your solution to people of other professions. In this case, it might be judicious to go for a ML solution that is more suitable for layman.