Table of content:

How to choose right machine learning models?

Understand assumptions and restrictions of each machine learning model will help you to get correct starting point.

Below table try to list down some famous models and their advantages / disadvantages.

Models

1. C = Classification.

2. R = Regression.

Learning Type (L):

1. S = Supervised

2. U = Unsupervised

3. C = Clustering

4. R = Reinforcement learning

Method (M):

1. N = Non-parametric

2. P = Parameteric

Approach (A):

1. G = Generative

2. D = Discriminative

Hyperparameters:

In statistics hyperparameters are parameters of a prior distribution. It relates to the parameters of our model.

T L M A Algorithm Assumption / Description Advantages Disadvantages Hyperparameters Loss (Cost) Function Activation Function (link function in linear model)
R S P D Linear Regression Baseline predictions 1. Simple to understand and explain. 2. It seldom overfits. 3. Using L1 & L2 regularization is effective in feature selection. 4. Fast to train. 1. sensitive to outliers. 2. You have to work hard to make it fit nonlinear functions. weights by its dimensions of input, bias, and learning rate Mean Square Error (MSE = L2 loss) $L(y, \hat{y}) = \frac{1}{M}\sum_{i=1}^{M}(\hat{y_i}-y_i)^2$, or $L(y, x, w) = \frac{1}{M}\sum_{i=1}^{M}(\hat{y_i}-(w^Tx_i+b))^2$ $y = f(x) = x$
C S P D Logistic regression Independent and irrelevant alternatives (IIA) or independent and identically distributed (i.i.d.) assumption. Output results are probabilities of categorical dependent variables. Types of logistic regression: 1. Binary (Pass/Fail), 2. Multi (Cats, Dogs, Sheep), 3. Ordinal (Low, Medium, High) 1. Simple to understand and explain. 2. It seldom over-fits. 3. Using L1 & L2 regularization is effective in feature selection. 4. The best algorithm for predicting probabilities of an event. 5. Fast to train 1. Can suffer from outliers. 2. You have to work hard to make it fit nonlinear functions   Cross-entropy loss = log loss = negative log-likelihood link function of binary classifier: Sigmoid = $\frac{1}{1+e^{-x}}$, link function of multinomial logistic regression for multi-classification = softmax
C S P G Naive Bayes random variables are independent and identically distributed (i.i.d. assumption). Assume that the value of a particular feature is independent of the value of any other feature, given the class variable. 1. Easy and fast to implement. 2. doesn’t require too much memory and can be used for online learning. 3. Easy to understand. 4. Takes into account prior knowledge 1. Strong and unrealistic feature independence assumptions. 2. Fails estimating rare occurrences. 3. Suffers from irrelevant features.
C S P G Hidden Markov Model
C S P D Linear-chain Conditional Random Field
C/R S P D Random Forest Apt at almost any machine learning problem 1. Can work in parallel. 2. May overfits (Depth of tree is a parameters to control overfits). 3. Automatically handles missing values. 4. No need to transform any variable. 5. Can be used by almost anyone with excellent results 1. Difficult to interpret due to complex multiple tree structures. 2. Weaker on regression when estimating values at the extremities of the distribution of response values. 3. Biased in multiclass problems toward more frequent classes.
C/R S P D Gradient Boosting 1. Apt at almost any machine learning problem. 2. Search engines (solving the problem of learning to rank) 1. It can approximate most nonlinear function. 2. Best in class predictor. 3. Automatically handles missing values. 4. No need to transform any variable 1. It can overfit if run for too many iterations. 2. Sensitive to noisy data and outliers. 3. Doesn’t work well without parameter tuning
C S P D Support Vector Machines 0. One of influential approaches to supervised learning model (before NN works). 1. Character recognition. 2. Image recognition. 3. Text classification. 1. Automatic nonlinear feature creation. 2. Can approximate complex nonlinear functions 1. Difficult to interpret when applying nonlinear kernels. 2. Suffers from too many examples, after 10,000 examples it starts taking too long to train
C/R C N D K-Means       K center points. Can be found by Elbow Method or hierarchical clustering Error = Sum of Squared Errors (SSE) for each data point to its center
C/R S P D Neural Networks / MLP (Multiple Layer Perceptron) data space re-project until converge 1. Can approximate any nonlinear function. 2. Robust to outliers. 3. Works only with a portion of the examples (the support vectors) 1. Very difficult to set up. 2. Difficult to tune because of too many parameters and you have also to decide the architecture of the network. 3. Difficult to interpret 4. Easy to overfit the number of layers, the number of neurons in each layer, the learning rate, regularization, dropout rate, batch size
C/R C N D K-nearest Neighbors the input consists of the k closest training examples in the feature space 1.Fast. 2. lazy training. 3.Can naturally handle extreme multiclass problems (like tagging text) 1. Slow and cumbersome in the predicting phase(Model has to carry with data) 2. Can fail to predict correctly due to the (the volume of the space increases so fast that the available data become sparse)
C/R S P D Neural Networks / MLP (Multiple Layer Perceptron) data space re-project until converge 1. Can approximate any nonlinear function. 2. Robust to outliers. 3. Works only with a portion of the examples (the support vectors) 1. Very difficult to set up. 2. Difficult to tune because of too many parameters and you have also to decide the architecture of the network. 3. Difficult to interpret 4. Easy to overfit the number of layers, the number of neurons in each layer, the learning rate, regularization, dropout rate, batch size
C S P D Perceptron Binary Classifier, Assume data is binary classifiable or If the training data is linearly separable, the algorithm stops in a finite number of steps.
C U N D PCA PCA is limited to re-expressing the data as a linear combination of its basis vectors to best express the data mean

Reference: 