Jekyll2018-09-25T12:32:21+00:00https://cheng-lin-li.github.io/Data Science and AI NotebookThis is my Blog for Data Science Notebook which includes Artificial Intelligence, Machine Learning, Natural Language Processing, Knowledge Graphs, Information Visualization, and Data Mining. Cheng-Lin LiWhat is XYZ?2018-07-13T00:00:00+00:002018-07-13T00:00:00+00:00https://cheng-lin-li.github.io/2018/07/13/What-is-XYZ<!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#what-is-xyz" id="markdown-toc-what-is-xyz">What is XYZ?</a></li> <li><a href="#basic-terms" id="markdown-toc-basic-terms">Basic terms</a> <ul> <li><a href="#what-is-loss-cost-objective-function" id="markdown-toc-what-is-loss-cost-objective-function">What is loss, cost, objective function?</a></li> <li><a href="#loss-function" id="markdown-toc-loss-function">Loss function</a></li> <li><a href="#cost-function" id="markdown-toc-cost-function">Cost function</a></li> <li><a href="#objective-function" id="markdown-toc-objective-function">Objective function</a></li> <li><a href="#what-is-the-loss-function-of-linear-regression" id="markdown-toc-what-is-the-loss-function-of-linear-regression">What is the loss function of linear regression?</a></li> <li><a href="#what-is-the-loss-function-of-logistic-regression" id="markdown-toc-what-is-the-loss-function-of-logistic-regression">What is the loss function of logistic regression?</a></li> <li><a href="#optimization-of-model" id="markdown-toc-optimization-of-model">Optimization of model</a> <ul> <li><a href="#what-is-l1-regularizer" id="markdown-toc-what-is-l1-regularizer">What is L1 regularizer?</a></li> <li><a href="#what-is-l2-regularizer" id="markdown-toc-what-is-l2-regularizer">What is L2 regularizer?</a></li> </ul> </li> <li><a href="#what-is-imbalance-data" id="markdown-toc-what-is-imbalance-data">What is imbalance data?</a> <ul> <li><a href="#what-are-metrics-to-evaluate-a-model-for-imbalance-data" id="markdown-toc-what-are-metrics-to-evaluate-a-model-for-imbalance-data">What are metrics to evaluate a model for imbalance data</a></li> <li><a href="#cost-function-based-approaches" id="markdown-toc-cost-function-based-approaches">Cost function based approaches</a></li> <li><a href="#sampling-based-approaches" id="markdown-toc-sampling-based-approaches">Sampling based approaches</a></li> <li><a href="#oversampling" id="markdown-toc-oversampling">Oversampling</a> <ul> <li><a href="#random-oversampling" id="markdown-toc-random-oversampling">Random oversampling</a></li> <li><a href="#synthetic-minority-over-sampling-technique-smote" id="markdown-toc-synthetic-minority-over-sampling-technique-smote">Synthetic Minority Over-sampling Technique (SMOTE)</a></li> <li><a href="#adaptive-synthetic-adasyn" id="markdown-toc-adaptive-synthetic-adasyn">Adaptive Synthetic (ADASYN)</a></li> </ul> </li> <li><a href="#undersampling" id="markdown-toc-undersampling">Undersampling</a> <ul> <li><a href="#random-undersampling" id="markdown-toc-random-undersampling">Random undersampling</a></li> <li><a href="#near-miss" id="markdown-toc-near-miss">Near miss</a></li> <li><a href="#nearmiss-1" id="markdown-toc-nearmiss-1">NearMiss-1</a></li> <li><a href="#nearmiss-2" id="markdown-toc-nearmiss-2">NearMiss-2</a></li> <li><a href="#nearmiss-3" id="markdown-toc-nearmiss-3">NearMiss-3</a></li> <li><a href="#tomeks-links" id="markdown-toc-tomeks-links">Tomeks links</a></li> <li><a href="#edited-nearest-neighbors" id="markdown-toc-edited-nearest-neighbors">Edited nearest neighbors</a></li> </ul> </li> <li><a href="#hybrid-approach" id="markdown-toc-hybrid-approach">Hybrid approach</a></li> </ul> </li> <li><a href="#what-is-anomaly-detection" id="markdown-toc-what-is-anomaly-detection">What is Anomaly Detection?</a></li> <li><a href="#what-is-receiver-operating-characteristics-roc-curve" id="markdown-toc-what-is-receiver-operating-characteristics-roc-curve">What is Receiver Operating Characteristics (ROC) curve?</a></li> <li><a href="#what-is-the-area-under-the-curve-auc" id="markdown-toc-what-is-the-area-under-the-curve-auc">What is the area under the curve (AUC)?</a></li> <li><a href="#what-is-p-value" id="markdown-toc-what-is-p-value">What is p value?</a></li> <li><a href="#what-is-tf-idf" id="markdown-toc-what-is-tf-idf">What is TF-IDF?</a></li> <li><a href="#what-is-bias" id="markdown-toc-what-is-bias">What is Bias?</a> <ul> <li><a href="#mean-signed-difference-deviation-or-error-msd" id="markdown-toc-mean-signed-difference-deviation-or-error-msd">Mean Signed Difference, Deviation or Error (MSD)</a></li> </ul> </li> <li><a href="#what-is-variance" id="markdown-toc-what-is-variance">What is Variance?</a></li> <li><a href="#what-is-bias-variance-tradeoff" id="markdown-toc-what-is-bias-variance-tradeoff">What is Bias-Variance tradeoff?</a> <ul> <li><a href="#what-is-bias-variance-decomposition-of-error" id="markdown-toc-what-is-bias-variance-decomposition-of-error">What is Bias-Variance decomposition of error?</a></li> </ul> </li> <li><a href="#what-is-recurrent-neural-networkn-rnn" id="markdown-toc-what-is-recurrent-neural-networkn-rnn">What is Recurrent Neural Networkn (RNN)?</a></li> <li><a href="#what-are-main-gates-in-long-short-term-memory-lstm" id="markdown-toc-what-are-main-gates-in-long-short-term-memory-lstm">What are main gates in Long Short-Term Memory (LSTM)?</a></li> <li><a href="#what-is-support-vector-machine-svm" id="markdown-toc-what-is-support-vector-machine-svm">What is (Support Vector Machine) SVM?</a> <ul> <li><a href="#hard-margin" id="markdown-toc-hard-margin">Hard-margin</a></li> <li><a href="#soft-margin" id="markdown-toc-soft-margin">Soft-margin</a></li> </ul> </li> <li><a href="#what-is-entropy-in-discrete" id="markdown-toc-what-is-entropy-in-discrete">What is Entropy in discrete?</a></li> <li><a href="#what-is-cross-entropy-in-discrete" id="markdown-toc-what-is-cross-entropy-in-discrete">What is Cross Entropy in discrete?</a></li> <li><a href="#what-is-the-difference-between-cross-entropy-and-entropy" id="markdown-toc-what-is-the-difference-between-cross-entropy-and-entropy">What is the difference between Cross Entropy and Entropy?</a></li> <li><a href="#what-is-post-of-speech-pos-tagging-in-nlp" id="markdown-toc-what-is-post-of-speech-pos-tagging-in-nlp">What is Post-of-Speech (POS) Tagging in NLP?</a> <ul> <li><a href="#how-to-do-the-pos-tagging" id="markdown-toc-how-to-do-the-pos-tagging">How to do the POS tagging?</a> <ul> <li><a href="#use-of-hidden-markov-models" id="markdown-toc-use-of-hidden-markov-models">Use of hidden Markov models</a></li> <li><a href="#dynamic-programming-methods" id="markdown-toc-dynamic-programming-methods">Dynamic programming methods</a></li> <li><a href="#unsupervised-taggers" id="markdown-toc-unsupervised-taggers">Unsupervised taggers</a></li> <li><a href="#other-taggers-and-methods" id="markdown-toc-other-taggers-and-methods">Other taggers and methods</a></li> </ul> </li> <li><a href="#what-is-the-loss-function-in-mathematics" id="markdown-toc-what-is-the-loss-function-in-mathematics">What is the loss function in mathematics?</a></li> </ul> </li> <li><a href="#what-is-conditional-random-field" id="markdown-toc-what-is-conditional-random-field">What is Conditional Random Field?</a> <ul> <li><a href="#what-is-the-meaning-of-conditional-in-this-algorithm" id="markdown-toc-what-is-the-meaning-of-conditional-in-this-algorithm">What is the meaning of Conditional in this algorithm?</a></li> <li><a href="#how-crfs-differ-from-hidden-markov-models" id="markdown-toc-how-crfs-differ-from-hidden-markov-models">How CRFs differ from Hidden Markov Models</a></li> <li><a href="#what-is-the-relationship-between-hidden-markov-model-and-conditional-random-field" id="markdown-toc-what-is-the-relationship-between-hidden-markov-model-and-conditional-random-field">What is the relationship between Hidden Markov Model and Conditional Random Field?</a></li> </ul> </li> <li><a href="#what-is-viterbi-algorithm" id="markdown-toc-what-is-viterbi-algorithm">What is Viterbi algorithm</a></li> <li><a href="#what-is-jaccard-similarity" id="markdown-toc-what-is-jaccard-similarity">What is Jaccard Similarity</a></li> <li><a href="#what-is-minhash" id="markdown-toc-what-is-minhash">What is MinHash</a></li> <li><a href="#what-is-locality-sensitive-hashinglsh" id="markdown-toc-what-is-locality-sensitive-hashinglsh">What is Locality Sensitive Hashing(LSH)</a></li> <li><a href="#what-is-shingling" id="markdown-toc-what-is-shingling">What is Shingling</a></li> </ul> </li> </ul> <hr /> <h1 id="what-is-xyz">What is XYZ?</h1> <p>Understand some foundational terms in machine learning area will help you to speed up the communication with experts. There are some famous terms you have to know.</p> <h1 id="basic-terms">Basic terms</h1> <h2 id="what-is-loss-cost-objective-function">What is loss, cost, objective function?</h2> <p>These are not very strict terms and they are highly related. However: In short, a loss function is a part of a cost function which is a type of an objective function.</p> <p>From section 4.3 in “Deep Learning” - Ian Goodfellow, Yoshua Bengio, Aaron Courville http://www.deeplearningbook.org/ The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.</p> <h2 id="loss-function">Loss function</h2> <p>A loss function is a way to map the performance of our model into a real number. It measures how well the model is performing its task.</p> <p>It is usually a function defined on a data point, prediction and label, and measures the penalty. For example: square loss <script type="math/tex">l(f(x_i\vert θ),y_i)=(f(x_i\vert θ)−y_i)^2</script>, used in linear regression hinge loss <script type="math/tex">l(f(x_i\vert θ),y_i)=max(0,1−f(x_i\vert θ)y_i)</script>, used in SVM 0/1 loss <script type="math/tex">l(f(x_i\vert θ),y_i)=1⟺f(x_i\vert θ)≠y_i</script>, used in theoretical analysis and definition of accuracy</p> <h2 id="cost-function">Cost function</h2> <p>It is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: Mean Squared Error <script type="math/tex">MSE(θ)=\frac{1}{N} \sum_{i=1}^n(f(x_i\vert θ)−y_i)^2</script> SVM cost function <script type="math/tex">SVM(θ)=∥θ∥^2+C\sum_{i=1}^nξ_i</script> (there are additional constraints connecting ξi with C and with training set)</p> <h2 id="objective-function">Objective function</h2> <p>It is the most general term for any function that you optimize during training. For example, a probability of generating training set in maximum likelihood approach is a well defined objective function, but it is not a loss function nor cost function (however you could define an equivalent cost function). For example: MLE is a type of objective function (which you maximize) Divergence between classes can be an objective function but it is barely a cost function, unless you define something artificial, like 1-Divergence, and name it a cost</p> <h2 id="what-is-the-loss-function-of-linear-regression">What is the loss function of linear regression?</h2> <p>Mean Squared Error, or L2 loss. Given our simple linear equation y=mx+b, we can calculate MSE as:</p> <script type="math/tex; mode=display">MSE = \frac{1}{N}\sum_{i=1}^n(y_i - (mx_i + b))^2</script> <p>N is the total number of data</p> <p><script type="math/tex">y_i</script> is the actual data</p> <p><script type="math/tex">mx_i + b</script> is our prediction</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">MSE</span><span class="p">(</span><span class="n">yHat</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">((</span><span class="n">yHat</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">y</span><span class="o">.</span><span class="n">size</span> </code></pre></div></div> <h2 id="what-is-the-loss-function-of-logistic-regression">What is the loss function of logistic regression?</h2> <p>Cross-Entropy</p> <p>In binary classification, where the number of classes M equals 2, cross-entropy can be calculated as:</p> <script type="math/tex; mode=display">Binary Cross Entropy = −(ylog(p)+(1−y)log(1−p))</script> <script type="math/tex; mode=display">Cross Entropy = -\sum_{c=1}^M y_{o,c}log(p_{o,c})</script> <p>M - number of classes (dog, cat, fish)</p> <p>log - the natural log</p> <p>y - binary indicator (0 or 1) if class label</p> <p>c is the correct classification for observation o</p> <p>p - predicted probability observation o is of class c</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">CrossEntropy</span><span class="p">(</span><span class="n">yHat</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span> <span class="k">if</span> <span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="k">return</span> <span class="o">-</span><span class="n">log</span><span class="p">(</span><span class="n">yHat</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="k">return</span> <span class="o">-</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">yHat</span><span class="p">)</span> </code></pre></div></div> <h2 id="optimization-of-model">Optimization of model</h2> <h3 id="what-is-l1-regularizer">What is L1 regularizer?</h3> <p>Reference: <a href="http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/">Differences between L1 and L2 as Loss Function and Regularization</a></p> <p>A regression model that uses L1 regularization technique is called Lasso Regression.</p> <p>Lasso Regression (Least Absolute Shrinkage and Selection Operator) <span style="color:red">adds “absolute value of magnitude” of coefficient as penalty term to the loss function</span>. Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.</p> <script type="math/tex; mode=display">w^* = \underset{w}argmin\sum_j^n (t(x_i)-\sum_i^k w_i h_i(x_j))^2 + \lambda \sum_{i=1}^k \vert w_i\vert</script> <h3 id="what-is-l2-regularizer">What is L2 regularizer?</h3> <p>Reference: <a href="http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/">Differences between L1 and L2 as Loss Function and Regularization</a></p> <p>A regression model which uses L2 is called Ridge Regression.</p> <p>Ridge regression <span style="color:red">adds “squared magnitude” of coefficient as penalty term to the loss function</span>.</p> <script type="math/tex; mode=display">w^* = \underset{w}argmin\sum_j^n (t(x_i)-\sum_i^k w_i h_i(x_j))^2 + \lambda \sum_{i=1}^k (w_i)^2</script> <h2 id="what-is-imbalance-data">What is imbalance data?</h2> <p>Reference: <a href="https://www.jeremyjordan.me/imbalanced-data/">imbalanced data</a></p> <p>Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you’ll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes)</p> <p>Two common ways we’ll train a model: tree-based logical rules developed according to some splitting criterion, and parameterized models updated by gradient descent.</p> <p>It’s worth noting that not all datasets are affected equally by class imbalance. Generally, for easy classification problems in which there’s a clear separation in the data, class imbalance doesn’t impede on the model’s ability to learn effectively. However, datasets that are inherently more difficult to learn from see an amplification in the learning challenge when a class imbalance is introduced.</p> <h3 id="what-are-metrics-to-evaluate-a-model-for-imbalance-data">What are metrics to evaluate a model for imbalance data</h3> <p>Accuracy is not a good metrics for imbalance data. The model always predict the negative (majority) cases will get high accuracy.</p> <p>Below are better metrics:</p> <ol> <li> <p>TP (True Positive), FP (False Positive), TN (True Negative), FN</p> </li> <li> <p>Precision and recall</p> </li> <li> <p>F1 or <script type="math/tex">F_{\beta}</script></p> </li> </ol> <p>We can roughly classify the approaches into three major categories: cost function based approaches, sampling based approaches and turning the task to Anomaly detections.</p> <h3 id="cost-function-based-approaches">Cost function based approaches</h3> <p>Reference: <a href="http://www.chioka.in/class-imbalance-problem/">class-imbalance-problem</a></p> <p>One of the simplest ways to address the class imbalance is to simply “provide a weight for each class” which places more emphasis on the minority classes such that the end result is a classifier which can learn equally from all classes.</p> <p>The intuition behind cost function based approaches is that if we think one false negative is worse than one false positive, we will count that one false negative as, e.g., 100 false negatives instead. For example, if 1 false negative is as costly as 100 false positives, then the machine learning algorithm will try to make fewer false negatives compared to false positives (since it is cheaper).</p> <h3 id="sampling-based-approaches">Sampling based approaches</h3> <p>Another approach towards dealing with a class imbalance is to simply alter the dataset to remove such an imbalance.</p> <p>This can be roughly classified into three categories:</p> <ol> <li> <p>Oversampling, by adding more of the minority class so it has more effect on the machine learning algorithm</p> </li> <li> <p>Undersampling, by removing some of the majority class so it has less effect on the machine learning algorithm</p> </li> <li> <p>Hybrid, a mix of oversampling and undersampling</p> </li> </ol> <h3 id="oversampling">Oversampling</h3> <p>Oversampling the minority classes to increase the number of minority observations until we’ve reached a balanced dataset.</p> <h4 id="random-oversampling">Random oversampling</h4> <p>The most naive method of oversampling is to randomly sample the minority classes and simply duplicate the sampled observations. With this technique, it’s important to note that you’re artificially “reducing the variance” of the dataset.</p> <h4 id="synthetic-minority-over-sampling-technique-smote">Synthetic Minority Over-sampling Technique (SMOTE)</h4> <p>SMOTE is a technique that generates new observations by interpolating between observations in the original dataset.</p> <p>For a given observation <script type="math/tex">x_i</script>, a new (synthetic) observation is generated by interpolating between one of the k-nearest neighbors, <script type="math/tex">x_{zi}</script>.</p> <p><script type="math/tex">x_{new} = x_i + \lambda (x_{zi}−x_i)</script> where λ is a random number in the range [0,1]. This interpolation will create a sample on the line between <script type="math/tex">x_i</script> and <script type="math/tex">x_{zi}</script>.</p> <p>This algorithm has three options for selecting which observations, <script type="math/tex">x_i</script>, to use in generating new data points.</p> <ol> <li> <p>regular: No selection rules, randomly sample all possible <script type="math/tex">x_i</script>.</p> </li> <li> <p>borderline: Separates all possible <script type="math/tex">x_i</script> into three classes using the k nearest neighbors of each point.</p> <p>a. noise: all nearest-neighbors are from a different class than <script type="math/tex">x_i</script></p> <p>b. in danger: at least half of the nearest neighbors are of the same class as <script type="math/tex">x_i</script></p> <p>c. safe: all nearest neighbors are from the same class as <script type="math/tex">x_i</script></p> </li> <li> <p>svm: Uses an SVM classifier to identify the support vectors (samples close to the decision boundary) and samples <script type="math/tex">x_i</script> from these points.</p> </li> </ol> <h4 id="adaptive-synthetic-adasyn">Adaptive Synthetic (ADASYN)</h4> <p>Adaptive Synthetic (ADASYN) sampling works in a similar manner as SMOTE, however, the number of samples generated for a given <script type="math/tex">x_i</script> is proportional to the number of nearby samples which “do not” belong to the same class as <script type="math/tex">x_i</script>. Thus, ADASYN tends to focus solely on outliers when generating new synthetic training examples.</p> <h3 id="undersampling">Undersampling</h3> <p>To achieve class balance by undersampling the majority class - essentially throwing away data to make it easier to learn characteristics about the minority classes.</p> <h4 id="random-undersampling">Random undersampling</h4> <p>A naive implementation would be to simply sample the majority class at random until reaching a similar number of observations as the minority classes.</p> <p>For example, if your majority class has 1,000 observations and you have a minority class with 20 observations, you would collect your training data for the majority class by randomly sampling 20 observations from the original 1,000. As you might expect, this could potentially result in removing key characteristics of the majority class.</p> <h4 id="near-miss">Near miss</h4> <p>reference: <a href="http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/under-sampling/plot_illustration_nearmiss.html">Illustration of the sample selection for the different NearMiss algorithms</a></p> <p>The general idea behind near miss is to only the sample the points from the majority class necessary to distinguish between other classes.</p> <h4 id="nearmiss-1">NearMiss-1</h4> <p>Select samples from the majority class for which the average distance of the N closest samples of a minority class is smallest.</p> <p><img src="http://contrib.scikit-learn.org/imbalanced-learn/stable/_images/sphx_glr_plot_illustration_nearmiss_001.png" alt="NearMiss1" /></p> <h4 id="nearmiss-2">NearMiss-2</h4> <p>Select samples from the majority class for which the average distance of the N farthest samples of a minority class is smallest.</p> <p><img src="http://contrib.scikit-learn.org/imbalanced-learn/stable/_images/sphx_glr_plot_illustration_nearmiss_002.png" alt="NearMiss2" /></p> <h4 id="nearmiss-3">NearMiss-3</h4> <p>NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their M nearest-neighbors will be kept. Then, the positive samples selected are the one for which the average distance to the N nearest-neighbors is the largest. <img src="http://contrib.scikit-learn.org/imbalanced-learn/stable/_images/sphx_glr_plot_illustration_nearmiss_003.png" alt="NearMiss3" /></p> <h4 id="tomeks-links">Tomeks links</h4> <p>Tomek’s link exists if two observations of different classes are the nearest neighbors of each other.</p> <p>We’ll remove any observations from the majority class for which a Tomek’s link is identified</p> <p>Depending on the dataset, this technique won’t actually achieve a balance among the classes - it will simply “clean” the dataset by removing some noisy observations, which may result in an easier classification problem.</p> <p>Most classifiers will still perform adequately for imbalanced datasets as long as there’s a clear separation between the classifiers. Thus, by focusing on removing noisy examples of the majority class, we can improve the performance of our classifier even if we don’t necessarily balance the classes.</p> <h4 id="edited-nearest-neighbors">Edited nearest neighbors</h4> <p>Edited Nearest Neighbors applies a nearest-neighbors algorithm and “edit” the dataset by removing samples which do not agree “enough” with their neighborhood.</p> <p>For each sample in the class to be under-sampled, the nearest-neighbors are computed and if the selection criterion is not fulfilled, the sample is removed.</p> <p>This is a similar approach as Tomek’s links in the respect that we’re not necessarily focused on actually achieving a class balance, we’re simply looking to remove noisy observations in an attempt to make for an easier classification problem.</p> <h3 id="hybrid-approach">Hybrid approach</h3> <p>By combining undersampling and oversampling approaches, we get the advantages but also drawbacks of both approaches as illustrated above, which is still a tradeoff.</p> <h2 id="what-is-anomaly-detection">What is Anomaly Detection?</h2> <p>Reference:</p> <p><a href="https://www.youtube.com/watch?v=8DfXJUDjx64">Anomaly Detection</a></p> <p><a href="https://www.youtube.com/watch?v=g2YBWQnqOpw">Anomaly Detection Algorithm</a></p> <p><a href="https://towardsdatascience.com/dealing-with-imbalanced-classes-in-machine-learning-d43d6fa19d2">Dealing with Imbalanced Classes in Machine Learning</a></p> <p>When you positive data is extremely small (2~20 or less than 50), throw away minority examples and switch to an anomaly detection framework. Assume you have 10000 data, only 20 positive data. You can select 6000 negative data to form a training set. Leverage gaussian distribution on each feature to form a probability model. Then set up an error probability = e, then P(<script type="math/tex">x_test</script>) &lt; e will be an anomaly case. Assume each feature <script type="math/tex">x_i</script> are independent and its values fellow gaussian distribution.</p> <p>P(X) = P(<script type="math/tex">x_1</script>; <script type="math/tex">\mu_1</script>; <script type="math/tex">\sigma_1^2</script>) * P(<script type="math/tex">x_2</script>; <script type="math/tex">\mu_2</script>; <script type="math/tex">\sigma_2^2</script>) * … * P(<script type="math/tex">x_n</script>; <script type="math/tex">\mu_n</script>; <script type="math/tex">\sigma_n^2</script>)</p> <p>We can plot histogram for each feature to verify its distribution. If it does not follow gaussian distribution, we need to transform the feature to new feature.</p> <p>example:</p> <p><script type="math/tex">x_{new1}</script> = log(<script type="math/tex">x_1</script>)</p> <p><script type="math/tex">x_{new2}</script> = <script type="math/tex">x_2^{0.05}</script></p> <h2 id="what-is-receiver-operating-characteristics-roc-curve">What is Receiver Operating Characteristics (ROC) curve?</h2> <p>Reference: <a href="https://www.jeremyjordan.me/imbalanced-data/">imbalanced data</a></p> <p>An ROC curve visualizes an algorithm’s ability to discriminate the positive class from the rest of the data. We’ll do this by plotting the True Positive Rate against the False Positive Rate for varying prediction thresholds.</p> <script type="math/tex; mode=display">TPR = \frac{True Positives}{True Positives + False Negatives}</script> <script type="math/tex; mode=display">FPR = \frac{False Positives}{False Positives + True Negatives}</script> <h2 id="what-is-the-area-under-the-curve-auc">What is the area under the curve (AUC)?</h2> <p>Reference: <a href="https://www.jeremyjordan.me/imbalanced-data/">imbalanced data</a></p> <p>The area under the curve (AUC) is a single-value metric for which attempts to summarize an ROC curve to evaluate the quality of a classifier. This metric approximates the area under the ROC curve for a given classifier. The ideal curve hugs the upper left hand corner as closely as possible, giving us the ability to identify all true positives while avoiding false positives; this ideal model would have an AUC of 1. On the flipside, if your model was no better than a random guess, your TPR and FPR would increase in parallel to one another, corresponding with an AUC of 0.5.</p> <h2 id="what-is-p-value">What is p value?</h2> <p>The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (<script type="math/tex">H_0</script>) of a study question is true – the definition of ‘extreme’ depends on how the hypothesis is being tested. P is also described in terms of rejecting <script type="math/tex">H_0</script> when it is actually true, however, it is not a direct probability of this state.</p> <h2 id="what-is-tf-idf">What is TF-IDF?</h2> <p>Term Frequency–Inverse Document Frequency</p> <p>Term Frequency also known as TF measures the number of times a term (word) occurs in a document.</p> <script type="math/tex; mode=display">tf(t,d) = \frac{f_{t,d}}{\sum_{t'\in{d}}f_{t',d}}</script> <p>The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.</p> <script type="math/tex; mode=display">idf(t, D) = log \frac{N}{\vert {d\in{D}: t\in{d}} \vert}</script> <ul> <li>N: total number of documents in the corpus N = |D|</li> <li><script type="math/tex">\vert {d\in{D}: t\in{d}}\vert</script>: number of documents where the term t appears. if the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1+ <script type="math/tex">\|{d\in{D}: t\in{d}}\|</script></li> </ul> <script type="math/tex; mode=display">tfidf(t,d, D) = tf(t, d) * idf(t, D)</script> <h2 id="what-is-bias">What is Bias?</h2> <p>The bias of an estimator <script type="math/tex">\hat{\theta_m}</script> to a statistic model <script type="math/tex">\theta</script> is:</p> <script type="math/tex; mode=display">Bias(\hat{\theta_m}) = E(\hat{\theta_m}) - \theta</script> <p><script type="math/tex">\theta</script> is the true underlying value of <script type="math/tex">\theta</script> used to define the data generating distribution.</p> <p><script type="math/tex">E(\hat{\theta_m})</script> is the expectation over the data (seen as samples from a random variable) by an estimator <script type="math/tex">\hat{\theta_m}</script>.</p> <p>In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference (ignore the noise for estimation purpose).</p> <h3 id="mean-signed-difference-deviation-or-error-msd">Mean Signed Difference, Deviation or Error (MSD)</h3> <script type="math/tex; mode=display">B = Mean Signed Diviation/Difference + Irreducible Error = MSD(\hat{\theta}) + noise = \frac{1}{n}\sum_{j=1}^n(\hat{\theta_j} - \theta_i) + noise</script> <p>The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).</p> <p>Bias occurs when an algorithm has limited flexibility or is not complex enough produce underfit model that can’t learn the true signal from a dataset.</p> <p>High bias, low variance algorithms train models that are consistent, but inaccurate on average. Small gap between training and test error but unacceptable high training error in high bias cases.</p> <p>Try a larger set of features, less regularization, unpruned trees, small-k KNN to fix high bias/small variance issue.</p> <h2 id="what-is-variance">What is Variance?</h2> <p>The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).</p> <p>The variance of an estimator is simply the variance</p> <p><script type="math/tex">Var(\hat{\theta}) = \frac{\sum(x_i-\bar{x})^2}{n}</script> (Machine Learning usually uses biased Variance)</p> <p><script type="math/tex">Var(\hat{\theta}) = \frac{\sum(x_i-\bar{x})^2}{n-1}</script> (Statistic usually uses unbiased Variance)</p> <p>Alternately, the square root of the variance is called the standard error, denoted <script type="math/tex">SE(\hat{\theta})</script>.</p> <p>Variance refers to an algorithm’s sensitivity to specific sets of training data. Algorithms are too complex produce overfit models that memorize the noise instead of the signal.</p> <p>High variance, low bias algorithms train models that are accurate on average, but inconsistent. Test error still decreasing as training set size increase but large gap between training and test error. Suggests larger training set will help.</p> <p>Try to get more training examples, highly regularized, highly pruned decision, large K KNN or try a smaller set of features to fix high variance/small bias issue.</p> <h2 id="what-is-bias-variance-tradeoff">What is Bias-Variance tradeoff?</h2> <p>To get good predictions, you’ll need to find a balance of bias and variance that minimizes “total error”.</p> <h3 id="what-is-bias-variance-decomposition-of-error">What is Bias-Variance decomposition of error?</h3> <p>Assume that <script type="math/tex">Y = f(X) + \epsilon = 0</script> where <script type="math/tex">E(\epsilon) = 0</script> and <script type="math/tex">Var(\epsilon) = \sigma_{\epsilon}^2</script>, we can derive an expression for the expected prediction error of a regression fit <script type="math/tex">\hat{f}(X)</script> at an input point <script type="math/tex">X = x_0</script>, using squared-error loss:</p> <script type="math/tex; mode=display">Err(x_0) = E[(Y-\hat{f}(x_0))^2 \vert X = x_0]\\ = [E\hat{f}(x_0)-f(x_0)] + E[\hat{f}(x_0)-E\hat{f}(x_0)]^2 + \sigma_{\epsilon}^2\\ = Bias^2(\hat{f}(x_0)) + Var(\hat{f}(x_0)) + Irreducible Error\\ = Bias^2 + Variance + Irreducible Error</script> <p>Irreducible error is “noise” that can’t be reduced by algorithm. It can sometimes be reduced by better data cleaning.</p> <p>Low variance algorithms tend to be less complex, with simple or rigid underlying structure.</p> <p>Examples:</p> <p>Regression, Naive Bayes, Linear algorithms, parametric algorithms.</p> <p>Low bias algorithms tend to be more complex, with flexible underlying structure.</p> <p>Examples:</p> <p>Decision trees, nearest neighbors, non-linear algorithms, non-parametric algorithms.</p> <p>A proper machine learning workflow finds that optimal balance.</p> <ul> <li>Separate training and test sets.</li> <li>Trying appropriate algorithms</li> <li>Fitting model parameters</li> <li>Tunning impactful hyperparameters</li> <li>Proper performance metrics</li> <li>Systematic cross-validation</li> </ul> <p><img src="http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png" alt="Bias-VarianceTradeOff" /></p> <p><img src="https://www.dataquest.io/blog/content/images/2017/12/add_data.png" alt="training and validation error" /></p> <p><img src="https://www.dataquest.io/blog/content/images/2017/12/low_high_bias.png" alt="High bias and low bias case" /></p> <p><img src="https://www.dataquest.io/blog/content/images/2017/12/low_high_var.png" alt="high variance and low variance case" /></p> <p>Diagnostic:</p> <p>–Variance: Training error will be much lower than test error.</p> <p>–Bias: Training error will also be high.</p> <p>Reference:</p> <ol> <li> <p><a href="https://elitedatascience.com/bias-variance-tradeoff">https://elitedatascience.com/bias-variance-tradeoff</a></p> </li> <li> <p><a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Understanding the Bias-Variance Tradeoff </a></p> </li> <li> <p><a href="http://cs229.stanford.edu/materials/ML-advice.pdf">Machine Learning advice.pdf</a></p> </li> <li> <p><a href="https://www.dataquest.io/blog/learning-curves-machine-learning/">Learning Curves for Machine Learning</a></p> </li> </ol> <h2 id="what-is-recurrent-neural-networkn-rnn">What is Recurrent Neural Networkn (RNN)?</h2> <p>A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. RNNs can use their internal state (memory) to process sequences of inputs.</p> <p>In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other. RNNs are called recurrent because they perform the prediction task for every element of a sequence, with the output being depended on the previous computations.</p> <h2 id="what-are-main-gates-in-long-short-term-memory-lstm">What are main gates in Long Short-Term Memory (LSTM)?</h2> <p>Long short-term memory (LSTM) units are units of a recurrent neural network (RNN). An RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.</p> <h2 id="what-is-support-vector-machine-svm">What is (Support Vector Machine) SVM?</h2> <p>SVM is supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.</p> <h3 id="hard-margin">Hard-margin</h3> <p>If the training data is linearly separable, we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplane is the hyperplane that lies halfway between them.</p> <p>An important consequence of this geometric description is that the max-margin hyperplane is completely determined by those <script type="math/tex">\overrightarrow{x_i}</script> which lie nearest to it. These <script type="math/tex">\overrightarrow{x_i}</script> are called support vectors.</p> <h3 id="soft-margin">Soft-margin</h3> <p>To extend SVM to cases in which the data are not linearly separable, we introduce the hinge loss function:</p> <p><script type="math/tex">max(0, 1-y_i(\overrightarrow{w}*\overrightarrow{x_i}-b))</script> is the current output.</p> <p>If <script type="math/tex">\overrightarrow{x_i}</script> lies on the correct side of the margin, the loss function is zero.</p> <p>Note that $y_i$ is the ith target(i.e., in this case, 1 or -1), and <script type="math/tex">(\overrightarrow{w}*\overrightarrow{x_i}-b)</script></p> <p>We then wish to minimize</p> <script type="math/tex; mode=display">[\frac{1}{n}\sum_{i=1}^n max(0, 1-y_i(\overrightarrow{w}*\overrightarrow{x_i}-b))] + \lambda \|\overrightarrow{w}^2\|</script> <p>where the parameter <script type="math/tex">\lambda</script> determines the tradeoff between increasing the margin-size and ensuring that the <script type="math/tex">\overrightarrow{x_i}</script> lie on the correct side of the margin.</p> <h2 id="what-is-entropy-in-discrete">What is Entropy in discrete?</h2> <script type="math/tex; mode=display">H(p) = -\sum_x p(x) log p(x)</script> <h2 id="what-is-cross-entropy-in-discrete">What is Cross Entropy in discrete?</h2> <p><script type="math/tex">H(p, q) = -\sum_x p(x) log q(x)</script> or by definition: <script type="math/tex">H(p, q) = H(p) + D_{KL}(p \vert\vert q)</script></p> <h2 id="what-is-the-difference-between-cross-entropy-and-entropy">What is the difference between Cross Entropy and Entropy?</h2> <p>The KL divergence from <script type="math/tex">\hat{y}</script> (or Q, your observation) to y (or P, ground truth) is simply the difference between cross entropy and entropy:</p> <script type="math/tex; mode=display">KL(y \vert\vert \hat{y})=\sum_iy_ilog\frac{1}{\hat{y}_i}−\sum_iy_ilog\frac{1}{y_i}=\sum_iy_ilog\frac{y_i}{\hat{y}_i}</script> <h2 id="what-is-post-of-speech-pos-tagging-in-nlp">What is Post-of-Speech (POS) Tagging in NLP?</h2> <h3 id="how-to-do-the-pos-tagging">How to do the POS tagging?</h3> <h4 id="use-of-hidden-markov-models">Use of hidden Markov models</h4> <h4 id="dynamic-programming-methods">Dynamic programming methods</h4> <h4 id="unsupervised-taggers">Unsupervised taggers</h4> <h4 id="other-taggers-and-methods">Other taggers and methods</h4> <p>include the Viterbi algorithm, Brill tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm)</p> <p>Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm.</p> <h3 id="what-is-the-loss-function-in-mathematics">What is the loss function in mathematics?</h3> <h2 id="what-is-conditional-random-field">What is Conditional Random Field?</h2> <p>Reference: <a href="https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541">Overview of Conditional Random Fields</a></p> <p><a href="http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/">Introduction to Conditional Random Fields</a></p> <h3 id="what-is-the-meaning-of-conditional-in-this-algorithm">What is the meaning of Conditional in this algorithm?</h3> <h3 id="how-crfs-differ-from-hidden-markov-models">How CRFs differ from Hidden Markov Models</h3> <h3 id="what-is-the-relationship-between-hidden-markov-model-and-conditional-random-field">What is the relationship between Hidden Markov Model and Conditional Random Field?</h3> <h2 id="what-is-viterbi-algorithm">What is Viterbi algorithm</h2> <h2 id="what-is-jaccard-similarity">What is Jaccard Similarity</h2> <h2 id="what-is-minhash">What is MinHash</h2> <h2 id="what-is-locality-sensitive-hashinglsh">What is Locality Sensitive Hashing(LSH)</h2> <h2 id="what-is-shingling">What is Shingling</h2>Cheng-Lin LiHow to Choose Right Machine Learning Model?2018-07-08T00:00:00+00:002018-07-08T00:00:00+00:00https://cheng-lin-li.github.io/2018/07/08/Models<!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#how-to-choose-right-machine-learning-models" id="markdown-toc-how-to-choose-right-machine-learning-models">How to choose right machine learning models?</a> <ul> <li><a href="#models" id="markdown-toc-models">Models</a></li> <li><a href="#reference" id="markdown-toc-reference">Reference:</a></li> </ul> </li> </ul> <hr /> <h2 id="how-to-choose-right-machine-learning-models">How to choose right machine learning models?</h2> <p>Understand assumptions and restrictions of each machine learning model will help you to get correct starting point.</p> <p>Below table try to list down some famous models and their advantages / disadvantages.</p> <h3 id="models">Models</h3> <p>Task (T):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. C = Classification. 2. R = Regression. </code></pre></div></div> <p>Learning Type (L):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. S = Supervised 2. U = Unsupervised 3. C = Clustering 4. R = Reinforcement learning </code></pre></div></div> <p>Method (M):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. N = Non-parametric 2. P = Parameteric </code></pre></div></div> <p>Approach (A):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. G = Generative 2. D = Discriminative </code></pre></div></div> <p>Hyperparameters:</p> <p>In statistics hyperparameters are parameters of a prior distribution. It relates to the parameters of our model.</p> <table> <thead> <tr> <th>T</th> <th>L</th> <th>M</th> <th>A</th> <th>Algorithm</th> <th>Assumption / Description</th> <th>Advantages</th> <th>Disadvantages</th> <th>Hyperparameters</th> <th>Loss (Cost) Function</th> <th>Activation Function (link function in linear model)</th> </tr> </thead> <tbody> <tr> <td>R</td> <td>S</td> <td>P</td> <td>D</td> <td>Linear Regression</td> <td>Baseline predictions</td> <td>1. Simple to understand and explain. 2. It seldom overfits. 3. Using L1 &amp; L2 regularization is effective in feature selection. 4. Fast to train.</td> <td>1. sensitive to outliers. 2. You have to work hard to make it fit nonlinear functions.</td> <td>weights by its dimensions of input, bias, and learning rate</td> <td>Mean Square Error (MSE = L2 loss) <script type="math/tex">L(y, \hat{y}) = \frac{1}{M}\sum_{i=1}^{M}(\hat{y_i}-y_i)^2</script>, or <script type="math/tex">L(y, x, w) = \frac{1}{M}\sum_{i=1}^{M}(\hat{y_i}-(w^Tx_i+b))^2</script></td> <td><script type="math/tex">y = f(x) = x</script></td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>D</td> <td>Logistic regression</td> <td>Independent and irrelevant alternatives (IIA) or independent and identically distributed (i.i.d.) assumption. Output results are probabilities of categorical dependent variables. Types of logistic regression: 1. Binary (Pass/Fail), 2. Multi (Cats, Dogs, Sheep), 3. Ordinal (Low, Medium, High)</td> <td>1. Simple to understand and explain. 2. It seldom over-fits. 3. Using L1 &amp; L2 regularization is effective in feature selection. 4. The best algorithm for predicting probabilities of an event. 5. Fast to train</td> <td>1. Can suffer from outliers. 2. You have to work hard to make it fit nonlinear functions</td> <td> </td> <td>Cross-entropy loss = log loss = negative log-likelihood</td> <td>link function of binary classifier: Sigmoid = <script type="math/tex">\frac{1}{1+e^{-x}}</script>, link function of multinomial logistic regression for multi-classification = softmax</td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>G</td> <td>Naive Bayes</td> <td>random variables are independent and identically distributed (i.i.d. assumption). Assume that the value of a particular feature is independent of the value of any other feature, given the class variable.</td> <td>1. Easy and fast to implement. 2. doesn’t require too much memory and can be used for online learning. 3. Easy to understand. 4. Takes into account prior knowledge</td> <td>1. Strong and unrealistic feature independence assumptions. 2. Fails estimating rare occurrences. 3. Suffers from irrelevant features.</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>G</td> <td>Hidden Markov Model</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>D</td> <td>Linear-chain Conditional Random Field</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C/R</td> <td>S</td> <td>P</td> <td>D</td> <td>Random Forest</td> <td>Apt at almost any machine learning problem</td> <td>1. Can work in parallel. 2. May overfits (Depth of tree is a parameters to control overfits). 3. Automatically handles missing values. 4. No need to transform any variable. 5. Can be used by almost anyone with excellent results</td> <td>1. Difficult to interpret due to complex multiple tree structures. 2. Weaker on regression when estimating values at the extremities of the distribution of response values. 3. Biased in multiclass problems toward more frequent classes.</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C/R</td> <td>S</td> <td>P</td> <td>D</td> <td>Gradient Boosting</td> <td>1. Apt at almost any machine learning problem. 2. Search engines (solving the problem of learning to rank)</td> <td>1. It can approximate most nonlinear function. 2. Best in class predictor. 3. Automatically handles missing values. 4. No need to transform any variable</td> <td>1. It can overfit if run for too many iterations. 2. Sensitive to noisy data and outliers. 3. Doesn’t work well without parameter tuning</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>D</td> <td>Support Vector Machines</td> <td>0. One of influential approaches to supervised learning model (before NN works). 1. Character recognition. 2. Image recognition. 3. Text classification.</td> <td>1. Automatic nonlinear feature creation. 2. Can approximate complex nonlinear functions</td> <td>1. Difficult to interpret when applying nonlinear kernels. 2. Suffers from too many examples, after 10,000 examples it starts taking too long to train</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C/R</td> <td>C</td> <td>N</td> <td>D</td> <td>K-Means</td> <td> </td> <td> </td> <td> </td> <td>K center points. Can be found by Elbow Method or hierarchical clustering</td> <td>Error = Sum of Squared Errors (SSE) for each data point to its center</td> <td> </td> </tr> <tr> <td>C/R</td> <td>S</td> <td>P</td> <td>D</td> <td>Neural Networks / MLP (Multiple Layer Perceptron)</td> <td>data space re-project until converge</td> <td>1. Can approximate any nonlinear function. 2. Robust to outliers. 3. Works only with a portion of the examples (the support vectors)</td> <td>1. Very difficult to set up. 2. Difficult to tune because of too many parameters and you have also to decide the architecture of the network. 3. Difficult to interpret 4. Easy to overfit</td> <td>the number of layers, the number of neurons in each layer, the learning rate, regularization, dropout rate, batch size</td> <td> </td> <td> </td> </tr> <tr> <td>C/R</td> <td>C</td> <td>N</td> <td>D</td> <td>K-nearest Neighbors</td> <td>the input consists of the k closest training examples in the feature space</td> <td>1.Fast. 2. lazy training. 3.Can naturally handle extreme multiclass problems (like tagging text)</td> <td>1. Slow and cumbersome in the predicting phase(Model has to carry with data) 2. Can fail to predict correctly due to the (the volume of the space increases so fast that the available data become sparse)</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C/R</td> <td>S</td> <td>P</td> <td>D</td> <td>Neural Networks / MLP (Multiple Layer Perceptron)</td> <td>data space re-project until converge</td> <td>1. Can approximate any nonlinear function. 2. Robust to outliers. 3. Works only with a portion of the examples (the support vectors)</td> <td>1. Very difficult to set up. 2. Difficult to tune because of too many parameters and you have also to decide the architecture of the network. 3. Difficult to interpret 4. Easy to overfit</td> <td>the number of layers, the number of neurons in each layer, the learning rate, regularization, dropout rate, batch size</td> <td> </td> <td> </td> </tr> <tr> <td>C</td> <td>S</td> <td>P</td> <td>D</td> <td>Perceptron</td> <td>Binary Classifier, Assume data is binary classifiable or If the training data is linearly separable, the algorithm stops in a finite number of steps.</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>C</td> <td>U</td> <td>N</td> <td>D</td> <td>PCA</td> <td>PCA is limited to re-expressing the data as a linear combination of its basis vectors to best express the data mean</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <h3 id="reference">Reference:</h3> <ol> <li> <p><a href="https://www.dummies.com/programming/big-data/data-science/machine-learning-dummies-cheat-sheet/">Machine Learning For Dummies Cheat Sheet</a></p> </li> <li> <p><a href="https://blog.dataiku.com/machine-learning-explained-algorithms-are-your-friend">Machine Learning Explained: Algorithms Are Your Friend</a></p> </li> <li> <p><a href="https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6">Cheat Sheet of Machine Learning and Python (and Math) Cheat Sheets</a></p> </li> <li> <p><a href="https://www.cs.toronto.edu/~frossard/post/linear_regression/">Linear Regression</a></p> </li> <li> <p><a href="http://www.davidsbatista.net/blog/2017/11/11/HHM_and_Naive_Bayes/">Hidden Markov Model and Naive Bayes relationship</a></p> </li> <li> <p><a href="http://www.davidsbatista.net/blog/2017/11/12/Maximum_Entropy_Markov_Model/">Maximum Entropy Markov Models and Logistic Regression</a></p> </li> <li> <p><a href="http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/">Conditional Random Fields for Sequence Prediction</a></p> </li> <li> <p><a href="http://cnyah.com/2017/08/26/from-naive-bayes-to-linear-chain-CRF/">From Naive Bayes to Linear-chain CRF</a></p> </li> </ol> <p><img src="http://cnyah.com/2017/08/26/from-naive-bayes-to-linear-chain-CRF/transforms.png" alt="Comparison between Naive Bayes, Logistic Regression, HMM, and CRF" /></p>Cheng-Lin LiHow to setup Google Colaboratory to get free GPU and integrate with Google drive?2018-04-04T00:00:00+00:002018-04-04T00:00:00+00:00https://cheng-lin-li.github.io/2018/04/04/Google_Colaboratory<p><img src="/images/2018-04-04.svg" alt="Google Colaboratory" /></p> <p><a href="https://github.com/Cheng-Lin-Li/Cheng-Lin-Li.github.io/blob/master/resources/2018-04-04/GoogleColaboratoryNotebookTemplate.ipynb">You can view my Colaboratory Notebook template with steps from here</a> or you want to <a href="https://cdn.rawgit.com/Cheng-Lin-Li/Cheng-Lin-Li.github.io/master/resources/2018-04-04/GoogleColaboratoryNotebookTemplate.ipynb">download this google colaboratory notebook template directly</a>.</p> <p>In one sentence, Google Colaboratory is an integration environment for the new learner of machine learning with GPU computational power.</p> <p>This documentation is focus on some tips by my personal experience to share with others who just want to start their trials.</p> <!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#why-do-we-need-gpu" id="markdown-toc-why-do-we-need-gpu">Why do we need GPU?</a> <ul> <li><a href="#high-dimensional-computions-are-major-operations-in-deep-learning-models" id="markdown-toc-high-dimensional-computions-are-major-operations-in-deep-learning-models">High dimensional computions are major operations in deep learning models.</a></li> </ul> </li> <li><a href="#what-is-google-colaboratory" id="markdown-toc-what-is-google-colaboratory">What is Google Colaboratory?</a></li> <li><a href="#how-does-it-works" id="markdown-toc-how-does-it-works">How does it works?</a> <ul> <li><a href="#pros" id="markdown-toc-pros">Pros</a></li> <li><a href="#cons-as-of-today" id="markdown-toc-cons-as-of-today">Cons (as of today)</a></li> </ul> </li> <li><a href="#how-can-we-create-a-google-colaboratory-notebook" id="markdown-toc-how-can-we-create-a-google-colaboratory-notebook">How can we create a Google Colaboratory notebook?</a></li> <li><a href="#how-does-it-work-with-google-drive" id="markdown-toc-how-does-it-work-with-google-drive">How does it work with Google drive?</a> <ul> <li><a href="#mount-google-drive-into-virtual-machine-of-colaboratory" id="markdown-toc-mount-google-drive-into-virtual-machine-of-colaboratory">Mount Google drive into virtual machine of Colaboratory.</a></li> <li><a href="#save-your-model-parameters-into-files" id="markdown-toc-save-your-model-parameters-into-files">Save your model parameters into files.</a></li> <li><a href="#load-your-model-parameters-from-local-files" id="markdown-toc-load-your-model-parameters-from-local-files">Load your model parameters from local files.</a></li> <li><a href="#copy-multiple-python-objects-from-colaboratory-to-google-drive" id="markdown-toc-copy-multiple-python-objects-from-colaboratory-to-google-drive">Copy multiple python objects from colaboratory to Google drive.</a></li> <li><a href="#load-multiple-python-objects-from-google-drive" id="markdown-toc-load-multiple-python-objects-from-google-drive">Load multiple python objects from Google drive.</a></li> </ul> </li> <li><a href="#reference" id="markdown-toc-reference">Reference:</a></li> </ul> <hr /> <h2 id="why-do-we-need-gpu">Why do we need GPU?</h2> <h3 id="high-dimensional-computions-are-major-operations-in-deep-learning-models">High dimensional computions are major operations in deep learning models.</h3> <p>Deep learning heavily rely on GPU to speed up high dimensional computations. You may only need 20 minutes to train a model with GPU but the same task may take 2 hours on pure CPU computing power.</p> <h2 id="what-is-google-colaboratory">What is Google Colaboratory?</h2> <p>Google Colaboratory is a free cloud service with GPU.</p> <h2 id="how-does-it-works">How does it works?</h2> <p>Please refer the above diagram.</p> <p>When you (the client) execute google colaboratory (Which is a Python Jupyter notebook) from Google drive, Google will create a new virtual machine to host/execute your colaboratory notebook. There is an individual virtual machine to run each notebook.</p> <p>That’s the reason why your notebook cannot read your data files which locate on Google drive. You have to upload your data file to the virtual machine directly or make a mount point to mount a specific folder on Google drive to the virtual machine as a local drive to access.</p> <h3 id="pros">Pros</h3> <ol> <li>The CPU / GPU resource is free. Currently, the environment provides one Tesla K80 GPU.</li> <li>The environment is well integrated with popular machine learning libraries. <blockquote> <p>Tensorflow, Keras, xgboost, numpy, pandas, scikit-learn, beautifulsoup, opencv-python …etc.</p> </blockquote> </li> </ol> <h3 id="cons-as-of-today">Cons (as of today)</h3> <ol> <li>Limited resource. <blockquote> <p>a. Only around 12GB free Memory for you. In most of cases you will run out of memory in training a deep learning model with huge data set.</p> <p>b. 50GB Hard drive space.</p> <p>c. Provide one core of Intel(R) Xeon(R) CPU @ 2.30GHz.</p> </blockquote> </li> <li>Connection time is limited for 12 hours. <blockquote> <p>You can use GPU as a back-end for 12 hours at a time. The connection will lost and Google will relaunch your notebook in a NEW virtual machine environment for next 12 hours. So all your data which stores in previous virtual machine is gone if you do not dump your model parameters into local and copy to Google driver.</p> </blockquote> </li> <li>Unstable on the huge task <blockquote> <p>Sometimes, the notebook just dies during the training. There may be many underlying causes for this, but out of memory is the major reason in my cases.</p> </blockquote> </li> </ol> <h2 id="how-can-we-create-a-google-colaboratory-notebook">How can we create a Google Colaboratory notebook?</h2> <ol> <li> <p>Let’s create a folder under Google drive, say ‘workspace’. <img src="https://cheng-lin-li.github.io/images/2018-04-04/create_folder.png" alt="Create folder" /></p> </li> <li> <p>Change your current folder to ‘workspace’ which you just create. Now it’s time to create your Google Colaboratory by right click on the folder, then select ‘Colaboratory’. <img src="https://cheng-lin-li.github.io/images/2018-04-04/create_file.png" alt="Image of create folder" /></p> </li> <li> <p>Enable GPU. Follow Edit &gt; Notebook settings&gt;Change runtime type (or Runtime &gt; Change runtime type) then select GPU as Hardware accelerator. <img src="https://cheng-lin-li.github.io/images/2018-04-04/enable_gpu.png" alt="Enable GPU" /></p> </li> </ol> <h2 id="how-does-it-work-with-google-drive">How does it work with Google drive?</h2> <p>If you want to maximally leverage the benefit from the platform, you will need to create a local mount point folder on your virtual machine of the notebook and map the folder to a correspond folder on Google drive.</p> <p>Here is the sample procedures.</p> <ol> <li>Download necessary software for authentication purpose.</li> <li>Create authentication tokens for Colaboratory and grant access privilege of Google drive to this session. <blockquote> <p>You need to grant “Google Could SDK” and notebook will leverage the SDK to access your “Google Drive”.</p> </blockquote> </li> </ol> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Download necessary software</span> <span class="err">!</span><span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">install</span> <span class="o">-</span><span class="n">y</span> <span class="o">-</span><span class="n">qq</span> <span class="n">software</span><span class="o">-</span><span class="n">properties</span><span class="o">-</span><span class="n">common</span> <span class="n">python</span><span class="o">-</span><span class="n">software</span><span class="o">-</span><span class="n">properties</span> <span class="n">module</span><span class="o">-</span><span class="n">init</span><span class="o">-</span><span class="n">tools</span> <span class="err">!</span><span class="n">add</span><span class="o">-</span><span class="n">apt</span><span class="o">-</span><span class="n">repository</span> <span class="o">-</span><span class="n">y</span> <span class="n">ppa</span><span class="p">:</span><span class="n">alessandro</span><span class="o">-</span><span class="n">strada</span><span class="o">/</span><span class="n">ppa</span> <span class="mi">2</span><span class="o">&gt;&amp;</span><span class="mi">1</span> <span class="o">&gt;</span> <span class="o">/</span><span class="n">dev</span><span class="o">/</span><span class="n">null</span> <span class="err">!</span><span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">update</span> <span class="o">-</span><span class="n">qq</span> <span class="mi">2</span><span class="o">&gt;&amp;</span><span class="mi">1</span> <span class="o">&gt;</span> <span class="o">/</span><span class="n">dev</span><span class="o">/</span><span class="n">null</span> <span class="err">!</span><span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="o">-</span><span class="n">y</span> <span class="n">install</span> <span class="o">-</span><span class="n">qq</span> <span class="n">google</span><span class="o">-</span><span class="n">drive</span><span class="o">-</span><span class="n">ocamlfuse</span> <span class="n">fuse</span> <span class="c"># Generate auth tokens for Colab</span> <span class="kn">from</span> <span class="nn">google.colab</span> <span class="kn">import</span> <span class="n">auth</span> <span class="n">auth</span><span class="o">.</span><span class="n">authenticate_user</span><span class="p">()</span> <span class="c"># Generate creds for the Drive FUSE library.</span> <span class="kn">from</span> <span class="nn">oauth2client.client</span> <span class="kn">import</span> <span class="n">GoogleCredentials</span> <span class="n">creds</span> <span class="o">=</span> <span class="n">GoogleCredentials</span><span class="o">.</span><span class="n">get_application_default</span><span class="p">()</span> <span class="kn">import</span> <span class="nn">getpass</span> <span class="err">!</span><span class="n">google</span><span class="o">-</span><span class="n">drive</span><span class="o">-</span><span class="n">ocamlfuse</span> <span class="o">-</span><span class="n">headless</span> <span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="p">{</span><span class="n">creds</span><span class="o">.</span><span class="n">client_id</span><span class="p">}</span> <span class="o">-</span><span class="n">secret</span><span class="o">=</span><span class="p">{</span><span class="n">creds</span><span class="o">.</span><span class="n">client_secret</span><span class="p">}</span> <span class="o">&lt;</span> <span class="o">/</span><span class="n">dev</span><span class="o">/</span><span class="n">null</span> <span class="mi">2</span><span class="o">&gt;&amp;</span><span class="mi">1</span> <span class="o">|</span> <span class="n">grep</span> <span class="n">URL</span> <span class="n">vcode</span> <span class="o">=</span> <span class="n">getpass</span><span class="o">.</span><span class="n">getpass</span><span class="p">()</span> <span class="err">!</span><span class="n">echo</span> <span class="p">{</span><span class="n">vcode</span><span class="p">}</span> <span class="o">|</span> <span class="n">google</span><span class="o">-</span><span class="n">drive</span><span class="o">-</span><span class="n">ocamlfuse</span> <span class="o">-</span><span class="n">headless</span> <span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="p">{</span><span class="n">creds</span><span class="o">.</span><span class="n">client_id</span><span class="p">}</span> <span class="o">-</span><span class="n">secret</span><span class="o">=</span><span class="p">{</span><span class="n">creds</span><span class="o">.</span><span class="n">client_secret</span><span class="p">}</span> </code></pre></div></div> <p>At the end of result, you will need to click a link to launch a new page to get a token string, copy the string and paste into the blank field in the notebook, click enter and repeat the procedure again to grant second privilege.</p> <h3 id="mount-google-drive-into-virtual-machine-of-colaboratory">Mount Google drive into virtual machine of Colaboratory.</h3> <p>Assume you create a folder “workspace” on your Google drive. Below script will create a /drive/workspace folder which is located on Google drive on your virtual machine of notebook, then copy all data files in workspace folder of Google drive to the home directory of virtual machine for your next tasks.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">mkdir</span> <span class="o">-</span><span class="n">p</span> <span class="n">drive</span> <span class="err">!</span><span class="n">google</span><span class="o">-</span><span class="n">drive</span><span class="o">-</span><span class="n">ocamlfuse</span> <span class="o">-</span><span class="n">o</span> <span class="n">nonempty</span> <span class="n">drive</span> <span class="err">!</span><span class="n">pwd</span> <span class="err">!</span><span class="n">ls</span> <span class="err">!</span><span class="n">cd</span> <span class="n">drive</span> <span class="err">!</span><span class="n">ls</span> <span class="kn">import</span> <span class="nn">os</span> <span class="n">os</span><span class="o">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">"drive/workspace"</span><span class="p">)</span> <span class="err">!</span><span class="n">ls</span> <span class="err">!</span><span class="n">cp</span> <span class="o">-</span><span class="n">R</span> <span class="o">*</span> <span class="o">../../</span> <span class="n">os</span><span class="o">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">"../../"</span><span class="p">)</span> <span class="err">!</span><span class="n">ls</span> <span class="o">-</span><span class="n">rlt</span> </code></pre></div></div> <blockquote> <p>Please note you can use ‘!’+unix command on the virtual machine. Use os.chdir(“target_folder”) to actually switch your working directory to target_folder.</p> </blockquote> <h3 id="save-your-model-parameters-into-files">Save your model parameters into files.</h3> <p>Assume you create a model file by Keras which is a high level wrapper of tensorflow, you can save the model by below command.</p> <p>I personally prefer to save the file onto virtual machine first, then copy the file to Google drive through the mound point folder. Of course you can directly save the model into “./drive/workspace/lstm_model.h5” but it may take longer time.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"lstm_model.h5"</span><span class="p">)</span> <span class="err">!</span><span class="n">cp</span> <span class="n">lstm_model</span><span class="o">.</span><span class="n">h5</span> <span class="o">./</span><span class="n">drive</span><span class="o">/</span><span class="n">workspace</span> </code></pre></div></div> <h3 id="load-your-model-parameters-from-local-files">Load your model parameters from local files.</h3> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">load_model</span> <span class="err">!</span><span class="n">cp</span> <span class="o">./</span><span class="n">drive</span><span class="o">/</span><span class="n">workspace</span><span class="o">/</span><span class="n">lstm_model</span><span class="o">.</span><span class="n">h5</span> <span class="o">.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">load_model</span><span class="p">(</span><span class="s">"lstm_model.h5"</span><span class="p">)</span> </code></pre></div></div> <h3 id="copy-multiple-python-objects-from-colaboratory-to-google-drive">Copy multiple python objects from colaboratory to Google drive.</h3> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create two objects</span> <span class="n">userlist</span> <span class="o">=</span> <span class="p">[</span><span class="s">'userlist'</span><span class="p">]</span> <span class="n">word_index</span> <span class="o">=</span> <span class="p">[</span><span class="s">'wordindex'</span><span class="p">]</span> <span class="c"># Dump to virtual machine </span> <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">((</span><span class="n">userlist</span><span class="p">,</span> <span class="n">word_index</span><span class="p">),</span> <span class="n">gzip</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"email_words_test.pkl"</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">))</span> <span class="c"># copy to google drive</span> <span class="err">!</span><span class="n">cp</span> <span class="n">email_words_test</span><span class="o">.</span><span class="n">pkl</span> <span class="o">./</span><span class="n">drive</span><span class="o">/</span><span class="n">workspace</span> </code></pre></div></div> <h3 id="load-multiple-python-objects-from-google-drive">Load multiple python objects from Google drive.</h3> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">cp</span> <span class="o">./</span><span class="n">drive</span><span class="o">/</span><span class="n">workspace</span><span class="o">/</span><span class="n">email_words_test</span><span class="o">.</span><span class="n">pkl</span> <span class="o">.</span> <span class="p">(</span><span class="n">userlist</span><span class="p">,</span> <span class="n">word_index</span><span class="p">)</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">gzip</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"email_words_test.pkl"</span><span class="p">,</span> <span class="s">'rb'</span><span class="p">))</span> </code></pre></div></div> <h2 id="reference">Reference:</h2> <ol> <li>fuat, <a href="https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d">Google Colab Free GPU Tutorial</a>.</li> </ol>Cheng-Lin LiWhat are differences between AI, Machine Learning, and Deep Learning?2018-03-29T00:00:00+00:002018-03-29T00:00:00+00:00https://cheng-lin-li.github.io/2018/03/29/AI_MachineLearning_DeepLearning<p><img src="/images/2018-03-29-AI.svg" alt="AI, machine learning, deep learning, and others" /></p> <p>Deep learning ⊂ Machine learning ⊂ Artificial Intelligence</p> <p>In one sentence, deep learning is a subset of machine learning in Artificial Intelligence (AI).</p> <!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#artificial-intelligence-ai" id="markdown-toc-artificial-intelligence-ai">Artificial Intelligence (AI)</a> <ul> <li><a href="#what-is-artificial-intelligence-" id="markdown-toc-what-is-artificial-intelligence-">What is Artificial Intelligence ?</a></li> </ul> </li> <li><a href="#machine-learning" id="markdown-toc-machine-learning">Machine Learning</a> <ul> <li><a href="#machine-learning-is-a-subset-of-ai" id="markdown-toc-machine-learning-is-a-subset-of-ai">Machine learning is a subset of AI</a> <ul> <li><a href="#two-categories-to-aggregate-machine-learning-algorithms" id="markdown-toc-two-categories-to-aggregate-machine-learning-algorithms">Two categories to aggregate machine learning algorithms:</a></li> <li><a href="#three-types-of-machine-learning-algorithms" id="markdown-toc-three-types-of-machine-learning-algorithms">Three types of machine learning algorithms:</a></li> </ul> </li> </ul> </li> <li><a href="#deep-learning" id="markdown-toc-deep-learning">Deep Learning</a> <ul> <li><a href="#deep-learning-is-a-subset-of-machine-learning" id="markdown-toc-deep-learning-is-a-subset-of-machine-learning">Deep learning is a subset of machine learning.</a></li> </ul> </li> <li><a href="#reference" id="markdown-toc-reference">Reference:</a></li> </ul> <hr /> <h2 id="artificial-intelligence-ai">Artificial Intelligence (AI)</h2> <h3 id="what-is-artificial-intelligence-">What is Artificial Intelligence ?</h3> <p>According to the definition of Association for the Advancement of Artificial Intelligence (AAAI):</p> <p>“The scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines.”</p> <p>In another word, A.I. attempts to build intelligent entities to perceive, understand, predict, and manipulate a world far larger and more complicated than itself.</p> <p>Traditionally, four approaches to define A.I.:</p> <ol> <li>Acting humanly: The Turing Test approach You can find what is <a href="https://en.wikipedia.org/wiki/Turing_test">the Turing Test by wiki</a>The major target is to let computer possess the following capabilities: a. natural language processing b. knowledge representation c. automated reasoning d. machine learning.</li> <li>Thinking humanly: The cognitive modeling approach From determine how humans think? The cognitive science brings together computer models from AI and experimental techniques from psychology to construct precise and testable theories of the human mind.</li> <li> <p>Thinking rationally: The “laws of thought” approach From the Greek philosopher Aristotle’s <a href="https://en.wikipedia.org/wiki/Syllogism">syllogisms</a> to the field of logic, this approach would like to develop irrefutable reasoning process and hope programs could, in principle, solve any solvable problem described in logical notation. In the “laws of thought” approach to AI, the emphasis was on correct inferences.</p> <p>There are two main obstacles to this approach. First, it is not easy to take informal knowledge and state it in the formal terms required by logical notation. Second, there is a big difference between solving a problem “in principle” and solving it in practice.</p> </li> <li> <p>Acting rationally: The rational agent approach, the mainstream approach as of today. An agent is just something that acts autonomously, perceive their environment, persist over a prolonged time period, adapt to change, and create and pursue goals to achieve the best outcome or, when there is uncertainty, the best expected outcome.</p> <p>The rational-agent approach has two advantages over the other approaches. First, it is more general than the “laws of thought” approach because correct inference is just one of several possible mechanisms for achieving rationality. Second, it is more amenable to scientific development than are approaches based on human behavior or human thought. The standard of rationality is mathematically well defined and completely general, and can be “unpacked” to generate agent designs that provably achieve it.</p> <p>One important thing is that achieving perfect rationality—always doing the right thing—may be not feasible in complicated environments due to long time computations, the issue of limited rationality—acting appropriately when there is not enough time to do all the computations should be considered in this approach.</p> </li> </ol> <p>Below areas are foundations of A.I.:</p> <ol> <li>Philosophy</li> <li>Mathematics</li> <li>Economics</li> <li>Neuroscience</li> <li>Psychology</li> <li>Computer engineering</li> <li>Control theory and cybernetics</li> <li>Linguistics</li> </ol> <h2 id="machine-learning">Machine Learning</h2> <h3 id="machine-learning-is-a-subset-of-ai">Machine learning is a subset of AI</h3> <p>Machine Learning gives an agent the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. An agent is learning if it improves its performance on future tasks after making observations about the world.</p> <h4 id="two-categories-to-aggregate-machine-learning-algorithms">Two categories to aggregate machine learning algorithms:</h4> <ol> <li> <p>Deductive learning: From Generalized rules to lead to correct knowledge. Pros: 1. Logical inference generates entailed statements. 2. Probabilistic reasoning can lead to updated belief states. Cons: 1. We often have insufficient knowledge for inference.</p> </li> <li> <p>Inductive learning: From examples or activities to lead to generalized rules. It may arrive at incorrect conclusions. Pros: 1. The learning can be better than not trying to learn at all. Cons: 1. Local optimals rather than global optimal. 2. Overfitting issues.</p> </li> </ol> <h4 id="three-types-of-machine-learning-algorithms">Three types of machine learning algorithms:</h4> <p>If you want to split machine learning into three types types of learning, that will be supervised learning, unsupervised learning, and reinforcement learning.</p> <ol> <li> <p>Supervised Learning: Algorithms find a solution based on labeled sample data.</p> <p>All regression tasks and most of classifications rely on labeled data to feed into algorithms. For example, machine can classify cat, dog, car, … from an image after trained by thousands of labeled images.</p> </li> <li> <p>Unsupervised Learning: Algorithms give an answer based on unlabeled data.</p> <p>How algorithm can learn knowledge from “unlabeled” data? Basically this kind of algorithms perform classification tasks which clusters data into different sets(or groups) based on “distance” of features from each data. Clustering and Dimensionality reduction algorithms rely on unsupervised learning.</p> <p>Is unsupervised learning useful? Of course it is. For instance, you search from google by some key words to get tones of websites is based on unsupervised learning algorithm.</p> </li> <li> <p>Reinforcement Learning: Algorithms based on long-term rewards to learn the rules/answers.</p> <p>You may not know reinforcement learning approach, but you definitely heart about AlphGo beats human champaign in 2017. Yes, it is reinforcement learning explore new approaches to play games.</p> <p>Given reward functions and the environment’s states, the agent will choose the action to maximize rewards or explore new possibilities.</p> </li> </ol> <h2 id="deep-learning">Deep Learning</h2> <h3 id="deep-learning-is-a-subset-of-machine-learning">Deep learning is a subset of machine learning.</h3> <p>Deep learning focus on multiple (deep) layer of Artificial Neural Network(ANN) with different combinations. This kind of algorithms dominate computer vision, sound recognition, machine translations…etc.</p> <p>Scientists construct different architectures of connections between different activation functions (to simulate the behavior of neurons) which actually project data from original question space to a new solution space to solve it (find the answer).</p> <p>These algorithm split all inputs (image, voice, text) into high dimensional matrix of numbers to compute them. Those matrix operations will take a lot of time and computing powers on CPU. Because the powerful graphics processing unit (GPU) was developed in recently years, it helps deep learning to be practicable and reveal the power of these algorithms.</p> <p>The research competition in this area is not only related to algorithm design but also computing power. You don’t want to wait for 1 weeks to see the result of experiments. That’s why GPU card (or high-end graphic card) is very important for machine learning researchers today.</p> <p>I don’t have a GPU card, how can I do for deep learning research? The good news is Google provides a free (so far) GPU developing enviroment call [Google Colaboratory] for you with some limitations. You may <a href="https://colab.research.google.com/notebooks/welcome.ipynb#recent=true">click here to try it</a> or <a href="https://research.google.com/colaboratory/faq.html#browsers">click here for more detail</a>.</p> <h2 id="reference">Reference:</h2> <ol> <li><a href="http://aima.cs.berkeley.edu/">Stuart Russel, Peter Norvig, Artificial Intelligence - A Mordern Approach, Third Edition.</a></li> <li><a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf">Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning - Data Mining,Inference,and Prediction</a></li> <li><a href="https://www.springer.com/us/book/9780387310732">Christopher Bishop, Pattern Recognition and Machine Learning</a></li> <li><a href="http://www.deeplearningbook.org/">Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning</a></li> </ol> <p>Revised on April, 1, 2018 for machine learning and deep learning. Revised on April, 30, 2018 to include table of content.</p>Cheng-Lin LiIt’s all about the Math !!2018-03-14T00:00:00+00:002018-03-14T00:00:00+00:00https://cheng-lin-li.github.io/2018/03/14/Math<!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#maths-are-the-foundation-of-science" id="markdown-toc-maths-are-the-foundation-of-science">Maths are the foundation of science</a> <ul> <li><a href="#1-calculus" id="markdown-toc-1-calculus">1. Calculus</a></li> <li><a href="#2-probability-and-statistics" id="markdown-toc-2-probability-and-statistics">2. Probability and Statistics</a></li> <li><a href="#3-linear-algebra--discrete-mathematics" id="markdown-toc-3-linear-algebra--discrete-mathematics">3. Linear Algebra / Discrete Mathematics</a></li> </ul> </li> </ul> <hr /> <h2 id="maths-are-the-foundation-of-science">Maths are the foundation of science</h2> <p>Mathematics are the foundation of science, you may need to review below areas if you really want to be an expert of a data scientist. There are so many easy to use packages, libraries, examples to help us on machine learning, natural language processing development. The real question is: what is the right model / algorithm I have to choose? How do you know the algorithm, model fit into your questions?</p> <p>The answer is to know about the assumption of each model and algorithm. The knowledge about math in below areas will definitely give you a hand to fully understand those assumptions. You will not only know how but also why you choose the model to deal with your own questions.</p> <p>There are suggesting topics you may need know now (or maybe in the future :) ).</p> <h3 id="1-calculus">1. Calculus</h3> <blockquote> <p>1-0. Reference material: <a href="http://ml-cheatsheet.readthedocs.io/en/latest/calculus.html">Machine Learning Cheatsheet - Calculus</a></p> <p>1-1. Limits</p> <p>The foundation of differential and integral calculus.<br /> 1-2. Taylor Series</p> <p>Taylor series is a method to approximate a function by polynomial. In some of machine learning algorithms, we may need exponential function as target function due to its nice properties (can be differential in everywhere). Due to the complexities of differential calculus on exponential function, we can use Taylor series to approaching the similar result base on specific point. When you understand the background technique of those algorithms, you will understand why learning rate has to be a small number (or steps).</p> <p>1-3. Differential calculus</p> <p>How to get the optimal(minimum or maximum) value in a function? What is Gradient Decent optimizer? It’s all about differential calculus.</p> <p>1-4. Integral calculus.</p> <p>Probabilistic modeling is one of most important models in machine learning. Integral calculus help us to get the expectation of our models.</p> </blockquote> <h3 id="2-probability-and-statistics">2. Probability and Statistics</h3> <blockquote> <p>2-1. Probabilities and Expectations</p> <p>Gaussian model, Bayes theory, Naive Bayes, Markov Chain, Hidden Markov Model, Viterbi Algorithm…etc. All of those models related to probability and statistics.</p> <blockquote> <p>2-1-1. <a href="http://cs229.stanford.edu/section/cs229-prob.pdf">Review of Probability Theory at Stanford CS229 machine learning</a></p> <p>2-1-2. Distributions and Tests</p> <p>You will need these tools to make sure your data distribution is the same as your assumption.</p> </blockquote> </blockquote> <h3 id="3-linear-algebra--discrete-mathematics">3. Linear Algebra / Discrete Mathematics</h3> <blockquote> <p>3-1 <a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">Linear Algebra Review and Reference at Stanford CS229 machine learning</a> 3-2 <a href="https://hadrienj.github.io/posts/Deep-Learning-Book-Series-Introduction/">Deep Learning Book Series · Introduction</a></p> <p>Most of machine learning algorithm involved high dimensional computations. A single dimensional array is a vector, a two dimensional array calls a matrix, a three or higher dimension array calls tensor. The linear algebra helps us to efficiently calculate high dimensional operations in an easy form. GPU is designed to perform 3D computer graphics and its hardware also help on deep learning high dimensional computation.</p> </blockquote>Cheng-Lin LiHello Data !!2018-03-13T00:00:00+00:002018-03-13T00:00:00+00:00https://cheng-lin-li.github.io/2018/03/13/Hello-Data<!-- more --> <hr /> <p>Table of content:</p> <ul class="table-of-content" id="markdown-toc"> <li><a href="#what-is-the-major-focus-of-this-blog" id="markdown-toc-what-is-the-major-focus-of-this-blog">What is the major focus of this blog?</a></li> <li><a href="#reference-material" id="markdown-toc-reference-material">Reference material:</a> <ul> <li><a href="#artificial-intelligence-ai" id="markdown-toc-artificial-intelligence-ai">Artificial intelligence (AI)</a></li> <li><a href="#machine-learning-ml" id="markdown-toc-machine-learning-ml">Machine Learning (ML)</a></li> <li><a href="#natural-language-processingnlp" id="markdown-toc-natural-language-processingnlp">Natural Language Processing(NLP)</a></li> <li><a href="#statistics" id="markdown-toc-statistics">Statistics:</a></li> </ul> </li> <li><a href="#general-topics" id="markdown-toc-general-topics">General topics</a> <ul> <li><a href="#parametric-vs-nonparametric-methods" id="markdown-toc-parametric-vs-nonparametric-methods">Parametric vs. Nonparametric Methods.</a> <ul> <li><a href="#reference" id="markdown-toc-reference">reference:</a></li> </ul> </li> <li><a href="#generative--discriminative-models" id="markdown-toc-generative--discriminative-models">Generative &amp; Discriminative models:</a> <ul> <li><a href="#reference-1" id="markdown-toc-reference-1">Reference:</a></li> </ul> </li> <li><a href="#look-ahead-bias" id="markdown-toc-look-ahead-bias">Look-Ahead Bias</a> <ul> <li><a href="#reference-2" id="markdown-toc-reference-2">Reference:</a></li> </ul> </li> <li><a href="#ensemble-learning-to-improve-machine-learning-results" id="markdown-toc-ensemble-learning-to-improve-machine-learning-results">Ensemble Learning to Improve Machine Learning Results</a></li> <li><a href="#glorot-initialization-xavier-initialization" id="markdown-toc-glorot-initialization-xavier-initialization">Glorot initialization/ Xavier initialization</a> <ul> <li><a href="#references" id="markdown-toc-references">References:</a></li> </ul> </li> <li><a href="#he-initialization-for-the-more-recent-rectifying-nonlinearities-relu" id="markdown-toc-he-initialization-for-the-more-recent-rectifying-nonlinearities-relu">He initialization: For the more recent rectifying nonlinearities (ReLu)</a> <ul> <li><a href="#references-1" id="markdown-toc-references-1">References:</a></li> </ul> </li> <li><a href="#glove-global-vectors-for-word-representation" id="markdown-toc-glove-global-vectors-for-word-representation">GloVe: Global Vectors for Word Representation</a> <ul> <li><a href="#references-2" id="markdown-toc-references-2">References:</a></li> </ul> </li> <li><a href="#f_beta-score-an-easy-to-combine-precision-and-recall-measures" id="markdown-toc-f_beta-score-an-easy-to-combine-precision-and-recall-measures"><script type="math/tex">F_{\beta}</script> score: An easy to combine precision and recall measures</a></li> <li><a href="#symmetric-mean-absolute-percent-error-smape" id="markdown-toc-symmetric-mean-absolute-percent-error-smape">Symmetric Mean Absolute Percent Error (SMAPE)</a> <ul> <li><a href="#references-3" id="markdown-toc-references-3">References:</a></li> </ul> </li> <li><a href="#mean-absolute-percent-error-mape" id="markdown-toc-mean-absolute-percent-error-mape">Mean Absolute Percent Error (MAPE)</a> <ul> <li><a href="#references-4" id="markdown-toc-references-4">References:</a></li> </ul> </li> <li><a href="#mle-vs-map-the-connection-between-maximum-likelihood-and-maximum-a-posteriori-estimation" id="markdown-toc-mle-vs-map-the-connection-between-maximum-likelihood-and-maximum-a-posteriori-estimation">MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation</a> <ul> <li><a href="#references-5" id="markdown-toc-references-5">References:</a></li> </ul> </li> <li><a href="#the-exponential-family" id="markdown-toc-the-exponential-family">The exponential family:</a> <ul> <li><a href="#references-6" id="markdown-toc-references-6">References:</a></li> </ul> </li> <li><a href="#generalized-linear-model-glm" id="markdown-toc-generalized-linear-model-glm">Generalized Linear Model, GLM</a> <ul> <li><a href="#references-7" id="markdown-toc-references-7">References:</a></li> </ul> </li> <li><a href="#kullback-leibler-divergence-kl-divergence--information-gain--relative-entropy" id="markdown-toc-kullback-leibler-divergence-kl-divergence--information-gain--relative-entropy">Kullback-Leibler divergence (KL Divergence) / Information Gain / relative entropy</a></li> <li><a href="#learning-theory--vc-dimensionfor-vapnikchervonenkis-dimension" id="markdown-toc-learning-theory--vc-dimensionfor-vapnikchervonenkis-dimension">Learning Theory &amp; VC dimension(for Vapnik–Chervonenkis dimension)</a> <ul> <li><a href="#references-8" id="markdown-toc-references-8">References:</a></li> </ul> </li> <li><a href="#statistical-forecasting" id="markdown-toc-statistical-forecasting">Statistical forecasting</a> <ul> <li><a href="#arima-auto-regressive-integrated-moving-average" id="markdown-toc-arima-auto-regressive-integrated-moving-average">ARIMA (Auto-Regressive Integrated Moving Average)</a></li> </ul> </li> </ul> </li> <li><a href="#disclaimer" id="markdown-toc-disclaimer">Disclaimer</a> <ul> <li><a href="#external-links-disclaimer" id="markdown-toc-external-links-disclaimer">External links disclaimer</a></li> </ul> </li> <li><a href="#contact-information" id="markdown-toc-contact-information">Contact Information</a></li> </ul> <hr /> <h2 id="what-is-the-major-focus-of-this-blog">What is the major focus of this blog?</h2> <p>This is my learning notebook which includes AI, Machine Learning, Big Data techniques, Knowledge Graph, Information Visualization and Natural Language Processing.</p> <p>I am a lifelong learner and passionate to contribute my knowledge to impact the world. After twenty years dedicate working on IT Application development departments for Intranet, B2C, B2B eCommerce portal and trading Websites, I observed the A.I. and DATA era is coming. It is time to let DATA tell its story by A.I. / machine learning algorithms and that’s the reason why I resigned from J.P. Morgan Asset Management (Taiwan) in 2016 and went back to school to be a graduate student in Viterbi school of engineering at University of Southern California. My research area is data informatics and I would like to share what I learn with everyone.</p> <p>I will leverage my spare time to enrich this notebook style blog from time to time. Your comments are appreciated.</p> <hr /> <h2 id="reference-material">Reference material:</h2> <h3 id="artificial-intelligence-ai">Artificial intelligence (AI)</h3> <p>Textbook:</p> <ol> <li><a href="http://aima.cs.berkeley.edu/">Stuart Russell,‎ Peter Norvig, Artificial Intelligence: A Modern Approach</a></li> </ol> <hr /> <h3 id="machine-learning-ml">Machine Learning (ML)</h3> <p>Textbook:</p> <ol> <li><a href="https://mitpress.mit.edu/books/introduction-machine-learning-0">Introduction to Machine Learning-3rd, Ethem Alpaydin</a></li> <li><a href="https://web.stanford.edu/~hastie/pub.htm">The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition), by Trevor Hastie, Robert Tibshirani and Jerome Friedman</a></li> <li><a href="https://www.microsoft.com/en-us/research/people/cmbishop/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fcmbishop%2Fprml%2F">Pattern Recognition And Machine Learning, Bishop</a></li> <li><a href="http://www.deeplearningbook.org/">Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville</a></li> </ol> <p>Articles &amp; Papers:</p> <ol> <li><a href="https://arxiv.org/abs/1801.05894v1">Deep Learning: An Introduction for Applied Mathematicians</a></li> <li><a href="https://arxiv.org/abs/1803.01164">The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches</a></li> </ol> <p>Training Material and Courses:</p> <ol> <li><a href="http://cs229.stanford.edu/syllabus.html">CS229: Machine Learning</a></li> <li><a href="https://ift6135h18.wordpress.com/">Representation learning in Montreal Institute for Learning Algorithms @Universite’ de Montre’al</a> (Deep Learning)</li> <li><a href="https://drive.google.com/drive/folders/0B3w765rOKuKANmxNbXdwaE1YU1k">Reinforcement Learning: An Introduction by Prof. Richard S. Sutton &amp; Andrew G. Barto @University of Alberta</a>, <a href="http://www.incompleteideas.net/book/the-book-2nd.html">OR try this alternative link.</a></li> <li><a href="https://sites.google.com/site/10715advancedmlintro2017f/lectures">Carnegie Mellon University - 10715 Advanced Introduction to Machine Learning: lectures</a></li> <li><a href="https://www.deeplearning.ai/">Deeplearning.ai, Andrew Ng, Introductory deep learning course.</a></li> </ol> <hr /> <h3 id="natural-language-processingnlp">Natural Language Processing(NLP)</h3> <p>Articles &amp; Papers:</p> <ol> <li><a href="https://www.deeplearningweekly.com/blog/demystifying-word2vec">Demystifying, word2vec</a></li> <li><a href="http://www.aclweb.org/anthology/A/A92/A92-1021.pdf">Brill (1992): A Simple Rule-Based Part of Speech Tagger</a></li> <li><a href="http://www.aclweb.org/anthology/W/W96/W96-0213.pdf">Ratnaparkhi (1996): A Maximum Entropy Model for Part-Of-Speech Tagging</a></li> <li><a href="https://repository.upenn.edu/cgi/viewcontent.cgi?referer=http://ron.artstein.org/csci544-2018/index.html&amp;httpsredir=1&amp;article=1162&amp;context=cis_papers">Lafferty, McCallum and Pereira (2001): Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</a></li> <li><a href="http://ieeexplore.ieee.org/document/536824/">Young (1996): A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine 13(5): 45–57.</a></li> <li><a href="https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf">Sutskever, Vinyals and Le (2014): Sequence to Sequence Learning with Neural Networks</a></li> <li><a href="https://arxiv.org/pdf/1703.01619.pdf">Neubig (2017): Neural Machine Translation and Sequence-to-sequence Models: A Tutorial</a></li> <li><a href="https://aclweb.org/anthology/N/N13/N13-1090.pdf">Mikolov, Yih and Zweig (2013): Linguistic Regularities in Continuous Space Word Representations</a></li> <li><a href="https://aclweb.org/anthology/Q/Q15/Q15-1016.pdf">Levy, Goldberg and Dagan (2015): Improving Distributional Similarity with Lessons Learned from Word Embeddings.</a></li> </ol> <p>Training Material and Courses:</p> <ol> <li><a href="http://www.cs.jhu.edu/~jason/465/">Natural Language Processing (Fall 2017) by Prof. Jason Eisner @Johns Hopkins University</a></li> <li><a href="http://cs224d.stanford.edu/">Natural Language Processing with Deep Learning (Winter 2017) by Chris Manning &amp; Richard Socher @Standford University: Material website</a> <a href="https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6">,and video link</a></li> </ol> <hr /> <h3 id="statistics">Statistics:</h3> <p><a href="http://users.stat.umn.edu/~helwig/teaching.html">Statistics and R by Nathaniel E. Helwig@University of Minnesota</a></p> <hr /> <h2 id="general-topics">General topics</h2> <blockquote> <p>I leave some Technology notes in this section. I may write articles for each of them in the future.</p> </blockquote> <h4 id="parametric-vs-nonparametric-methods">Parametric vs. Nonparametric Methods.</h4> <h5 id="reference">reference:</h5> <ol> <li><a href="http://aima.cs.berkeley.edu/">Stuart Russell,‎ Peter Norvig, Artificial Intelligence: A Modern Approach</a> <ul> <li>Parametric Methods:</li> </ul> </li> </ol> <p>A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.</p> <p>Models do not growth with data.</p> <p>Model examples:</p> <blockquote> <p>1-1. Linear regression</p> <p>1-2. Logistic regression</p> <p>1-4. Perceptron</p> <p>1-5. Naive Bayes</p> <p>1-6. …etc.</p> </blockquote> <ul> <li>Nonparametric Methods: Don’t summarize data into parameters.</li> </ul> <p>Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features.</p> <p>Models growth with data.</p> <p>Model examples:</p> <blockquote> <p>1-1. k-nearest neighbors</p> <p>1-2. Support Vector Machine</p> <p>1-3. Decision Tree: (CART and C4.5)</p> </blockquote> <hr /> <h4 id="generative--discriminative-models">Generative &amp; Discriminative models:</h4> <h5 id="reference-1">Reference:</h5> <ol> <li><a href="https://en.wikipedia.org/wiki/Generative_model">https://en.wikipedia.org/wiki/Generative_model</a></li> </ol> <ul> <li> <p>Generative model, also called joint distribution models.</p> <p>Generative learning algorithms assume there is a model to GENERATE the observable variable by hidden(or target) variable and the hidden variables is a distribution rather than a fix value.</p> <p>Given an observable variable X and a target variable Y, a generative model is a statistical model of the joint probability distribution on X × Y, P ( X , Y )</p> <ol> <li>Gaussian mixture model and other types of mixture model</li> <li>Hidden Markov model</li> <li>Probabilistic context-free grammar</li> <li>Naive Bayes</li> <li>Averaged one-dependence estimators</li> <li>Latent Dirichlet allocation</li> <li>Restricted Boltzmann machine</li> <li>Generative adversarial networks</li> </ol> </li> <li> <p>Discriminative model, also called conditional models.</p> <p>A discriminative model is a model of the conditional probability of the target Y, given an observation x, symbolically, P ( Y | X = x ) and,</p> <p>Classifiers computed without using a probability model are also referred to loosely as “discriminative”.</p> <p>Algorithms that try to learn P( Y | X ) directly (such as logistic regression) by given X, or algorithms that try to learn mappings directly from the space of inputs X to the labels {0,1}, (such as the perceptron algorithm) are called discriminative learning algorithms.</p> <ol> <li>Logistic regression, a type of generalized linear regression used for predicting binary or categorical outputs (also known as maximum entropy classifiers)</li> <li>Support vector machines</li> <li>Boosting (meta-algorithm)</li> <li>Conditional random fields</li> <li>Linear regression</li> <li>Neural networks</li> <li>Random forests</li> </ol> </li> </ul> <hr /> <h4 id="look-ahead-bias">Look-Ahead Bias</h4> <h5 id="reference-2">Reference:</h5> <ol> <li><a href="https://www.investopedia.com/terms/l/lookaheadbias.asp">https://www.investopedia.com/terms/l/lookaheadbias.asp</a></li> </ol> <p>Look-ahead bias occurs by using information or data in a study or simulation that would not have been known or available during the period being analyzed. This will usually lead to inaccurate results in the study or simulation. Look-ahead bias can be used to sway simulation results closer into line with the desired outcome of the test.</p> <p>To avoid look-ahead bias, if an investor is backtesting the performance of a trading strategy, it is vital that he or she only [uses information that would have been available at the time of the trade]. For example, if a trade is simulated based on [information that was not available] at the time of the trade - such as a quarterly earnings number that was released three months later - it will diminish the accuracy of the trading strategy’s true performance and potentially bias the results in favor of the desired outcome. Look-ahead bias is one of many biases that must be accounted for when running simulations. Other common biases are :</p> <p>a. [sample selection bias]: Non-random sample of a population,</p> <p>b. [time period bias]: Early termination of a trial at a time when its results support a desired conclusion.</p> <p>c. [survivorship/survival bias]: It is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility.</p> <p>All of these biases have the potential to sway simulation results closer into line with the desired outcome of the simulation, as the input parameters of the simulation can be selected in such a way as to favor the desired outcome.</p> <hr /> <h4 id="ensemble-learning-to-improve-machine-learning-results">Ensemble Learning to Improve Machine Learning Results</h4> <p>Reference:</p> <ol> <li> <p>Vadim Smolyakov, <a href="https://blog.statsbot.co/ensemble-learning-d1dcd548e936">Ensemble Learning to Improve Machine Learning Results.</a></p> </li> <li> <p><a href="https://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning">Bagging, boosting and stacking in machine learning</a></p> </li> </ol> <p>Ensemble methods are meta-algorithms which combine several machine learning techniques into one model to increase the performance:</p> <ol> <li> <p>bagging (decrease variance): bootstrap aggregation. Parallel ensemble: each model is built independently a. Reduce the variance of an estimate is to average together multiple estimates. b. Bagging uses bootstrap sampling (combinations with repetitions) to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.</p> </li> <li> <p>boosting (decrease bias): Sequential ensemble: try to add new models that do well where previous models lack. a. Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds. b. Two-step approach, where first uses subsets of the original data to produce a series of averagely performing models and then “boosts” their performance by combining them together using a particular cost function (majority vote for classification or a weighted sum for regression). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.</p> </li> <li> <p>stacking (improve predictions): Sequential ensemble: stacking is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor. a. The base level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base level model as features.</p> </li> </ol> <hr /> <h4 id="glorot-initialization-xavier-initialization">Glorot initialization/ Xavier initialization</h4> <h5 id="references">References:</h5> <ol> <li><a href="http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization">http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization</a></li> <li><a href="https://jamesmccaffrey.wordpress.com/2017/06/21/neural-network-glorot-initialization/">https://jamesmccaffrey.wordpress.com/2017/06/21/neural-network-glorot-initialization/</a></li> </ol> <p>Glorot initialization: it helps signals reach deep into the network.</p> <p>a. If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.</p> <p>b. If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.</p> <p>Formular: <script type="math/tex">Var(W) = \frac{1}{n_{in}}</script></p> <p>where W is the initialization distribution for the neuron in question, and n_in is the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.</p> <p>It’s worth mentioning that Glorot &amp; Bengio’s paper originally recommended using: <script type="math/tex">Var(W) = \frac{2}{(n_{in}+n_{out})}</script> where <script type="math/tex">n_{out}</script> is the number of neurons the result is fed to.</p> <hr /> <h4 id="he-initialization-for-the-more-recent-rectifying-nonlinearities-relu">He initialization: For the more recent rectifying nonlinearities (ReLu)</h4> <h5 id="references-1">References:</h5> <ol> <li><a href="https://arxiv.org/abs/1502.01852">https://arxiv.org/abs/1502.01852</a></li> </ol> <p>Formular: <script type="math/tex">Var(W) = \frac{2}{n_{in}}</script></p> <p>Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.</p> <hr /> <h4 id="glove-global-vectors-for-word-representation">GloVe: Global Vectors for Word Representation</h4> <h5 id="references-2">References:</h5> <ol> <li> <p><a href="https://nlp.stanford.edu/projects/glove/">Jeffrey Pennington, Richard Socher, Christopher D. Manning, GloVe: Global Vectors for Word Representation</a></p> <p>GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.</p> </li> </ol> <hr /> <h4 id="f_beta-score-an-easy-to-combine-precision-and-recall-measures"><script type="math/tex">F_{\beta}</script> score: An easy to combine precision and recall measures</h4> <blockquote> <script type="math/tex; mode=display">F_{\beta} = \frac{(1+\beta^2)(precision*recall)}{(\beta^2*precision+recall)} = \frac{1}{\frac{1}{\beta^2+1}*\frac{1}{precision}+\frac{\beta^2}{\beta^2+1}*\frac{1}{recall}}</script> <p><script type="math/tex">% <![CDATA[ \beta < 1 %]]></script> lends more weight to precision, while <script type="math/tex">\beta > 1</script> favors recall (<script type="math/tex">\beta -> 0</script> considers only precision, <script type="math/tex">\beta -> inf</script> only recall)</p> <script type="math/tex; mode=display">F1 = \frac{2*(precision*recall)}{(precision+recall)} = \frac{1}{\frac{1}{2}*\frac{1}{precision}+\frac{1}{2}*\frac{1}{recall}}</script> </blockquote> <hr /> <h4 id="symmetric-mean-absolute-percent-error-smape">Symmetric Mean Absolute Percent Error (SMAPE)</h4> <h5 id="references-3">References:</h5> <ol> <li> <p><a href="http://www.vanguardsw.com/business-forecasting-101/symmetric-mean-absolute-percent-error-smape/">http://www.vanguardsw.com/business-forecasting-101/symmetric-mean-absolute-percent-error-smape/</a></p> <p>An alternative to Mean Absolute Percent Error (MAPE) when there are zero or near-zero demand for items. SMAPE self-limits to an error rate of 200%, reducing the influence of these low volume items. Low volume items are problematic because they could otherwise have infinitely high error rates that skew the overall error rate. SMAPE is the forecast minus actual divided by the sum of forecasts and actual as expressed in formula:</p> </li> </ol> <blockquote> <script type="math/tex; mode=display">SMAPE = \frac{2}{N} * \sum_{k=1}^N\frac{\vert F_k-A_k\vert}{(F_k + A_k)}</script> <p>k = each time period.</p> </blockquote> <h4 id="mean-absolute-percent-error-mape">Mean Absolute Percent Error (MAPE)</h4> <h5 id="references-4">References:</h5> <ol> <li><a href="http://www.vanguardsw.com/business-forecasting-101/mean-absolute-percent-error/">http://www.vanguardsw.com/business-forecasting-101/mean-absolute-percent-error/</a></li> </ol> <p>Mean Absolute Percent Error (MAPE) is the most common measure of forecast error. MAPE functions best when there are no extremes to the data (including zeros).</p> <p>With zeros or near-zeros, MAPE can give a distorted picture of error. The error on a near-zero item can be infinitely high, causing a distortion to the overall error rate when it is averaged in. For forecasts of items that are near or at zero volume, Symmetric Mean Absolute Percent Error (SMAPE) is a better measure. MAPE is the average absolute percent error for each time period or forecast minus actuals divided by actuals:</p> <blockquote> <script type="math/tex; mode=display">MAPE = \frac{1}{N} * \sum_{k=1}^N\frac{\vert F_k-A_k\vert}{A_k}</script> <p>k = each time period.</p> </blockquote> <hr /> <h4 id="mle-vs-map-the-connection-between-maximum-likelihood-and-maximum-a-posteriori-estimation">MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation</h4> <h5 id="references-5">References:</h5> <ol> <li><a href="http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf">http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf</a></li> <li><a href="https://wiseodd.github.io/techblog/2017/01/01/mle-vs-map/">https://wiseodd.github.io/techblog/2017/01/01/mle-vs-map/</a></li> </ol> <p>Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a method for estimating some variable in the setting of probability distributions or graphical models. They are similar, as they compute a single estimate, instead of a full distribution. Maximum Likelihood estimation (MLE): Choose value that maximizes the probability of observed data. <script type="math/tex">\hat \theta_{MLE}={\underset{\theta}argmax}P(D\vert \theta)</script> Maximum a posteriori(MAP) estimation: Choose value that is most probable given observed data and prior belief. <script type="math/tex">\hat \theta_{MAP}={\underset{\theta}argmax}P(\theta\vert D)={\underset{\theta}argmax}P(D\vert \theta)*P(\theta)</script> What we could conclude then, is that MLE is a special case of MAP, where the prior probability is uniform (the same everywhere)!</p> <hr /> <h4 id="the-exponential-family">The exponential family:</h4> <h5 id="references-6">References:</h5> <ol> <li><a href="https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf">https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf</a></li> <li><a href="www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf">www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf</a></li> <li><a href="https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/MIT18_655S16_LecNote7.pdf">https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/MIT18_655S16_LecNote7.pdf</a></li> </ol> <p>(Chinese reference)</p> <ol> <li><a href="http://blog.csdn.net/dream_angel_z/article/details/46288167">http://blog.csdn.net/dream_angel_z/article/details/46288167</a></li> <li><a href="http://www.cnblogs.com/huangshiyu13/p/6820729.html">http://www.cnblogs.com/huangshiyu13/p/6820729.html</a></li> </ol> <p>Given a measure η, we define an exponential family of probability distributions as those distributions whose density (relative to η) have the following general form:</p> <blockquote> <script type="math/tex; mode=display">p(x\vert η) = h(x)e^{η^T . T(x) − A(η)}</script> </blockquote> <blockquote> <p>Key point: x and η only “mix” in <script type="math/tex">exp^{(η^T . T(x))}</script></p> </blockquote> <blockquote> <p>η : vector of “nature parameters”</p> <p>T(x): vector of “Natural Sufficient Statistic”</p> <p>A(η): partition function / cumulant generating function</p> </blockquote> <blockquote> <p>h : X → R</p> <p>η : Θ → R</p> <p>B : Θ → R.</p> </blockquote> <h4 id="generalized-linear-model-glm">Generalized Linear Model, GLM</h4> <h5 id="references-7">References:</h5> <ol> <li><a href="http://www.cs.princeton.edu/courses/archive/spr09/cos513/scribe/lecture11.pdf">http://www.cs.princeton.edu/courses/archive/spr09/cos513/scribe/lecture11.pdf</a></li> </ol> <p>The generalized linear model (GLM) is a powerful generalization of linear regression to more general exponential family. The model is based on the following assumptions:</p> <ol> <li>The observed input enters the model through a linear function <script type="math/tex">(β^T X)</script>.</li> <li>The conditional mean of response, is represented as a function of the linear combination: <script type="math/tex">E[Y\vert X]</script> is defined as <script type="math/tex">µ = f(β^T.X)</script>.</li> <li>The observed response is drawn from an exponential family distribution with conditional mean µ.</li> </ol> <p>η = Ψ(µ)</p> <p>where Ψ is a function which maps the natural (canonical) parameters to the mean parameter. µ defined as E[t(X)] can be computed from dA(η)/dη which is solely a function η.</p> <p>[ (xn)–&gt;(yn)&lt;–]–(β) (Representation of a generalized linear model)</p> <p>(β^T.X)–f(β^T.X)–&gt; µ– Ψ(µ)–&gt;η (Relationship between the variables in a generalized linear model)</p> <hr /> <h4 id="kullback-leibler-divergence-kl-divergence--information-gain--relative-entropy">Kullback-Leibler divergence (KL Divergence) / Information Gain / relative entropy</h4> <p>The KL divergence from <script type="math/tex">\hat{y}</script> (or Q, your observation) to y (or P, ground truth) is simply the difference between cross entropy and entropy:</p> <script type="math/tex; mode=display">KL(y \vert\vert \hat{y})=\sum_iy_ilog\frac{1}{\hat{y}_i}−\sum_iy_ilog\frac{1}{y_i}=\sum_iy_ilog\frac{y_i}{\hat{y}_i}</script> <p>In the context of machine learning, <script type="math/tex">KL(P\vert\vert Q)</script> is often called the information gain achieved if Q is used instead of P. By analogy with information theory, it is also called the relative entropy of P with respect to Q.</p> <hr /> <h4 id="learning-theory--vc-dimensionfor-vapnikchervonenkis-dimension">Learning Theory &amp; VC dimension(for Vapnik–Chervonenkis dimension)</h4> <h5 id="references-8">References:</h5> <ol> <li><a href="https://drive.google.com/file/d/0B6pX3VvUVMAIeVk4OXlxRk0tcXM/view">https://drive.google.com/file/d/0B6pX3VvUVMAIeVk4OXlxRk0tcXM/view</a></li> <li><a href="https://www.cs.cmu.edu/~epxing/Class/10701/slides/lecture16-VC.pdf">https://www.cs.cmu.edu/~epxing/Class/10701/slides/lecture16-VC.pdf</a></li> <li><a href="http://cs229.stanford.edu/notes/cs229-notes4.pdf">http://cs229.stanford.edu/notes/cs229-notes4.pdf</a></li> </ol> <p>Definition: The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) is defined as infinite.</p> <p>VC dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter.</p> <hr /> <h4 id="statistical-forecasting">Statistical forecasting</h4> <h5 id="arima-auto-regressive-integrated-moving-average">ARIMA (Auto-Regressive Integrated Moving Average)</h5> <ol> <li>A series which needs to be differenced to be made stationary is an “integrated” (I) series</li> <li>Lags of the stationarized series are called “autoregressive” (AR) terms</li> <li>Lags of the forecast errors are called “moving average” (MA) terms</li> <li>Non-seasonal ARIMA model “ARIMA(p,d,q)” model . p = the number of autoregressive terms . d = the number of nonseasonal differences . q = the number of moving-average terms</li> <li>Seasonal ARIMA models, “ARIMA(p,d,q)X(P,D,Q)” model . P = # of seasonal autoregressive terms . D = # of seasonal differences . Q = # of seasonal moving-average terms</li> <li>Augmented Dickey-Fuller (ADF) test of data stationarity If test statistic &lt; test critical value %1 =&gt; Data is stationarity.</li> <li>Data stationarity 1. The mean of the series should not be a function of time. 2. The variance of the series should not be a function of time. 3. The covariance of the i th term and the (i + m) th term should not be a function of time.</li> <li>Transformations to stationarize the data. 1. Deflation by CPI 2. Logarithmic 3. First Difference 4. Seasonal Difference 5. Seasonal Adjustment</li> </ol> <p>Reference: <a href="http://people.duke.edu/~rnau/411home.htm">http://people.duke.edu/~rnau/411home.htm</a></p> <h2 id="disclaimer">Disclaimer</h2> <p>Last updated: March 13, 2018</p> <p>The information contained on https://github.com/Cheng-Lin-Li/ website (the “Service”) is for general information purposes only. Cheng-Lin-Li’s github assumes no responsibility for errors or omissions in the contents on the Service and Programs.</p> <p>In no event shall Cheng-Lin-Li’s github be liable for any special, direct, indirect, consequential, or incidental damages or any damages whatsoever, whether in an action of contract, negligence or other tort, arising out of or in connection with the use of the Service or the contents of the Service. Cheng-Lin-Li’s github reserves the right to make additions, deletions, or modification to the contents on the Service at any time without prior notice.</p> <h3 id="external-links-disclaimer">External links disclaimer</h3> <p>https://github.com/Cheng-Lin-Li/ website may contain links to external websites that are not provided or maintained by or in any way affiliated with Cheng-Lin-Li’s github.</p> <p>Please note that the Cheng-Lin-Li’s github does not guarantee the accuracy, relevance, timeliness, or completeness of any information on these external websites.</p> <h2 id="contact-information">Contact Information</h2> <p>mailto:<a href="mailto:clark.cl.li@gmail.com">clark.cl.li@gmail.com</a></p>Cheng-Lin Li