Hello Data !!
Table of content:
 What is the major focus of this blog?
 Reference material:
 General topics
 Parametric vs. Nonparametric Methods.
 Generative & Discriminative models:
 LookAhead Bias
 Ensemble Learning to Improve Machine Learning Results
 Glorot initialization/ Xavier initialization
 He initialization: For the more recent rectifying nonlinearities (ReLu)
 GloVe: Global Vectors for Word Representation
 score: An easy to combine precision and recall measures
 Symmetric Mean Absolute Percent Error (SMAPE)
 Mean Absolute Percent Error (MAPE)
 MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation
 The exponential family:
 Generalized Linear Model, GLM
 KullbackLeibler divergence (KL Divergence) / Information Gain / relative entropy
 Learning Theory & VC dimension(for Vapnik–Chervonenkis dimension)
 Statistical forecasting
 Disclaimer
 Contact Information
What is the major focus of this blog?
This is my learning notebook which includes AI, Machine Learning, Big Data techniques, Knowledge Graph, Information Visualization and Natural Language Processing.
I am a lifelong learner and passionate to contribute my knowledge to impact the world. After twenty years dedicate working on IT Application development departments for Intranet, B2C, B2B eCommerce portal and trading Websites, I observed the A.I. and DATA era is coming. It is time to let DATA tell its story by A.I. / machine learning algorithms and that’s the reason why I resigned from J.P. Morgan Asset Management (Taiwan) in 2016 and went back to school to be a graduate student in Viterbi school of engineering at University of Southern California. My research area is data informatics and I would like to share what I learn with everyone.
I will leverage my spare time to enrich this notebook style blog from time to time. Your comments are appreciated.
Reference material:
Artificial intelligence (AI)
Textbook:
Machine Learning (ML)
Textbook:
 Introduction to Machine Learning3rd, Ethem Alpaydin
 The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition), by Trevor Hastie, Robert Tibshirani and Jerome Friedman
 Pattern Recognition And Machine Learning, Bishop
 Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville
Articles & Papers:
 Deep Learning: An Introduction for Applied Mathematicians
 The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches
Training Material and Courses:
 CS229: Machine Learning
 Representation learning in Montreal Institute for Learning Algorithms @Universite’ de Montre’al (Deep Learning)
 Reinforcement Learning: An Introduction by Prof. Richard S. Sutton & Andrew G. Barto @University of Alberta, OR try this alternative link.
 Carnegie Mellon University  10715 Advanced Introduction to Machine Learning: lectures
 Deeplearning.ai, Andrew Ng, Introductory deep learning course.
Natural Language Processing(NLP)
Articles & Papers:
 Demystifying, word2vec
 Brill (1992): A Simple RuleBased Part of Speech Tagger
 Ratnaparkhi (1996): A Maximum Entropy Model for PartOfSpeech Tagging
 Lafferty, McCallum and Pereira (2001): Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
 Young (1996): A review of largevocabulary continuousspeech recognition. IEEE Signal Processing Magazine 13(5): 45–57.
 Sutskever, Vinyals and Le (2014): Sequence to Sequence Learning with Neural Networks
 Neubig (2017): Neural Machine Translation and Sequencetosequence Models: A Tutorial
 Mikolov, Yih and Zweig (2013): Linguistic Regularities in Continuous Space Word Representations
 Levy, Goldberg and Dagan (2015): Improving Distributional Similarity with Lessons Learned from Word Embeddings.
Training Material and Courses:
 Natural Language Processing (Fall 2017) by Prof. Jason Eisner @Johns Hopkins University
 Natural Language Processing with Deep Learning (Winter 2017) by Chris Manning & Richard Socher @Standford University: Material website ,and video link
Statistics:
Statistics and R by Nathaniel E. Helwig@University of Minnesota
General topics
I leave some Technology notes in this section. I may write articles for each of them in the future.
Parametric vs. Nonparametric Methods.
reference:
 Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach
 Parametric Methods:
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.
Models do not growth with data.
Model examples:
11. Linear regression
12. Logistic regression
14. Perceptron
15. Naive Bayes
16. …etc.
 Nonparametric Methods: Don’t summarize data into parameters.
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features.
Models growth with data.
Model examples:
11. knearest neighbors
12. Support Vector Machine
13. Decision Tree: (CART and C4.5)
Generative & Discriminative models:
Reference:

Generative model, also called joint distribution models.
Generative learning algorithms assume there is a model to GENERATE the observable variable by hidden(or target) variable and the hidden variables is a distribution rather than a fix value.
Given an observable variable X and a target variable Y, a generative model is a statistical model of the joint probability distribution on X × Y, P ( X , Y )
 Gaussian mixture model and other types of mixture model
 Hidden Markov model
 Probabilistic contextfree grammar
 Naive Bayes
 Averaged onedependence estimators
 Latent Dirichlet allocation
 Restricted Boltzmann machine
 Generative adversarial networks

Discriminative model, also called conditional models.
A discriminative model is a model of the conditional probability of the target Y, given an observation x, symbolically, P ( Y  X = x ) and,
Classifiers computed without using a probability model are also referred to loosely as “discriminative”.
Algorithms that try to learn P( Y  X ) directly (such as logistic regression) by given X, or algorithms that try to learn mappings directly from the space of inputs X to the labels {0,1}, (such as the perceptron algorithm) are called discriminative learning algorithms.
 Logistic regression, a type of generalized linear regression used for predicting binary or categorical outputs (also known as maximum entropy classifiers)
 Support vector machines
 Boosting (metaalgorithm)
 Conditional random fields
 Linear regression
 Neural networks
 Random forests
LookAhead Bias
Reference:
Lookahead bias occurs by using information or data in a study or simulation that would not have been known or available during the period being analyzed. This will usually lead to inaccurate results in the study or simulation. Lookahead bias can be used to sway simulation results closer into line with the desired outcome of the test.
To avoid lookahead bias, if an investor is backtesting the performance of a trading strategy, it is vital that he or she only [uses information that would have been available at the time of the trade]. For example, if a trade is simulated based on [information that was not available] at the time of the trade  such as a quarterly earnings number that was released three months later  it will diminish the accuracy of the trading strategy’s true performance and potentially bias the results in favor of the desired outcome. Lookahead bias is one of many biases that must be accounted for when running simulations. Other common biases are :
a. [sample selection bias]: Nonrandom sample of a population,
b. [time period bias]: Early termination of a trial at a time when its results support a desired conclusion.
c. [survivorship/survival bias]: It is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility.
All of these biases have the potential to sway simulation results closer into line with the desired outcome of the simulation, as the input parameters of the simulation can be selected in such a way as to favor the desired outcome.
Ensemble Learning to Improve Machine Learning Results
Reference:

Vadim Smolyakov, Ensemble Learning to Improve Machine Learning Results.
Ensemble methods are metaalgorithms which combine several machine learning techniques into one model to increase the performance:

bagging (decrease variance): bootstrap aggregation. Parallel ensemble: each model is built independently a. Reduce the variance of an estimate is to average together multiple estimates. b. Bagging uses bootstrap sampling (combinations with repetitions) to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

boosting (decrease bias): Sequential ensemble: try to add new models that do well where previous models lack. a. Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds. b. Twostep approach, where first uses subsets of the original data to produce a series of averagely performing models and then “boosts” their performance by combining them together using a particular cost function (majority vote for classification or a weighted sum for regression). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.

stacking (improve predictions): Sequential ensemble: stacking is an ensemble learning technique that combines multiple classification or regression models via a metaclassifier or a metaregressor. a. The base level models are trained based on a complete training set, then the metamodel is trained on the outputs of the base level model as features.
Glorot initialization/ Xavier initialization
References:
 http://andyljones.tumblr.com/post/110998971763/anexplanationofxavierinitialization
 https://jamesmccaffrey.wordpress.com/2017/06/21/neuralnetworkglorotinitialization/
Glorot initialization: it helps signals reach deep into the network.
a. If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
b. If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Formular:
where W is the initialization distribution for the neuron in question, and n_in is the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.
It’s worth mentioning that Glorot & Bengio’s paper originally recommended using: where is the number of neurons the result is fed to.
He initialization: For the more recent rectifying nonlinearities (ReLu)
References:
Formular:
Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.
GloVe: Global Vectors for Word Representation
References:

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global wordword cooccurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
score: An easy to combine precision and recall measures
lends more weight to precision, while favors recall ( considers only precision, only recall)
Symmetric Mean Absolute Percent Error (SMAPE)
References:

http://www.vanguardsw.com/businessforecasting101/symmetricmeanabsolutepercenterrorsmape/
An alternative to Mean Absolute Percent Error (MAPE) when there are zero or nearzero demand for items. SMAPE selflimits to an error rate of 200%, reducing the influence of these low volume items. Low volume items are problematic because they could otherwise have infinitely high error rates that skew the overall error rate. SMAPE is the forecast minus actual divided by the sum of forecasts and actual as expressed in formula:
k = each time period.
Mean Absolute Percent Error (MAPE)
References:
Mean Absolute Percent Error (MAPE) is the most common measure of forecast error. MAPE functions best when there are no extremes to the data (including zeros).
With zeros or nearzeros, MAPE can give a distorted picture of error. The error on a nearzero item can be infinitely high, causing a distortion to the overall error rate when it is averaged in. For forecasts of items that are near or at zero volume, Symmetric Mean Absolute Percent Error (SMAPE) is a better measure. MAPE is the average absolute percent error for each time period or forecast minus actuals divided by actuals:
k = each time period.
MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation
References:
 http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf
 https://wiseodd.github.io/techblog/2017/01/01/mlevsmap/
Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a method for estimating some variable in the setting of probability distributions or graphical models. They are similar, as they compute a single estimate, instead of a full distribution. Maximum Likelihood estimation (MLE): Choose value that maximizes the probability of observed data. Maximum a posteriori(MAP) estimation: Choose value that is most probable given observed data and prior belief. What we could conclude then, is that MLE is a special case of MAP, where the prior probability is uniform (the same everywhere)!
The exponential family:
References:
 https://people.eecs.berkeley.edu/~jordan/courses/260spring10/otherreadings/chapter8.pdf
 www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
 https://ocw.mit.edu/courses/mathematics/18655mathematicalstatisticsspring2016/lecturenotes/MIT18_655S16_LecNote7.pdf
(Chinese reference)
 http://blog.csdn.net/dream_angel_z/article/details/46288167
 http://www.cnblogs.com/huangshiyu13/p/6820729.html
Given a measure η, we define an exponential family of probability distributions as those distributions whose density (relative to η) have the following general form:
Key point: x and η only “mix” in
η : vector of “nature parameters”
T(x): vector of “Natural Sufficient Statistic”
A(η): partition function / cumulant generating function
h : X → R
η : Θ → R
B : Θ → R.
Generalized Linear Model, GLM
References:
The generalized linear model (GLM) is a powerful generalization of linear regression to more general exponential family. The model is based on the following assumptions:
 The observed input enters the model through a linear function .
 The conditional mean of response, is represented as a function of the linear combination: is defined as .
 The observed response is drawn from an exponential family distribution with conditional mean µ.
η = Ψ(µ)
where Ψ is a function which maps the natural (canonical) parameters to the mean parameter. µ defined as E[t(X)] can be computed from dA(η)/dη which is solely a function η.
[ (xn)–>(yn)<–]–(β) (Representation of a generalized linear model)
(β^T.X)–f(β^T.X)–> µ– Ψ(µ)–>η (Relationship between the variables in a generalized linear model)
KullbackLeibler divergence (KL Divergence) / Information Gain / relative entropy
The KL divergence from (or Q, your observation) to y (or P, ground truth) is simply the difference between cross entropy and entropy:
In the context of machine learning, is often called the information gain achieved if Q is used instead of P. By analogy with information theory, it is also called the relative entropy of P with respect to Q.
Learning Theory & VC dimension(for Vapnik–Chervonenkis dimension)
References:
 https://drive.google.com/file/d/0B6pX3VvUVMAIeVk4OXlxRk0tcXM/view
 https://www.cs.cmu.edu/~epxing/Class/10701/slides/lecture16VC.pdf
 http://cs229.stanford.edu/notes/cs229notes4.pdf
Definition: The VapnikChervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) is defined as infinite.
VC dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter.
Statistical forecasting
ARIMA (AutoRegressive Integrated Moving Average)
 A series which needs to be differenced to be made stationary is an “integrated” (I) series
 Lags of the stationarized series are called “autoregressive” (AR) terms
 Lags of the forecast errors are called “moving average” (MA) terms
 Nonseasonal ARIMA model “ARIMA(p,d,q)” model . p = the number of autoregressive terms . d = the number of nonseasonal differences . q = the number of movingaverage terms
 Seasonal ARIMA models, “ARIMA(p,d,q)X(P,D,Q)” model . P = # of seasonal autoregressive terms . D = # of seasonal differences . Q = # of seasonal movingaverage terms
 Augmented DickeyFuller (ADF) test of data stationarity If test statistic < test critical value %1 => Data is stationarity.
 Data stationarity 1. The mean of the series should not be a function of time. 2. The variance of the series should not be a function of time. 3. The covariance of the i th term and the (i + m) th term should not be a function of time.
 Transformations to stationarize the data. 1. Deflation by CPI 2. Logarithmic 3. First Difference 4. Seasonal Difference 5. Seasonal Adjustment
Reference: http://people.duke.edu/~rnau/411home.htm
Disclaimer
Last updated: March 13, 2018
The information contained on https://github.com/ChengLinLi/ website (the “Service”) is for general information purposes only. ChengLinLi’s github assumes no responsibility for errors or omissions in the contents on the Service and Programs.
In no event shall ChengLinLi’s github be liable for any special, direct, indirect, consequential, or incidental damages or any damages whatsoever, whether in an action of contract, negligence or other tort, arising out of or in connection with the use of the Service or the contents of the Service. ChengLinLi’s github reserves the right to make additions, deletions, or modification to the contents on the Service at any time without prior notice.
External links disclaimer
https://github.com/ChengLinLi/ website may contain links to external websites that are not provided or maintained by or in any way affiliated with ChengLinLi’s github.
Please note that the ChengLinLi’s github does not guarantee the accuracy, relevance, timeliness, or completeness of any information on these external websites.
Contact Information
mailto:clark.cl.li@gmail.com