# Math for Machine Learning

Stochastic Processes (Random Planar Maps)

Once you decide to get started with Data Science (in a serious way), the first few months (if not year) can seem pretty difficult. At times, maybe even hopeless, especially if you do not have the necessary academic background or it has been a while since you have been in an academic setting. In my judgment, you should not let that stop you from your pursuit. There are so many resources online/offline to help you start to fill in your gaps.

Below are some of the key areas that you should have mastery over for you to go further with data science/machine learning:

Calculus

• Functions
• Continuity
• Differentiability
• Integration (single and multi-variables)
• Optimization
• Convexity/Concavity

Linear Algebra

• Vectors
• Matrices
• Eigenvalue
• Vector
• Singular Value Decomposition
• Least Squares Estimation and Matrix Algebra

Statistics/Probability

• Basic probability
• Sample spaces
• Conditional probabilities and independence
• Random variables
• Moments
• Distributions
• Chi-Squared
• F-Test
• T-Test
• Bayes’ Theorem
• Marginalization
• Bayesian Inference
• Likelihood
• Estimation
• Regression
• Analysis of Variance

Stochastic Processes and Dynamical Systems

• Dirichlet Processes
• Gaussian Processes for Machine Learning

What else do you think is necessary?

# Getting Started with Machine Learning

In truth, I am an advocate for jumping in head first and using what you learn in real-time. Practically speaking this means learn less about all the theory and heavy math behind what it is you are using with the attitude that you will move towards understanding.

Do you know how to program in a specific language? If so, then determine if that language has a library which can be leveraged to aid you in your machine learning journey.

If you do not know how to program, that is okay also. Survey a few languages (R and Python are popular among data scientist) and see if you have one that is more understandable to you and then go down the same path…seeking a machine learning library.

# Shhh, it’s a Library

##### No Programming Necessary
• WEKA – you can do virtually everything with this workbench. Pre-processing the data, visualizing the data, building classifiers, and make predictions.
• BigML – Like WEKA you will not have to program with BIGML. You can explore model building in a browser. If you not certain about machine learning (or data science for that matter), this would be a great place to start.
##### R (Statistical Computing)
• If you are really enjoy math and have not picked a language yet, then this may be for you. There are a lot of packages here developed by pioneers in the field which you can leverage without having to refactor any code. All packages come with instructions – giving you some of the theory and example cases for you to see in action. In my judgment, learning this language allows you to explore and prototype quickly which most certainly will prove valuable.
##### Python
• Scikit Learn – If you enjoy Python then this library is for you. This library is known for its documentation which allows you to rapidly deploy virtually any machine learning algorithm.
##### Octave
• Octave is the open-source version of MatLab (some functions are not present). As is MatLab, Octave is known for solving linear and non-linear problems. If you have an engineering background then this might be the place for you. Although, practically speaking many organizations do not use Octave/MatLab as it is seen as a primarily academic software.

No matter what you pick, decide to use it and stick with it for awhile. In fact, I would commit to it for the next 12-months. Actually use the language/library you choose do not just read about it.

# Learning Online

If you are really a beginner, you may want to stay clear of some of what you see online. Many people I talk to like the idea of data science and machine learning and decide to sign-up for an online course. The problem they encounter is that in many cases they already have to know how to program (to some degree) and they should know linear algebra and probability theory.

Linear Algebra Example

Probability Theory Example

If you do decide to watch classes online, then you should absolutely take notes (even if you toss them later). The key is to participate – which may sound obvious, but when you are at home in your pajamas learning about data science it is not quite so obvious.

That being said there are some really good (and free) online lectures (do not be overwhelmed):

# Research Papers

This may not be your thing either, not everybody likes to pick up a research paper to read. Many individuals complain that the reading is a bit to academic and does not lend itself to really conveying insight to the reader (which is opposite of the intent of the paper). To be candid some are written better than others, many cases that has to do with the topic or the time period the paper was written in. However, there are a few seminal papers which you should be acquainted with that will allow you to gain context for machine learning and data science which should prove invaluable in your journey. My encouragement to you is to find these papers and if you are not ready to read them due to your efforts to skill building on other areas then simply hold on to them and test read them every 3-months. See how far you get without getting lost, see if you understand what you are doing when you are coding a solution at a deeper level for having read the paper, and best of all read the reference page – find out who influenced the paper you read.

# Machine Learning Books for those Just Starting

Let’s face it there are not a lot of books out there that aim to aid those just starting out in machine learning. As before, the expectation is that you will have some linear algebra or probability theory down pat. Unless you come from the hard sciences (mathematician, engineer, bio-statistics, etc) then you probably will have to do some skill building here even before reading most of the books out in the market place. However, there are a few that are approach the true beginning most people are at and encourage those of you willing to try on your own.

Curious to know your thoughts on the above. Have you used any of these resources? Do you have any that you would recommend?

# Seizure Detection in EEG Time Series

I had a wonderful opportunity to work Eben Olson of Yale University on a problem data set provided by The Mayo Clinic. He and I did a write-up on the what we learned during the process and hope that it helps others in their knowledge discovery surrounding seizure detection.

1 Introduction
We describe here the methods used in preparing our submission to the UPenn and Mayo Clinic Seizure Detection Challenge, which obtained second place on the private leaderboard. We present in detail our ﬁnal and most sucessful approach, as well as an overview of less successful experiments which also contributed to our ﬁnal ensemble or provided some insight. It is assumed that the reader is familiar with the structure of the challenge and the data, described at http://www.kaggle.com/c/seizure-detection.

2 Early approaches
2.1 Spectrograms
Our initial feature extraction method calculated spectrograms of each EEG trace, in an attempt to capture both frequency content and temporal dynamics. Each clip was ﬁrst resampled to 500Hz, and the short-time Fourier transform was applied, discarding phase information. Spectrograms were ﬂattened into vectors, and mean subtraction and normalization was applied on a per subject and per feature basis. Features from each channel were concatenated, and logistic regression or random forests were used for classiﬁcation. Our best (ensembled) submission with these features scored 0.94081 on the public leaderboard.

2.2 Scattering coeﬃcients
As an alternative to spectrograms, we attempted to use scattering coeﬃcients[2], a framework for time-frequency analysis which has been shown to give good results in audio classiﬁcation tasks[1]. We used the ScatNet MATLAB toolbox1 to compute scattering coeﬃcients for each clip, after resampling to 500Hz. Coeﬃcients for each channel were concatenated and logistic regression was used for classiﬁcation. Only a marginal improvement (0.94212 public leaderboard) was seen over spectrogram features.

2.3 Maximal cross-correlation
We next considered maximal cross-correlation, which has been reported to produce useful features for detection of epileptic EEG activity[4]. This method attempts to compensate for propagation delays of brain activity by computing cross-correlation between channels at various lag times and taking only the maximum value, normalized by the channel autocorrelation. We obtained a substantially worse score (0.86761 public leaderboard) with this method. However, review of the code indicated that this may have been due to a bug in the feature calculation, and further investigation of this method may be valuable.

3 Final approach
3.1 Feature extraction
Our ﬁnal approach to feature extraction calculated the covariance matrix of the EEG data, in order to capture correlations between channels. Since seizure activity is characterized by increased long-range synchronization of neuronal activity, this was expected to produce informative features. Matrices were individually normalized to zero mean and unit variance. As frequency analysis had been shown to be valuable, rather than compute a single covariance matrix we ﬁrst ﬁltered each trace with several bandpass ﬁlters. We initially applied four ﬁlters covering the range 1-200Hz. Filter choice presents a complicated trade-oﬀ between frequency selectivity, signal to noise ratio, and output dimensionality. Performance was evaluated by cross validation of logistic regression predictors. While attempting to manually optimize the ﬁlter parameters, we found that ﬁlters chosen for one subject could perform extremely poorly on others. We therefore performed an automated ﬁlter selection step, in which combinations of up to four ﬁlters were evaluated on each subject. These ﬁlters were chosen from a bank of 10 partially overlapping, approximately log-spaced bandpass ﬁlters covering the range 5-200Hz. The three combinations which gave the highest CV values were retained.

3.2 Neural network classiﬁcation
As an alternative classiﬁcation strategy, we experimented with the use of multilayered neural networks. Our initial motivation was the possibility of learning a cross-subject mapping which would allow our model to use the full training set to improve its predictions. While this goal was not realized, we did ﬁnd that the NN models provided a boost over logistic regression. Our software was based on dnn.py2, a recently released demonstration of a deep neural network written in Python. This provided an excellent framework which was simple to adapt to our problem. We tested a number of network architectures, but found that a network with two hidden layers of 200 and 100 units respectively gave good results while being reasonably quick to train. Rectiﬁed linear units were used in the hidden layers and logistic regression in the output layer. Dropout of 0.5 was used in the hidden layers for regularization. All networks were trained with the adadelta method for 100 epochs. Multiple networks were trained for each subject and ﬁlter combination. In an attempt both to increase diversity and to reduce the impact of dissimilar electrode patterns across subjects, each network was trained on a 12-channel subset of the full covariance matrix. We found that depending on network architecture, predictions would become extremely compressed into the neighborhoods of zero and one. To avoid potential issues with numerical precision, we applied a logarithmic rescaling to predictions in the (0,0.1] and [0.9,1) ranges.

3.3 Early seizure prediction
Our best scores were obtained by submitting the same values for pearly and pseizure, rather than trying to train separate classiﬁers for early ictal events. This phenomenon was reported early in the competition by the user Alexandre3. We observed a similar trend in our cross-validation testing, and believe it is explained by the combination of the AUC metric and the imbalanced classes of the data set, which leads to a much larger penalty for false negatives than false positives. At least for the classiﬁcation strategies we employed, the error due to training on the “wrong” labels was outweighed by the beneﬁts of a larger training set. However, post-deadline testing showed that our error on the early detection task was several times higher than on the seizure detection task. Improvements could potentially be obtained by ﬁne-tuning of the trained networks with earlyictal data alone, or by applying other techniques for learning from noisy labels.

4 Ensembling
We operated according to this learning bias: select the simplest model given that the theories are consistent with the data. We attempted to produce strong learners that could be arbitrarily accurate and weak learners that were more accurate than random guessing. In blending our models, our process was to start with uniform weighting; however, through each step of learning we decreased the weighting of the models that were not correctly learned by the weak learner, and increased the weighting of the models that were correctly learned by strong learners.[3] In each ensemble we attempted to create more diversity of opinion, independence, decentralization, and aggregation. Our objective was to discover the best model from many classiﬁer models with similar training /test errors. Selecting a model at random, we would have risked the possibility of choosing a weak learner. However, we observed that combining them (averaging) presented us with the possibility of avoiding a poor decision. The idea was that every time we ran a model iteration, we would encounter diﬀerent local optima. However, combining the model outputs would allow us to ﬁnd a solution that is closer to the global minimum. Since it is possible that the classiﬁer space may not contain the solution to a given problem, we pursued an ensemble of such classiﬁers that could contain the solution to the given problem. For example, in the case of a linear classiﬁer that cannot solve a non-linearly separable problem, combing linear classiﬁers may solve a non-linear problem.

5 Technical details
5.1 Software

All of our experiments, with the exception of scattering coeﬃcient extraction, were carried out using Python and IPython notebooks. The numpy, scipy, and scikit-learn packages were used extensively. Theano4 is required by the neural network module. Our repository, available at https://github.com/ ebenolson/seizure-detection, contains a full list of required packages and instructions for running our example code. We will also give a brief description here:
• Data Preparation.ipynb This notebook loads all clips for each subject, applies bandpass ﬁlters and calculates covariance matrices, then performs the ﬁlter selection step. The ﬁltered covariance matrices are saved, along with the clip labels and ﬁlenames, to a pickle ﬁle in the data subdirectory.
• Train Classifiers and Predict.ipynb This notebook loads the preprocessed data and trains multiple networks on each input ﬁle. The predictions of each network are saved to the output subdirectory.
• Postprocessing.ipynb This notebook loads predictions of the trained networks and combines them to produce submission.csv, a submission in the required format.
• Miscellaneous.ipynb This notebook contains code snippets implementing some of our earlier approaches.
• simplednn/ This subdirectory contains the neural network module.

5.2 Hardware and runtime
Most computations were done using Amazon EC2 m3.2xlarge instances with 8 virtual cores and 30GB RAM. Using spot requests, EC2 provides an aﬀordable platform which can be easily scaled and parallelized depending on the memory and processing power required. Some experiments were also done using a Core i5 quad core desktop with 16GB of RAM. On the desktop, preprocessing the full data set requires approximately one hour and training one iteration of networks (3 per subject) requires 13 minutes. The time required to generate predictions is negligible.
[4]https://github.com/Theano/Theano

References
[1] Joakim And´en and St´ephane Mallat. Multiscale scattering for audio classiﬁcation. 2011.
[2] J. Bruna and S. Mallat. Classiﬁcation with scattering operators. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1566, June 2011.
[3] Seni Giovanni, John Elder, and Grossman Robert. Ensemble methods in data mining: Improving accuracy through combining predictions. In Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. MLSP 2010, pages 4–10, February 2010.
[4] P.W. Mirowski, Yann LeCun, D. Madhavan, and R. Kuzniecky. Comparing SVM and convolutional networks for epileptic seizure prediction from intracranial EEG. In IEEE Workshop on Machine Learning for Signal Processing, 2008. MLSP 2008, pages 244–249, October 2008.

# Do I Have to be Great at Math?

In many cases, individuals try to start out knowing all the math behind the data science before being practical with the science. The majority stop here before they even get started – in fact, they have a belief in data science (they are attracted to it) but they simply do not feel adequate because the math can get in the way.

Bottom-line you can go far in machine learning without understanding all the math.

# Just Start

Think back to when you started something that you did not understand completely. You made use of it. It probably was terrible, but you stumbled your way through and on the other side you had much to be proud of. It was a slow process, but your passion for whatever you were involved in carried you through the stinking points. As the complexity of whatever is you were interested grew so did you desire for knowledge and so you were able to overcome the limitations that could have been perceived to be in your way.

The truth if everyone had to start machine learning by learning about Nonnegative Matrix Factorization or Peturbation Bounds for Eigendecompositions then very few individuals would have the passion ignited.

# Copy-and-Paste Heaven

You may not believe me, but one of the best things that can happen to you early on in your data science journey is for things to fail. Let’s face it when something breaks you can either do one of two things 1. walk away or 2. fix-it. Many individuals focus on copy-and-pasting code from Git, data challenge, or code cookbook just to simply see if the code can work on their machine. In fact, they simply want to recreate the results, but there is a point where you will want to extend the code. Maybe it doesn’t do everything you need or the data type is not  the same as what you are working on, or maybe still you want to tune the model – in all these cases you will  have to move beyond copy-and-paste code.

When something breaks or you decide to enhance what you have, almost certainly you will have to breakdown the code line-by-line and understand what it is that the code is doing. In doing this, you may not realize it, but you are building confidence in what it is you are working on (as a side, you should apply this same technique to the math formulas you encounter).

# Where’s My Geiger Counter

1. Be a Navigator – work well with tools like scikit-learn, R, WEKA
2. Be a Librarian – use programming libraries that give you small algorithms
3. Be a Doer – put what you are learning directly into what you are doing

You only need to know what you can use today. These approach is highly practical and efficient.

Find some dataset or problem that you are interested in working with and began to work systematically through the results. Document it for yourself – categorizing your thoughts will help you crystallize what  you are learning. In time you will began gaining an energy within your learning which will prompt you to seek new algorithms, understand the parameters behind them, and solve more complex problems.

This type of process, like a Geiger counter, points you to want you need (and nothing more). It is highly customized and meaningful to the individual who adopts this method. The Geiger counter should be juxtaposed to an individual completely grounded in theory. Theory alone is generally not sufficient to move beyond a prototype solution.

The secret to math is you need to know what you are good at and what you are not. Know where you need to improve. When the time is right pursue that knowledge. Having a problem that you are working with and needing more knowledge around a math concept can go along way in cementing the idea in your mind for future problem sets.

# You Really Do Need Math, Later

Since you will have to know linear algebra, why not use machine learning to teach you through a problem that interests you? You will gain more intuition at a deeper level than any textbook introduction you may receive. In short, there is a path available to all individuals interested in machine learning, but that are intimidated by math concepts. Keep in mind that you can overcome issues of overconfidence of a poorly understood algorithm when you decide to understand the math behind any algorithm you decide to implement.

I am sure some of the data science purist may disagree with what I have stated here, leave a comment and let me know what you think.