3 Effortless Tactics to Be a Data Science Success in Business

Damian Mingle - Business Decision

“Move out of the way – I am ready to model.” That is the typical sentiment of a Data Science team when given a business problem. However, in the context of a dynamic business, things are not that simple; instead, business needs require that the Data Science team be detailed in the communication of their process. The last thing a Data Science team wants to do is produce a project plan they feel is a pedestrian artifact aimed to pacify their business counterparts. They tend to prefer a more fluid and creative style as opposed to one that is stiff and inflexible. Data Scientists may be tempted to promote the idea that they cannot let anything get in the way of creativity and brilliance or it will be to the detriment of the business. However, in many cases, Data Scientists may be allowing their human fear of transparency and accountability to dictate how they approach what the business needs – maximum visibility. Don’t fall into the trap of believing that these templated documents merely exist to check the proverbial box in order to placate the MBAs and Project Managers in the room. Data Science teams designed for success will most certainly deliver a Data Science project plan and use it throughout their analytics project.

Producing a Data Science Project Plan 

You might ask what the intended purpose behind such a fancy business document really is at its core. The Data Science project plan is incredibly straightforward: its sole purpose is to be the battle plan for achieving the Data Science goals which in turn achieve the business goals. Successful Data Science teams will know that there is immense value in not only being able to achieve the Data Science goals, but in being able to relate them back to the business on a constant basis. It’s the burden of the Data Scientist to be sure that clear communication exists between the two groups. The challenge for a Data Scientist is translating Data Science into business terms. This is the kind of thing that is built through experience and through learning what the business expects in a traditional project plan. If a business had a choice between a model with higher predictive accuracy by a Data Scientist without a project plan and a model with lower predictive accuracy by a Data Scientist with a project plan, they most certainly would choose to work with a Data Scientist who could communicate in terms of business, translate Data Science ideas, and understand the power of leveraging other individuals in the organization to contribute to the overall outcome.

Project Plan in Action

The nuts and bolts of a Data Science project plan will be different for each team and each organization, but there are core elements you will see in almost all effective Data Science project plans – sort of a Tao of Data Science Project Plans.

Three Effortless Tactics:

  1. List the stages in the project 

The business should not have to make assumptions about the stages you may take them through as a Data Scientist. Display your expectation to everyone and let them know how much time each stage may take. Also, do the obvious things like listing the resources required as well as the types of inputs and outputs your team expects. Lastly, list dependencies. After all, you will want your counterparts to be aware that you cannot move forward until “x” event happens; for example, the Data Scientist may be waiting to receive a data feed from IT. This is precisely the kind of thing to call out in the Data Science project plan.

2. Define the large iterations in the project 

Most business users will not be intimately involved in how a Data Science team works or why it may change when you encounter a classification problem versus a regression problem. So in an effort to be clear and meaningful, share stages that are more iterative as well as their corresponding durations – such as modeling or the evaluation stages. The best Data Scientists know how to  appropriately manage expectations from the business through communication with the broader organization.

3. Point out scheduling and risks

Virtually all working individuals know that it’s unrealistic to think everything happens only in ideal scenarios. Data Scientists should take the necessary time to consider scheduling resources and the inherent risk they could encounter in the project. Give the business the comfort that only a trusted advisor can provide them. Think through what could happen and what you would recommend to them if they encounter turbulence – because turbulence is inevitable. Taking this extra step is the hallmark of a Data Science professional.

Summary

Do not view the Data Science project plan as training wheels for a junior Data Scientist who is new to working with business, but rather what a skilled Data Scientist will review each time his or her team begins a new task within the Data Science project. Crafting a Data Science project plan to pacify the business – and never utilizing it for team guidance – is a grave mistake that one day could end in ruin for the Data Science team, the business, or both. An effective Data Scientist will work from the perspective that a goal without a plan is simply a wish and nothing more. Or, said differently, an effective Data Science team works a plan at all times.

[Originally posted on LinkedIn]

A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column

ABSTRACT:

Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology related to the vertebral column. There is great potential to become more efficient and effective in terms of quality of care provided to patients through the use of automated systems. However, in many cases automated systems can allow for misclassification and force providers to have to review more cases than necessary. In this study, we analyzed methods to increase the True Positives and lower the False Positives while comparing them against state-of-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized in the current research community.

Research Article:

Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data Mining 4: 114. doi: 10.4172/2090-4924. 100114

Creating Value for Business: 2 Data Science Questions You Must Ask from the Start

Decisions in Data Science

Business goals are no doubt important, but in an analytic project it makes sense to balance the organization’s goals with those of the Data Science department. Most individuals will recognize balance as a principle of art, but the notion of creating a sense of equilibrium between the business and the Data Scientist is just as foundational in today’s insight economy. To not cultivate this balance is to invite ruin into the organization.

Question 1: What are the Data Science Goals?

As a Data Scientist working in an organization, it is important to understand how the intended outputs of the Data Science project enable the achievement of the business objectives. Imagine a situation where a business has a set of defined goals, but the analytics team had a different target in mind or vice versa. The result is extra cost, time delay, and missed business opportunities. Unfortunately, these sort of happenings are more common than you would imagine in everyday business – and with organizations big and small. As a Data Scientist serving a business, it is prudent to define your goals in tandem with the business objectives and obtain buy-in of your interpretation. This can be done by explicitly documenting what you expect the output to be like and confirming its usefulness to the business unit you are supporting.

Question 2: What is the Data Science success criteria?

Businesses should work with Data Scientists who know how to precisely define a correct outcome in technical terms. In truth, it could prove important to describe these outcomes in subjective terms; however, if this ends up being the case, the person in charge of making these subjective judgments needs to be identified. Neither the business nor the Data Science department will succeed with a moving target. Transparency and visibility are always good things in business. This allows individuals to manage towards a known expectation.

Organizations working with Data Scientists who simply have technical know-how are missing out on significant value within their analytic projects. Organizations should seek to find professionals who know how to translate business concepts into analytic outcomes. This skill should be considered primary over knowing the most advanced techniques and methods when analyzing data. Unfortunately, most organizations are still on a discovery mission with regard to what they need from Data Science. Organizations still remain beholden to the idea that if they hire a Ph.D. in some highly-analytical field then success is just around the corner for their organization. This is rarely the case. In fact, most Ph.D.’s need significant time to warm up to the corporate culture and learn the language of business before they can be fully effective.

It may seem obvious to the organization, but having your analytic superhero be able to quickly judge the type of Data Science problem that you are looking for them to contribute to is paramount to pulling it off.  Typically, being able to specify things like whether the target is a classification, description, prediction, or a clustering problem works well for all involved and starts to build context across disciplines in the organization. This becomes especially important when a Data Science department begins to grow and less experienced Data Scientists can learn to see more like senior Data Scientists; this can only happen with intentionality and purpose.

Organizations should come to expect that one way a good Data Scientist will often demonstrate his or her ability is by reframing or redefining the problem put before them by the company. The first few times this may seem off-putting, but organizations who learn to embrace this sort of transformation of the business problem will be able to compete for the future. Practically speaking this may look like shifting to “medical device retention” rather than “patient retention” when targeting patient retention delivers results too late to affect the outcome.

As a business concerned with the ROI from your Data Science investment, you will undoubtedly want to see activities of the Data Scientist which specify criteria for model assessment. These typically present themselves as model accuracy or performance and complexity. In many cases, it is indispensable to see that a Data Scientist has defined benchmarks for evaluation criteria. Even in the case of subjective assessment, criteria definition becomes important. At times it can be difficult to meet a company’s Data Science goal of model explainability – or data insights provided by the model – if the Data Scientist has not done a good job of uncovering this as a businesses need. So, the adage “to begin with the end in mind” should prompt the Data Scientist to ask an appropriate series of questions of the business to ensure value creation.

Summary

Remember that the Data Science project success criteria are without a doubt different than the business success criteria. Any Data Scientist with experience will say that it is always best to plan with deployment from the beginning of a project. If the organization experiences a Data Scientist not following this best practice, expect spotty results and a bit of frustration from business counterparts. As an organization, it is vital to push your Data Scientist to work hard and be assertive within the project – as well as to use their mind and imagination. This should give him or her the permission to shape the future your company desires.

How Is Knowing the Business Important to Data Science?

Businesses around the world are involved in a multitude of projects at any given time. As Data Scientists come into the business fold, it becomes more important with each passing day to have both parties – “the business” and “the Data Scientist” – begin to define successful strategies of working together. Businesses are having to become aware of the techniques and methods of a Data Scientist in order to maximize their analytic investments; and, simultaneously, Data Scientists are having to learn how to be relevant to an organization that is in a constant state of change. From a business perspective, knowing what to expect of a Data Scientist and having that Data Scientist develop a reasonable Data Science workflow can create huge competitive advantage over other companies who are lost at the “Data Science Sea.”

Our Business Conditions, Today

Performing a bit of journalistic investigation into the organization’s business situation will help provide a Data Scientist with the necessary context for their Data Science project right off the top. Getting background facts on the business will help the Data Scientist know what he or she is getting involved in – in the truest sense. This may not be obvious to the Data Scientist at first, but learning background facts about the business helps to uncover details that will round out one’s understanding of what the business has determined it needs as it relates to the Data Science project. Through this process, information on identifying resources most certainly bubbles to the surface. The takeaway: even if a Data Scientist has worked at the organization for years, this critical step should not be skipped. The business background is a dynamic concept that speaks to the circumstances or situation prevailing at a particular time – it should not be looked at as part of a one-and-done process. Data Scientists should be careful not to fall into the trap of believing that nothing has changed since the last Data Science project.

 It Doesn’t Matter What the Business Wants – I Can Model Anyway!

Many Data Scientists forget the essential step of learning about the business from the business’s perspective. Since the business is the customer of the Data Scientist, this can be easily boiled down to “What does the customer truly want to accomplish?” This simple but straightforward question may seem frivolous to an inexperienced Data Scientist, but getting at what the business objectives are for any Data Science project will create a necessary roadmap for moving forward. The fact of the matter is that most businesses have many competing objectives and constraints that have to be properly balanced in order to be successful on a day-to-day basis. As the Data Scientist, one of your primary aims in ensuring a successful Data Science project is uncovering important, possibly derailing factors that can impact outcomes. Data Scientists should not advance the project workflow on the basis of their analytic talent alone, but rather take the time and necessary steps to learn the business objectives; otherwise, a Data Scientist runs the risk of being seen as a rogue employee with irrelevant results. At the end of a Data Science project, everybody can see clearly when a Data Scientist has come up with the right answer to the wrong problem. A Data Scientist with half the analytic skill can be more effective to an organization than a Data Scientist who squeezes every last bit of information gain from a dataset, but does not know how to frame the business problem.

 What Do You Mean I Missed The Target?

As a Data Scientist who operates in business, you should want to know what it takes for your Data Science project to be successful. However, this cannot be only about the evaluation of predictive models or how a Data Scientist designs experiments, but in addition to how the business will judge success. Learning how to frame up the business success criteria in the form of a question – and whether the criteria will be judged subjectively or objectively – will help a Data Scientist pinpoint the true target. An example of a business criterion that might be specific and measurable objectively would be “reduction of patient readmissions to below 19%.” An example of a business success criterion that is more subjective would be something like “gives actionable insights into the relationship of the data we have.” However, in this later case, it only makes sense for the Data Scientist to ask who is making the call on what is useful and how “useful” is defined. Bottom-line: if Data Scientists do not know what the business success criteria is for a Data Science project, they have already failed before the project has begun.

Summary

Having a solid business understanding about a Data Science project will prove to be valuable for both the Data Scientist and the business. Real-world Data Scientists should not operate as an island. In reality they need to learn to speak many languages beyond Python, R, and Julia; they should also learn to speak “business.” The better a Data Scientist can understand the business milieu, the business objectives, and how to measure the success of a Data Science project in the eyes of the business, the more effective a Data Science will be for an organization.

7 Questions Every Data Scientist Should Be Answering for Businesses

 

Business professionals of all levels have asked me over the years what it is that they should know that their Data Science departments may not be telling them. To be candid, many Data Scientists operate in fear wondering what they should be doing as it relates to the business. In my judgment, the questions below address both parties with the common goal of a win-win for the organization: Data Scientists support their organization as they should while business professionals become more informed with each analysis.

What problem are we trying to solve?

 

It is important to be able to answer this question in the form of a sentence. Remember that the business end-user most likely does not use common terms like CV, logistic regression, or error-based learning in their everyday business routine. It does not help anyone when a Data Scientist hides behind fancy terms instead of providing actionable insight that moves the organization along. I can assure you that translating the Data Science jargon into something digestible for the business professional will create many allies. After all, a Data Scientist should have the primary skill of being able to transform complex ideas and make them readily understood.

Does the approach make sense?

 

In truth, this may be the single best question that benefits the Data Scientist even though it is asked primarily of the business professional. Learning to write out an effective analytic plan can have profound meaning. Writing is a discipline that should be embraced by the Data Scientist. It allows the Data Scientist to synthesize his or her thoughts. Although we live in a day and time where technology is at the center of everything we do, we should remember that technology, Data Science, and statistical computing are not replacements for critical thinking.

Does the answer make sense?

 

Can you make sense out of what you have found? Do you know how to explain the answer you have received? Your organization is counting on you to be the translation piece between the computer output and their business needs. Remember: computers simply do what they are told. As Data Scientists, we need to be sure we directed it to do the right thing. Validate that the instructions you gave it were the ones you intended. Be scientific in your approach, document your assumptions, and be sure you have not introduced bias into your work.

Is it a finding or a mistake?

 

Not everything is a Eureka! moment. So, make skepticism a discipline as a Data Scientist. One should always be skeptical of surprise findings. Experience should tell you that if it seems wrong, then it probably is wrong. Do not blindly accept the conclusion your data presents to you. Again, there is no substitute for critical thinking. Make absolutely sure you understand, and can clearly explain, why things are the way they are – whether a finding or a mistake.

Does the analysis address the original intent?

 

Unless you are surrounded by other Data Scientists in your organization, this question requires accountability to one’s self. You should be honest with yourself, always ensuring that you are not aligning the outcome with the expectations of the organization. It may be obvious to note, but it is critical to speak the truth of the data, realizing sometimes that the outcome does not align with the question the business is seeking to answer. However, if your analysis is essentially something unflattering to the organization, be sure you are 100% confident in your findings. In this situation, additional analysis is more important than less. Giving an analysis that does not reflect well on the business – and that is not well substantiated – may very well be your last.

Is the story complete?

 

We would agree that the best speakers, writers, and leaders are all good storytellers; it is no different for the Data Scientist. While storytelling is not the only way to engage people with your ideas, it is certainly a critical part of the Data Science recipe. Do your best to tell an actionable story. Resist the urge to rely on your business audience to stitch the pieces of your data story together. After all, your analysis is too important to leave up to wild interpretations. Take time to identify potential holes in your story and fill them appropriately to avoid surprises. Grammar, spelling, and graphics matter; your audience will lose confidence in your analysis if your results look sloppy.

Where would we head next?

 

As Data Scientists we should realize that no analysis is truly ever finished – we simply run out of resources. It is worth the effort for a Data Scientist to understand and be able to explain what additional measures could be taken if the business was able to provide additional resources. In simple terms, the business professionals you work with, at the very least, will need to have that information so they can decide if it makes sense to move forward with the supplemental analysis.

Summary

It is key to remember that Data Science techniques are tools that we can use to help make better decisions for an organization and that the predictive models are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address – and then ensure that it does just that. These seven questions begin to form the bond of a stronger partnership between the data science department and the organization.

Introduction to Inference and Learning

Many of my subscribers have asked for some resources to help get them on a path for better understanding with regards to inference and learning. As many individuals have various learning styles there are both reading and video (I would recommend both).

  • Book: Murphy — Chapter 1 — Introduction
  • Book: Bishop — Chapter 1 — Introduction

Books mentioned above:

Machine Learning: A Probabilistic Perspective Kevin P. Murphy, MIT Press, 2012.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer, 2006. An excellent and affordable book on machine learning, with a Bayesian focus. It covers fewer topics than the Murphy book, but goes into more depth on the topics it covers.

If you have resources that you think that I missed, please let me know. If there is a resource that you particularly enjoyed I would like to hear from you as well.

Getting Started with Machine Learning

In truth, I am an advocate for jumping in head first and using what you learn in real-time. Practically speaking this means learn less about all the theory and heavy math behind what it is you are using with the attitude that you will move towards understanding.

Do you know how to program in a specific language? If so, then determine if that language has a library which can be leveraged to aid you in your machine learning journey.

If you do not know how to program, that is okay also. Survey a few languages (R and Python are popular among data scientist) and see if you have one that is more understandable to you and then go down the same path…seeking a machine learning library.

Shhh, it’s a Library

No Programming Necessary
  • WEKA – you can do virtually everything with this workbench. Pre-processing the data, visualizing the data, building classifiers, and make predictions.
  • BigML – Like WEKA you will not have to program with BIGML. You can explore model building in a browser. If you not certain about machine learning (or data science for that matter), this would be a great place to start.
R (Statistical Computing)
  • If you are really enjoy math and have not picked a language yet, then this may be for you. There are a lot of packages here developed by pioneers in the field which you can leverage without having to refactor any code. All packages come with instructions – giving you some of the theory and example cases for you to see in action. In my judgment, learning this language allows you to explore and prototype quickly which most certainly will prove valuable.
Python
  • Scikit Learn – If you enjoy Python then this library is for you. This library is known for its documentation which allows you to rapidly deploy virtually any machine learning algorithm.
Octave
  • Octave is the open-source version of MatLab (some functions are not present). As is MatLab, Octave is known for solving linear and non-linear problems. If you have an engineering background then this might be the place for you. Although, practically speaking many organizations do not use Octave/MatLab as it is seen as a primarily academic software.

No matter what you pick, decide to use it and stick with it for awhile. In fact, I would commit to it for the next 12-months. Actually use the language/library you choose do not just read about it.

Learning Online

If you are really a beginner, you may want to stay clear of some of what you see online. Many people I talk to like the idea of data science and machine learning and decide to sign-up for an online course. The problem they encounter is that in many cases they already have to know how to program (to some degree) and they should know linear algebra and probability theory.

Linear Algebra Example

Probability Theory Example

If you do decide to watch classes online, then you should absolutely take notes (even if you toss them later). The key is to participate – which may sound obvious, but when you are at home in your pajamas learning about data science it is not quite so obvious.

That being said there are some really good (and free) online lectures (do not be overwhelmed):

Research Papers

This may not be your thing either, not everybody likes to pick up a research paper to read. Many individuals complain that the reading is a bit to academic and does not lend itself to really conveying insight to the reader (which is opposite of the intent of the paper). To be candid some are written better than others, many cases that has to do with the topic or the time period the paper was written in. However, there are a few seminal papers which you should be acquainted with that will allow you to gain context for machine learning and data science which should prove invaluable in your journey. My encouragement to you is to find these papers and if you are not ready to read them due to your efforts to skill building on other areas then simply hold on to them and test read them every 3-months. See how far you get without getting lost, see if you understand what you are doing when you are coding a solution at a deeper level for having read the paper, and best of all read the reference page – find out who influenced the paper you read.

Machine Learning Books for those Just Starting

Let’s face it there are not a lot of books out there that aim to aid those just starting out in machine learning. As before, the expectation is that you will have some linear algebra or probability theory down pat. Unless you come from the hard sciences (mathematician, engineer, bio-statistics, etc) then you probably will have to do some skill building here even before reading most of the books out in the market place. However, there are a few that are approach the true beginning most people are at and encourage those of you willing to try on your own.

Curious to know your thoughts on the above. Have you used any of these resources? Do you have any that you would recommend?

How to Become a Data Scientist

How does one become a data scientist?

Well, in truth, the path is most certainly clear. However, the work it takes to travel down the road is not for everyone. Before reading this you may want to have an understanding of where you are with your current analytic skills (e.g. MS Excel only, maybe a little bit of SQL, Crystal reports, etc). Use the rest of this article as a measuring stick for where you are and where you would like to go. In fact, it is best to begin with the end in mind and work backwards to the most basic skill you will need and start building from there…

Recently DataCamp posted an infographic which described 8 easy steps to become a data scientist.

How to become a data scientist

How to become a data scientist A portion of the infographic posted on the DataCamp blog

What is a Data Scientist

It’s important to understand what this infographic is based on:

  1. Drew Conway’s data science venn diagram that combines hacking skills, math and statistics knowledge and substantive expertise.
  2. A graph showing the survey results on the question of education level, not unlike the graph in O’Reilly’s Analyzing the Analyzers.
  3. Josh Wills’ quote on what is a data scientist.

Become a Data Scientist

Using the infographic, the 8 steps to becoming an data scientists are:

  1. You need to know (there is a spectrum here) stats and machine learning. The fix – take online courses for free.
  2. Learn to code (not everything, but very specific things). Get a book or take a class (online or offline). Popular languages are Python and R in the data science space.
  3. You should understand databases. This is important because for the most part this is where the data lives.
  4. Critical skills are data munging (data clean-up and transformations), visualization, and reporting.
  5. You will need to Biggie-Size your skills. Learn to use tools like Hadoop, MapReduce, and Spark.
  6. This part is extremely important – get experience. You should be meeting with other data scientists in meetups or talking with people in your office about what you are learning and accomplishing with your enhanced skills. Do yourself a favor obtain a data set online and start exploring them with your new found techniques. I recommend Kaggle and CrowdAnalytx for interesting data sets.
  7. Get yourself one of these: internship, bootcamp or a job. You can’t beat real experience.
  8. Know who the players are in this space and why. Follow them and engage with them, and be a part of and engage with the data science community.

My thoughts…

In my judgement, look at the data and the algorithms first then get busy with the math and programming. However, I do agree with the idea of moving steps 1-5 for familiarity sake of the discipline. Steps 6-7 I would categorize as working the problem and the final step would be plugging into a community.

It may be important to go another step forward. 

It is more intuitive to minimize steps 1-5 into one (this could be a crash course of terms and themes relevant to data science). My preference (its what has worked for me) is to jump in with the data and the tools of the trade as soon as possible. More need to develop just-in-time learning mechanisms, rather than learning the entire universe of a topic. Approaching data science in this way allows an individual to build on a combination of theory and practical experience. This done by encountering problem sets over and over again.

Learn the art of relevance…what makes sense for my situation right now. Obtain a solid data set and get learning. This sort of action works to build context for the tools you are using.

The fastest way to become a data scienist is to recognize where you are with you current skills, grab a data set, pick a language (R,Python, Julia, C++, Matlab,etc) and start working through a problem end-to-end.

What do you think it takes to be a data scientist?

 

Seizure Detection in EEG Time Series

I had a wonderful opportunity to work Eben Olson of Yale University on a problem data set provided by The Mayo Clinic. He and I did a write-up on the what we learned during the process and hope that it helps others in their knowledge discovery surrounding seizure detection.

1 Introduction
We describe here the methods used in preparing our submission to the UPenn and Mayo Clinic Seizure Detection Challenge, which obtained second place on the private leaderboard. We present in detail our final and most sucessful approach, as well as an overview of less successful experiments which also contributed to our final ensemble or provided some insight. It is assumed that the reader is familiar with the structure of the challenge and the data, described at http://www.kaggle.com/c/seizure-detection.

2 Early approaches
2.1 Spectrograms
Our initial feature extraction method calculated spectrograms of each EEG trace, in an attempt to capture both frequency content and temporal dynamics. Each clip was first resampled to 500Hz, and the short-time Fourier transform was applied, discarding phase information. Spectrograms were flattened into vectors, and mean subtraction and normalization was applied on a per subject and per feature basis. Features from each channel were concatenated, and logistic regression or random forests were used for classification. Our best (ensembled) submission with these features scored 0.94081 on the public leaderboard.

2.2 Scattering coefficients
As an alternative to spectrograms, we attempted to use scattering coefficients[2], a framework for time-frequency analysis which has been shown to give good results in audio classification tasks[1]. We used the ScatNet MATLAB toolbox1 to compute scattering coefficients for each clip, after resampling to 500Hz. Coefficients for each channel were concatenated and logistic regression was used for classification. Only a marginal improvement (0.94212 public leaderboard) was seen over spectrogram features.

2.3 Maximal cross-correlation
We next considered maximal cross-correlation, which has been reported to produce useful features for detection of epileptic EEG activity[4]. This method attempts to compensate for propagation delays of brain activity by computing cross-correlation between channels at various lag times and taking only the maximum value, normalized by the channel autocorrelation. We obtained a substantially worse score (0.86761 public leaderboard) with this method. However, review of the code indicated that this may have been due to a bug in the feature calculation, and further investigation of this method may be valuable.

3 Final approach
3.1 Feature extraction
Our final approach to feature extraction calculated the covariance matrix of the EEG data, in order to capture correlations between channels. Since seizure activity is characterized by increased long-range synchronization of neuronal activity, this was expected to produce informative features. Matrices were individually normalized to zero mean and unit variance. As frequency analysis had been shown to be valuable, rather than compute a single covariance matrix we first filtered each trace with several bandpass filters. We initially applied four filters covering the range 1-200Hz. Filter choice presents a complicated trade-off between frequency selectivity, signal to noise ratio, and output dimensionality. Performance was evaluated by cross validation of logistic regression predictors. While attempting to manually optimize the filter parameters, we found that filters chosen for one subject could perform extremely poorly on others. We therefore performed an automated filter selection step, in which combinations of up to four filters were evaluated on each subject. These filters were chosen from a bank of 10 partially overlapping, approximately log-spaced bandpass filters covering the range 5-200Hz. The three combinations which gave the highest CV values were retained.

3.2 Neural network classification
As an alternative classification strategy, we experimented with the use of multilayered neural networks. Our initial motivation was the possibility of learning a cross-subject mapping which would allow our model to use the full training set to improve its predictions. While this goal was not realized, we did find that the NN models provided a boost over logistic regression. Our software was based on dnn.py2, a recently released demonstration of a deep neural network written in Python. This provided an excellent framework which was simple to adapt to our problem. We tested a number of network architectures, but found that a network with two hidden layers of 200 and 100 units respectively gave good results while being reasonably quick to train. Rectified linear units were used in the hidden layers and logistic regression in the output layer. Dropout of 0.5 was used in the hidden layers for regularization. All networks were trained with the adadelta method for 100 epochs. Multiple networks were trained for each subject and filter combination. In an attempt both to increase diversity and to reduce the impact of dissimilar electrode patterns across subjects, each network was trained on a 12-channel subset of the full covariance matrix. We found that depending on network architecture, predictions would become extremely compressed into the neighborhoods of zero and one. To avoid potential issues with numerical precision, we applied a logarithmic rescaling to predictions in the (0,0.1] and [0.9,1) ranges.

3.3 Early seizure prediction
Our best scores were obtained by submitting the same values for pearly and pseizure, rather than trying to train separate classifiers for early ictal events. This phenomenon was reported early in the competition by the user Alexandre3. We observed a similar trend in our cross-validation testing, and believe it is explained by the combination of the AUC metric and the imbalanced classes of the data set, which leads to a much larger penalty for false negatives than false positives. At least for the classification strategies we employed, the error due to training on the “wrong” labels was outweighed by the benefits of a larger training set. However, post-deadline testing showed that our error on the early detection task was several times higher than on the seizure detection task. Improvements could potentially be obtained by fine-tuning of the trained networks with earlyictal data alone, or by applying other techniques for learning from noisy labels.

4 Ensembling
We operated according to this learning bias: select the simplest model given that the theories are consistent with the data. We attempted to produce strong learners that could be arbitrarily accurate and weak learners that were more accurate than random guessing. In blending our models, our process was to start with uniform weighting; however, through each step of learning we decreased the weighting of the models that were not correctly learned by the weak learner, and increased the weighting of the models that were correctly learned by strong learners.[3] In each ensemble we attempted to create more diversity of opinion, independence, decentralization, and aggregation. Our objective was to discover the best model from many classifier models with similar training /test errors. Selecting a model at random, we would have risked the possibility of choosing a weak learner. However, we observed that combining them (averaging) presented us with the possibility of avoiding a poor decision. The idea was that every time we ran a model iteration, we would encounter different local optima. However, combining the model outputs would allow us to find a solution that is closer to the global minimum. Since it is possible that the classifier space may not contain the solution to a given problem, we pursued an ensemble of such classifiers that could contain the solution to the given problem. For example, in the case of a linear classifier that cannot solve a non-linearly separable problem, combing linear classifiers may solve a non-linear problem.

5 Technical details
5.1 Software

All of our experiments, with the exception of scattering coefficient extraction, were carried out using Python and IPython notebooks. The numpy, scipy, and scikit-learn packages were used extensively. Theano4 is required by the neural network module. Our repository, available at https://github.com/ ebenolson/seizure-detection, contains a full list of required packages and instructions for running our example code. We will also give a brief description here:
• Data Preparation.ipynb This notebook loads all clips for each subject, applies bandpass filters and calculates covariance matrices, then performs the filter selection step. The filtered covariance matrices are saved, along with the clip labels and filenames, to a pickle file in the data subdirectory.
• Train Classifiers and Predict.ipynb This notebook loads the preprocessed data and trains multiple networks on each input file. The predictions of each network are saved to the output subdirectory.
• Postprocessing.ipynb This notebook loads predictions of the trained networks and combines them to produce submission.csv, a submission in the required format.
• Miscellaneous.ipynb This notebook contains code snippets implementing some of our earlier approaches.
• simplednn/ This subdirectory contains the neural network module.

5.2 Hardware and runtime
Most computations were done using Amazon EC2 m3.2xlarge instances with 8 virtual cores and 30GB RAM. Using spot requests, EC2 provides an affordable platform which can be easily scaled and parallelized depending on the memory and processing power required. Some experiments were also done using a Core i5 quad core desktop with 16GB of RAM. On the desktop, preprocessing the full data set requires approximately one hour and training one iteration of networks (3 per subject) requires 13 minutes. The time required to generate predictions is negligible.
[4]https://github.com/Theano/Theano

References
[1] Joakim And´en and St´ephane Mallat. Multiscale scattering for audio classification. 2011.
[2] J. Bruna and S. Mallat. Classification with scattering operators. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1566, June 2011.
[3] Seni Giovanni, John Elder, and Grossman Robert. Ensemble methods in data mining: Improving accuracy through combining predictions. In Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. MLSP 2010, pages 4–10, February 2010.
[4] P.W. Mirowski, Yann LeCun, D. Madhavan, and R. Kuzniecky. Comparing SVM and convolutional networks for epileptic seizure prediction from intracranial EEG. In IEEE Workshop on Machine Learning for Signal Processing, 2008. MLSP 2008, pages 244–249, October 2008.

Do I Have to be Great at Math?

In many cases, individuals try to start out knowing all the math behind the data science before being practical with the science. The majority stop here before they even get started – in fact, they have a belief in data science (they are attracted to it) but they simply do not feel adequate because the math can get in the way.

Bottom-line you can go far in machine learning without understanding all the math.

Just Start

Think back to when you started something that you did not understand completely. You made use of it. It probably was terrible, but you stumbled your way through and on the other side you had much to be proud of. It was a slow process, but your passion for whatever you were involved in carried you through the stinking points. As the complexity of whatever is you were interested grew so did you desire for knowledge and so you were able to overcome the limitations that could have been perceived to be in your way.

The truth if everyone had to start machine learning by learning about Nonnegative Matrix Factorization or Peturbation Bounds for Eigendecompositions then very few individuals would have the passion ignited.

Copy-and-Paste Heaven

You may not believe me, but one of the best things that can happen to you early on in your data science journey is for things to fail. Let’s face it when something breaks you can either do one of two things 1. walk away or 2. fix-it. Many individuals focus on copy-and-pasting code from Git, data challenge, or code cookbook just to simply see if the code can work on their machine. In fact, they simply want to recreate the results, but there is a point where you will want to extend the code. Maybe it doesn’t do everything you need or the data type is not  the same as what you are working on, or maybe still you want to tune the model – in all these cases you will  have to move beyond copy-and-paste code.

When something breaks or you decide to enhance what you have, almost certainly you will have to breakdown the code line-by-line and understand what it is that the code is doing. In doing this, you may not realize it, but you are building confidence in what it is you are working on (as a side, you should apply this same technique to the math formulas you encounter).

Where’s My Geiger Counter

  1. Be a Navigator – work well with tools like scikit-learn, R, WEKA
  2. Be a Librarian – use programming libraries that give you small algorithms
  3. Be a Doer – put what you are learning directly into what you are doing

You only need to know what you can use today. These approach is highly practical and efficient.

Find some dataset or problem that you are interested in working with and began to work systematically through the results. Document it for yourself – categorizing your thoughts will help you crystallize what  you are learning. In time you will began gaining an energy within your learning which will prompt you to seek new algorithms, understand the parameters behind them, and solve more complex problems.

This type of process, like a Geiger counter, points you to want you need (and nothing more). It is highly customized and meaningful to the individual who adopts this method. The Geiger counter should be juxtaposed to an individual completely grounded in theory. Theory alone is generally not sufficient to move beyond a prototype solution.

The secret to math is you need to know what you are good at and what you are not. Know where you need to improve. When the time is right pursue that knowledge. Having a problem that you are working with and needing more knowledge around a math concept can go along way in cementing the idea in your mind for future problem sets.

You Really Do Need Math, Later

Since you will have to know linear algebra, why not use machine learning to teach you through a problem that interests you? You will gain more intuition at a deeper level than any textbook introduction you may receive. In short, there is a path available to all individuals interested in machine learning, but that are intimidated by math concepts. Keep in mind that you can overcome issues of overconfidence of a poorly understood algorithm when you decide to understand the math behind any algorithm you decide to implement. 

I am sure some of the data science purist may disagree with what I have stated here, leave a comment and let me know what you think.