Google Is Rethinking Its Business What About You?

Originally published on DataRobot.com.

Google recently started promoting automated machine learning (what they call AutoML) – something that DataRobot customers have been doing for years. So what is driving Google to now embrace this transformative approach? Guest blogger Damian Mingle explains. 

You might be asking yourself why Google, a company with over $600 billion in market cap, is rethinking its business, and if it has any relevance to your own business. In truth, Google is a business like any other, but beyond their P&L they represent an attitude towards disruption with technology. While there is considerable evidence of this, here are few recent examples:

  • Google search now ranks differently using machine learning
  • Google Maps Street View automatically recognizes signs
  • Video calling uses machine learning for low-bandwidth situations

As a business person, it may be difficult to connect with a few esoteric statements about Google, so let me create a scenario for you to consider. Which organization do you think is more likely to succeed in today’s business environment – Business A or Business B?

  • Business A: an organization that uses a 30-day lag report on the number of items sold to indicate both how they are doing and what they should do next.
  • Business B: an organization that uses an automated machine learning technology, like DataRobot, in a way that they can predict the likelihood of what they will sell next month, predict price fluctuations, use their website data to identify client questions which have the same intent to provide the optimal response, predict which products a customer will purchase again, predict which recommended content each user will click, pair products with people every time they visit their site, maximize sales and minimize returns based on transaction data, predict the likelihood of which clients may leave them and what month that will happen over the next year, and combine various data sources with their own organic data to allow for the business to have increased context for decision making.

My hope is that you see the value in Business B.

Like business, technology has entered a new era. This new era allows you to reframe old problems in new ways at a rapid pace and without requiring the pedigree of a Ph.D. from an Ivy League school.

Let me explain:

Most companies understand and make use of programming, which often exists to relieve individuals of performing rote tasks. Business has benefited greatly from this and will continue to into the future. However, nowadays we are hearing more and more about machine learning as a complement to programming. The difference between programming and machine learning is that machine learning allows computers to learn how to best perform these rote tasks, and that optimization is significant. Additionally, because of the early success of machine learning, companies like Google have made machine learning/artificial intelligence front and center of their respective organizations and have created momentum for technology they call AutoML. Automated Machine learning automation allows computers to go even one level further by automating the optimization of how to best perform these rote actions.

Remember our vignette above about which organization will likely succeed in today’s business – Business A or Business B? Let me share with you what so many organizations (both big and small) tell themselves that keep them stuck in the position of Business A, and my responses to those excuses:

Excuse: “Our data is a mess. It’s not in the right format and is not well documented.

Response: Maybe true — figure it out or chances are you will be out of business in 3 years.

Excuse: “We don’t have the right data.” 

Response: I have worked with domain experts for years who often are trying to run their business straight from the gut. Don’t get caught in the trap of thinking you know what the right data is. IF you do this, you effectively constrain all the technology we are talking about so that it fits in how you see and reason about the world.

Excuse: “Our culture is not right for machine learning or automated machine learning. 

Response: Change the culture with machine learning. Start with the things you measure in a business intelligence report. For whatever reason, these reports have proven to be important to the business and you most likely have data that can be used to fuel the growth of your business.

Excuse: The timing is not right; I have other initiatives.

Response: Stop and think for a moment. Do you think your competitor is worried about initiatives like paving the parking lot or figuring out a new coffee vendor? Be open to the idea that your current initiative is the last thing that will actually help your business survive the current business environment.

Excuse: We are in the wrong industry.

Response: It’s clear some industries are slower to adopt change than others. If you are in a fast-adopting industry, you have no time to lose – get busy using this sort of technology. If you are in an industry that’s slow to adopt change, disrupt away!

Excuse: We don’t have access to that sort of talent. 

Response: What used to come out of the ivory towers after years of research from world experts is now more commonplace than you think. In fact, technology like automated machine learning makes it possible for you to invert your current business hierarchy and allows you to tap into more of the organization’s human capital. With automated machine learning technology, you can move many of your individuals from “doer” roles to more “thinker” roles. Your business is most likely rife with these thinker individuals already; you just aren’t making use of machine learning.

The Question You Have To Answer

The question that your organization is answering each day is, “How will I compete in the new world?” Just so we are clear, the answer is binary: “I will have a future” or “I will not have a future.” The takeaway is that your competitor is most likely unknown to you, is 17 years old, working with messy data, not dealing with culture in the least, and who instead of having a Ph.D. is using automated machine learning to disrupt an industry, out of her parent’s basement. What you decide to do is up to you, but the question cannot go unanswered.

Augmented Intelligence: The Better AI

This article was originally posted on Health Management Technology.

Artificial intelligence (AI) is already everywhere in today’s business landscape, and it doesn’t always look like what science fiction portrays: Think more about Google search algorithms, customer experience by Amazon, or chatbots creating their own language at Facebook versus robots who have distinctive human features. AI and machine learning are creating a sizable niche in today’s healthcare lexicon—so much so that having even a high-level understanding of what machine-learning algorithms do is becoming an increasingly valuable skill for anyone working in the healthcare industry.

While many may assume that AI can quickly or automatically identify inefficiencies in business and solve problems independently, there is actually a significant human element that is needed for this technology to function most effectively. Many may also be surprised to know that non-data scientists or business stakeholders play an important and significant role in recognizing opportunities to apply machine learning. This article covers the approach that the data science team at Intermedix uses to identify machine-learning opportunities across divisions for the improvement of end-to-end user experiences.

Getting started

Business leaders and individuals without ‘data’ or ‘analyst’ in their titles typically have the best perspective of the bottlenecks, pain points, and strengths within a given organization. For this reason, division heads, team leads, and vice presidents are often well-positioned to identify business opportunities to apply machine learning. Additionally, data scientists are largely ineffective when it comes to operating in a vacuum, which is why they oftentimes rely on business stakeholders to alert them of relevant problems that may present worthwhile ventures. Once an opportunity has been identified, it’s the responsibility of the business leader to deploy adequate human capital to support the data science team with the internal subject matter expertise needed to select and segment data sets, provide input, and establish targeted objectives.

Figure 1 The seesaw that businesses must balance when seeking a data science solution.

At Intermedix, we encourage our business leaders to consider what results are worth guessing or predicting. From there, these leaders can work with data scientists and internal subject matter experts to determine if we have the necessary data to consider a valuable machine-learning strategy.

Conducting due diligence

Once it’s determined that the necessary data is available, and a machine-learning approach may be appropriate, teams need to evaluate how AI capabilities should be applied, and decide what the desired end goal should be. To start, teams should ask themselves a few important questions.

For example, if 10 individuals at the organization perform the task that’s meant to be improved by machine learning, would all 10 agree on what a positive outcome looks like? If all 10 disagree, then the business may need to consider what the expectation is for the AI implementation and how it can have the most meaningful impact on the business problem.

Another thing to take into consideration is how long have individuals at the organization been attempting the targeted task or something similar? Furthermore, it would be beneficial to look at whether or not thorough records have been saved that track strong outcomes related to this task. When looking for documentation, if records capture successful task outcomes that have been maintained over a suitable period of time, then the datasets can be used to help train a machine-learning algorithm to predict additional, unknown outcomes. If an accurate historical dataset doesn’t exist, then the organization will need to start tracking and collecting this information. Even early data can be incorporated into a machine learning algorithm, along with the additional input and guidance from human insight, to train the algorithm over time.

Taking time to consider overall objectives and potential outcomes is important. Teams need to understand upfront that AI and machine-learning algorithms are only as good as the input data and feedback provided by the humans who are living and working in these various scenarios. In many cases, the best way to ensure that teams get the most out of AI implementations is by adding unique and invaluable human insight.

If the human logic within your organization is unable to reach a consensus on desired outcomes, then you need to go back to the drawing board and reassess. There needs to be an accepted majority vote outcome that can serve as the basis and foundation for building a machine learning model. Understanding that a deliberation phase needs to take place helps the organization move past the unfruitful experience of believing that AI and machines can make any decision without any context.

Looping in the experts

After assessments from project contributors, organizational consensus, and historical datasets have been deliberated, it’s time to loop in the data scientists. Data scientists may consist of either internal AI experts or third-party vendors, and can help determine if the project you’re considering is feasible. It’s crucial to schedule time to discuss the project at length with these data scientists and walk them through the needs, available datasets, and integral component considerations made, and the conclusions reached as a result. This input helps data scientists speculate if machine-learning automation is viable for your task and helps them flesh out which pieces of data involved in your target task are the most positive predictors of a rewarding outcome.

Evaluating business impact

After working with data scientists on the data logistics of the project, it’s time to step back and reevaluate the project. For instance, an organization may need to consider any long-term changes and implications that will take place if the project and applied AI is successful. On a similar note, what will happen if the project is not a success? What kind of impact would this have on the organization? What if the algorithm only yields a positive outcome 65% of the time? An accuracy threshold is needed for the organization to execute the project securely and to the best of its ability, and ensure that this endeavor is the best use of time and resources before implementation begins.

Combining skill sets for effective outcomes

One of the most challenging aspects of finding opportunities for machine learning and carrying out an AI-driven project is the need to shift preconceived notions of how traditional tasks are currently accomplished. It helps to remember that an algorithm can only reach its intended objective when the proper data and input are supplied by internal experts and historical datasets.

Figure 2 The framework used by the Intermedix Data Science team to develop a data science solution.

While this process can, at times, be arduous and involve a significant, progressive learning curve, once it’s broken down, it all essentially starts with a stakeholder who is finding ways to address relevant problems within the organization. A culture of curiosity—combined with the right machine learning approach—can go a very long way toward building a competitive advantage for any organization.

Neither the subject matter experts nor the machines can be successful on their own. The overall goal of a machine-learning algorithm is to incorporate the many years of experience and insight into an algorithm that’s able to build upon that established expertise and data and apply this knowledge to future tasks to increase productivity or create new value—thus supporting one another to create a better solution in the long run.

Data Devils Snapping At Your Heals

This article was originally published on LinkedIn.

Ever wonder how you can create meaningful insights with your data? Tired of being asked what’s your big data strategy, are you doing predictive analytics, or how you are making use of the latest technologies? There is a way for you to get what you need with Data Science. The key to new knowledge is to make sure you use Data Science in your organization.

WHY SHOULD I READ THIS? …

By the end of this article, you will have a straightforward strategy to deal with the data to get you the sort of meaningful insights you deserve. This article will help you develop your thinking beyond just databases, reports, and rules engines.

WHO ARE YOU?…

If you’re a business professional, you may be angry because at your core you know you cannot simply run a business on 30-day lag reports, but your business has only ever provided you more and more spreadsheets.

If you’re a data analyst, you probably want to move your organization past the trend-lines and frequency charts your reports deliver. After all, there is only so much one person can do with MS Excel and you no doubt have questions that your current methods simply don’t make use of.

If you’re a data scientist, you are most likely making mistakes you are unaware of. Maybe not in the science component, or even the data component, but most likely the organization component. In my judgment, this is the toughest career on the planet right now, primarily because you know what’s possible and you have to do better communicating with other non-tech types who have little idea what’s possible. There is a day in the not too distant future where upper management will be sharing stories with their associates, “Yeah – I had to let our Data Scientist go. He never really persuaded me what Data Science could do.”

SPREAD THE WORD…

I am going to show you a simplified three-step workflow that any organization should adopt immediately if they want meaningful insights.

THE HOW-TO…

While I like to geek out like the rest on things like machine learning, nested cross-validation, and linear algebra… in an organizational setting this sort of stuff simply isn’t appropriate:

However, what I do know is the steps below are the right first steps to maximize organizational value from a Data Science point-of-view even though they don’t appear to have the whiz-bang that the fancy math does:

  1. Generating the Right Questions
  2. Understanding the Data
  3. Producing Answers

PRACTICAL, EASY TO IMPLEMENT ADVICE…

Step 1: Generating the Right Question

Intuitively you probably would agree that all questions are not created equal, even within the same topic. Most organizations suffer from a case of the “rote-s”. Rote questions that have been passed down from year after year, predecessor after predecessor – all in the name of “that’s how we do it”. The problem is the margins erode, technology is hard to keep up with, and competition is global.

For fastest traction in your organization, you should start with the reports you pull today. What are the top 3 reports that create the most important impact on your business?

For example, you may have a report that measures how many widgets you sold in the last 30-days.

  • Great report…but is this the best question to ask with this data?
  • Let me clarify what I mean – would it be easier to drive for 30-days looking through the rear-view mirror or through the windshield of the car?
  • Take your 30-day widget report and modify the question from “What did I sell?” to “What will I sell?”

Don’t miss what I am communicating here. The historical report obviously has value, but augment your question set to include what might happen now and in the future. Just to be clear by “predict what might happen” I am not talking about forecasting (or the idea of simply extending the trend by a few periods).

Step 2: Understanding the Data

To make sure new found question will generate the sort of meaningful insight you are looking to generate for your organization, you have to take the next step – validation. You need to answer the question do you have access or could you get access to the data that will help you generate the answer to your question.

Don’t believe it is that easy – let’s return to our widget example.

From a Data Science point of view, the question of “what will I likely sell in the next 30-days” is simple. Think about this you have all the historical transactions which take into all the typical questions a business thinks to ask. Things like seasonality, pricing changes, broad market factors – they are all baked in. However, in truth, there are things baked in that a business wouldn’t know to ask – even the most sophisticated business wouldn’t know to ask. You should confirm for yourself that you get the transaction detail, not the summary detail to take you to step 3. If you can’t get this data in this format, you can’t get to step 3.

Recall, your organization may in fact, be used to supplying summary data to business units, not typically transaction detail. When you ask for the transaction detail of what sold over the last 30-days – from a data perspective, get everything you can. Watch out for traps like, “just tell me what data you need, specifically.” Tell the supplier of your data you want all the data related to the transactions, everything. If you don’t know what you have to work with you can’t get to work on step 3.

Step 3: Producing Answers

If you have made it to this step you will quickly understand that finding answers that will drive your business in meaningful directions is the easy part (I know what I just wrote, but it is true). Data Scientists use a method of providing answers called “the learning model”. So you might be thinking I am stating the obvious at this point, but here is a little-known fact. In Data Science, we use the prior outcomes to predict likely future outcomes with learning models. While there is a bit more to it than “point-and-shoot and big insights will appear”, the idea is to ingest the data into a code pipeline (fancy talk for a series of computational steps) and generate insights. From here not only will you be able to predict the probability of sales by whatever period duration your organization needs, but you can quickly determine which variables in the data drive those outcomes.

Machine Learning: A Bold Strategic Initiative

This article was originally posted on LinkedIn.

It can be difficult leading an organization in today’s climate, especially if you are trying to move your organization forward by embedding a modern AI solution. It may seem easy at first, as virtually everywhere you can read online about the development of Data Science, the miracles of machine learning, or the advances of AI. It makes sense that you want to improve your competitive advantage by bringing these new capabilities into the organization and reaping benefits immediately. Unfortunately, you will not be met with celebratory confetti, powder cannons, or Tag Team’s song “Whoop There It Is”.  Simply put, “It ain’t easy”. In this write-up we will explore the three most common obstacles leaders of any size organization will likely encounter and be forced to overcome in order to move the organization forward with a modern AI solution.

Lesson #1: Leaders should help their organizations see they are trying to build planes instead of birds.

Most organizations have an expert or group of experts they have been leaning on for some time and believe to be critically important to the company. Many leaders are interested in top-line growth, market share, and scaling their enterprise, but they’re stuck. At times leaders feel they are forced to choose between “experts” or “AI solutions”. This moment hits hard, no matter if you work in a startup or run a multi-billion organization; the problem is ubiquitous.

If an executive is brave enough to take on machine learning, he or she will experience many individuals in their organization who may began to squirm.  Call it fear, distrust, or an uneasiness with too much technology too fast. Here is how you can tell you have struck a nerve:

  • I have gained my knowledge over 30 years; how can this solution be trained in a single day?
  • Why we are using more and different data inputs to make these predictions?
  • If this solution is so good, why did it miss predictions in several cases?

So how should organizations respond when an AI solution produces the same or superior results to the status-quo solution, but arrives at that outcome differently?

Do you know when humans finally took off with flight, when the Wright brothers stopped imitating birds and started down a different path – like wind tunnels and pursuing aerodynamics. Flying was about getting into the air, maintaining flight, and landing safely; it was never about fooling birds to think a plane was a bird. The fact that an AI solution doesn’t do the same things as your company’s expert could be the absolute best thing that happens to your organization.

Lesson #2: Leaders should help organizations determine what an apples-to-apples comparison looks like.

At the start of the company’s transformation from “old world” to “new world” thinking, there is typically an organizational process or business area that the leadership team will want to focus on because they see it as low-hanging fruit for the organization.

It is important to realize that once leadership decides to start a Data Science initiative, they will likely experience a series of undoing efforts as an attack on the success of the project. These efforts happen in most cases, unintentionally, but in some cases intentionally; they happen virtually on every Data Science project, no matter who is at the helm.

Here is how a leader can identify if his or her initiative might be at risk for succeeding based on comments and questions they might hear:

  • Is your machine learning effort reproducible?
  • Your solution missed obvious cases – how do you explain that?
  • Can you explain your model to me?

These questions are really ones of comparison between the “experts” and the “AI solutions”. Many of the organization’s brightest people will ask these questions, and in many cases, they seem like good questions to ask, rooted in helping the organization. Let me show you why those who are asking these questions may actually be hurting their organization instead of helping it.

Concerning reproducibility, an enterprise-grade machine learning solution is always reproducible. I would make the claim that this is just good science, but many scientists seem to struggle with this reproducibility themselves. In many cases, when the expert is asked to meet the same criteria, he or she couches what he or she does as that “je ne sais quoi” – that uncountable thing or simply “art”.

Concerning missing cases when predicting, machine learning solutions are not perfect although they strive by design to be. Unfortunately, this question acts like a David Copperfield magic trick: having the leadership team focus on my left hand while the coin is really in my right. A well-designed solution for the real-world should not only be measured on how close to 100% it is, but rather besting the status-quo of the organization. Let me explain:

  • If an organization has no current baseline to a Data Science initiative, they should use random as the starting point to try and best (that is 50%)

So, if the AI solution comes in at 75% accuracy, it does not help the organization for experts to say the AI solution missed 25%, when in fact it beat random guessing (which is what the organization is doing in this example) by 25%.

  • If an organization has experts who use a series of business rules (if this than that) which help them achieve 75% accuracy, we should use this as the baseline to improve upon.

So, if the AI solution comes in at 90% accuracy besting the current solution by 15%, it does not help the organization for experts to say, “Let’s just keep what we have in place, since the AI solution misses 10%.”

Concerning “explain your model”, there are a range of solutions when it comes to machine learning. Some models are simple in their approach and others more complex. The question can be answered on a variety of levels – everything from what data was used to how a convulsion neural network is designed. While all these things can be explained, it may not be immediately understood to an organization’s expert.

When these questions are turned back to the organization’s expert to answer about the current process, the following are responses that the leadership team might hear:

  • I did what I did out of my own experience.
  • I will not be able to reproduce my results in every case, because it is an “in the moment” thing.
  • Nobody is perfect; of course I am going to miss cases. I am only human.

So how should organizations carry out an apples-to-apples comparison between a current and a future solution? Easy – either agree they are not the same or be consistent in your approach of evaluation between the two. Also, if you want to determine the success of a model, compare it to what exists today in your organization, not what it should be.

Lesson #3: Leaders should stretch their organizations to go beyond their own intellect.  

If the leadership in the organization has made it this far with their initiative, they may feel a bit like Alice’s Adventures in Wonderland next. You should expect to see the many bizarre physical changes of the AI solution that you intended to be so easy to implement within your organization. Just like in Wonderland there are mysterious potions and cake all along the way that an organization may have to consume to grow and shrink, in hopes of not missing out on the size it needs to be in the proverbial hall of doors.

Imagine this scenario: let’s say you are a CEO who oversees 48 convenient stores distributed across the county, and you want to build a machine learning model to help you maximize the price of gasoline based on your inventory level and what the market will bear in price. To contrast the future state with the current state, you already have 18 expert individuals who have been in the industry for a combined 180 years of experience doing your pricing for you. Let’s further say you were able to build a model that improves the group of experts’ unrealized profits by 66%. Ready to deploy it and start making money? Not so fast. Here is the next blow that the executives will receive: one, if not all of your experts, will want you to explain to them what you built and why they should use it.

Even the simplest AI solutions can’t fit in the head of a single individual, but an organization’s expert will ask for it any way. Let’s face it, 15 years of daily gasoline price data for the United States of America, micro- and macro-economic data, weather data, and socio-economics data plus all the linear algebra, probability & statistics, multivariable calculus, and optimization strategies that were deployed in the AI solution, may be hard to comprehend. While that seems unlikely to be consumed by your organization’s expert, it is harder still for experts who have been working with two dozen assumptions for the last 10 years or have been operating “straight from the gut”.

So how should organizations carry out the request for an industry expert needing to know all the ins-and-outs of an AI solution? Well leadership needs to know what is at stake when they answer this question. If you answer this technically, you may not obtain adoption for the solution. If you work towards really having your experts understand or change your machine learning approach to fit what they want to comprehend – such as modifiable and intuitive data inputs and no fancy math, methods, and techniques – then the organization has allowed the experts to govern how much success the organization can participate in and what they will actually be able to deploy.

Practically, it looks like this:

  • Current State: experts, misses 0.06/per gallon</li>  	<li>Future State: <strong>Model_1</strong>, misses0.04/per gallon (a 33% improvement over the current state)
  • Future State: Model_2, misses $0.02/per gallon (a 66% improvement over the current state)

In most cases, an organization’s leadership will end up taking Model_1 because it is a simpler model that can be somewhat explained to the organization’s group of experts in a palatable way. This highlights a phenomenon in the applied aspects of an AI solution: the fact that real-world problems tend to be multi-objective. In this case, the multi-part objective that should be optimized is 1) the most performant model and 2) what the organization’s experts will accept as truth.

Go Forth and Conquer

A well-designed AI solution tries to ease the burden that the organization will likely experience when embarking on this sort of transformational journey. Having reliable strategies that both move the organization forward and bring along the others proves to be the difference between machine learning success and failure in the enterprise. Business leaders should see that the use of this technology demands more agility and change-friendly organizations. It most certainly requires more leadership from more people, and not just top management. It requires more strategic sophistication. At the most basic level, an organization must have a much greater capacity to execute bold strategic initiatives rapidly while minimizing the size and number of bumps in the road that slow an organization down.

A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column

ABSTRACT:

Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology related to the vertebral column. There is great potential to become more efficient and effective in terms of quality of care provided to patients through the use of automated systems. However, in many cases automated systems can allow for misclassification and force providers to have to review more cases than necessary. In this study, we analyzed methods to increase the True Positives and lower the False Positives while comparing them against state-of-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized in the current research community.

Research Article:

Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data Mining 4: 114. doi: 10.4172/2090-4924. 100114

Introduction to Inference and Learning

Many of my subscribers have asked for some resources to help get them on a path for better understanding with regards to inference and learning. As many individuals have various learning styles there are both reading and video (I would recommend both).

  • Book: Murphy — Chapter 1 — Introduction
  • Book: Bishop — Chapter 1 — Introduction

Books mentioned above:

Machine Learning: A Probabilistic Perspective Kevin P. Murphy, MIT Press, 2012.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer, 2006. An excellent and affordable book on machine learning, with a Bayesian focus. It covers fewer topics than the Murphy book, but goes into more depth on the topics it covers.

If you have resources that you think that I missed, please let me know. If there is a resource that you particularly enjoyed I would like to hear from you as well.

Getting Started with Machine Learning

In truth, I am an advocate for jumping in head first and using what you learn in real-time. Practically speaking this means learn less about all the theory and heavy math behind what it is you are using with the attitude that you will move towards understanding.

Do you know how to program in a specific language? If so, then determine if that language has a library which can be leveraged to aid you in your machine learning journey.

If you do not know how to program, that is okay also. Survey a few languages (R and Python are popular among data scientist) and see if you have one that is more understandable to you and then go down the same path…seeking a machine learning library.

Shhh, it’s a Library

No Programming Necessary
  • WEKA – you can do virtually everything with this workbench. Pre-processing the data, visualizing the data, building classifiers, and make predictions.
  • BigML – Like WEKA you will not have to program with BIGML. You can explore model building in a browser. If you not certain about machine learning (or data science for that matter), this would be a great place to start.
R (Statistical Computing)
  • If you are really enjoy math and have not picked a language yet, then this may be for you. There are a lot of packages here developed by pioneers in the field which you can leverage without having to refactor any code. All packages come with instructions – giving you some of the theory and example cases for you to see in action. In my judgment, learning this language allows you to explore and prototype quickly which most certainly will prove valuable.
Python
  • Scikit Learn – If you enjoy Python then this library is for you. This library is known for its documentation which allows you to rapidly deploy virtually any machine learning algorithm.
Octave
  • Octave is the open-source version of MatLab (some functions are not present). As is MatLab, Octave is known for solving linear and non-linear problems. If you have an engineering background then this might be the place for you. Although, practically speaking many organizations do not use Octave/MatLab as it is seen as a primarily academic software.

No matter what you pick, decide to use it and stick with it for awhile. In fact, I would commit to it for the next 12-months. Actually use the language/library you choose do not just read about it.

Learning Online

If you are really a beginner, you may want to stay clear of some of what you see online. Many people I talk to like the idea of data science and machine learning and decide to sign-up for an online course. The problem they encounter is that in many cases they already have to know how to program (to some degree) and they should know linear algebra and probability theory.

Linear Algebra Example

Probability Theory Example

If you do decide to watch classes online, then you should absolutely take notes (even if you toss them later). The key is to participate – which may sound obvious, but when you are at home in your pajamas learning about data science it is not quite so obvious.

That being said there are some really good (and free) online lectures (do not be overwhelmed):

Research Papers

This may not be your thing either, not everybody likes to pick up a research paper to read. Many individuals complain that the reading is a bit to academic and does not lend itself to really conveying insight to the reader (which is opposite of the intent of the paper). To be candid some are written better than others, many cases that has to do with the topic or the time period the paper was written in. However, there are a few seminal papers which you should be acquainted with that will allow you to gain context for machine learning and data science which should prove invaluable in your journey. My encouragement to you is to find these papers and if you are not ready to read them due to your efforts to skill building on other areas then simply hold on to them and test read them every 3-months. See how far you get without getting lost, see if you understand what you are doing when you are coding a solution at a deeper level for having read the paper, and best of all read the reference page – find out who influenced the paper you read.

Machine Learning Books for those Just Starting

Let’s face it there are not a lot of books out there that aim to aid those just starting out in machine learning. As before, the expectation is that you will have some linear algebra or probability theory down pat. Unless you come from the hard sciences (mathematician, engineer, bio-statistics, etc) then you probably will have to do some skill building here even before reading most of the books out in the market place. However, there are a few that are approach the true beginning most people are at and encourage those of you willing to try on your own.

Curious to know your thoughts on the above. Have you used any of these resources? Do you have any that you would recommend?

How to Become a Data Scientist

How does one become a data scientist?

Well, in truth, the path is most certainly clear. However, the work it takes to travel down the road is not for everyone. Before reading this you may want to have an understanding of where you are with your current analytic skills (e.g. MS Excel only, maybe a little bit of SQL, Crystal reports, etc). Use the rest of this article as a measuring stick for where you are and where you would like to go. In fact, it is best to begin with the end in mind and work backwards to the most basic skill you will need and start building from there…

Recently DataCamp posted an infographic which described 8 easy steps to become a data scientist.

How to become a data scientist

How to become a data scientist A portion of the infographic posted on the DataCamp blog

What is a Data Scientist

It’s important to understand what this infographic is based on:

  1. Drew Conway’s data science venn diagram that combines hacking skills, math and statistics knowledge and substantive expertise.
  2. A graph showing the survey results on the question of education level, not unlike the graph in O’Reilly’s Analyzing the Analyzers.
  3. Josh Wills’ quote on what is a data scientist.

Become a Data Scientist

Using the infographic, the 8 steps to becoming an data scientists are:

  1. You need to know (there is a spectrum here) stats and machine learning. The fix – take online courses for free.
  2. Learn to code (not everything, but very specific things). Get a book or take a class (online or offline). Popular languages are Python and R in the data science space.
  3. You should understand databases. This is important because for the most part this is where the data lives.
  4. Critical skills are data munging (data clean-up and transformations), visualization, and reporting.
  5. You will need to Biggie-Size your skills. Learn to use tools like Hadoop, MapReduce, and Spark.
  6. This part is extremely important – get experience. You should be meeting with other data scientists in meetups or talking with people in your office about what you are learning and accomplishing with your enhanced skills. Do yourself a favor obtain a data set online and start exploring them with your new found techniques. I recommend Kaggle and CrowdAnalytx for interesting data sets.
  7. Get yourself one of these: internship, bootcamp or a job. You can’t beat real experience.
  8. Know who the players are in this space and why. Follow them and engage with them, and be a part of and engage with the data science community.

My thoughts…

In my judgement, look at the data and the algorithms first then get busy with the math and programming. However, I do agree with the idea of moving steps 1-5 for familiarity sake of the discipline. Steps 6-7 I would categorize as working the problem and the final step would be plugging into a community.

It may be important to go another step forward. 

It is more intuitive to minimize steps 1-5 into one (this could be a crash course of terms and themes relevant to data science). My preference (its what has worked for me) is to jump in with the data and the tools of the trade as soon as possible. More need to develop just-in-time learning mechanisms, rather than learning the entire universe of a topic. Approaching data science in this way allows an individual to build on a combination of theory and practical experience. This done by encountering problem sets over and over again.

Learn the art of relevance…what makes sense for my situation right now. Obtain a solid data set and get learning. This sort of action works to build context for the tools you are using.

The fastest way to become a data scienist is to recognize where you are with you current skills, grab a data set, pick a language (R,Python, Julia, C++, Matlab,etc) and start working through a problem end-to-end.

What do you think it takes to be a data scientist?

 

Seizure Detection in EEG Time Series

I had a wonderful opportunity to work Eben Olson of Yale University on a problem data set provided by The Mayo Clinic. He and I did a write-up on the what we learned during the process and hope that it helps others in their knowledge discovery surrounding seizure detection.

1 Introduction
We describe here the methods used in preparing our submission to the UPenn and Mayo Clinic Seizure Detection Challenge, which obtained second place on the private leaderboard. We present in detail our final and most sucessful approach, as well as an overview of less successful experiments which also contributed to our final ensemble or provided some insight. It is assumed that the reader is familiar with the structure of the challenge and the data, described at http://www.kaggle.com/c/seizure-detection.

2 Early approaches
2.1 Spectrograms
Our initial feature extraction method calculated spectrograms of each EEG trace, in an attempt to capture both frequency content and temporal dynamics. Each clip was first resampled to 500Hz, and the short-time Fourier transform was applied, discarding phase information. Spectrograms were flattened into vectors, and mean subtraction and normalization was applied on a per subject and per feature basis. Features from each channel were concatenated, and logistic regression or random forests were used for classification. Our best (ensembled) submission with these features scored 0.94081 on the public leaderboard.

2.2 Scattering coefficients
As an alternative to spectrograms, we attempted to use scattering coefficients[2], a framework for time-frequency analysis which has been shown to give good results in audio classification tasks[1]. We used the ScatNet MATLAB toolbox1 to compute scattering coefficients for each clip, after resampling to 500Hz. Coefficients for each channel were concatenated and logistic regression was used for classification. Only a marginal improvement (0.94212 public leaderboard) was seen over spectrogram features.

2.3 Maximal cross-correlation
We next considered maximal cross-correlation, which has been reported to produce useful features for detection of epileptic EEG activity[4]. This method attempts to compensate for propagation delays of brain activity by computing cross-correlation between channels at various lag times and taking only the maximum value, normalized by the channel autocorrelation. We obtained a substantially worse score (0.86761 public leaderboard) with this method. However, review of the code indicated that this may have been due to a bug in the feature calculation, and further investigation of this method may be valuable.

3 Final approach
3.1 Feature extraction
Our final approach to feature extraction calculated the covariance matrix of the EEG data, in order to capture correlations between channels. Since seizure activity is characterized by increased long-range synchronization of neuronal activity, this was expected to produce informative features. Matrices were individually normalized to zero mean and unit variance. As frequency analysis had been shown to be valuable, rather than compute a single covariance matrix we first filtered each trace with several bandpass filters. We initially applied four filters covering the range 1-200Hz. Filter choice presents a complicated trade-off between frequency selectivity, signal to noise ratio, and output dimensionality. Performance was evaluated by cross validation of logistic regression predictors. While attempting to manually optimize the filter parameters, we found that filters chosen for one subject could perform extremely poorly on others. We therefore performed an automated filter selection step, in which combinations of up to four filters were evaluated on each subject. These filters were chosen from a bank of 10 partially overlapping, approximately log-spaced bandpass filters covering the range 5-200Hz. The three combinations which gave the highest CV values were retained.

3.2 Neural network classification
As an alternative classification strategy, we experimented with the use of multilayered neural networks. Our initial motivation was the possibility of learning a cross-subject mapping which would allow our model to use the full training set to improve its predictions. While this goal was not realized, we did find that the NN models provided a boost over logistic regression. Our software was based on dnn.py2, a recently released demonstration of a deep neural network written in Python. This provided an excellent framework which was simple to adapt to our problem. We tested a number of network architectures, but found that a network with two hidden layers of 200 and 100 units respectively gave good results while being reasonably quick to train. Rectified linear units were used in the hidden layers and logistic regression in the output layer. Dropout of 0.5 was used in the hidden layers for regularization. All networks were trained with the adadelta method for 100 epochs. Multiple networks were trained for each subject and filter combination. In an attempt both to increase diversity and to reduce the impact of dissimilar electrode patterns across subjects, each network was trained on a 12-channel subset of the full covariance matrix. We found that depending on network architecture, predictions would become extremely compressed into the neighborhoods of zero and one. To avoid potential issues with numerical precision, we applied a logarithmic rescaling to predictions in the (0,0.1] and [0.9,1) ranges.

3.3 Early seizure prediction
Our best scores were obtained by submitting the same values for pearly and pseizure, rather than trying to train separate classifiers for early ictal events. This phenomenon was reported early in the competition by the user Alexandre3. We observed a similar trend in our cross-validation testing, and believe it is explained by the combination of the AUC metric and the imbalanced classes of the data set, which leads to a much larger penalty for false negatives than false positives. At least for the classification strategies we employed, the error due to training on the “wrong” labels was outweighed by the benefits of a larger training set. However, post-deadline testing showed that our error on the early detection task was several times higher than on the seizure detection task. Improvements could potentially be obtained by fine-tuning of the trained networks with earlyictal data alone, or by applying other techniques for learning from noisy labels.

4 Ensembling
We operated according to this learning bias: select the simplest model given that the theories are consistent with the data. We attempted to produce strong learners that could be arbitrarily accurate and weak learners that were more accurate than random guessing. In blending our models, our process was to start with uniform weighting; however, through each step of learning we decreased the weighting of the models that were not correctly learned by the weak learner, and increased the weighting of the models that were correctly learned by strong learners.[3] In each ensemble we attempted to create more diversity of opinion, independence, decentralization, and aggregation. Our objective was to discover the best model from many classifier models with similar training /test errors. Selecting a model at random, we would have risked the possibility of choosing a weak learner. However, we observed that combining them (averaging) presented us with the possibility of avoiding a poor decision. The idea was that every time we ran a model iteration, we would encounter different local optima. However, combining the model outputs would allow us to find a solution that is closer to the global minimum. Since it is possible that the classifier space may not contain the solution to a given problem, we pursued an ensemble of such classifiers that could contain the solution to the given problem. For example, in the case of a linear classifier that cannot solve a non-linearly separable problem, combing linear classifiers may solve a non-linear problem.

5 Technical details
5.1 Software

All of our experiments, with the exception of scattering coefficient extraction, were carried out using Python and IPython notebooks. The numpy, scipy, and scikit-learn packages were used extensively. Theano4 is required by the neural network module. Our repository, available at https://github.com/ ebenolson/seizure-detection, contains a full list of required packages and instructions for running our example code. We will also give a brief description here:
• Data Preparation.ipynb This notebook loads all clips for each subject, applies bandpass filters and calculates covariance matrices, then performs the filter selection step. The filtered covariance matrices are saved, along with the clip labels and filenames, to a pickle file in the data subdirectory.
• Train Classifiers and Predict.ipynb This notebook loads the preprocessed data and trains multiple networks on each input file. The predictions of each network are saved to the output subdirectory.
• Postprocessing.ipynb This notebook loads predictions of the trained networks and combines them to produce submission.csv, a submission in the required format.
• Miscellaneous.ipynb This notebook contains code snippets implementing some of our earlier approaches.
• simplednn/ This subdirectory contains the neural network module.

5.2 Hardware and runtime
Most computations were done using Amazon EC2 m3.2xlarge instances with 8 virtual cores and 30GB RAM. Using spot requests, EC2 provides an affordable platform which can be easily scaled and parallelized depending on the memory and processing power required. Some experiments were also done using a Core i5 quad core desktop with 16GB of RAM. On the desktop, preprocessing the full data set requires approximately one hour and training one iteration of networks (3 per subject) requires 13 minutes. The time required to generate predictions is negligible.
[4]https://github.com/Theano/Theano

References
[1] Joakim And´en and St´ephane Mallat. Multiscale scattering for audio classification. 2011.
[2] J. Bruna and S. Mallat. Classification with scattering operators. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1566, June 2011.
[3] Seni Giovanni, John Elder, and Grossman Robert. Ensemble methods in data mining: Improving accuracy through combining predictions. In Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. MLSP 2010, pages 4–10, February 2010.
[4] P.W. Mirowski, Yann LeCun, D. Madhavan, and R. Kuzniecky. Comparing SVM and convolutional networks for epileptic seizure prediction from intracranial EEG. In IEEE Workshop on Machine Learning for Signal Processing, 2008. MLSP 2008, pages 244–249, October 2008.

Do I Have to be Great at Math?

In many cases, individuals try to start out knowing all the math behind the data science before being practical with the science. The majority stop here before they even get started – in fact, they have a belief in data science (they are attracted to it) but they simply do not feel adequate because the math can get in the way.

Bottom-line you can go far in machine learning without understanding all the math.

Just Start

Think back to when you started something that you did not understand completely. You made use of it. It probably was terrible, but you stumbled your way through and on the other side you had much to be proud of. It was a slow process, but your passion for whatever you were involved in carried you through the stinking points. As the complexity of whatever is you were interested grew so did you desire for knowledge and so you were able to overcome the limitations that could have been perceived to be in your way.

The truth if everyone had to start machine learning by learning about Nonnegative Matrix Factorization or Peturbation Bounds for Eigendecompositions then very few individuals would have the passion ignited.

Copy-and-Paste Heaven

You may not believe me, but one of the best things that can happen to you early on in your data science journey is for things to fail. Let’s face it when something breaks you can either do one of two things 1. walk away or 2. fix-it. Many individuals focus on copy-and-pasting code from Git, data challenge, or code cookbook just to simply see if the code can work on their machine. In fact, they simply want to recreate the results, but there is a point where you will want to extend the code. Maybe it doesn’t do everything you need or the data type is not  the same as what you are working on, or maybe still you want to tune the model – in all these cases you will  have to move beyond copy-and-paste code.

When something breaks or you decide to enhance what you have, almost certainly you will have to breakdown the code line-by-line and understand what it is that the code is doing. In doing this, you may not realize it, but you are building confidence in what it is you are working on (as a side, you should apply this same technique to the math formulas you encounter).

Where’s My Geiger Counter

  1. Be a Navigator – work well with tools like scikit-learn, R, WEKA
  2. Be a Librarian – use programming libraries that give you small algorithms
  3. Be a Doer – put what you are learning directly into what you are doing

You only need to know what you can use today. These approach is highly practical and efficient.

Find some dataset or problem that you are interested in working with and began to work systematically through the results. Document it for yourself – categorizing your thoughts will help you crystallize what  you are learning. In time you will began gaining an energy within your learning which will prompt you to seek new algorithms, understand the parameters behind them, and solve more complex problems.

This type of process, like a Geiger counter, points you to want you need (and nothing more). It is highly customized and meaningful to the individual who adopts this method. The Geiger counter should be juxtaposed to an individual completely grounded in theory. Theory alone is generally not sufficient to move beyond a prototype solution.

The secret to math is you need to know what you are good at and what you are not. Know where you need to improve. When the time is right pursue that knowledge. Having a problem that you are working with and needing more knowledge around a math concept can go along way in cementing the idea in your mind for future problem sets.

You Really Do Need Math, Later

Since you will have to know linear algebra, why not use machine learning to teach you through a problem that interests you? You will gain more intuition at a deeper level than any textbook introduction you may receive. In short, there is a path available to all individuals interested in machine learning, but that are intimidated by math concepts. Keep in mind that you can overcome issues of overconfidence of a poorly understood algorithm when you decide to understand the math behind any algorithm you decide to implement. 

I am sure some of the data science purist may disagree with what I have stated here, leave a comment and let me know what you think.