Creating Value for Business: 2 Data Science Questions You Must Ask from the Start

Decisions in Data Science

Business goals are no doubt important, but in an analytic project it makes sense to balance the organization’s goals with those of the Data Science department. Most individuals will recognize balance as a principle of art, but the notion of creating a sense of equilibrium between the business and the Data Scientist is just as foundational in today’s insight economy. To not cultivate this balance is to invite ruin into the organization.

Question 1: What are the Data Science Goals?

As a Data Scientist working in an organization, it is important to understand how the intended outputs of the Data Science project enable the achievement of the business objectives. Imagine a situation where a business has a set of defined goals, but the analytics team had a different target in mind or vice versa. The result is extra cost, time delay, and missed business opportunities. Unfortunately, these sort of happenings are more common than you would imagine in everyday business – and with organizations big and small. As a Data Scientist serving a business, it is prudent to define your goals in tandem with the business objectives and obtain buy-in of your interpretation. This can be done by explicitly documenting what you expect the output to be like and confirming its usefulness to the business unit you are supporting.

Question 2: What is the Data Science success criteria?

Businesses should work with Data Scientists who know how to precisely define a correct outcome in technical terms. In truth, it could prove important to describe these outcomes in subjective terms; however, if this ends up being the case, the person in charge of making these subjective judgments needs to be identified. Neither the business nor the Data Science department will succeed with a moving target. Transparency and visibility are always good things in business. This allows individuals to manage towards a known expectation.

Organizations working with Data Scientists who simply have technical know-how are missing out on significant value within their analytic projects. Organizations should seek to find professionals who know how to translate business concepts into analytic outcomes. This skill should be considered primary over knowing the most advanced techniques and methods when analyzing data. Unfortunately, most organizations are still on a discovery mission with regard to what they need from Data Science. Organizations still remain beholden to the idea that if they hire a Ph.D. in some highly-analytical field then success is just around the corner for their organization. This is rarely the case. In fact, most Ph.D.’s need significant time to warm up to the corporate culture and learn the language of business before they can be fully effective.

It may seem obvious to the organization, but having your analytic superhero be able to quickly judge the type of Data Science problem that you are looking for them to contribute to is paramount to pulling it off.  Typically, being able to specify things like whether the target is a classification, description, prediction, or a clustering problem works well for all involved and starts to build context across disciplines in the organization. This becomes especially important when a Data Science department begins to grow and less experienced Data Scientists can learn to see more like senior Data Scientists; this can only happen with intentionality and purpose.

Organizations should come to expect that one way a good Data Scientist will often demonstrate his or her ability is by reframing or redefining the problem put before them by the company. The first few times this may seem off-putting, but organizations who learn to embrace this sort of transformation of the business problem will be able to compete for the future. Practically speaking this may look like shifting to “medical device retention” rather than “patient retention” when targeting patient retention delivers results too late to affect the outcome.

As a business concerned with the ROI from your Data Science investment, you will undoubtedly want to see activities of the Data Scientist which specify criteria for model assessment. These typically present themselves as model accuracy or performance and complexity. In many cases, it is indispensable to see that a Data Scientist has defined benchmarks for evaluation criteria. Even in the case of subjective assessment, criteria definition becomes important. At times it can be difficult to meet a company’s Data Science goal of model explainability – or data insights provided by the model – if the Data Scientist has not done a good job of uncovering this as a businesses need. So, the adage “to begin with the end in mind” should prompt the Data Scientist to ask an appropriate series of questions of the business to ensure value creation.

Summary

Remember that the Data Science project success criteria are without a doubt different than the business success criteria. Any Data Scientist with experience will say that it is always best to plan with deployment from the beginning of a project. If the organization experiences a Data Scientist not following this best practice, expect spotty results and a bit of frustration from business counterparts. As an organization, it is vital to push your Data Scientist to work hard and be assertive within the project – as well as to use their mind and imagination. This should give him or her the permission to shape the future your company desires.

5 Unbelievable Ways You Can Be a Better Data Scientist in Business

 

Most Data Scientists like to get their hands dirty with data just as quickly as possible, but it is important to practice some delayed gratification and first dig into the details of the Data Science project before you start modeling. A Data Scientist who has the business in mind will attempt to determine what factors might get in the way of the business experiencing success with the project. At different phases there are differing needs for information, but once you have moved past gathering the initial stage of understanding the business, a successful Data Scientist’s objective becomes diving into the details quick and deep.

1: Conduct a Resource Inventory

 

As a Data Scientist, it is important to know the in’s and out’s of the available resources of a Data Science project. This is not just about how much computer power you have to run your analysis. A professional Data Scientist needs to consider many things like the business experts, data experts, technical support, and other data scientists. In addition, there are important variables such as fixed extracts, access to live data, warehoused data, and operational data. However, no one should forget the computing resources such as hardware and software. Any Data Scientist who takes on a project without seriously considering these areas is walking into a minefield, never knowing when something might explode.

2: Understand the Requirements, Assumptions, and Constraints

Most Data Scientists know they have to be better than average at predicting outcomes for whatever the business has selected as a target, but highly successful Data Scientists know that there is more to it than simply gaining a few more points in predictive accuracy. Take for example a Data Scientist who considers all the assumptions that are known about the project both from a bushiness perspective and an analytical perspective. These assumptions can take many forms – however, the ones that rear their ugly heads most often are about the data. Sometimes assumptions are not verifiable as they relate to the business – these can be the riskiest. If at all possible these risky assumptions should be prioritized at the top of the list because they could affect the validity of the results you aim to discover.

Data Scientists need to watch for traps. Consider making explicit any and all availability of resources, even technology constraints. Think outside the box when it comes to limitations. For example, is the size of the data practical for modeling? This may seem obvious, but many Data Scientists overlook this important consideration.

3: Determine Risk and Contingencies

Have you ever started a data analysis project that ended up falling apart only because there were external delays to the project? It is a wise move to consider contingency plans up front. Many Data Scientists take a short-cut here and do not take seriously the insurance that this sort of preparation can provide when needed. It can be extremely helpful to have a backup plan or two in place in the event unknown risks try to derail your projects success. Experience would say that something is always trying to cause you to fail, so plan for alternatives from the beginning.

4: Document Meaning

The question “What do you mean?” is a particularly important question to answer when working with inter-disciplinary teams in a business environment. It should be obvious that we all do not speak the same language when it comes to our domains. Taking the time up front to develop a working glossary of relevant business terminology can keep you and others on track. Another good practice is to have Data Science terminology defined and illustrated with examples, but only work with the terms that directly relate to the business problem at hand. This does not need to be a 700-page document; rather, keep things cogent and useful to all parties involved. Keep in mind others want you to be the Data Scientist; only at the highest level do others want to know the underbelly of statistics and coding.

5: Calculate Cost and Benefits

It is good practice to demonstrate value in your Data Science projects. Remember that as a professional who supports the business it is important to ask and answer the question, “Is the Data Science project of value?” A simple comparison of the associated costs of the project against the potential benefits if successful will go a long way for both you and the business. Knowing this at the beginning of the project is clearly more beneficial to you and the organization than at the close. In my judgment, to not ask and answer this question is a career limiting move that your most successful Data Scientist will seek to get right straight out of the gate. Have the common sense to take on this activity yourself and not wait on your business counterparts or leaders to ask you to do it.

Summary

As Data Science matures in a business context, a Data Scientist needs to be more aware of assessing the situation, taking an inventory, learning about the risk and developing contingencies, and understanding the cost benefits of having a successful Data Science project. Not every Data Scientist will take these steps, but then again not every Data Scientist is highly successful. Like water in the desert is a solid Data Science methodology to a business. Do not leave your organization thirsty when it needs you most.

R | Data Selection and Manipulation

This functions below aim to give a bit of background on data and data manipulation in R.

  • which.max(x) returns the index of the greatest element of x
  • which.min(x) returns the index of the smallest element of x
  • rev(x) reverses the elements of x
  • sort(x) sorts the elements of x in increasing order; to sort in decreasing order: rev(sort(x))
  • cut(x,breaks) divides x into intervals (factors); breaks is the number of cut intervals or a vector of cut points
  • match(x, y) returns a vector of the same length than x with the elements of x which are in y (NA otherwise)
  • which(x == a) returns a vector of the indices of x if the comparison operation is true (TRUE), in this example the values of i for which x[i] == a (the argument of this function must be a variable of mode logical)
  • choose(n, k) computes the combinations of k events among n repetitions = n!/[(n−k)!k!]
  • na.omit(x) suppresses the observations with missing data (NA) (suppresses the corresponding line if x is a matrix or a data frame)
  • na.fail(x) returns an error message if x contains at least one NA
  • unique(x) if x is a vector or a data frame, returns a similar object but with the duplicate elements suppressed
  • table(x) returns a table with the numbers of the differents values of x (typically for integers or factors)
  • subset(x, …) returns a selection of x with respect to criteria (…,
  • typically comparisons: x$V1 < 10); if x is a data frame, the option
  • select gives the variables to be kept or dropped using a minus sign
  • sample(x, size) resample randomly and without replacement size elements in the vector x, the option replace = TRUE allows to resample with replacement
  • prop.table(x,margin=) table entries as fraction of marginal table

 

Functions for Manipulating Character Variables
nchar(x) a vector fo the lengths of each value in x
paste(a,b,sep=”_”) concatenates character values, using sep between them
substr(x,start,stop) extract characters from positions start to stop from x
strsplit(x,split) split each value of x into a list of strings using split as the delimiter
grep(pattern,x) return a vector of the elements of x that included pattern
grepl(pattern,x) returns a logical vector indicating whether each element of x contained pattern
regexpr(pattern,x) returns the integer positions of the first occurrence of pattern in each element of x
gsub(pattern,replacement,x) replaces each occurrence of pattern with occurrence
tolower(x) converts x to all lower case
toupper(x) converts x to all upper case

 

Logical Operators
== is equal to
!= is not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% is in the list
! not (reverses T & F
& and
| or

 

Do I Have to be Great at Math?

In many cases, individuals try to start out knowing all the math behind the data science before being practical with the science. The majority stop here before they even get started – in fact, they have a belief in data science (they are attracted to it) but they simply do not feel adequate because the math can get in the way.

Bottom-line you can go far in machine learning without understanding all the math.

Just Start

Think back to when you started something that you did not understand completely. You made use of it. It probably was terrible, but you stumbled your way through and on the other side you had much to be proud of. It was a slow process, but your passion for whatever you were involved in carried you through the stinking points. As the complexity of whatever is you were interested grew so did you desire for knowledge and so you were able to overcome the limitations that could have been perceived to be in your way.

The truth if everyone had to start machine learning by learning about Nonnegative Matrix Factorization or Peturbation Bounds for Eigendecompositions then very few individuals would have the passion ignited.

Copy-and-Paste Heaven

You may not believe me, but one of the best things that can happen to you early on in your data science journey is for things to fail. Let’s face it when something breaks you can either do one of two things 1. walk away or 2. fix-it. Many individuals focus on copy-and-pasting code from Git, data challenge, or code cookbook just to simply see if the code can work on their machine. In fact, they simply want to recreate the results, but there is a point where you will want to extend the code. Maybe it doesn’t do everything you need or the data type is not  the same as what you are working on, or maybe still you want to tune the model – in all these cases you will  have to move beyond copy-and-paste code.

When something breaks or you decide to enhance what you have, almost certainly you will have to breakdown the code line-by-line and understand what it is that the code is doing. In doing this, you may not realize it, but you are building confidence in what it is you are working on (as a side, you should apply this same technique to the math formulas you encounter).

Where’s My Geiger Counter

  1. Be a Navigator – work well with tools like scikit-learn, R, WEKA
  2. Be a Librarian – use programming libraries that give you small algorithms
  3. Be a Doer – put what you are learning directly into what you are doing

You only need to know what you can use today. These approach is highly practical and efficient.

Find some dataset or problem that you are interested in working with and began to work systematically through the results. Document it for yourself – categorizing your thoughts will help you crystallize what  you are learning. In time you will began gaining an energy within your learning which will prompt you to seek new algorithms, understand the parameters behind them, and solve more complex problems.

This type of process, like a Geiger counter, points you to want you need (and nothing more). It is highly customized and meaningful to the individual who adopts this method. The Geiger counter should be juxtaposed to an individual completely grounded in theory. Theory alone is generally not sufficient to move beyond a prototype solution.

The secret to math is you need to know what you are good at and what you are not. Know where you need to improve. When the time is right pursue that knowledge. Having a problem that you are working with and needing more knowledge around a math concept can go along way in cementing the idea in your mind for future problem sets.

You Really Do Need Math, Later

Since you will have to know linear algebra, why not use machine learning to teach you through a problem that interests you? You will gain more intuition at a deeper level than any textbook introduction you may receive. In short, there is a path available to all individuals interested in machine learning, but that are intimidated by math concepts. Keep in mind that you can overcome issues of overconfidence of a poorly understood algorithm when you decide to understand the math behind any algorithm you decide to implement. 

I am sure some of the data science purist may disagree with what I have stated here, leave a comment and let me know what you think.

10 Trends You Will Continue to See In 2014

Many businesses ask me what do you see happening in the next 12 months. They ask me question like:

What should we expect?

Where should we be investing?

What should we be thinking about to keep ahead of the curve?

The list below is not particular to anyone industry, rather a general overview of the state of the analytics ecosystem at a particular moment in time. For many industries, if they were to focus on a single item below it would perform wonders for their business, and yet others would need to adapt for more.

  1. Data science move to the every-man.
  2. Analytics will drive cloud-based business solutions.
  3. Cloud data warehouses transform the process from months to days.
  4. Business individuals began to have expectation of flexibility and usability in their dashboards.
  5. No longer is retrospective views of the data enough, so the addition of prospective views become important.
  6. Embedded analytics begins to to come into mainstream business.
  7. Dashboards with context become important, hence narrative around the data becomes key.
  8. Business users began to seek information wherever they are and not just at their desktop.
  9. Social media becomes a measures of competitive advantage for organizations.
  10. NoSQL will become increasingly more important as organizations attempt to work with unstructured data.

It is fabulous time to be involved in analytics and organizations of all types. We are at a new frontier of business that we should all be excited by rather than intimidated by.

The Death of the Data Scientist?

There has been a lot of chatter recently around the notion that data scientist are soon to be replaced by a 30/hr specialist from places like Odesk, Freelancer, and Elance.  Before we go down the path of can we replace a data scientist, let us take some time to hone in on exactly what a data scientist does? Being candid, there is a plethora of answers to this question.  If we mean, a person who pull together a data summary or modeling task that has been well-defined before they even encounter the problem, then I think it is absolutely possible to come in at a30/hr price. In truth, I see that time of data scientist being replaced by automated software without having to deal with a freelancer at all. Look to how other scenarios like this have occurred, such as online marketing or site development.

But we need to focus on the concept “the data problem was previously well-defined”.

Data scientist who achieve higher salaries happen to be in either two distinct camps:

1) The Engineer:

This individual knows how to choose the proper tools and infrastructure to solve a specific, technology laden data problem. These individuals usually work on the leading edge of a problem or at times there may be very few examples of this problem being worked in global community. This is markedly different than the well-defined problem of the freelancer situation we defined earlier.

2) The Communicator:

This individual knows the technical side of what data science is and how to get at solutions, but there strength is in the story telling. Many times business leadership is unknowing about what is possible with data science and for that they need a translator of sorts. These types of individuals encounter organizations that know they have a problem to solve, but they do not necessarily know how to frame the question so that it can be satisfied by the data. These business look for someone who is personable and not thousands of miles away to guide  them through what they feel is incredibly difficult and important.

While it is certainly true that there may segments of data science which are automated, there will certainly always be a place for problem solvers – think physicians, attorneys, developers, consultants, etc. Like these roles just mentioned, data scientist is not simple a role.

Not all data scientist are performing rote tasks.

There will always be a place for individuals skilled at solving leveraging technology to solve complex business problems and we will have to invest more than $30/hr to garner their expertise.