10 Trends You Will Continue to See In 2014

Many businesses ask me what do you see happening in the next 12 months. They ask me question like:

What should we expect?

Where should we be investing?

What should we be thinking about to keep ahead of the curve?

The list below is not particular to anyone industry, rather a general overview of the state of the analytics ecosystem at a particular moment in time. For many industries, if they were to focus on a single item below it would perform wonders for their business, and yet others would need to adapt for more.

  1. Data science move to the every-man.
  2. Analytics will drive cloud-based business solutions.
  3. Cloud data warehouses transform the process from months to days.
  4. Business individuals began to have expectation of flexibility and usability in their dashboards.
  5. No longer is retrospective views of the data enough, so the addition of prospective views become important.
  6. Embedded analytics begins to to come into mainstream business.
  7. Dashboards with context become important, hence narrative around the data becomes key.
  8. Business users began to seek information wherever they are and not just at their desktop.
  9. Social media becomes a measures of competitive advantage for organizations.
  10. NoSQL will become increasingly more important as organizations attempt to work with unstructured data.

It is fabulous time to be involved in analytics and organizations of all types. We are at a new frontier of business that we should all be excited by rather than intimidated by.

The Base Plotting System in R

When conducting data analysis plotting is critically important. In R, plots are crafted by calling successive functions to essentially build-up a plot. Just like a house you should start with the foundation and progress one step at a time until the home is complete. A best practice when dealing with charts in R is to think in two phases: (1) creating a plot and (2) annotating (adding lines, points, texts, etc) the plot. R is very robust in its plotting system and as such offers a high-degree of flexibility and control over charts which you will come to enjoy.

 

Plotting System

If you are trying to get to the core of the graphics engine with R remember the following two packages:

  1. graphics: this includes items such as plot, hist, and boxplot 
  2. grDevices: this includes the graphic devices such as PDF, PostScrip, and PNG

There is a very important package known as the lattice plotting system and it is uniquely implemented as such:

  1. lattice: this includes the code for creating Trellis graphics using functions like xyplotbwpot, and levelplot
  2. grid: lattice build on top of the grid, so you will not directly be calling packages from here
The Process of Making a Plot

During this phrase it is important to consider what it is you would like to accomplish by way of making a plot.

A few questions that you may want to think about before proceeding are:

  1. Where should I make the plot? (on the screen?, in a file?, etc)
  2. How is the plot going to be used?
  3. Is it just for me to conduct exploratory data analysis (temporary)
  4. Will this be going to a browser online?
  5. Will this end up in a publication of sorts?
  6. Is this going to be in a presentation?
  7. Is it going to just a few points of data or a large amount of data?
  8. Will I need to have a dynamic graphic?
  9. What graphic package should I aim to use (base, lattice, or ggplot2)?

It is important to note that graphics generally are constructed in a modular fashion. This means that each section are built in a one-by-one setup using a series of function calls. Many data scientist like this approach as it simulates the way we think.

Alternatively the lattice package requires that you define all parameters upfront which allows for lattice to calculate the appropriate spacing and font sizes.

ggplot2 is a fine package and plots using elements from both base and lattice, however it uses an independent implementation so we will not cover it in this post.

Base Graphics

If you are interested in creating 2-D graphics than you should use the base graphics system.

This is a two-step process:

  1. Initialize a new plot
  2. Add to an existing plot

You can call by plot(x, y) or hist(x). This will launch the graphics device and render a new plot. If you are not using the base graphics for some special use case then it will default to the system standard. Keep in mind though it is possible to change things like the title, x-axis label, y-axis label, etc. If you want to investigate further what can actually be changed key in ?par.This will generate the help page for you.

Simple Base Graphics: Histogram

 

library(datasets)
hist(warpbreaks$breaks) ## Draw a new plot

Base Graphics: Histogram

 

Simple Base Graphics: Scatterplot

 

library(datasets)
with(ChickWeight, plot(weight, chick) )
Scatterplot
Simple Base Graphics: Boxplot

 

library(datasets)
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")
Boxplot
Some Important Base Graphics Parameters

 

Function Name Definition
col: the plotting color (review the colors() function)
lty: the line type (solid line by default)
lwd: the line width
pch: the plotting symbol (open circle by default)
xlab: characters for x-axis label
ylab: characters for y-axis label
It is worthwhile to investigate the par() function. This function controls the global graphics parameters which affect all the plots in a single R session. You can override parameters by using the following:

 

Parameter Name Definition
las: how axis labels are oriented on the plot
bg: background color
mar: size of margin
oma: outer margin size
mfrow: how many plots per row (row-wise)
mfcol: how many plots per row (column-wise)

Default values for global graphic parameters:

par("lty")
[1] "solid"
par("col")
[1] "black"
par("pch")
[1] 1
par("bg")
[1] "white"
par("mar")
[1] 5.1 4.1 4.1 2.1
par("mfrow")
[1] 1 1

 

Base Plotting Functions

 

Function Name Definition
plot: makes a scatterplot
lines: adds a line to a plot
points: adds points to a plot
text: add text labels to a plot
title: title and subtitle labels
mtext: adds text to margins of the plot
axis: add axis ticks/labels
Base Plot with Annotation

 

library(datasets)
with(ChickWeight, plot(weight, Chick))
title(main="Chicks and Weight in Nashville") ## Add a title

Annotated Scatterplot

with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville"))
with(subset(ChickWeight, Diet == 4), points(weight, Chick, col = "blue"))
Annotated with Color
with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville", type = "n"))
with(subset(ChickWeight, Diet == 4), points(weight, Chick, col = "blue"))
with(subset(ChickWeight, Diet != 4), points(weight, Chick, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("Not Normal","Normal"))
Multi-color Annotation
Base Plot with Regression Line
with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville", pch = 20))
model <- lm(Chick ~ weight, ChickWeight)
abline(model, lwd = 2)

Linear Model

Multiple Base Plots

 

with(ChickWeight, {plot(weight, Chick, main="Chicks and Weight") 
+ plot(Diet, weight, main ="Weight and Diet")})

Multiplot

The Death of the Data Scientist?

There has been a lot of chatter recently around the notion that data scientist are soon to be replaced by a 30/hr specialist from places like Odesk, Freelancer, and Elance.  Before we go down the path of can we replace a data scientist, let us take some time to hone in on exactly what a data scientist does? Being candid, there is a plethora of answers to this question.  If we mean, a person who pull together a data summary or modeling task that has been well-defined before they even encounter the problem, then I think it is absolutely possible to come in at a30/hr price. In truth, I see that time of data scientist being replaced by automated software without having to deal with a freelancer at all. Look to how other scenarios like this have occurred, such as online marketing or site development.

But we need to focus on the concept “the data problem was previously well-defined”.

Data scientist who achieve higher salaries happen to be in either two distinct camps:

1) The Engineer:

This individual knows how to choose the proper tools and infrastructure to solve a specific, technology laden data problem. These individuals usually work on the leading edge of a problem or at times there may be very few examples of this problem being worked in global community. This is markedly different than the well-defined problem of the freelancer situation we defined earlier.

2) The Communicator:

This individual knows the technical side of what data science is and how to get at solutions, but there strength is in the story telling. Many times business leadership is unknowing about what is possible with data science and for that they need a translator of sorts. These types of individuals encounter organizations that know they have a problem to solve, but they do not necessarily know how to frame the question so that it can be satisfied by the data. These business look for someone who is personable and not thousands of miles away to guide  them through what they feel is incredibly difficult and important.

While it is certainly true that there may segments of data science which are automated, there will certainly always be a place for problem solvers – think physicians, attorneys, developers, consultants, etc. Like these roles just mentioned, data scientist is not simple a role.

Not all data scientist are performing rote tasks.

There will always be a place for individuals skilled at solving leveraging technology to solve complex business problems and we will have to invest more than $30/hr to garner their expertise.

R Objects: Data Types

  • OBJECTS
  • R has five basic classes of objects:
  1. Character
  2. Numeric
  3. Integer
  4. Complex
  5. Logical

However, the most basic object is a vector.

  • There are two things which you should remember when dealing with vectors.
    1. A vector can only contain objects of the same class.
    2. AND there is an exception to this, a list. A list looks like a vector, but can have different classes.

To create an empty vector use the following function:

vector()
Vectors can be created by using the following:
c() # used to concatenate individual values together
: # to create a sequence, such as 1:10
seq() # to create more complex sequences
rep() # replicates values
sort() # ordering elements in a vector
order() # ordering elements in a vector

 

An example of using rep()

rep(5,2) #a vector of two fives
[1] 5 5

 

An example of using c()

c(3,2,1) # vector of three numeric elements in that order
[1] 3 2 1

 

An example of using seq()

seq(4,20, by = 2)
[1]  4  6  8 10 12 14 16 18 20
seq(1,length = 20, by =4)
 [1]  1  5  9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77

R: Numbers

In general, numbers in R are treated as numeric objects.

For example,

 3 # numeric object
[1] 3
 3L # explicitly gives an integer
[1] 3
 Inf # a special number which represents infinity
[1] Inf
 1/0
[1] Inf
 1/Inf # can be used in calculations
[1] 0
 0/0 # NaN ("not a number"); also, seen as a missing number
[1] NaN

 

Numerics are also decimal values in R. This happens by default, so that if you create a decimal value for x that is will be of the numeric type.

 x = 8.3 # create x which a decimal value
 x # print the value of x
[1] 8.3
 class(x) # what is the class of x?
[1] "numeric"

 

Even when assigning an integer to a variable such as N, it is still being retained as a numeric value.

 N = 43
 N #print the value of N
[1] 43
 class(N) # what is the class of N?
[1] "numeric"

 

You can further confirm that N is not an integer by using the is.integer function.

 is.integer(N) # is N an integer?
[1] FALSE
 is.numeric(N) # is N numeric?
[1] TRUE