R | Variable Conversion

The “as….” explicitly converts the data to whatever you desired outcome.

  • as.array(x)
  • as.data.frame(x)
  • as.numeric(x)
  • as.logical(x)
  • as.vector(x)
  • as.matrix(x)
  • as.complex(x)
  • as.character(x)
  • … convert type; for a complete list, use methods(as)
to one long
vector
to
matrix
to
data frame
from
vector
c(x,y) cbind(x,y)
rbind(x,y)
data.frame(x,y)
from
matrix
as.vector(mymatrix) as.data.frame(mymatrix)
from
data frame
as.matrix(myframe)

If you are interested in testing data types use “is…” instead of “as…” this will allow R to return a TRUE or FALSE outcomes.

R | Slicing and Extracting Data

Indexing Vectors

  • x[n] nth element
  • x[-n] all but the nth element
  • x[1:n] first n elements
  • x[-(1:n)] elements from n+1 to the end
  • x[c(1,4,2)] specific elements
  • x[“name”] element named “name”
  • x[x > 3] all elements greater than 3
  • x[x > 3 & x < 5] all elements between 3 and 5
  • x[x %in% c(“a”,”and”,”the”)] elements in the given set

Indexing Lists

  • x[n] list with elements n
  • x[[n]] nth element of the list
  • x[[“name”]] element of the list named “name”
  • xname</strong> id.</li> </ul> <h3>Indexing Matrices</h3> <ul> 	<li><strong>x[i,j]</strong> element at row i, column j</li> 	<li><strong>x[i,]</strong> row i</li> 	<li><strong>x[,j]</strong> column j</li> 	<li><strong>x[,c(1,3)]</strong> columns 1 and 3</li> 	<li><strong>x["name",]</strong> row named "name"</li> </ul> <h3>Indexing Data Frames</h3> <ul> 	<li><strong>x[["name"]]</strong> column named "name"</li> 	<li><strong>xname id

R | Data Creation

Ever wonder how to generate data within R?

  • c(…) generic function to combine arguments with the default forming a vector; with recursive=TRUE descends through lists combining all elements into one vector
  • from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5”
  • seq(from,to) generates a sequence by= specifies increment; length=specifies desired length
  • seq(along=x) generates 1, 2, …, length(along); useful for for loops
  • rep(x,times) replicate x times; use each= to repeat “each” element of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3;
  • rep(c(1,2,3),each=2) is 1 1 2 2 3 3
  • data.frame(…) create a data frame of the named or unnamed arguments; data.frame(v=1:4,ch=c(“a”,”B”,”c”,”d”),n=10);
  • shorter vectors are recycled to the length of the longest
  • list(…) create a list of the named or unnamed arguments; list(a=c(1,2),b=”hi”,c=3i);
  • array(x,dim=) array with data x; specify dimensions like dim=c(3,4,2); elements of x recycle if x is not long enough
  • matrix(x,nrow=,ncol=) matrix; elements of x recycle
  • factor(x,levels=) encodes a vector x as a factor
  • gl(n,k,length=n*k,labels=1:n) generate levels (factors) by specifying the pattern of their levels; k is the number of levels, and n is the number of replications expand.grid() a data frame from all combinations of the supplied vectors or factors
  • rbind(…) combine arguments by rows for matrices, data frames, and others
  • cbind(…) id. by columns

R | Input and Output

Many of the functions for input and output when using R are online.

  • load() load the datasets written with save
  • data(x) loads specified data sets
  • library(x) load add-on packages
  • read.table(file) reads a file in table format and creates a data frame from it; the default separator sep=”” is any whitespace; use header=TRUE to read the first line as a header of column names; use as.is=TRUE to prevent character vectors from being converted to factors; use comment.char=”” to prevent “#” from being interpreted as a comment; use skip=n to skip n lines before reading data; see the help for options on row naming, NA treatment, and others
  • read.csv(“filename”,header=TRUE) id. but with defaults set for
  • reading comma-delimited files
  • read.delim(“filename”,header=TRUE) id. but with defaults set
  • for reading tab-delimited files
  • read.fwf(file,widths,header=FALSE,sep=””,as.is=FALSE) read a table of fixed width formatted data into a ’data.frame’; widths is an integer vector, giving the widths of the fixed-width fields
  • save(file,…) saves the specified objects (…) in the XDR platform independent binary format
  • save.image(file) saves all objects
  • cat(…, file=””, sep=” “) prints the arguments after coercing to character; sep is the character separator between arguments
  • print(a, …) prints its arguments; generic, meaning it can have different methods for different objects
  • format(x,…) format an R object for pretty printing
  • write.table(x,file=””,row.names=TRUE,col.names=TRUE,sep=” “) prints x after converting to a data frame; if quote is TRUE, character or factor columns are surrounded by quotes (“); sep is the field separator; eol is the end-of-line separator; na is the string for missing values; use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input
  • sink(file) output to file, until sink(). Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file=”” means the standard input or output. Connections can include files, pipes, zipped files, and R variables. On windows, the file connection can also be used with description = “clipboard”.
  • To read a table copied from Excel, use x <- read.delim(“clipboard”) To write a table to the clipboard for Excel, use write.table(x,”clipboard”,sep=”\t”,col.names=NA)
  • For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle.
  • See packages XML, hdf5, netCDF for reading other file formats.

 

R|Getting Help

Most R functions have online documentation.

 

  • help(topic) documentation on topic
  • ?topic id.
  • help.search(“topic”) search the help system
  • apropos(“topic”) the names of all objects in the search list matching the regular expression ”topic”
  • help.start() start the HTML version of help
  • str(a) display the internal *str*ucture of an R object
  • summary(a) gives a “summary” of a, usually a statistical summary but it is generic meaning it has different operations for different classes of a
  • ls() show objects in the search path; specify pat=”pat” to search on a
  • pattern
  • ls.str() str() for each variable in the search path
  • dir() show files in the current directory
  • methods(a) shows S3 methods of a
  • methods(class=class(a)) lists all the methods to handle objects of class a

Simple Discrete Models

For those of you interested in learning a bit more about discrete models, below are some fantastic resources to help you on your journey:

  • Book: Murphy — Chapter 2 — Probability
  • Book: Murphy — Chapter 3 — Generative Models for Discrete Data
  • Book: Bishop — Chapter 2, Sections 2.1-2.2 — Probability Distributions
  • Book: MacKay — Chapter 2 — Probability, Entropy, and Inference
  • Book: MacKay — Chapter 3 — More About Inference
  • Book: Mackay — Chapter 23 — Useful Probability Distributions

 

Book reference for above:

Machine Learning: A Probabilistic Perspective Kevin P. Murphy, MIT Press, 2012.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer, 2006. An excellent and affordable book on machine learning, with a Bayesian focus. It covers fewer topics than the Murphy book, but goes into more depth on the topics it covers.

David J.C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press.

 

Getting Started with Machine Learning

In truth, I am an advocate for jumping in head first and using what you learn in real-time. Practically speaking this means learn less about all the theory and heavy math behind what it is you are using with the attitude that you will move towards understanding.

Do you know how to program in a specific language? If so, then determine if that language has a library which can be leveraged to aid you in your machine learning journey.

If you do not know how to program, that is okay also. Survey a few languages (R and Python are popular among data scientist) and see if you have one that is more understandable to you and then go down the same path…seeking a machine learning library.

Shhh, it’s a Library

No Programming Necessary
  • WEKA – you can do virtually everything with this workbench. Pre-processing the data, visualizing the data, building classifiers, and make predictions.
  • BigML – Like WEKA you will not have to program with BIGML. You can explore model building in a browser. If you not certain about machine learning (or data science for that matter), this would be a great place to start.
R (Statistical Computing)
  • If you are really enjoy math and have not picked a language yet, then this may be for you. There are a lot of packages here developed by pioneers in the field which you can leverage without having to refactor any code. All packages come with instructions – giving you some of the theory and example cases for you to see in action. In my judgment, learning this language allows you to explore and prototype quickly which most certainly will prove valuable.
Python
  • Scikit Learn – If you enjoy Python then this library is for you. This library is known for its documentation which allows you to rapidly deploy virtually any machine learning algorithm.
Octave
  • Octave is the open-source version of MatLab (some functions are not present). As is MatLab, Octave is known for solving linear and non-linear problems. If you have an engineering background then this might be the place for you. Although, practically speaking many organizations do not use Octave/MatLab as it is seen as a primarily academic software.

No matter what you pick, decide to use it and stick with it for awhile. In fact, I would commit to it for the next 12-months. Actually use the language/library you choose do not just read about it.

Learning Online

If you are really a beginner, you may want to stay clear of some of what you see online. Many people I talk to like the idea of data science and machine learning and decide to sign-up for an online course. The problem they encounter is that in many cases they already have to know how to program (to some degree) and they should know linear algebra and probability theory.

Linear Algebra Example

Probability Theory Example

If you do decide to watch classes online, then you should absolutely take notes (even if you toss them later). The key is to participate – which may sound obvious, but when you are at home in your pajamas learning about data science it is not quite so obvious.

That being said there are some really good (and free) online lectures (do not be overwhelmed):

Research Papers

This may not be your thing either, not everybody likes to pick up a research paper to read. Many individuals complain that the reading is a bit to academic and does not lend itself to really conveying insight to the reader (which is opposite of the intent of the paper). To be candid some are written better than others, many cases that has to do with the topic or the time period the paper was written in. However, there are a few seminal papers which you should be acquainted with that will allow you to gain context for machine learning and data science which should prove invaluable in your journey. My encouragement to you is to find these papers and if you are not ready to read them due to your efforts to skill building on other areas then simply hold on to them and test read them every 3-months. See how far you get without getting lost, see if you understand what you are doing when you are coding a solution at a deeper level for having read the paper, and best of all read the reference page – find out who influenced the paper you read.

Machine Learning Books for those Just Starting

Let’s face it there are not a lot of books out there that aim to aid those just starting out in machine learning. As before, the expectation is that you will have some linear algebra or probability theory down pat. Unless you come from the hard sciences (mathematician, engineer, bio-statistics, etc) then you probably will have to do some skill building here even before reading most of the books out in the market place. However, there are a few that are approach the true beginning most people are at and encourage those of you willing to try on your own.

Curious to know your thoughts on the above. Have you used any of these resources? Do you have any that you would recommend?

TITLE – The New CEO Super Power – Big Data Analytics

DESCRIPTION – Business leaders are often lost when it comes to the heralds of big data. Do leaders need to understand all the technical details and programming of everything from the cloud that could create insight for their business? We will walk through what is reasonable to change the DNA of your organization from its current state to a future analytic state – leading to new business models and sparking disruptive innovation.

ABSTRACT – Many executives misunderstand the key to transforming their organizations into analytics-driven-ones. Too often, analytics is seen as a problem in data science and technology. We’ll dive into the courage it takes to use the insight gleaned from data analytics to drive an organization, leaders who change their organization to deliver on their companies’ analytical potential, and why you don’t want to hand your organization over to a team of statisticians and programmers.

  • What leaders need to understand about big data
  • Building new business models
  • Sparking disruptive innovation

 

Date: January 28, 2015
Event: Rainmakers Companies Conference January 2015
Topic: The New CEO Super Power – Big Data Analytics
Sponsor: The Rainmakers Companies
800.231.2524
Location: Las Vegas, NV
Public: Private

Contact me to speak at your next conference.

How to Become a Data Scientist

How does one become a data scientist?

Well, in truth, the path is most certainly clear. However, the work it takes to travel down the road is not for everyone. Before reading this you may want to have an understanding of where you are with your current analytic skills (e.g. MS Excel only, maybe a little bit of SQL, Crystal reports, etc). Use the rest of this article as a measuring stick for where you are and where you would like to go. In fact, it is best to begin with the end in mind and work backwards to the most basic skill you will need and start building from there…

Recently DataCamp posted an infographic which described 8 easy steps to become a data scientist.

How to become a data scientist

How to become a data scientist A portion of the infographic posted on the DataCamp blog

What is a Data Scientist

It’s important to understand what this infographic is based on:

  1. Drew Conway’s data science venn diagram that combines hacking skills, math and statistics knowledge and substantive expertise.
  2. A graph showing the survey results on the question of education level, not unlike the graph in O’Reilly’s Analyzing the Analyzers.
  3. Josh Wills’ quote on what is a data scientist.

Become a Data Scientist

Using the infographic, the 8 steps to becoming an data scientists are:

  1. You need to know (there is a spectrum here) stats and machine learning. The fix – take online courses for free.
  2. Learn to code (not everything, but very specific things). Get a book or take a class (online or offline). Popular languages are Python and R in the data science space.
  3. You should understand databases. This is important because for the most part this is where the data lives.
  4. Critical skills are data munging (data clean-up and transformations), visualization, and reporting.
  5. You will need to Biggie-Size your skills. Learn to use tools like Hadoop, MapReduce, and Spark.
  6. This part is extremely important – get experience. You should be meeting with other data scientists in meetups or talking with people in your office about what you are learning and accomplishing with your enhanced skills. Do yourself a favor obtain a data set online and start exploring them with your new found techniques. I recommend Kaggle and CrowdAnalytx for interesting data sets.
  7. Get yourself one of these: internship, bootcamp or a job. You can’t beat real experience.
  8. Know who the players are in this space and why. Follow them and engage with them, and be a part of and engage with the data science community.

My thoughts…

In my judgement, look at the data and the algorithms first then get busy with the math and programming. However, I do agree with the idea of moving steps 1-5 for familiarity sake of the discipline. Steps 6-7 I would categorize as working the problem and the final step would be plugging into a community.

It may be important to go another step forward. 

It is more intuitive to minimize steps 1-5 into one (this could be a crash course of terms and themes relevant to data science). My preference (its what has worked for me) is to jump in with the data and the tools of the trade as soon as possible. More need to develop just-in-time learning mechanisms, rather than learning the entire universe of a topic. Approaching data science in this way allows an individual to build on a combination of theory and practical experience. This done by encountering problem sets over and over again.

Learn the art of relevance…what makes sense for my situation right now. Obtain a solid data set and get learning. This sort of action works to build context for the tools you are using.

The fastest way to become a data scienist is to recognize where you are with you current skills, grab a data set, pick a language (R,Python, Julia, C++, Matlab,etc) and start working through a problem end-to-end.

What do you think it takes to be a data scientist?

 

Functions

The notion of a function is that of something which provides a distinct output for a given input.

Definition

Think about two sets, D and R along with a principle which appropriates a unique element of R to each and every element of D. This rule is termed a function and it is represented by a letter such as f. Given n x ∈ D, f (x) is the name of the thing in R which comes from doing f to x. D is called the domain of f. In order to establish that D refers to f, the representation D (f) may be used. The set R is sometimes described as the range of f. Nowadays it.
is known as the codomain. The set of all elements of R which are of the form f (x) for some x ∈ D is consequently, a subset of R. This is sometimes referred to as the image of f. When this set equals R, the function f is said to be onto, also surjective, if whenever x  ̸= y it followss f (x) ̸= f (y), the function is called one-to-one, also injective.

It is typical representation to write f : D → R to denote the condition just described within this definition where f is a function characterized on a domain D which has values in a codomain R.