Introduction to Inference and Learning

Many of my subscribers have asked for some resources to help get them on a path for better understanding with regards to inference and learning. As many individuals have various learning styles there are both reading and video (I would recommend both).

  • Book: Murphy — Chapter 1 — Introduction
  • Book: Bishop — Chapter 1 — Introduction

Books mentioned above:

Machine Learning: A Probabilistic Perspective Kevin P. Murphy, MIT Press, 2012.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer, 2006. An excellent and affordable book on machine learning, with a Bayesian focus. It covers fewer topics than the Murphy book, but goes into more depth on the topics it covers.

If you have resources that you think that I missed, please let me know. If there is a resource that you particularly enjoyed I would like to hear from you as well.

R | Data Selection and Manipulation

This functions below aim to give a bit of background on data and data manipulation in R.

  • which.max(x) returns the index of the greatest element of x
  • which.min(x) returns the index of the smallest element of x
  • rev(x) reverses the elements of x
  • sort(x) sorts the elements of x in increasing order; to sort in decreasing order: rev(sort(x))
  • cut(x,breaks) divides x into intervals (factors); breaks is the number of cut intervals or a vector of cut points
  • match(x, y) returns a vector of the same length than x with the elements of x which are in y (NA otherwise)
  • which(x == a) returns a vector of the indices of x if the comparison operation is true (TRUE), in this example the values of i for which x[i] == a (the argument of this function must be a variable of mode logical)
  • choose(n, k) computes the combinations of k events among n repetitions = n!/[(n−k)!k!]
  • na.omit(x) suppresses the observations with missing data (NA) (suppresses the corresponding line if x is a matrix or a data frame)
  • na.fail(x) returns an error message if x contains at least one NA
  • unique(x) if x is a vector or a data frame, returns a similar object but with the duplicate elements suppressed
  • table(x) returns a table with the numbers of the differents values of x (typically for integers or factors)
  • subset(x, …) returns a selection of x with respect to criteria (…,
  • typically comparisons: x$V1 < 10); if x is a data frame, the option
  • select gives the variables to be kept or dropped using a minus sign
  • sample(x, size) resample randomly and without replacement size elements in the vector x, the option replace = TRUE allows to resample with replacement
  • prop.table(x,margin=) table entries as fraction of marginal table

 

Functions for Manipulating Character Variables
nchar(x) a vector fo the lengths of each value in x
paste(a,b,sep=”_”) concatenates character values, using sep between them
substr(x,start,stop) extract characters from positions start to stop from x
strsplit(x,split) split each value of x into a list of strings using split as the delimiter
grep(pattern,x) return a vector of the elements of x that included pattern
grepl(pattern,x) returns a logical vector indicating whether each element of x contained pattern
regexpr(pattern,x) returns the integer positions of the first occurrence of pattern in each element of x
gsub(pattern,replacement,x) replaces each occurrence of pattern with occurrence
tolower(x) converts x to all lower case
toupper(x) converts x to all upper case

 

Logical Operators
== is equal to
!= is not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% is in the list
! not (reverses T & F
& and
| or

 

R | Variable Information

If you want to know a little bit more about the variables you are working with try out these R commands.

  • is.na(x), is.null(x), is.array(x), is.data.frame(x)is.numeric(x), is.complex(x), is.character(x),… test for type; for a complete list, use methods(is)
  • length(x) number of elements in x
  • dim(x) Retrieve or set the dimension of an object; dim(x) <- c(3,2)
  • dimnames(x) Retrieve or set the dimension names of an object
  • nrow(x) number of rows; NROW(x) is the same but treats a vector as a onerow matrix
  • ncol(x) and NCOL(x) id. for columns
  • class(x) get or set the class of x; class(x) <- “myclass”
  • unclass(x) remove the class attribute of x
  • attr(x,which) get or set the attribute which of x
  • attributes(obj) get or set the list of attributes of obj

R | Variable Conversion

The “as….” explicitly converts the data to whatever you desired outcome.

  • as.array(x)
  • as.data.frame(x)
  • as.numeric(x)
  • as.logical(x)
  • as.vector(x)
  • as.matrix(x)
  • as.complex(x)
  • as.character(x)
  • … convert type; for a complete list, use methods(as)
to one long
vector
to
matrix
to
data frame
from
vector
c(x,y) cbind(x,y)
rbind(x,y)
data.frame(x,y)
from
matrix
as.vector(mymatrix) as.data.frame(mymatrix)
from
data frame
as.matrix(myframe)

If you are interested in testing data types use “is…” instead of “as…” this will allow R to return a TRUE or FALSE outcomes.

R | Slicing and Extracting Data

Indexing Vectors

  • x[n] nth element
  • x[-n] all but the nth element
  • x[1:n] first n elements
  • x[-(1:n)] elements from n+1 to the end
  • x[c(1,4,2)] specific elements
  • x[“name”] element named “name”
  • x[x > 3] all elements greater than 3
  • x[x > 3 & x < 5] all elements between 3 and 5
  • x[x %in% c(“a”,”and”,”the”)] elements in the given set

Indexing Lists

  • x[n] list with elements n
  • x[[n]] nth element of the list
  • x[[“name”]] element of the list named “name”
  • xname</strong> id.</li> </ul> <h3>Indexing Matrices</h3> <ul> 	<li><strong>x[i,j]</strong> element at row i, column j</li> 	<li><strong>x[i,]</strong> row i</li> 	<li><strong>x[,j]</strong> column j</li> 	<li><strong>x[,c(1,3)]</strong> columns 1 and 3</li> 	<li><strong>x["name",]</strong> row named "name"</li> </ul> <h3>Indexing Data Frames</h3> <ul> 	<li><strong>x[["name"]]</strong> column named "name"</li> 	<li><strong>xname id

R | Data Creation

Ever wonder how to generate data within R?

  • c(…) generic function to combine arguments with the default forming a vector; with recursive=TRUE descends through lists combining all elements into one vector
  • from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5”
  • seq(from,to) generates a sequence by= specifies increment; length=specifies desired length
  • seq(along=x) generates 1, 2, …, length(along); useful for for loops
  • rep(x,times) replicate x times; use each= to repeat “each” element of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3;
  • rep(c(1,2,3),each=2) is 1 1 2 2 3 3
  • data.frame(…) create a data frame of the named or unnamed arguments; data.frame(v=1:4,ch=c(“a”,”B”,”c”,”d”),n=10);
  • shorter vectors are recycled to the length of the longest
  • list(…) create a list of the named or unnamed arguments; list(a=c(1,2),b=”hi”,c=3i);
  • array(x,dim=) array with data x; specify dimensions like dim=c(3,4,2); elements of x recycle if x is not long enough
  • matrix(x,nrow=,ncol=) matrix; elements of x recycle
  • factor(x,levels=) encodes a vector x as a factor
  • gl(n,k,length=n*k,labels=1:n) generate levels (factors) by specifying the pattern of their levels; k is the number of levels, and n is the number of replications expand.grid() a data frame from all combinations of the supplied vectors or factors
  • rbind(…) combine arguments by rows for matrices, data frames, and others
  • cbind(…) id. by columns

R | Input and Output

Many of the functions for input and output when using R are online.

  • load() load the datasets written with save
  • data(x) loads specified data sets
  • library(x) load add-on packages
  • read.table(file) reads a file in table format and creates a data frame from it; the default separator sep=”” is any whitespace; use header=TRUE to read the first line as a header of column names; use as.is=TRUE to prevent character vectors from being converted to factors; use comment.char=”” to prevent “#” from being interpreted as a comment; use skip=n to skip n lines before reading data; see the help for options on row naming, NA treatment, and others
  • read.csv(“filename”,header=TRUE) id. but with defaults set for
  • reading comma-delimited files
  • read.delim(“filename”,header=TRUE) id. but with defaults set
  • for reading tab-delimited files
  • read.fwf(file,widths,header=FALSE,sep=””,as.is=FALSE) read a table of fixed width formatted data into a ’data.frame’; widths is an integer vector, giving the widths of the fixed-width fields
  • save(file,…) saves the specified objects (…) in the XDR platform independent binary format
  • save.image(file) saves all objects
  • cat(…, file=””, sep=” “) prints the arguments after coercing to character; sep is the character separator between arguments
  • print(a, …) prints its arguments; generic, meaning it can have different methods for different objects
  • format(x,…) format an R object for pretty printing
  • write.table(x,file=””,row.names=TRUE,col.names=TRUE,sep=” “) prints x after converting to a data frame; if quote is TRUE, character or factor columns are surrounded by quotes (“); sep is the field separator; eol is the end-of-line separator; na is the string for missing values; use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input
  • sink(file) output to file, until sink(). Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file=”” means the standard input or output. Connections can include files, pipes, zipped files, and R variables. On windows, the file connection can also be used with description = “clipboard”.
  • To read a table copied from Excel, use x <- read.delim(“clipboard”) To write a table to the clipboard for Excel, use write.table(x,”clipboard”,sep=”\t”,col.names=NA)
  • For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle.
  • See packages XML, hdf5, netCDF for reading other file formats.

 

R|Getting Help

Most R functions have online documentation.

 

  • help(topic) documentation on topic
  • ?topic id.
  • help.search(“topic”) search the help system
  • apropos(“topic”) the names of all objects in the search list matching the regular expression ”topic”
  • help.start() start the HTML version of help
  • str(a) display the internal *str*ucture of an R object
  • summary(a) gives a “summary” of a, usually a statistical summary but it is generic meaning it has different operations for different classes of a
  • ls() show objects in the search path; specify pat=”pat” to search on a
  • pattern
  • ls.str() str() for each variable in the search path
  • dir() show files in the current directory
  • methods(a) shows S3 methods of a
  • methods(class=class(a)) lists all the methods to handle objects of class a

Simple Discrete Models

For those of you interested in learning a bit more about discrete models, below are some fantastic resources to help you on your journey:

  • Book: Murphy — Chapter 2 — Probability
  • Book: Murphy — Chapter 3 — Generative Models for Discrete Data
  • Book: Bishop — Chapter 2, Sections 2.1-2.2 — Probability Distributions
  • Book: MacKay — Chapter 2 — Probability, Entropy, and Inference
  • Book: MacKay — Chapter 3 — More About Inference
  • Book: Mackay — Chapter 23 — Useful Probability Distributions

 

Book reference for above:

Machine Learning: A Probabilistic Perspective Kevin P. Murphy, MIT Press, 2012.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer, 2006. An excellent and affordable book on machine learning, with a Bayesian focus. It covers fewer topics than the Murphy book, but goes into more depth on the topics it covers.

David J.C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press.