R | Data Selection and Manipulation

This functions below aim to give a bit of background on data and data manipulation in R.

  • which.max(x) returns the index of the greatest element of x
  • which.min(x) returns the index of the smallest element of x
  • rev(x) reverses the elements of x
  • sort(x) sorts the elements of x in increasing order; to sort in decreasing order: rev(sort(x))
  • cut(x,breaks) divides x into intervals (factors); breaks is the number of cut intervals or a vector of cut points
  • match(x, y) returns a vector of the same length than x with the elements of x which are in y (NA otherwise)
  • which(x == a) returns a vector of the indices of x if the comparison operation is true (TRUE), in this example the values of i for which x[i] == a (the argument of this function must be a variable of mode logical)
  • choose(n, k) computes the combinations of k events among n repetitions = n!/[(n−k)!k!]
  • na.omit(x) suppresses the observations with missing data (NA) (suppresses the corresponding line if x is a matrix or a data frame)
  • na.fail(x) returns an error message if x contains at least one NA
  • unique(x) if x is a vector or a data frame, returns a similar object but with the duplicate elements suppressed
  • table(x) returns a table with the numbers of the differents values of x (typically for integers or factors)
  • subset(x, …) returns a selection of x with respect to criteria (…,
  • typically comparisons: x$V1 < 10); if x is a data frame, the option
  • select gives the variables to be kept or dropped using a minus sign
  • sample(x, size) resample randomly and without replacement size elements in the vector x, the option replace = TRUE allows to resample with replacement
  • prop.table(x,margin=) table entries as fraction of marginal table

 

Functions for Manipulating Character Variables
nchar(x) a vector fo the lengths of each value in x
paste(a,b,sep=”_”) concatenates character values, using sep between them
substr(x,start,stop) extract characters from positions start to stop from x
strsplit(x,split) split each value of x into a list of strings using split as the delimiter
grep(pattern,x) return a vector of the elements of x that included pattern
grepl(pattern,x) returns a logical vector indicating whether each element of x contained pattern
regexpr(pattern,x) returns the integer positions of the first occurrence of pattern in each element of x
gsub(pattern,replacement,x) replaces each occurrence of pattern with occurrence
tolower(x) converts x to all lower case
toupper(x) converts x to all upper case

 

Logical Operators
== is equal to
!= is not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% is in the list
! not (reverses T & F
& and
| or

 

R | Variable Information

If you want to know a little bit more about the variables you are working with try out these R commands.

  • is.na(x), is.null(x), is.array(x), is.data.frame(x)is.numeric(x), is.complex(x), is.character(x),… test for type; for a complete list, use methods(is)
  • length(x) number of elements in x
  • dim(x) Retrieve or set the dimension of an object; dim(x) <- c(3,2)
  • dimnames(x) Retrieve or set the dimension names of an object
  • nrow(x) number of rows; NROW(x) is the same but treats a vector as a onerow matrix
  • ncol(x) and NCOL(x) id. for columns
  • class(x) get or set the class of x; class(x) <- “myclass”
  • unclass(x) remove the class attribute of x
  • attr(x,which) get or set the attribute which of x
  • attributes(obj) get or set the list of attributes of obj

R | Variable Conversion

The “as….” explicitly converts the data to whatever you desired outcome.

  • as.array(x)
  • as.data.frame(x)
  • as.numeric(x)
  • as.logical(x)
  • as.vector(x)
  • as.matrix(x)
  • as.complex(x)
  • as.character(x)
  • … convert type; for a complete list, use methods(as)
to one long
vector
to
matrix
to
data frame
from
vector
c(x,y) cbind(x,y)
rbind(x,y)
data.frame(x,y)
from
matrix
as.vector(mymatrix) as.data.frame(mymatrix)
from
data frame
as.matrix(myframe)

If you are interested in testing data types use “is…” instead of “as…” this will allow R to return a TRUE or FALSE outcomes.

R | Slicing and Extracting Data

Indexing Vectors

  • x[n] nth element
  • x[-n] all but the nth element
  • x[1:n] first n elements
  • x[-(1:n)] elements from n+1 to the end
  • x[c(1,4,2)] specific elements
  • x[“name”] element named “name”
  • x[x > 3] all elements greater than 3
  • x[x > 3 & x < 5] all elements between 3 and 5
  • x[x %in% c(“a”,”and”,”the”)] elements in the given set

Indexing Lists

  • x[n] list with elements n
  • x[[n]] nth element of the list
  • x[[“name”]] element of the list named “name”
  • xname</strong> id.</li> </ul> <h3>Indexing Matrices</h3> <ul> 	<li><strong>x[i,j]</strong> element at row i, column j</li> 	<li><strong>x[i,]</strong> row i</li> 	<li><strong>x[,j]</strong> column j</li> 	<li><strong>x[,c(1,3)]</strong> columns 1 and 3</li> 	<li><strong>x["name",]</strong> row named "name"</li> </ul> <h3>Indexing Data Frames</h3> <ul> 	<li><strong>x[["name"]]</strong> column named "name"</li> 	<li><strong>xname id

R | Data Creation

Ever wonder how to generate data within R?

  • c(…) generic function to combine arguments with the default forming a vector; with recursive=TRUE descends through lists combining all elements into one vector
  • from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5”
  • seq(from,to) generates a sequence by= specifies increment; length=specifies desired length
  • seq(along=x) generates 1, 2, …, length(along); useful for for loops
  • rep(x,times) replicate x times; use each= to repeat “each” element of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3;
  • rep(c(1,2,3),each=2) is 1 1 2 2 3 3
  • data.frame(…) create a data frame of the named or unnamed arguments; data.frame(v=1:4,ch=c(“a”,”B”,”c”,”d”),n=10);
  • shorter vectors are recycled to the length of the longest
  • list(…) create a list of the named or unnamed arguments; list(a=c(1,2),b=”hi”,c=3i);
  • array(x,dim=) array with data x; specify dimensions like dim=c(3,4,2); elements of x recycle if x is not long enough
  • matrix(x,nrow=,ncol=) matrix; elements of x recycle
  • factor(x,levels=) encodes a vector x as a factor
  • gl(n,k,length=n*k,labels=1:n) generate levels (factors) by specifying the pattern of their levels; k is the number of levels, and n is the number of replications expand.grid() a data frame from all combinations of the supplied vectors or factors
  • rbind(…) combine arguments by rows for matrices, data frames, and others
  • cbind(…) id. by columns

R | Input and Output

Many of the functions for input and output when using R are online.

  • load() load the datasets written with save
  • data(x) loads specified data sets
  • library(x) load add-on packages
  • read.table(file) reads a file in table format and creates a data frame from it; the default separator sep=”” is any whitespace; use header=TRUE to read the first line as a header of column names; use as.is=TRUE to prevent character vectors from being converted to factors; use comment.char=”” to prevent “#” from being interpreted as a comment; use skip=n to skip n lines before reading data; see the help for options on row naming, NA treatment, and others
  • read.csv(“filename”,header=TRUE) id. but with defaults set for
  • reading comma-delimited files
  • read.delim(“filename”,header=TRUE) id. but with defaults set
  • for reading tab-delimited files
  • read.fwf(file,widths,header=FALSE,sep=””,as.is=FALSE) read a table of fixed width formatted data into a ’data.frame’; widths is an integer vector, giving the widths of the fixed-width fields
  • save(file,…) saves the specified objects (…) in the XDR platform independent binary format
  • save.image(file) saves all objects
  • cat(…, file=””, sep=” “) prints the arguments after coercing to character; sep is the character separator between arguments
  • print(a, …) prints its arguments; generic, meaning it can have different methods for different objects
  • format(x,…) format an R object for pretty printing
  • write.table(x,file=””,row.names=TRUE,col.names=TRUE,sep=” “) prints x after converting to a data frame; if quote is TRUE, character or factor columns are surrounded by quotes (“); sep is the field separator; eol is the end-of-line separator; na is the string for missing values; use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input
  • sink(file) output to file, until sink(). Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file=”” means the standard input or output. Connections can include files, pipes, zipped files, and R variables. On windows, the file connection can also be used with description = “clipboard”.
  • To read a table copied from Excel, use x <- read.delim(“clipboard”) To write a table to the clipboard for Excel, use write.table(x,”clipboard”,sep=”\t”,col.names=NA)
  • For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle.
  • See packages XML, hdf5, netCDF for reading other file formats.

 

R|Getting Help

Most R functions have online documentation.

 

  • help(topic) documentation on topic
  • ?topic id.
  • help.search(“topic”) search the help system
  • apropos(“topic”) the names of all objects in the search list matching the regular expression ”topic”
  • help.start() start the HTML version of help
  • str(a) display the internal *str*ucture of an R object
  • summary(a) gives a “summary” of a, usually a statistical summary but it is generic meaning it has different operations for different classes of a
  • ls() show objects in the search path; specify pat=”pat” to search on a
  • pattern
  • ls.str() str() for each variable in the search path
  • dir() show files in the current directory
  • methods(a) shows S3 methods of a
  • methods(class=class(a)) lists all the methods to handle objects of class a

The Base Plotting System in R

When conducting data analysis plotting is critically important. In R, plots are crafted by calling successive functions to essentially build-up a plot. Just like a house you should start with the foundation and progress one step at a time until the home is complete. A best practice when dealing with charts in R is to think in two phases: (1) creating a plot and (2) annotating (adding lines, points, texts, etc) the plot. R is very robust in its plotting system and as such offers a high-degree of flexibility and control over charts which you will come to enjoy.

 

Plotting System

If you are trying to get to the core of the graphics engine with R remember the following two packages:

  1. graphics: this includes items such as plot, hist, and boxplot 
  2. grDevices: this includes the graphic devices such as PDF, PostScrip, and PNG

There is a very important package known as the lattice plotting system and it is uniquely implemented as such:

  1. lattice: this includes the code for creating Trellis graphics using functions like xyplotbwpot, and levelplot
  2. grid: lattice build on top of the grid, so you will not directly be calling packages from here
The Process of Making a Plot

During this phrase it is important to consider what it is you would like to accomplish by way of making a plot.

A few questions that you may want to think about before proceeding are:

  1. Where should I make the plot? (on the screen?, in a file?, etc)
  2. How is the plot going to be used?
  3. Is it just for me to conduct exploratory data analysis (temporary)
  4. Will this be going to a browser online?
  5. Will this end up in a publication of sorts?
  6. Is this going to be in a presentation?
  7. Is it going to just a few points of data or a large amount of data?
  8. Will I need to have a dynamic graphic?
  9. What graphic package should I aim to use (base, lattice, or ggplot2)?

It is important to note that graphics generally are constructed in a modular fashion. This means that each section are built in a one-by-one setup using a series of function calls. Many data scientist like this approach as it simulates the way we think.

Alternatively the lattice package requires that you define all parameters upfront which allows for lattice to calculate the appropriate spacing and font sizes.

ggplot2 is a fine package and plots using elements from both base and lattice, however it uses an independent implementation so we will not cover it in this post.

Base Graphics

If you are interested in creating 2-D graphics than you should use the base graphics system.

This is a two-step process:

  1. Initialize a new plot
  2. Add to an existing plot

You can call by plot(x, y) or hist(x). This will launch the graphics device and render a new plot. If you are not using the base graphics for some special use case then it will default to the system standard. Keep in mind though it is possible to change things like the title, x-axis label, y-axis label, etc. If you want to investigate further what can actually be changed key in ?par.This will generate the help page for you.

Simple Base Graphics: Histogram

 

library(datasets)
hist(warpbreaks$breaks) ## Draw a new plot

Base Graphics: Histogram

 

Simple Base Graphics: Scatterplot

 

library(datasets)
with(ChickWeight, plot(weight, chick) )
Scatterplot
Simple Base Graphics: Boxplot

 

library(datasets)
airquality &lt;- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")
Boxplot
Some Important Base Graphics Parameters

 

Function Name Definition
col: the plotting color (review the colors() function)
lty: the line type (solid line by default)
lwd: the line width
pch: the plotting symbol (open circle by default)
xlab: characters for x-axis label
ylab: characters for y-axis label
It is worthwhile to investigate the par() function. This function controls the global graphics parameters which affect all the plots in a single R session. You can override parameters by using the following:

 

Parameter Name Definition
las: how axis labels are oriented on the plot
bg: background color
mar: size of margin
oma: outer margin size
mfrow: how many plots per row (row-wise)
mfcol: how many plots per row (column-wise)

Default values for global graphic parameters:

par("lty")
[1] "solid"
par("col")
[1] "black"
par("pch")
[1] 1
par("bg")
[1] "white"
par("mar")
[1] 5.1 4.1 4.1 2.1
par("mfrow")
[1] 1 1

 

Base Plotting Functions

 

Function Name Definition
plot: makes a scatterplot
lines: adds a line to a plot
points: adds points to a plot
text: add text labels to a plot
title: title and subtitle labels
mtext: adds text to margins of the plot
axis: add axis ticks/labels
Base Plot with Annotation

 

library(datasets)
with(ChickWeight, plot(weight, Chick))
title(main="Chicks and Weight in Nashville") ## Add a title

Annotated Scatterplot

with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville"))
with(subset(ChickWeight, Diet == 4), points(weight, Chick, col = "blue"))
Annotated with Color
with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville", type = "n"))
with(subset(ChickWeight, Diet == 4), points(weight, Chick, col = "blue"))
with(subset(ChickWeight, Diet != 4), points(weight, Chick, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("Not Normal","Normal"))
Multi-color Annotation
Base Plot with Regression Line
with(ChickWeight, plot(weight, Chick, main = "Chicks and Weight in Nashville", pch = 20))
model &lt;- lm(Chick ~ weight, ChickWeight)
abline(model, lwd = 2)

Linear Model

Multiple Base Plots

 

with(ChickWeight, {plot(weight, Chick, main="Chicks and Weight") 
+ plot(Diet, weight, main ="Weight and Diet")})

Multiplot

R Objects: Data Types

  • OBJECTS
  • R has five basic classes of objects:
  1. Character
  2. Numeric
  3. Integer
  4. Complex
  5. Logical

However, the most basic object is a vector.

  • There are two things which you should remember when dealing with vectors.
    1. A vector can only contain objects of the same class.
    2. AND there is an exception to this, a list. A list looks like a vector, but can have different classes.

To create an empty vector use the following function:

vector()
Vectors can be created by using the following:
c() # used to concatenate individual values together
: # to create a sequence, such as 1:10
seq() # to create more complex sequences
rep() # replicates values
sort() # ordering elements in a vector
order() # ordering elements in a vector

 

An example of using rep()

rep(5,2) #a vector of two fives
[1] 5 5

 

An example of using c()

c(3,2,1) # vector of three numeric elements in that order
[1] 3 2 1

 

An example of using seq()

seq(4,20, by = 2)
[1]  4  6  8 10 12 14 16 18 20
seq(1,length = 20, by =4)
 [1]  1  5  9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77

R: Numbers

In general, numbers in R are treated as numeric objects.

For example,

 3 # numeric object
[1] 3
 3L # explicitly gives an integer
[1] 3
 Inf # a special number which represents infinity
[1] Inf
 1/0
[1] Inf
 1/Inf # can be used in calculations
[1] 0
 0/0 # NaN ("not a number"); also, seen as a missing number
[1] NaN

 

Numerics are also decimal values in R. This happens by default, so that if you create a decimal value for x that is will be of the numeric type.

 x = 8.3 # create x which a decimal value
 x # print the value of x
[1] 8.3
 class(x) # what is the class of x?
[1] "numeric"

 

Even when assigning an integer to a variable such as N, it is still being retained as a numeric value.

 N = 43
 N #print the value of N
[1] 43
 class(N) # what is the class of N?
[1] "numeric"

 

You can further confirm that N is not an integer by using the is.integer function.

 is.integer(N) # is N an integer?
[1] FALSE
 is.numeric(N) # is N numeric?
[1] TRUE