Chapter 5 Getting Started with
If you are completely new to all things
If you have a background in computer programming languages or software such as Python, Stata®, SAS®, or Matlab®, you may notice many familiar concepts and terminology such as functions, variables, and operators in the example
R code recipes referenced in this book.
R is a free, open source statistical programming language that is powerful, flexible, and evolving.
R, which has grown significantly in popularity, is an interactive and object-oriented programming language that offers a variety of data structures, graphical capabilities, functions, packages, documentation, and community support. In addition, it is an evolving ecosystem that can effectively handle different data types and perform complex analysis on individual and distributed computer systems, which are important capabilities to consider when developing data analytics solutions of any size or scale.
5.2 Download R (Required)
R is compatible with Windows™, macOS, and a variety of Unix systems.
The latest version of
R is available for download via the Comprehensive R Archive Network (CRAN):
All of the
R code in this book has been tested to work with
R version 3.4.3 (2017-11-30). Please check your existing
R installation and upgrade to the latest version if needed.
Rcode files and data by visiting http://www.nandeshwar.info/ds4fundraisingcode.
5.3 Install RStudio (Optional)
RStudio is an integrated development environment (IDE), which includes a code editor, debugger, and visualization tools that make
R more user friendly.
RStudio Desktop (Open Source Edition) is available for free download via the following links:
5.4 Install Packages
R is a popular programming language that benefits from community-driven support and ongoing enhancements.
R packages, which you may have already heard about, are bundles of reusable
R functions, support documentation, and sample data (if included). As of writing this book, there are currently 12,106
R packages available to download, install, and use. The fact that there are over 12,000 packages of freely available add-on code libraries speaks to the flexibility of the
R language and the robust commitment of the
R user community. The potential data analytics solutions you can develop using these packages is perhaps only limited by your curiosity, creativity, and willingness to learn
We assume you’ve already downloaded
R on your computer, so now it’s time to get your feet wet and download two popular
ggplot, using the
install.packages function to familiarize yourself with the
R package installation process.
To run the following code, copy and paste each line into your R console window and click the Enter key. Alternatively, you can copy and paste these commands into a new
R script by selecting
File > New File > R Script within R Studio.
# Install dplyr package install.packages("dplyr", repos='http://cran.us.r-project.org') # Install ggplot2 package install.packages("ggplot2", repos='http://cran.us.r-project.org')
Voila! You successfully ran your first
R code, which downloaded and installed two popular
R packages for data manipulation and visualization tools. We’ll cover these tools later in greater detail.
These lines of
R code contain two
install.packages commands, each of which is preceded by a comment line indicated by the
# symbol. The
# symbol is a comment symbol that will not be executed by
R. As a good programming practice, comment your code liberally to document it for later reference.
#symbol so that you can later reference, check, test, and update your code as needed.
If you create a new
R script, you can also highlight all four lines of code in your script with your mouse cursor and then manually select
Code > Run Select Line(s) from the R Studio menu. Alternatively, you can use the keyboard shortcut
Command + Enter on a Mac or
Control + Enter on Windows or Linux to run these lines of code.
For a full list of RStudio keyboard shortcuts, please refer to RStudio’s knowledge base.
Now that you’ve installed both
R packages, let’s load these packages and make them available for use on your system using the
library("package name") function.
# Load dplyr package library("dplyr") # Load ggplot2 package library("ggplot2")
To see all of the
R packages installed on your system, call the
library function without any arguments (that is, inputs) or package names.
# List all packages installed library()
library function output, you should see both the
ggplot2 packages listed in alphabetically along with the following brief package descriptions.
dplyr: A Grammar of Data Manipulation
ggplot2: Create Elegant Data Visualizations Using the Grammar of Graphics
You just completed an
R package installation process using repeatable and reusable
R code, which downloaded, installed, and loaded
R packages on your computer.
5.5 Learning R
R is a powerful statistical modeling and programming environment, it can take some time to get comfortable using
R, especially if you don’t have any background in statistics or computer programming. For users with minimal experience in writing code, we encourage you to be patient while you get the hang of working with
R. The benefits (flexibility, extensibility, and speed, just to name a few) are well worth the time and effort to overcome the initial learning curve associated with
Here are some tips for learning
- Do: Many people learn R best through hands-on learning and directly entering
Rcommands within the
- Review: Check out code samples and retype the commands you find in this book and beyond.
- Experiment: Try modifying
Rcommands and running the code to see what happens to develop a better sense and understanding of how it works.
- Research: You will encounter errors in
Rhas excellent error messages that (usually) offer useful diagnostic information to help you figure out the root cause of the issue.
5.6 R Console
Assuming you’ve already installed
R on your computer, the first thing you will encounter when you launch
R is the
R console window and the command prompt
>, which indicates
R is ready for your instructions.
As previously mentioned,
R is an interactive programming environment, so let’s use
R as a calculator and enter some basic arithmetic operators to explore it can do.
# Addition 1+8 #>  9 # Subtraction 1-7 #>  -6 # Division 1/7 #>  0.143 # Multiplication 1*7 #>  7 # Exponentiation 2^3 #>  8 # Order of Operations 1+2*3 #>  7
After you enter each command into the
R command prompt, each result will be interactively displayed in the
R console as shown in Figure 5.2.
If you’ve installed RStudio, the
R Console command prompt and interactive output will be displayed at the bottom of your RStudio session window.
5.7 Built-in Functions
R has many built-in functions, which are reusable expressions that involve zero or more variables.
# Logarithm log(x = 100) #>  4.61 # Square Root sqrt(x = 16) #>  4 # Round round(x = 8.3) #>  8
These variables are arguments (inputs or parameters) that are passed to functions in order to perform various types of calculations. For example, the
sqrt function takes a single argument of
x. We used 16 as our
x and the function returned its square root of 4.
Functions can also take more than one parameter, separated by commas.
In the previous example, the
round function took the number 8.3 and rounded to the closest integer, which is 8. However, if we pass the
round function a number such as pi (3.141592…), we can instruct
R to round pi to the nearest hundredth by passing an additional parameter
digits the value of
# Round round(x = 3.141592, digits = 2) #>  3.14
The base installation of
R includes several built-in constant variables, one of which is
LETTERS: The 26 upper-case letters of the Roman alphabet
letters: The 26 lower-case letters of the Roman alphabet
month.abb: The three-letter abbreviations for the English month names
month.name: The English names for the months of the year
pi: The ratio of the circumference of a circle to its diameter
Rather that manually typing the value of pi in the previous example, you could have also used the built-in constant
# Round round(x = pi, digits = 2) #>  3.14
If you want additional information about a function and its parameters, the base
R installation comes with useful help pages with function descriptions, usage, arguments, details, and examples.
helpfunction. Another way is using
example(round)in your console.
To learn more about the
round function and its usage details, try entering either of the following commands in your
# ? Operator Help ?round # Help Function help(round)
To learn more about built-in constants in the base
R namespace, try entering either of the following commands.
# ? Operator Help ?Constants # Help Function help(Constants)
Ralso allows you to write your own functions. If you are curious or are already comfortable using built-in functions, we encourage you to explore and try creating your own custom functions. For additional details, you can check out this article.
Variables allow you to store data in a named object, whose values can later be retrieved and changed as needed. To create a variable in
R, use the assignment operator “<-”" to assign data to a variable name.
For example, suppose we wanted to store the value of the square root calculation for later use. Here’s a code snippet that stores the calculation in a variable.
# Calculate square root and assign to "sqroot" variable sqroot <- sqrt(16) # Print "sqroot" value sqroot #>  4
In this example, you will note that we selected
sqroot as the variable name to avoid a naming conflict with the
sqrt function. To further extend this example, suppose we needed to regularly update the
sqrt function input value instead of hard-coding the value “16”. We can modify the code to use another variable for the input parameter.
# Square Root Function Input (Parameter) input <- 16 # Calculate square root and assign to "sqroot" variable sqroot <- sqrt(input) # Print "sqroot" value sqroot #>  4
5.9 Conditional Logic
R provides a variety of logical operators that return a value of
# Less Than 1 < 2 #>  TRUE # Less Than or Equal To 2 <= 2 #>  TRUE # Greater Than 1 > 2 #>  FALSE # Greater Than or Equal to 2 >= 2 #>  TRUE # Exactly Equal to 2 == 2 #>  TRUE # Not Equal To 1 != 1 #>  FALSE # Not X X <- TRUE !X #>  FALSE # X or Y X <- FALSE Y <- TRUE X | Y #>  TRUE # X AND Y X <- FALSE Y <- TRUE X & Y #>  FALSE # Test whether value of X is TRUE X <- FALSE isTRUE(X) #>  FALSE
5.10 Data Types
R is an object.
R offers a variety of data types such as scalars, vectors, matrices, data frames, and lists.
A vector is an ordered collection of atomic (integer, numeric, character, or logical) values. Vectors are one of the most common and basic data structures in
R, so it is useful to familiarize yourself with them.
Vectors can be one of two different types: (1) atomic vectors and (2) lists.
You can manually create a vector by using the
combine, function to combine a collection of data values. For example, suppose we needed to create a list of donor ages and store them in a variable called
# Create donor_age vector donor_age <- c(28, 32, 77, 57, 52, 41, 42, 49)
We can use the
c function again to add additional elements to
donor_age if needed.
# Update donor_age with additional donor age values donor_age <- c(donor_age, 72, 68)
You can also create vectors as a sequence of numbers using the
seq function or using the “:” operator.
seq(from = 1, to = 10) #>  1 2 3 4 5 6 7 8 9 10 series <- 1:10 series #>  1 2 3 4 5 6 7 8 9 10 # check whether they give same results identical(x = seq(1, 10), y = series) #>  TRUE
Matrices are a special type of atomic (integer, numeric, character, or logical) vector with dimensional attributes (rows and columns). By default, matrices are filled column wise.
A list is a special vector type where elements are not restricted to a single data type. Because the contents of a list can include a mixture of data types, lists are flexible data structures and sometimes referred to as generic vectors.
To create a list, use the
# Update donor_age with additional donor age values donor_name <- "John Smith" donor_age <- 58 donor_city <- "San Francisco" donor_lifetimegiving <- 14225 donor_profile <- list(donor_name, donor_age, donor_city, donor_lifetimegiving) donor_profile #> [] #>  "John Smith" #> #> [] #>  58 #> #> [] #>  "San Francisco" #> #> [] #>  14225
Factors are vectors used to represent categorical data labels.
Factors can be ordered or unordered and are especially useful when organizing and working with categorical data due to their speed and efficiency. Although factors look like character vectors, they are actually stored internally within
R as integers, so you need to be careful when treating them like characters to avoid running into errors. It is also important to note that factors can only contain pre-defined label values, also known as levels.
donor_ind <- factor(c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "yes")) donor_ind
Let’s use the
table function to create a two-way frequency table that shows the count of donors versus non-donors using the donor indicator variable
donor_ind we just created.
donor_ind <- factor(c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "yes")) table(donor_ind) #> donor_ind #> no yes #> 4 6
5.16 Data Frame
A data frame is a special kind of list where each element has the same length. Data frames are important in
R because they are used frequently for storing tabular data for analysis.
In addition to length, data frames have additional attributes, such as
rownames, which can be used to organize and annotate data labels, such as
Let’s create a data frame using the
donor_ind vectors we just created.
donor_age <- c(28, 32, 77, 57, 52, 41, 42, 49, 72, 68) donor_ind <- factor(c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "yes")) dd <- data.frame(donor_age, donor_ind) dd #> donor_age donor_ind #> 1 28 no #> 2 32 no #> 3 77 yes #> 4 57 yes #> 5 52 yes #> 6 41 no #> 7 42 no #> 8 49 yes #> 9 72 yes #> 10 68 yes
Let’s use the
table function to display a frequency table of
table(dd) #> donor_ind #> donor_age no yes #> 28 1 0 #> 32 1 0 #> 41 1 0 #> 42 1 0 #> 49 0 1 #> 52 0 1 #> 57 0 1 #> 68 0 1 #> 72 0 1 #> 77 0 1
5.17 Data Types
R provides several functions to examine the features of various data types such as:
class: What kind of data object?
type: What kind of data storage type?
length: What is the length of the data object?
attributes: What kind of metadata?
str: What kind of data object and internal structure?
5.18 Additional Support
We encourage you to start where you are and embrace the learning curve you inevitably encounter when learning any type of new language, whether computer or human.
For reference, the following is a link to
R manuals provided by the
R Development Core Team as a learning resource.
The following is a list of
R community support sites with knowledgeable and helpful
R user forums, which can be a useful resource when you encounter questions or run into a technical hurdle.
# Install dplyr package #install.packages("dplyr") # Install ggplot2 package #install.packages("ggplot2") # Install tidyverse #install.packages("tidyverse") # Load dplyr package library("dplyr") # Load ggplot2 package library("ggplot2")