Chapter 6 Loading Data

Before we create visualizations, build models, and extract actionable insights, we need to load data into R.

Data management (which includes finding, loading, cleaning, and manipulating data) plays an integral role in data analytics.

Similar to cooking a meal, you can consider data management the preparation required to set the stage for your analysis. Just like peeling potatoes can be repetitive and tedious, loading and preparing data for analysis isn’t usually the most exciting part of the job. Nevertheless, it is important and must be done to lay the groundwork for buttery mashed potatoes or, in your case, delicious data discovery.

First, let’s load a sample data set using the built-in R function read.csv.

6.1 Read CSV Files

read.csv is part of the read.table utilities built into R. read.csv reads a comma-separated values (CSV) file, which stores tabular data (numbers and text) in plain text, into a data frame. Each row (or line) of the file is a data record. Each record consists of one or more fields (columns or features) separated by commas.

# Load CSV file
data <- read.csv(file = "data/DonorSampleData.csv", 
                 header = TRUE, sep = ',')

Here we are loading the CSV file using the read.csv function and three arguments (parameters):

  • file: The file name along with its path
  • header: A logical value indicating whether the file contains the names of the variables as its first line. In this case, we are telling R that the file does contain variable names in its first row.
  • sep: A character value that R uses to separate fields.

read.csv loads the DonorSampleData.csv file into a data frame variable called data using the <- assignment operator.

For more details about the read.csv function, enter the following command into your R console:

# Display read.csv() help
?read.csv

6.2 Read Delimited CSV and TSV Files

You will soon discover there’s usually more than one way to accomplish the same task in R. Another popular package for reading flat, delimited files into R is called readr, which includes the read_csv and read_tsv functions, which read comma-separated and tab-separated value files, respectively. These functions do a better job at guessing input file data types and, as an added benefit, they trim excessive whitespace from character data.

# Load readr package
library(readr)

# Load CSV file
data <- read_csv("data/DonorSampleData.csv")

6.3 List Session Variables

# Show variables currently loaded in R session
ls()
#> [1] "data"

The output of [1] "data" confirms that your CSV file has been loaded into the data frame variable called “data.”

6.4 List Data Frame Variable Names

Let’s take a look at the field (columns or features) inside the data object we created by loading our CSV file into R.

# List variables stored in "data" object
names(data)
#>  [1] "ID"                  "ZIPCODE"            
#>  [3] "AGE"                 "MARITAL_STATUS"     
#>  [5] "GENDER"              "MEMBERSHIP_IND"     
#>  [7] "ALUMNUS_IND"         "PARENT_IND"         
#>  [9] "HAS_INVOLVEMENT_IND" "WEALTH_RATING"      
#> [11] "DEGREE_LEVEL"        "PREF_ADDRESS_TYPE"  
#> [13] "EMAIL_PRESENT_IND"   "CON_YEARS"          
#> [15] "PrevFYGiving"        "PrevFY1Giving"      
#> [17] "PrevFY2Giving"       "PrevFY3Giving"      
#> [19] "PrevFY4Giving"       "CurrFYGiving"       
#> [21] "TotalGiving"         "DONOR_IND"

From the names function output, we can quickly identify there are 22 different columns or features to this data set.

6.5 Inspect Data Object

Let’s inspect our data frame object in greater detail.

# Inspect the data object
str(data)

str is a very useful function that reports a wealth of information. The str output tells us we have 34,508 observations of 22 variables and provides a description of each variable, data type, and so on.

6.6 Read Excel files

In addition to CSV files, you will often find data in Excel spreadsheet formats (“.xls” and “.xlsx”).

To read Excel files directly into R, you can use the readxl function that comes bundled with the popular tidyverse package.

The following is code to download and load the tidverse package so you can use the readxl function to read an Excel file directly into R.

# Install tidyverse package
install.packages("tidyverse", 
                 repos = "http://cran.us.r-project.org")
# Load tidyverse package
library(tidyverse)

# Load readxl package
library(readxl)

# Load Excel file into "excel" data frame 
excel <- read_xlsx("data/DonorSampleData.xlsx")
For practice, try using the str function on the excel data frame to examine its structure. Does the excel object have the same number of observations and variables as the CSV file?

6.7 Read from a Database

Sometimes you need to pull data from a database rather than a CSV file or Excel spreadsheet. There are many different kinds of databases such as SQL, SQLite, and Oracle™. It is best to contact your database administrator to confirm your organization’s database details.

In the following example, we will cover how to connect to a SQLite file and introduce you to the general process of how to connect to a database to retrieve data that meets your specific criteria or needs.


# Install DBI
install.packages("DBI", 
                 repos = "http://cran.us.r-project.org")

# Install RSQLite
install.packages("RSQlite",
                 repos = "http://cran.us.r-project.org")
# Load DBI package
library(DBI)

# Load RSQLite package
library(RSQLite)

# Connect to SQLite Database File 
con <- dbConnect(SQLite(), 
                 dbname = "Data/DonorSampleDataSQL.db")

# Build Database Query to Pull Alumni Who Are Also Parents
query <- dbSendQuery(con, 
                     "SELECT * FROM data 
                     WHERE ALUMNUS_IND = 'Y' 
                     AND PARENT_IND = 'Y'")

# Fetch Query
db_data <- dbFetch(query)

# Display 10 prospects who are both parent and alumni
head(db_data, n = 10)

# Clear Query from Memory
dbClearResult(query)

The dbConnect function will let you connect you to the databases for which you have drivers installed on your system. A typical database management system is Microsoft’s SQL Server® (MSSQL). To connect to MSSQL or other databases, follow the syntax of the dbConnect function. You will need to install the odbc package for this code to work. You will also need the ODBC drivers for SQL Server if they are not already installed. On a Windows™ computer, the drivers are already installed, but on a MacOS, you will need to install them manually as well as set up some additional files (not fun or straightforward). RStudio, the maker of the odbc package, provided guidelines on how to install these drivers on its site. The connection to MSSQL may look like:

library(odbc)
con <- dbConnect(odbc::odbc(),
                      driver = "SQL Server",
                      server = <your_server_ip_or_address>,
                      database = <your_database_name>,
                      uid = <your_username>,
                      pwd = <your_password>)

Once connected, you can list all the tables in your database using this command:

dbListTables(con)

In this recipe, we connected to a SQLite database file (modeled on DonorSampleData.csv), built a database query to select prospects who are both alumni and parent constituent type, and displayed 10 sample prospects. We also showed a way to connect to Microsoft SQL Server. To get into the habit of keeping a tidy R workspace, we explicitly cleared the database query from computer memory as best practice. Cleaning up your R programming environment will become increasingly important as you work on larger projects with varied information sources and complex queries across multiple database systems, tables, and so on.