Chapter 6 Loading Data
Before we create visualizations, build models, and extract actionable insights, we need to load data into
Data management (which includes finding, loading, cleaning, and manipulating data) plays an integral role in data analytics.
Similar to cooking a meal, you can consider data management the preparation required to set the stage for your analysis. Just like peeling potatoes can be repetitive and tedious, loading and preparing data for analysis isn’t usually the most exciting part of the job. Nevertheless, it is important and must be done to lay the groundwork for buttery mashed potatoes or, in your case, delicious data discovery.
First, let’s load a sample data set using the built-in
6.1 Read CSV Files
read.csv is part of the
read.table utilities built into
read.csv reads a comma-separated values (CSV) file, which stores tabular data (numbers and text) in plain text, into a data frame. Each row (or line) of the file is a data record. Each record consists of one or more fields (columns or features) separated by commas.
# Load CSV file data <- read.csv(file = "data/DonorSampleData.csv", header = TRUE, sep = ',')
Here we are loading the CSV file using the
read.csv function and three arguments (parameters):
file: The file name along with its path
header: A logical value indicating whether the file contains the names of the variables as its first line. In this case, we are telling
Rthat the file does contain variable names in its first row.
sep: A character value that
Ruses to separate fields.
read.csv loads the
DonorSampleData.csv file into a data frame variable called
data using the
<- assignment operator.
For more details about the
read.csv function, enter the following command into your
# Display read.csv() help ?read.csv
6.2 Read Delimited CSV and TSV Files
You will soon discover there’s usually more than one way to accomplish the same task in
R. Another popular package for reading flat, delimited files into
R is called
readr, which includes the
read_tsv functions, which read comma-separated and tab-separated value files, respectively. These functions do a better job at guessing input file data types and, as an added benefit, they trim excessive whitespace from character data.
# Load readr package library(readr) # Load CSV file data <- read_csv("data/DonorSampleData.csv")
6.3 List Session Variables
# Show variables currently loaded in R session ls() #>  "data"
The output of
 "data" confirms that your CSV file has been loaded into the data frame variable called “data.”
6.4 List Data Frame Variable Names
Let’s take a look at the field (columns or features) inside the
data object we created by loading our CSV file into
# List variables stored in "data" object names(data) #>  "ID" "ZIPCODE" #>  "AGE" "MARITAL_STATUS" #>  "GENDER" "MEMBERSHIP_IND" #>  "ALUMNUS_IND" "PARENT_IND" #>  "HAS_INVOLVEMENT_IND" "WEALTH_RATING" #>  "DEGREE_LEVEL" "PREF_ADDRESS_TYPE" #>  "EMAIL_PRESENT_IND" "CON_YEARS" #>  "PrevFYGiving" "PrevFY1Giving" #>  "PrevFY2Giving" "PrevFY3Giving" #>  "PrevFY4Giving" "CurrFYGiving" #>  "TotalGiving" "DONOR_IND"
names function output, we can quickly identify there are 22 different columns or features to this data set.
6.5 Inspect Data Object
Let’s inspect our data frame object in greater detail.
# Inspect the data object str(data)
str is a very useful function that reports a wealth of information. The
str output tells us we have 34,508 observations of 22 variables and provides a description of each variable, data type, and so on.
6.6 Read Excel files
In addition to CSV files, you will often find data in Excel spreadsheet formats (“.xls” and “.xlsx”).
To read Excel files directly into
R, you can use the
readxl function that comes bundled with the popular
The following is code to download and load the
tidverse package so you can use the
readxl function to read an Excel file directly into
# Install tidyverse package install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# Load tidyverse package library(tidyverse) # Load readxl package library(readxl) # Load Excel file into "excel" data frame excel <- read_xlsx("data/DonorSampleData.xlsx")
strfunction on the
exceldata frame to examine its structure. Does the
excelobject have the same number of observations and variables as the CSV file?
6.7 Read from a Database
Sometimes you need to pull data from a database rather than a CSV file or Excel spreadsheet. There are many different kinds of databases such as SQL, SQLite, and Oracle™. It is best to contact your database administrator to confirm your organization’s database details.
In the following example, we will cover how to connect to a
SQLite file and introduce you to the general process of how to connect to a database to retrieve data that meets your specific criteria or needs.
# Install DBI install.packages("DBI", repos = "http://cran.us.r-project.org") # Install RSQLite install.packages("RSQlite", repos = "http://cran.us.r-project.org")
# Load DBI package library(DBI) # Load RSQLite package library(RSQLite) # Connect to SQLite Database File con <- dbConnect(SQLite(), dbname = "Data/DonorSampleDataSQL.db") # Build Database Query to Pull Alumni Who Are Also Parents query <- dbSendQuery(con, "SELECT * FROM data WHERE ALUMNUS_IND = 'Y' AND PARENT_IND = 'Y'") # Fetch Query db_data <- dbFetch(query) # Display 10 prospects who are both parent and alumni head(db_data, n = 10) # Clear Query from Memory dbClearResult(query)
dbConnect function will let you connect you to the databases for which you have drivers installed on your system. A typical database management system is Microsoft’s SQL Server® (MSSQL). To connect to MSSQL or other databases, follow the syntax of the
dbConnect function. You will need to install the
odbc package for this code to work. You will also need the
ODBC drivers for SQL Server if they are not already installed. On a Windows™ computer, the drivers are already installed, but on a MacOS, you will need to install them manually as well as set up some additional files (not fun or straightforward). RStudio, the maker of the
odbc package, provided guidelines on how to install these drivers on its site. The connection to MSSQL may look like:
library(odbc) con <- dbConnect(odbc::odbc(), driver = "SQL Server", server = <your_server_ip_or_address>, database = <your_database_name>, uid = <your_username>, pwd = <your_password>)
Once connected, you can list all the tables in your database using this command:
In this recipe, we connected to a
SQLite database file (modeled on
DonorSampleData.csv), built a database query to select prospects who are both alumni and parent constituent type, and displayed 10 sample prospects. We also showed a way to connect to Microsoft SQL Server. To get into the habit of keeping a tidy
R workspace, we explicitly cleared the database query from computer memory as best practice. Cleaning up your
R programming environment will become increasingly important as you work on larger projects with varied information sources and complex queries across multiple database systems, tables, and so on.