Chapter 7 Cleaning Data

Unlike any textbook data, real-world data is messy and hardly ready for analysis. Some common problems with real-world data are incorrect data types and space-padded text. On top of that, R prefers data in a certain way. For example, unknown values are noted as NAs, and for character or factor data types, the plotting order of axis labels depends on the order of underlying data. We’ll take a look at some of these cleaning operations, but first let’s load the data using the library readr:

library(readr)
donor_data <- read_csv("data/DonorSampleData.csv")

We will use the dplyr library’s glimpse function to peek into the data. You can also see the summary of the data by using the summary(donor_data) command.

library(dplyr)
glimpse(donor_data)
#> Observations: 34,508
#> Variables: 22
#> $ ID                  <int> 1, 2, 3, 4, 5, 6,...
#> $ ZIPCODE             <chr> "23187", "77643",...
#> $ AGE                 <int> NA, 33, NA, 31, 6...
#> $ MARITAL_STATUS      <chr> "Married", "Unkno...
#> $ GENDER              <chr> "Female", "Female...
#> $ MEMBERSHIP_IND      <chr> "N", "N", "N", "N...
#> $ ALUMNUS_IND         <chr> "N", "Y", "N", "Y...
#> $ PARENT_IND          <chr> "N", "N", "N", "N...
#> $ HAS_INVOLVEMENT_IND <chr> "N", "Y", "N", "Y...
#> $ WEALTH_RATING       <chr> NA, NA, NA, NA, N...
#> $ DEGREE_LEVEL        <chr> NA, "UB", NA, NA,...
#> $ PREF_ADDRESS_TYPE   <chr> "HOME", NA, "HOME...
#> $ EMAIL_PRESENT_IND   <chr> "N", "Y", "N", "Y...
#> $ CON_YEARS           <int> 1, 0, 1, 0, 0, 0,...
#> $ PrevFYGiving        <chr> "$0", "$0", "$0",...
#> $ PrevFY1Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY2Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY3Giving       <chr> "$0", "$0", "$0",...
#> $ PrevFY4Giving       <chr> "$0", "$0", "$0",...
#> $ CurrFYGiving        <chr> "$0", "$0", "$200...
#> $ TotalGiving         <chr> "$10", "$2,100", ...
#> $ DONOR_IND           <chr> "Y", "Y", "Y", "N...

As you see from the glimpse of the data, all columns, except for numeric columns, are character or string type. This is the default setting for the readr package, compared to the default setting of the factor using read.csv from base R. Reading data as character type helps to clean it easily because factors are stored by their indices compared to the literal string values of the characters.

7.1 Remove Extra Spaces

Often, data-entry errors cause extra spaces around text. For example, “John Smith” might be entered as “John Smith”. The library stringr offers a convenient way to trim all the spaces around text.

library(stringr)
str_trim('John Smith ')
#> [1] "John Smith"
str_trim('  John Smith ')
#> [1] "John Smith"

Let’s remove spaces around the PREF_ADDRESS_TYPE column in our data:

donor_data$PREF_ADDRESS_TYPE <- str_trim(
  donor_data$PREF_ADDRESS_TYPE)

What if we wanted to trim extra spaces from all character columns? We can use the mutate_if function from the dplyr package. This function will apply any arbitrary function to the selected columns, matched by a condition. For our purposes, we can select all the columns that are of character type using the is.character function.

is.character(donor_data$PREF_ADDRESS_TYPE)
#> [1] TRUE
is.character(donor_data$MARITAL_STATUS)
#> [1] TRUE
is.character(donor_data$GENDER)
#> [1] TRUE

donor_data <- mutate_if(donor_data,
                        .predicate = is.character, 
                        .funs = str_trim)

7.2 Change Data Types

As we saw from the glimpse above, there are a few columns that are indicator variables (with yes or no values). For various types of analyses, it is useful to convert them to factor type variables. We will use the ends_with function from the dplyr library to select all columns whose names end with “_IND“. Then, we will convert these selected columns to factors.

donor_data <- mutate_at(donor_data,
                        .vars = vars(ends_with("_IND")),
                        .funs = as.factor)

7.3 Replace Values with NA

If your data has text values that denotes missing values, we need to explicitly convert those to NA for R to treat them as missing values. The MARITAL_STATUS column has the value of Unknown for a majority of the rows. Let’s convert those to NA using the ifelse function. The ifelse function will test for a condition and return some other values based on the result of the test. In this case, we will test whether the MARITAL_STATUS column has the value of Unknown; if the result is true, then we return NA, else we return the original value.

donor_data$MARITAL_STATUS <- with(
  donor_data,
  ifelse(MARITAL_STATUS == 'Unknown', 
         NA, 
         MARITAL_STATUS))

7.4 Change Order of Values

For a factor column, the order of the elements in that column matters for various labeling operations. If the order is important to you, you should let R know that order. The following is how to do so.

donor_data$WEALTH_RATING <- with(
  donor_data,
  factor(WEALTH_RATING, 
         levels = c('$1-$24,999', '$25,000-$49,999', 
                    '$50,000-$99,999', '$100,000-$249,999', 
                    '$250,000-$499,999', '$500,000-$999,999', 
                    '$1,000,000-$2,499,999',
                    '$2,500,000-$4,999,999', 
                    '$5,000,000-$9,999,999',
                    '$10,000,000-$24,999,999'), 
         ordered = TRUE))

7.5 Clean ZIP Codes

ZIP Codes are often problematic in addresses because of data-entry errors or the presence of international ZIP Codes. We will use the zipcode library and the clean.zipcodes function to clean and limit the ZIP Codes to only US ZIP Codes.

The following are some erroneous ZIP Codes.

#> # A tibble: 6 x 1
#>   ZIPCODE
#>     <chr>
#> 1 NA-1175
#> 2 NA-2179
#> 3 NA-2245
#> 4 NA-4919
#> 5 NA-5419
#> 6 NA-6653

And some ZIP Codes in the zip + 4 format.

#> # A tibble: 6 x 1
#>      ZIPCODE
#>        <chr>
#> 1 92555-6454
#> 2 90265-5223
#> 3 90265-5785
#> 4 14845-5217
#> 5 33433-3803
#> 6 90265-4485

Let’s clean them.

library(zipcode)
head(donor_data$ZIPCODE)
#> [1] "23187"      "77643"      NA          
#> [4] "47141"      "92555-6454" "95191"
donor_data$ZIPCODE <- clean.zipcodes(donor_data$ZIPCODE)
head(donor_data$ZIPCODE)
#> [1] "23187" "77643" NA      "47141" "92555"
#> [6] "95191"

7.6 Manipulate Dates

Another common problem with the messy data is incorrect or missing date formats. The lubridate library makes it easy to perform various date-based operations.

Let’s say we want to add a column for birth dates. We don’t have the actual birth date, but only the age in our sample data. Let’s look at the wrong way of doing this first.

library(lubridate)
# Wrong Way. Only Subtracted Days
donor_data %>% mutate(BIRTH_DATE = as_date(today() - AGE)) %>% 
  select(BIRTH_DATE) 
#> # A tibble: 34,508 x 1
#>   BIRTH_DATE
#>       <date>
#> 1         NA
#> 2 2018-11-24
#> 3         NA
#> 4 2018-11-26
#> 5 2018-10-20
#> 6 2018-10-31
#> # ... with 3.45e+04 more rows

The correct way is to subtract the age from the current year and then create a date field. We will use random months and days.

# Correct Way
donor_data$BIRTH_DATE <- make_date(
  year = year(today()) - donor_data$AGE, 
  month = sample(1:12, replace = TRUE), 
  day = sample(1:26, replace = TRUE))
head(donor_data$BIRTH_DATE)
#> [1] NA           "1985-06-16" NA          
#> [4] "1987-12-03" "1950-09-11" "1961-01-23"

7.7 Remove Non-Numeric Characters

Often, numeric data columns have non-numeric characters, which create problems in further analysis. A common example is seeing the dollar sign ($) and/or commas in currency columns. The stringr library offers the str_replace_all function to replace such offending characters with something else.

Let’s remove the dollar sign from the TotalGiving column.

head(donor_data$TotalGiving)
#> [1] "$10"    "$2,100" "$200"   "$0"     "$505"  
#> [6] "$0"

donor_data$TotalGiving <- str_replace_all(
  string = donor_data$TotalGiving,
  pattern = "\\$",
  replacement = "")
head(donor_data$TotalGiving)
#> [1] "10"    "2,100" "200"   "0"     "505"  
#> [6] "0"

Note that we used two backward slashes in front of the dollar sign. We did so because the dollar symbol is a special character in the pattern-matching mechanism known as regular expressions or regex. We had to tell R that we’re looking for a literal symbol by using a backward slash. Unfortunately, a backward slash is a reserved symbol for R but we escaped that by using another backward slash.

Let’s remove any commas in the TotalGiving column.

donor_data$TotalGiving <- str_replace_all(
  string = donor_data$TotalGiving,
  pattern = ",",
  replacement = "")

Pattern matching and searching using regex is very powerful—think super-charged wildcard matching. Get familiar with regexes to super-charge your searching, finding, and replacing. Here’s a good cheat sheet: http://bit.ly/1NgWBbL. You can also test your regular expressions online: http://regexr.com/. Try to combine the two operations above into one single operation. Hint: See Section 13.1 to clean multiple columns at the same time.

7.8 Convert from Character to Numeric Data Type

Now that the TotalGiving column is clean, let’s save it as a numeric column using the as.numeric function.

donor_data$TotalGiving <- as.numeric(donor_data$TotalGiving)

7.9 Export Cleaned Data File

In the following chapters, we will use this cleaned version. Let’s save this data frame using the write_csv function from the readr library:

write_csv(donor_data, "data/DonorSampleDataCleaned.csv")