CSSS508, Lecture 5
Importing, Exporting, and Cleaning Data
Michael Pearce
(based on slides from Chuck Lanfear)
April 26, 2023
1 / 35

Topics

Last time, we learned about,

Types of Data
Vectors
Matrices
Lists

2 / 35

Topics

Last time, we learned about,

Types of Data
Vectors
Matrices
Lists

Today, we will cover,

Importing and exporting data
Reshaping data
Dates and times

2 / 35

1. Importing and Exporting DataData packages
Imporing data with code
Importing data by "point-and-click"
3 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

4 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

WDI: World Development Indicators (World Bank)
WHO: World Health Organization API
tidycensus: Census and American Community Survey
quantmod: financial data from Yahoo, FRED, Google

4 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

WDI: World Development Indicators (World Bank)
WHO: World Health Organization API
tidycensus: Census and American Community Survey
quantmod: financial data from Yahoo, FRED, Google

If you have an actual data file, you'll have to import it yourself...

4 / 35

Delimited Text Files

Besides a package, it's easiest when data is stored in a text file.

5 / 35

Delimited Text Files

Besides a package, it's easiest when data is stored in a text file.

An example of a comma-separated values (.csv) file is below:

"Subject","Depression","Sex","Week","HamD","Imipramine"
101,"Non-endogenous","Second",0,26,NA
101,"Non-endogenous","Second",1,22,NA
101,"Non-endogenous","Second",2,18,4.04305
101,"Non-endogenous","Second",3,7,3.93183
101,"Non-endogenous","Second",4,4,4.33073
101,"Non-endogenous","Second",5,3,4.36945
103,"Non-endogenous","First",0,33,NA
103,"Non-endogenous","First",1,24,NA
103,"Non-endogenous","First",2,15,2.77259

5 / 35

`readr`

R has some built-in functions for importing data, such as read.table() and read.csv().

6 / 35

`readr`

R has some built-in functions for importing data, such as read.table() and read.csv().

The readr package provides similar functions, like read_csv(), that have slightly better features:

Faster!
Better defaults (e.g. doesn't convert characters to factors)
A little smarter about dates and times
Loading bars for large files

library(readr)

6 / 35

`readr` Importing Example

Let's import some data about song ranks on the Billboard Hot 100 in 2000:

billboard_2000_raw <- read_csv(file = 
"https://clanfear.github.io/CSSS508/Lectures/Week5/data/billboard.csv")

## Rows: 317 Columns: 81
## ── Column specification ──────────────────────────────────────────────
## Delimiter: ","
## chr   (2): artist, track
## dbl  (66): year, wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10...
## lgl  (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, w...
## date  (1): date.entered
## time  (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

7 / 35

Did It Load?library(dplyr)
dim(billboard_2000_raw)

## [1] 317  81
names(billboard_2000_raw) %>% head(20)

##  [1] "year"         "artist"       "track"        "time"        
##  [5] "date.entered" "wk1"          "wk2"          "wk3"         
##  [9] "wk4"          "wk5"          "wk6"          "wk7"         
## [13] "wk8"          "wk9"          "wk10"         "wk11"        
## [17] "wk12"         "wk13"         "wk14"         "wk15"
8 / 35

Alternate Solution

Import the data manually!

In the upper right-hand console, select:

Import Dataset > From Text (readr)

9 / 35

Alternate Solution

Import the data manually!

In the upper right-hand console, select:

Import Dataset > From Text (readr)

Once you've imported the data, you can copy/paste the import code from the console into your file!!

This makes the process reproducible!

9 / 35

Importing Other Data Types

For Excel files (.xls or .xlsx), use package readxl
For Google Docs Spreadsheets, use package googlesheets4
For Stata, SPSS, and SAS files, use package haven (tidyverse)
For Stata, SPSS, and Minitab, use package foreign

You won't keep text formatting, color, comments, or merged cells!!

10 / 35

Writing Delimited Files

Getting data out of R into a delimited file is very similar to getting it into R:

write_csv(billboard_2000_raw, path = "billboard_data.csv")

This saved the data we pulled off the web in a file called billboard_data.csv in my working directory.

11 / 35

2. Reshaping Data

12 / 35

Initial Spot Checks

First things to check after loading new data:

13 / 35

Initial Spot Checks

First things to check after loading new data:

Did all the rows/columns from the original file make it in?
- Check using dim() or str()

13 / 35

Initial Spot Checks

First things to check after loading new data:

Did all the rows/columns from the original file make it in?
- Check using dim() or str()

Are the column names in good shape?
- Use names() to check; fix with rename()

13 / 35

Initial Spot Checks

First things to check after loading new data:

Did all the rows/columns from the original file make it in?
- Check using dim() or str()

Are the column names in good shape?
- Use names() to check; fix with rename()

Are there "decorative" blank rows or columns to remove?
- filter() or select() out those rows/columns

13 / 35