+ - 0:00:00
Notes for current slide
Notes for next slide

CSSS508, Lecture 5

Importing, Exporting, and Cleaning Data

Michael Pearce
(based on slides from Chuck Lanfear)

April 26, 2023

1 / 35

Topics

Last time, we learned about,

  1. Types of Data
  2. Vectors
  3. Matrices
  4. Lists
2 / 35

Topics

Last time, we learned about,

  1. Types of Data
  2. Vectors
  3. Matrices
  4. Lists

Today, we will cover,

  1. Importing and exporting data
  2. Reshaping data
  3. Dates and times
2 / 35

1. Importing and Exporting Data

  • Data packages
  • Imporing data with code
  • Importing data by "point-and-click"
3 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

4 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

  • WDI: World Development Indicators (World Bank)
  • WHO: World Health Organization API
  • tidycensus: Census and American Community Survey
  • quantmod: financial data from Yahoo, FRED, Google
4 / 35

Data Packages

R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.

  • WDI: World Development Indicators (World Bank)
  • WHO: World Health Organization API
  • tidycensus: Census and American Community Survey
  • quantmod: financial data from Yahoo, FRED, Google

If you have an actual data file, you'll have to import it yourself...

4 / 35

Delimited Text Files

Besides a package, it's easiest when data is stored in a text file.

5 / 35

Delimited Text Files

Besides a package, it's easiest when data is stored in a text file.

An example of a comma-separated values (.csv) file is below:

"Subject","Depression","Sex","Week","HamD","Imipramine"
101,"Non-endogenous","Second",0,26,NA
101,"Non-endogenous","Second",1,22,NA
101,"Non-endogenous","Second",2,18,4.04305
101,"Non-endogenous","Second",3,7,3.93183
101,"Non-endogenous","Second",4,4,4.33073
101,"Non-endogenous","Second",5,3,4.36945
103,"Non-endogenous","First",0,33,NA
103,"Non-endogenous","First",1,24,NA
103,"Non-endogenous","First",2,15,2.77259
5 / 35

readr

R has some built-in functions for importing data, such as read.table() and read.csv().

6 / 35

readr

R has some built-in functions for importing data, such as read.table() and read.csv().

The readr package provides similar functions, like read_csv(), that have slightly better features:

  • Faster!
  • Better defaults (e.g. doesn't convert characters to factors)
  • A little smarter about dates and times
  • Loading bars for large files
library(readr)
6 / 35

readr Importing Example

Let's import some data about song ranks on the Billboard Hot 100 in 2000:

billboard_2000_raw <- read_csv(file =
"https://clanfear.github.io/CSSS508/Lectures/Week5/data/billboard.csv")
## Rows: 317 Columns: 81
## ── Column specification ──────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist, track
## dbl (66): year, wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10...
## lgl (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, w...
## date (1): date.entered
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
7 / 35

Did It Load?

library(dplyr)
dim(billboard_2000_raw)
## [1] 317 81
names(billboard_2000_raw) %>% head(20)
## [1] "year" "artist" "track" "time"
## [5] "date.entered" "wk1" "wk2" "wk3"
## [9] "wk4" "wk5" "wk6" "wk7"
## [13] "wk8" "wk9" "wk10" "wk11"
## [17] "wk12" "wk13" "wk14" "wk15"
8 / 35

Alternate Solution

Import the data manually!

In the upper right-hand console, select:

Import Dataset > From Text (readr)

9 / 35

Alternate Solution

Import the data manually!

In the upper right-hand console, select:

Import Dataset > From Text (readr)

Once you've imported the data, you can copy/paste the import code from the console into your file!!

This makes the process reproducible!

9 / 35

Importing Other Data Types

  • For Excel files (.xls or .xlsx), use package readxl
  • For Google Docs Spreadsheets, use package googlesheets4
  • For Stata, SPSS, and SAS files, use package haven (tidyverse)
  • For Stata, SPSS, and Minitab, use package foreign

You won't keep text formatting, color, comments, or merged cells!!

10 / 35

Writing Delimited Files

Getting data out of R into a delimited file is very similar to getting it into R:

write_csv(billboard_2000_raw, path = "billboard_data.csv")

This saved the data we pulled off the web in a file called billboard_data.csv in my working directory.

11 / 35

2. Reshaping Data

12 / 35

Initial Spot Checks

First things to check after loading new data:

13 / 35

Initial Spot Checks

First things to check after loading new data:

  • Did all the rows/columns from the original file make it in?

    • Check using dim() or str()
13 / 35

Initial Spot Checks

First things to check after loading new data:

  • Did all the rows/columns from the original file make it in?

    • Check using dim() or str()
  • Are the column names in good shape?

    • Use names() to check; fix with rename()
13 / 35

Initial Spot Checks

First things to check after loading new data:

  • Did all the rows/columns from the original file make it in?

    • Check using dim() or str()
  • Are the column names in good shape?

    • Use names() to check; fix with rename()
  • Are there "decorative" blank rows or columns to remove?

    • filter() or select() out those rows/columns
13 / 35

Tidy Data

Tidy data (aka "long data") are such that:

14 / 35

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
14 / 35

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
14 / 35

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
  3. There is only one value per cell.
14 / 35

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
  3. There is only one value per cell.

Why do we want tidy data?

  • Easier to understand many rows than many columns
  • Required for plotting in ggplot2
  • Required for many types of statistical procedures (e.g. hierarchical or mixed effects models)
  • Fewer issues with missing values and "imbalanced" repeated measures data
14 / 35

Slightly "Messy" Data

Program First Year Second Year
Evans School 10 6
Arts & Sciences 5 6
Public Health 2 3
Other 5 1
15 / 35

Slightly "Messy" Data

Program First Year Second Year
Evans School 10 6
Arts & Sciences 5 6
Public Health 2 3
Other 5 1
  • What is an observation?
    • A group of students from a program of a given year
  • What are the variables?
    • Program, Year
  • What are the values?
    • Program: Evans School, Arts & Sciences, Public Health, Other
    • Year: First, Second -- in column headings. Bad!
    • Count: spread over two columns!
15 / 35

Tidy Version

Program Year Count
Evans School First 10
Evans School Second 6
Arts & Sciences First 5
Arts & Sciences Second 6
Public Health First 2
Public Health Second 3
Other First 5
Other Second 1
16 / 35

Tidy Version

Program Year Count
Evans School First 10
Evans School Second 6
Arts & Sciences First 5
Arts & Sciences Second 6
Public Health First 2
Public Health Second 3
Other First 5
Other Second 1
  • Each variable is a column.

  • Each observation is a row.

  • Each cell has a single value.

16 / 35

Billboard is Just Ugly-Messy

## # A tibble: 10 × 81
## year artist track time date.ent…¹ wk1 wk2 wk3 wk4 wk5
## <dbl> <chr> <chr> <tim> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 2 Pac Baby… 04:22 2000-02-26 87 82 72 77 87
## 2 2000 2Ge+her The … 03:15 2000-09-02 91 87 92 NA NA
## 3 2000 3 Doors… Kryp… 03:53 2000-04-08 81 70 68 67 66
## 4 2000 3 Doors… Loser 04:24 2000-10-21 76 76 72 69 67
## 5 2000 504 Boyz Wobb… 03:35 2000-04-15 57 34 25 17 17
## 6 2000 98^0 Give… 03:24 2000-08-19 51 39 34 26 26
## 7 2000 A*Teens Danc… 03:44 2000-07-08 97 97 96 95 100
## 8 2000 Aaliyah I Do… 04:15 2000-01-29 84 62 51 41 38
## 9 2000 Aaliyah Try … 04:03 2000-03-18 59 53 38 28 21
## 10 2000 Adams, … Open… 05:30 2000-08-26 76 76 74 69 68
## # … with 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>,
## # wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,
## # wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>,
## # wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## # wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, …

Week columns continue up to wk76!

17 / 35

Billboard

  • What are the observations in the data?
18 / 35

Billboard

  • What are the observations in the data?

    • Song on the Billboard chart each week
18 / 35

Billboard

  • What are the observations in the data?

    • Song on the Billboard chart each week
  • What are the variables in the data?

18 / 35

Billboard

  • What are the observations in the data?

    • Song on the Billboard chart each week
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
18 / 35

Billboard

  • What are the observations in the data?

    • Song on the Billboard chart each week
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
  • What are the values in the data?

18 / 35

Billboard

  • What are the observations in the data?

    • Song on the Billboard chart each week
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
  • What are the values in the data?

    • e.g. 2000; 3 Doors Down; Kryptonite; 3 minutes 53 seconds; April 8, 2000; Week 3 (stuck in column headings); rank 68 (spread over many columns)
18 / 35

tidyr

The tidyr package provides functions to tidy up data.

19 / 35

tidyr

The tidyr package provides functions to tidy up data.

Key functions:

  • pivot_longer(): takes a set of columns and pivots them down to make two new columns (which you can name yourself):
    • A name column that stores the original column names
    • A value with the values in those original columns
19 / 35

tidyr

The tidyr package provides functions to tidy up data.

Key functions:

  • pivot_longer(): takes a set of columns and pivots them down to make two new columns (which you can name yourself):
    • A name column that stores the original column names
    • A value with the values in those original columns
  • pivot_wider(): inverts pivot_longer() by taking two columns and pivoting them up into multiple columns
19 / 35

tidyr

The tidyr package provides functions to tidy up data.

Key functions:

  • pivot_longer(): takes a set of columns and pivots them down to make two new columns (which you can name yourself):
    • A name column that stores the original column names
    • A value with the values in those original columns
  • pivot_wider(): inverts pivot_longer() by taking two columns and pivoting them up into multiple columns

We're going to focus only on pivot_longer here, but know that it can be reversed!

19 / 35

pivot_longer()

This function usually takes three arguments:

  1. cols: The columns we want to modify
  2. names_to: New variable name to store original columns
  3. values_to: New variable name to store original values
20 / 35

Example of pivot_longer()

library(tidyr)
billboard_2000 <- billboard_2000_raw %>%
pivot_longer(cols=wk1:wk76,
names_to ="week",
values_to = "rank")
billboard_2000 %>% head(5)
## # A tibble: 5 × 7
## year artist track time date.entered week rank
## <dbl> <chr> <chr> <time> <date> <chr> <dbl>
## 1 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk1 87
## 2 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk2 82
## 3 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk3 72
## 4 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk4 77
## 5 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk5 87

Now we have a single week column!

21 / 35

Lots of missing values?!

summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 26.00 51.00 51.05 76.00 100.00 18785

We don't want to keep the 18785 rows with missing ranks.

22 / 35

Pivoting Better: values_drop_na

Adding the argument values_drop_na = TRUE to pivot_longer() will remove rows with missing ranks.

billboard_2000 <- billboard_2000_raw %>%
pivot_longer(cols=wk1:wk76,
names_to ="week",
values_to = "rank",
values_drop_na = TRUE)
summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 26.00 51.00 51.05 76.00 100.00
23 / 35

Pivoting Better: values_drop_na

Adding the argument values_drop_na = TRUE to pivot_longer() will remove rows with missing ranks.

billboard_2000 <- billboard_2000_raw %>%
pivot_longer(cols=wk1:wk76,
names_to ="week",
values_to = "rank",
values_drop_na = TRUE)
summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 26.00 51.00 51.05 76.00 100.00

No more NA values!

dim(billboard_2000)
## [1] 5307 7

And way fewer rows!

23 / 35

parse_number()

The week column is character, but should be numeric.

head(billboard_2000$week)
## [1] "wk1" "wk2" "wk3" "wk4" "wk5" "wk6"
24 / 35

parse_number()

The week column is character, but should be numeric.

head(billboard_2000$week)
## [1] "wk1" "wk2" "wk3" "wk4" "wk5" "wk6"

parse_number() grabs just the numeric information from a character string:

billboard_2000 <- billboard_2000 %>%
mutate(week = parse_number(week))
summary(billboard_2000$week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 10.00 11.47 16.00 65.00

More sophisticated tools for character strings will be covered later in this course!

24 / 35

3. Dates and Times

25 / 35

lubridate

The package lubridate (part of the tidyverse!) has a very large number of functions you can use!

26 / 35

lubridate

The package lubridate (part of the tidyverse!) has a very large number of functions you can use!

  • Converting dates/times between formats (DD-MM-YY to YY-MM-DD)
  • Extracting dates/times (day of week, month, leap years, etc.)
  • Math with dates/times (time zone conversions, etc.)
26 / 35

lubridate

The package lubridate (part of the tidyverse!) has a very large number of functions you can use!

  • Converting dates/times between formats (DD-MM-YY to YY-MM-DD)
  • Extracting dates/times (day of week, month, leap years, etc.)
  • Math with dates/times (time zone conversions, etc.)

There's too much to cover all of it, but I'll run through a few examples.

26 / 35

Dates in billboard_2000

billboard_2000 %>% select(date.entered) %>% head(10)
## # A tibble: 10 × 1
## date.entered
## <date>
## 1 2000-02-26
## 2 2000-02-26
## 3 2000-02-26
## 4 2000-02-26
## 5 2000-02-26
## 6 2000-02-26
## 7 2000-02-26
## 8 2000-09-02
## 9 2000-09-02
## 10 2000-09-02
27 / 35

Extracting Year, Month, or Day

library(lubridate)
head(billboard_2000$date.entered,5)
## [1] "2000-02-26" "2000-02-26" "2000-02-26" "2000-02-26" "2000-02-26"
year(billboard_2000$date.entered) %>% head(5)
## [1] 2000 2000 2000 2000 2000
month(billboard_2000$date.entered) %>% head(5)
## [1] 2 2 2 2 2
day(billboard_2000$date.entered) %>% head(5)
## [1] 26 26 26 26 26
28 / 35

Extracting Weekday, Quarter, and Leap Year Boolean

wday(billboard_2000$date.entered) %>% head(5)
## [1] 7 7 7 7 7
quarter(billboard_2000$date.entered) %>% head(5)
## [1] 1 1 1 1 1
leap_year(billboard_2000$date.entered) %>% head(5)
## [1] TRUE TRUE TRUE TRUE TRUE
29 / 35

Summary

  1. Importing/Exporting Data: readr
  2. Reshaping data: tidyr
  3. Dates and times lubridate

Let's take a 10 minute break, then reconvene for an activity!

30 / 35

Activity!

In groups of 2-3, you will use the Billboard data to investigate a question:

  1. Write down a question of interest that could be studied with this data

    • Which/how many artists had #1 hits?
    • How does rank for each song change over time?
    • Is there a relationship between highest rank and length of song?
  2. Make the Billboard data tidy, perhaps using the code from this lecture.

  3. Perform additional steps (if necessary) to help answer your question:

    • Perhaps using filter, select, group_by, mutate, summarize, etc.
  4. Make a plot or table that answers your question and write down your answer in a sentence.

  5. Email me your question, plot/table, and written answer (mpp790@uw.edu)

31 / 35

My Example: Question

Question: Do songs that hit #1 have a different trajectory than those that don't?

billboard_2000_question <- billboard_2000 %>%
group_by(artist, track) %>%
mutate(`Weeks at #1` = sum(rank == 1),
`Peak Rank` = ifelse(any(rank == 1),
"Hit #1",
"Didn't #1"))

Note: any(min_rank==1) checks to see if any value of rank is equal to one for the given artist and track

32 / 35

My Example: Figure

library(ggplot2)
billboard_trajectories <-
ggplot(data = billboard_2000_question,
aes(x = week, y = rank, group = track,
color = `Peak Rank`)) +
geom_line(aes(size = `Peak Rank`), alpha = 0.4) +
theme_classic() +
xlab("Week") + ylab("Rank") +
scale_color_manual(values = c("black", "red")) +
scale_size_manual(values = c(0.25, 1)) +
theme(legend.position = c(0.90, 0.75),
legend.background = element_rect(fill="transparent"))
33 / 35

Charts of 2000: Beauty!

Songs that reach #1 on the Billboard charts appear to last >20 weeks on the charts, while other songs very rarely make it past that point.

34 / 35

Homework 5

On Course Website!

35 / 35

Topics

Last time, we learned about,

  1. Types of Data
  2. Vectors
  3. Matrices
  4. Lists
2 / 35
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow