Last time, we learned about,
Last time, we learned about,
Today, we will cover,
R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.
R has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.
WDI
: World Development Indicators (World Bank)WHO
: World Health Organization APItidycensus
: Census and American Community Surveyquantmod
: financial data from Yahoo, FRED, GoogleR has a big user base. If you are working with a popular data source, it will often have a devoted R package on CRAN or Github.
WDI
: World Development Indicators (World Bank)WHO
: World Health Organization APItidycensus
: Census and American Community Surveyquantmod
: financial data from Yahoo, FRED, GoogleIf you have an actual data file, you'll have to import it yourself...
Besides a package, it's easiest when data is stored in a text file.
Besides a package, it's easiest when data is stored in a text file.
An example of a comma-separated values (.csv) file is below:
"Subject","Depression","Sex","Week","HamD","Imipramine"101,"Non-endogenous","Second",0,26,NA101,"Non-endogenous","Second",1,22,NA101,"Non-endogenous","Second",2,18,4.04305101,"Non-endogenous","Second",3,7,3.93183101,"Non-endogenous","Second",4,4,4.33073101,"Non-endogenous","Second",5,3,4.36945103,"Non-endogenous","First",0,33,NA103,"Non-endogenous","First",1,24,NA103,"Non-endogenous","First",2,15,2.77259
readr
R has some built-in functions for importing data, such as read.table()
and read.csv()
.
readr
R has some built-in functions for importing data, such as read.table()
and read.csv()
.
The readr
package provides similar functions, like read_csv()
, that have slightly better features:
library(readr)
readr
Importing ExampleLet's import some data about song ranks on the Billboard Hot 100 in 2000:
billboard_2000_raw <- read_csv(file = "https://clanfear.github.io/CSSS508/Lectures/Week5/data/billboard.csv")
## Rows: 317 Columns: 81## ── Column specification ──────────────────────────────────────────────## Delimiter: ","## chr (2): artist, track## dbl (66): year, wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10...## lgl (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, w...## date (1): date.entered## time (1): time## ## ℹ Use `spec()` to retrieve the full column specification for this data.## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr)dim(billboard_2000_raw)
## [1] 317 81
names(billboard_2000_raw) %>% head(20)
## [1] "year" "artist" "track" "time" ## [5] "date.entered" "wk1" "wk2" "wk3" ## [9] "wk4" "wk5" "wk6" "wk7" ## [13] "wk8" "wk9" "wk10" "wk11" ## [17] "wk12" "wk13" "wk14" "wk15"
Import the data manually!
In the upper right-hand console, select:
Import Dataset > From Text (readr)
Import the data manually!
In the upper right-hand console, select:
Import Dataset > From Text (readr)
Once you've imported the data, you can copy/paste
the import code from the console into your file!!
This makes the process reproducible!
.xls
or .xlsx
), use package readxl
googlesheets4
haven
(tidyverse
)foreign
You won't keep text formatting, color, comments, or merged cells!!
Getting data out of R into a delimited file is very similar to getting it into R:
write_csv(billboard_2000_raw, path = "billboard_data.csv")
This saved the data we pulled off the web in a file called billboard_data.csv
in my working directory.
First things to check after loading new data:
First things to check after loading new data:
Did all the rows/columns from the original file make it in?
dim()
or str()
First things to check after loading new data:
Did all the rows/columns from the original file make it in?
dim()
or str()
Are the column names in good shape?
names()
to check; fix with rename()
First things to check after loading new data:
Did all the rows/columns from the original file make it in?
dim()
or str()
Are the column names in good shape?
names()
to check; fix with rename()
Are there "decorative" blank rows or columns to remove?
filter()
or select()
out those rows/columnsTidy data (aka "long data") are such that:
Tidy data (aka "long data") are such that:
Tidy data (aka "long data") are such that:
Tidy data (aka "long data") are such that:
Tidy data (aka "long data") are such that:
Why do we want tidy data?
ggplot2
Program | First Year | Second Year |
---|---|---|
Evans School | 10 | 6 |
Arts & Sciences | 5 | 6 |
Public Health | 2 | 3 |
Other | 5 | 1 |
Program | First Year | Second Year |
---|---|---|
Evans School | 10 | 6 |
Arts & Sciences | 5 | 6 |
Public Health | 2 | 3 |
Other | 5 | 1 |
Program | Year | Count |
---|---|---|
Evans School | First | 10 |
Evans School | Second | 6 |
Arts & Sciences | First | 5 |
Arts & Sciences | Second | 6 |
Public Health | First | 2 |
Public Health | Second | 3 |
Other | First | 5 |
Other | Second | 1 |
Program | Year | Count |
---|---|---|
Evans School | First | 10 |
Evans School | Second | 6 |
Arts & Sciences | First | 5 |
Arts & Sciences | Second | 6 |
Public Health | First | 2 |
Public Health | Second | 3 |
Other | First | 5 |
Other | Second | 1 |
Each variable is a column.
Each observation is a row.
Each cell has a single value.
## # A tibble: 10 × 81## year artist track time date.ent…¹ wk1 wk2 wk3 wk4 wk5## <dbl> <chr> <chr> <tim> <date> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 2000 2 Pac Baby… 04:22 2000-02-26 87 82 72 77 87## 2 2000 2Ge+her The … 03:15 2000-09-02 91 87 92 NA NA## 3 2000 3 Doors… Kryp… 03:53 2000-04-08 81 70 68 67 66## 4 2000 3 Doors… Loser 04:24 2000-10-21 76 76 72 69 67## 5 2000 504 Boyz Wobb… 03:35 2000-04-15 57 34 25 17 17## 6 2000 98^0 Give… 03:24 2000-08-19 51 39 34 26 26## 7 2000 A*Teens Danc… 03:44 2000-07-08 97 97 96 95 100## 8 2000 Aaliyah I Do… 04:15 2000-01-29 84 62 51 41 38## 9 2000 Aaliyah Try … 04:03 2000-03-18 59 53 38 28 21## 10 2000 Adams, … Open… 05:30 2000-08-26 76 76 74 69 68## # … with 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>,## # wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,## # wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>,## # wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,## # wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, …
Week columns continue up to wk76
!
What are the observations in the data?
What are the observations in the data?
What are the variables in the data?
What are the observations in the data?
What are the variables in the data?
What are the observations in the data?
What are the variables in the data?
What are the values in the data?
What are the observations in the data?
What are the variables in the data?
What are the values in the data?
tidyr
The tidyr
package provides functions to tidy up data.
tidyr
The tidyr
package provides functions to tidy up data.
Key functions:
pivot_longer()
: takes a set of columns and pivots them down to make two new columns (which you can name yourself): name
column that stores the original column namesvalue
with the values in those original columnstidyr
The tidyr
package provides functions to tidy up data.
Key functions:
pivot_longer()
: takes a set of columns and pivots them down to make two new columns (which you can name yourself): name
column that stores the original column namesvalue
with the values in those original columnspivot_wider()
: inverts pivot_longer()
by taking two columns and pivoting them up into multiple columnstidyr
The tidyr
package provides functions to tidy up data.
Key functions:
pivot_longer()
: takes a set of columns and pivots them down to make two new columns (which you can name yourself): name
column that stores the original column namesvalue
with the values in those original columnspivot_wider()
: inverts pivot_longer()
by taking two columns and pivoting them up into multiple columnsWe're going to focus only on pivot_longer
here, but know that it can be reversed!
pivot_longer()
This function usually takes three arguments:
pivot_longer()
library(tidyr)billboard_2000 <- billboard_2000_raw %>% pivot_longer(cols=wk1:wk76, names_to ="week", values_to = "rank")billboard_2000 %>% head(5)
## # A tibble: 5 × 7## year artist track time date.entered week rank## <dbl> <chr> <chr> <time> <date> <chr> <dbl>## 1 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk1 87## 2 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk2 82## 3 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk3 72## 4 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk4 77## 5 2000 2 Pac Baby Don't Cry (Keep... 04:22 2000-02-26 wk5 87
Now we have a single week column!
summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 26.00 51.00 51.05 76.00 100.00 18785
We don't want to keep the 18785 rows with missing ranks.
values_drop_na
Adding the argument values_drop_na = TRUE
to pivot_longer()
will remove rows with missing ranks.
billboard_2000 <- billboard_2000_raw %>% pivot_longer(cols=wk1:wk76, names_to ="week", values_to = "rank", values_drop_na = TRUE)summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 26.00 51.00 51.05 76.00 100.00
values_drop_na
Adding the argument values_drop_na = TRUE
to pivot_longer()
will remove rows with missing ranks.
billboard_2000 <- billboard_2000_raw %>% pivot_longer(cols=wk1:wk76, names_to ="week", values_to = "rank", values_drop_na = TRUE)summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 26.00 51.00 51.05 76.00 100.00
No more NA
values!
dim(billboard_2000)
## [1] 5307 7
And way fewer rows!
parse_number()
The week column is character, but should be numeric.
head(billboard_2000$week)
## [1] "wk1" "wk2" "wk3" "wk4" "wk5" "wk6"
parse_number()
The week column is character, but should be numeric.
head(billboard_2000$week)
## [1] "wk1" "wk2" "wk3" "wk4" "wk5" "wk6"
parse_number()
grabs just the numeric information from a character string:
billboard_2000 <- billboard_2000 %>% mutate(week = parse_number(week))summary(billboard_2000$week)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 5.00 10.00 11.47 16.00 65.00
More sophisticated tools for character strings will be covered later in this course!
lubridate
The package lubridate
(part of the tidyverse!) has a very large number of functions you can use!
lubridate
The package lubridate
(part of the tidyverse!) has a very large number of functions you can use!
lubridate
The package lubridate
(part of the tidyverse!) has a very large number of functions you can use!
There's too much to cover all of it, but I'll run through a few examples.
billboard_2000
billboard_2000 %>% select(date.entered) %>% head(10)
## # A tibble: 10 × 1## date.entered## <date> ## 1 2000-02-26 ## 2 2000-02-26 ## 3 2000-02-26 ## 4 2000-02-26 ## 5 2000-02-26 ## 6 2000-02-26 ## 7 2000-02-26 ## 8 2000-09-02 ## 9 2000-09-02 ## 10 2000-09-02
library(lubridate)head(billboard_2000$date.entered,5)
## [1] "2000-02-26" "2000-02-26" "2000-02-26" "2000-02-26" "2000-02-26"
year(billboard_2000$date.entered) %>% head(5)
## [1] 2000 2000 2000 2000 2000
month(billboard_2000$date.entered) %>% head(5)
## [1] 2 2 2 2 2
day(billboard_2000$date.entered) %>% head(5)
## [1] 26 26 26 26 26
wday(billboard_2000$date.entered) %>% head(5)
## [1] 7 7 7 7 7
quarter(billboard_2000$date.entered) %>% head(5)
## [1] 1 1 1 1 1
leap_year(billboard_2000$date.entered) %>% head(5)
## [1] TRUE TRUE TRUE TRUE TRUE
readr
tidyr
lubridate
Let's take a 10 minute break, then reconvene for an activity!
In groups of 2-3, you will use the Billboard data to investigate a question:
Write down a question of interest that could be studied with this data
Make the Billboard data tidy, perhaps using the code from this lecture.
Perform additional steps (if necessary) to help answer your question:
filter
, select
, group_by
, mutate
, summarize
, etc.Make a plot or table that answers your question and write down your answer in a sentence.
Email me your question, plot/table, and written answer (mpp790@uw.edu
)
Question: Do songs that hit #1 have a different trajectory than those that don't?
billboard_2000_question <- billboard_2000 %>% group_by(artist, track) %>% mutate(`Weeks at #1` = sum(rank == 1), `Peak Rank` = ifelse(any(rank == 1), "Hit #1", "Didn't #1"))
Note: any(min_rank==1)
checks to see if any value of rank
is equal to one for the given artist
and track
library(ggplot2)billboard_trajectories <- ggplot(data = billboard_2000_question, aes(x = week, y = rank, group = track, color = `Peak Rank`)) + geom_line(aes(size = `Peak Rank`), alpha = 0.4) + theme_classic() + xlab("Week") + ylab("Rank") + scale_color_manual(values = c("black", "red")) + scale_size_manual(values = c(0.25, 1)) + theme(legend.position = c(0.90, 0.75), legend.background = element_rect(fill="transparent"))
Songs that reach #1 on the Billboard charts appear to last >20 weeks on the charts, while other songs very rarely make it past that point.
On Course Website!
Last time, we learned about,
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |