Last time, we learned about,
apply()
Last time, we learned about,
apply()
Today, we will cover,
stringr
(tidyverse)A general programming term for a unit of character data is a string
Strings are a sequence of characters
In R, "strings" and "character data" are mostly interchangeable.
Some languages have more precise distinctions, but we won't worry about that here!
A general programming term for a unit of character data is a string
Strings are a sequence of characters
In R, "strings" and "character data" are mostly interchangeable.
Some languages have more precise distinctions, but we won't worry about that here!
We can create strings by surrounding text, numbers, spaces, or symbols with quotes!
"Hello! My name is Michael"
or "%*$#01234"
R can treat strings in funny ways!
"01" == "1"
## [1] FALSE
"01" == 1
## [1] FALSE
"1" == 1
## [1] TRUE
R can treat strings in funny ways!
"01" == "1"
## [1] FALSE
"01" == 1
## [1] FALSE
"1" == 1
## [1] TRUE
Reminder: We can check data types using the class()
function!
c(class("1"),class(1))
## [1] "character" "numeric"
nchar()
substr()
paste()
Today we'll study real data on food safety inspections in King County, collected from data.kingcounty.gov.
Note these data are fairly large. The following code can be used to download the data directly from my Github page:
load(url("https://pearce790.github.io/CSSS508/Lectures/Lecture8/restaurants.Rdata"))
names(restaurants)
## [1] "Name" "Program_Identifier" ## [3] "Inspection_Date" "Description" ## [5] "Address" "City" ## [7] "Zip_Code" "Phone" ## [9] "Longitude" "Latitude" ## [11] "Inspection_Business_Name" "Inspection_Type" ## [13] "Inspection_Score" "Inspection_Result" ## [15] "Inspection_Closed_Business" "Violation_Type" ## [17] "Violation_Description" "Violation_Points" ## [19] "Business_ID" "Inspection_Serial_Num" ## [21] "Violation_Record_ID" "Grade" ## [23] "Date"
dim(restaurants)
## [1] 258630 23
Good Questions to Ask:
nchar()
The nchar()
function calculates the number of characters in a given string.
length()
doesn't work with strings!!nchar()
The nchar()
function calculates the number of characters in a given string.
length()
doesn't work with strings!!nchar("Mike Pearce")
## [1] 11
nchar()
The nchar()
function calculates the number of characters in a given string.
length()
doesn't work with strings!!nchar("Mike Pearce")
## [1] 11
In our restaurants
data, let's see how many characters are in each zip code:
length_zip <- nchar(restaurants$Zip_Code)table(length_zip)
## length_zip## 5 10 ## 258629 1
substr()
The substr()
function allows us to extract characters from a string.
substr()
The substr()
function allows us to extract characters from a string.
For example, we can extract the third through fifth elements of a string as follows:
substr("98126",3,5)
## [1] "126"
substr()
Let's extract the first five chararacters from each zip code in the restaurants data, and add it to our dataset.
library(dplyr)restaurants$ZIP_5 <- substr(restaurants$Zip_Code,1,5)restaurants %>% distinct(ZIP_5) %>% head()
## # A tibble: 6 × 1## ZIP_5## <chr>## 1 98126## 2 98109## 3 98101## 4 98032## 5 98102## 6 98004
paste()
We combine strings together using paste()
. By default, it puts a space between different strings.
paste()
We combine strings together using paste()
. By default, it puts a space between different strings.
For example, we can combine "Michael"
and "Pearce"
as follows:
paste("Michael","Pearce")
## [1] "Michael Pearce"
paste()
commandsThere are two additional common arguments to use with paste()
:
sep=
controls what separates vectors, entry-wisecollapse=
controls if/how multiple outputs are collapsed into a single string.paste()
commandsThere are two additional common arguments to use with paste()
:
sep=
controls what separates vectors, entry-wisecollapse=
controls if/how multiple outputs are collapsed into a single string.paste("CSSS","508",sep= "_")
## [1] "CSSS_508"
paste(c("CSSS","STAT"),"508",sep= "_")
## [1] "CSSS_508" "STAT_508"
paste(c("CSSS","STAT"),"508",sep= "_",collapse=" , ")
## [1] "CSSS_508 , STAT_508"
When do we get one string as output vs. two?
paste()
Let's use paste()
to create complete mailing addresses for each restaurant:
restaurants$mailing_address <- paste(restaurants$Address,", ", restaurants$City,", WA ",restaurants$ZIP_5, sep = "")restaurants %>% distinct(mailing_address) %>% head()
## # A tibble: 6 × 1## mailing_address ## <chr> ## 1 2920 SW AVALON WAY, Seattle, WA 98126 ## 2 10 MERCER ST, Seattle, WA 98109 ## 3 1001 FAIRVIEW AVE N Unit 1700A, SEATTLE, WA 98109## 4 1225 1ST AVE, SEATTLE, WA 98101 ## 5 18114 E VALLEY HWY, KENT, WA 98032 ## 6 121 11TH AVE E, SEATTLE, WA 98102
stringr
str_length()
str_sub()
str_c()
str_to_upper()
, str_to_lower()
, and str_to_title()
str_trim()
str_detect()
str_replace()
stringr
stringr
is yet another R package from the Tidyverse (like ggplot2
, dplyr
, tidyr
, lubridate
, readr
).
stringr
stringr
is yet another R package from the Tidyverse (like ggplot2
, dplyr
, tidyr
, lubridate
, readr
).
It provides TONS of functions for working with strings:
stringr
stringr
is yet another R package from the Tidyverse (like ggplot2
, dplyr
, tidyr
, lubridate
, readr
).
It provides TONS of functions for working with strings:
Most stringr
functions begin with "str_
" to make RStudio auto-complete more useful.
stringr
stringr
is yet another R package from the Tidyverse (like ggplot2
, dplyr
, tidyr
, lubridate
, readr
).
It provides TONS of functions for working with strings:
Most stringr
functions begin with "str_
" to make RStudio auto-complete more useful.
We'll cover the basics today, but know there's much more out there!
library(stringr)
str_length()
str_length()
is equivalent to nchar()
:
nchar("weasels")
## [1] 7
str_length("weasels")
## [1] 7
str_sub()
str_sub()
is like substr()
:
str_sub("Washington", 2,4)
## [1] "ash"
str_sub()
str_sub()
is like substr()
:
str_sub("Washington", 2,4)
## [1] "ash"
str_sub()
also lets you put in negative values to count backwards from the end (-1 is the end, -3 is third from end):
str_sub("Washington", 4, -3)
## [1] "hingt"
str_c()
str_c()
("string combine") is just like paste()
but where the default is sep = ""
(no space!)
str_c(c("CSSS","STAT"),508)
## [1] "CSSS508" "STAT508"
str_c(c("CSSS","STAT"),508,sep=" ")
## [1] "CSSS 508" "STAT 508"
str_c(c("CSSS","STAT"),508,sep = " ",collapse = ", ")
## [1] "CSSS 508, STAT 508"
str_to_upper()
, str_to_lower()
, str_to_title()
convert cases, which is often a good idea to do before searching for values:
unique_cities <- unique(restaurants$City)unique_cities %>% head()
## [1] "Seattle" "SEATTLE" "KENT" "BELLEVUE" "KENMORE" "Issaquah"
str_to_upper(unique_cities) %>% head()
## [1] "SEATTLE" "SEATTLE" "KENT" "BELLEVUE" "KENMORE" "ISSAQUAH"
str_to_lower(unique_cities) %>% head()
## [1] "seattle" "seattle" "kent" "bellevue" "kenmore" "issaquah"
str_to_title(unique_cities) %>% head()
## [1] "Seattle" "Seattle" "Kent" "Bellevue" "Kenmore" "Issaquah"
str_trim()
Extra leading or trailing whitespace is common in text data:
unique_names <- unique(restaurants$Name)unique_names %>% head(3)
## [1] "@ THE SHACK, LLC " "10 MERCER RESTAURANT"## [3] "100 LB CLAM"
str_trim()
Extra leading or trailing whitespace is common in text data:
unique_names <- unique(restaurants$Name)unique_names %>% head(3)
## [1] "@ THE SHACK, LLC " "10 MERCER RESTAURANT"## [3] "100 LB CLAM"
We can remove the whitespace using str_trim()
:
str_trim(unique_names) %>% head(3)
## [1] "@ THE SHACK, LLC" "10 MERCER RESTAURANT"## [3] "100 LB CLAM"
It's common to want to see if a string satisfies a certain pattern.
It's common to want to see if a string satisfies a certain pattern.
We did this with numeric values earlier in this course!
cars %>% filter(speed < 5 | speed > 24)
## speed dist## 1 4 2## 2 4 10## 3 25 85
cars %>% filter(dist > 2 & dist <= 10)
## speed dist## 1 4 10## 2 7 4## 3 9 10
str_detect()
We can do similar pattern-checking using str_detect()
:
str_detect(string,pattern)
string
is the character string (or vector of strings) we want to examinepattern
is the pattern that we're checking for inside string
str_detect()
We can do similar pattern-checking using str_detect()
:
str_detect(string,pattern)
string
is the character string (or vector of strings) we want to examinepattern
is the pattern that we're checking for inside string
str_detect(string = c("Hello","my name","is Michael"), pattern = "m")
## [1] FALSE TRUE FALSE
str_detect(string = c("Hello","my name","is Michael"), pattern = "M")
## [1] FALSE FALSE TRUE
Results are case-sensitive!!
str_detect()
Let's see which phone numbers are in the 206 area code:
unique_phones <- unique(restaurants$Phone)unique_phones %>% tail(4)
## [1] "(360) 698-0417" "(206) 525-7747" "(206) 390-9205"## [4] "(425) 557-4474"
str_detect(unique_phones,"206") %>% tail(4)
## [1] FALSE TRUE TRUE FALSE
str_replace()
What about if you want to replace a string with something else? Use str_replace()
!
str_replace()
What about if you want to replace a string with something else? Use str_replace()
!
This function works very similarly to str_detect()
, but with one extra argument:
str_replace(string, pattern, replacement)
replacement
is what pattern
is substituted for.str_replace()
What about if you want to replace a string with something else? Use str_replace()
!
This function works very similarly to str_detect()
, but with one extra argument:
str_replace(string, pattern, replacement)
replacement
is what pattern
is substituted for.str_replace(string="Hi, I'm Michael", pattern="Hi",replacement="Hello")
## [1] "Hello, I'm Michael"
str_replace()
In the Date
variable, let's replace each dash ("-") with an underscore ("_")
dates <- restaurants$Datedates %>% tail(3)
## [1] "2017-03-21" "2017-03-21" "2016-10-10"
str_replace(dates,"-","_") %>% tail(3)
## [1] "2017_03-21" "2017_03-21" "2016_10-10"
Wait, what?
str_replace_all()
str_replace()
only changes the first instance of a pattern in each string!
str_replace_all()
str_replace()
only changes the first instance of a pattern in each string!
If we want to replace all patterns, use str_replace_all()
dates <- restaurants$Datedates %>% tail(3)
## [1] "2017-03-21" "2017-03-21" "2016-10-10"
str_replace_all(dates,"-","_") %>% tail(3)
## [1] "2017_03_21" "2017_03_21" "2016_10_10"
We've seen lots of functions today!
Don't try to memorize them! Instead, use this page as a reference.
Character Length: nchar
and str_length
Subsetting: substr
and str_sub
Combining: paste
and str_c
Case Changes: str_to_upper()
, str_to_lower()
, and str_to_title()
Removing Whitespace: str_trim
Pattern Detection/Replacement: str_detect()
and str_replace()
The variable Inspection_Date
is in the format "MM/DD/YYYY". In this question, we'll change the format using functions for strings.
substr()
to extract the month of each entry and save it to an object called "months"substr()
to extract the year of each entry and save it to an object called "years"paste()
to combine each month and year, separated by an underscore (_
). Save this as a new variable in the data called "Inspection_Date_Formatted"The variable Inspection_Date
is in the format "MM/DD/YYYY". In this question, we'll change the format using functions for strings.
1.How long is each character string in this variable?
table(nchar(restaurants$Inspection_Date))
## ## 10 ## 258000
The variable Inspection_Date
is in the format "MM/DD/YYYY". In this question, we'll change the format using functions for strings.
1.How long is each character string in this variable?
table(nchar(restaurants$Inspection_Date))
## ## 10 ## 258000
2.Use substr()
to extract the month of each entry and save it to an object called "months"
3.Use substr()
to extract the year of each entry and save it to an object called "years"
months <- substr(restaurants$Inspection_Date,1,2)years <- substr(restaurants$Inspection_Date,7,10)
4.Use paste()
to combine each month and year, separated by an underscore (_
). Save this as a new variable in the data called "Inspection_Date_Formatted"
restaurants$Inspection_Date_Formatted <- paste(months,years,sep="_")restaurants %>% select(Name,Inspection_Date,Inspection_Date_Formatted) %>% head(5)
## # A tibble: 5 × 3## Name Inspection_Date Inspection_Date_Formatted## <chr> <chr> <chr> ## 1 "@ THE SHACK, LLC " <NA> NA_NA ## 2 "10 MERCER RESTAURANT" 01/24/2017 01_2017 ## 3 "10 MERCER RESTAURANT" 01/24/2017 01_2017 ## 4 "10 MERCER RESTAURANT" 01/24/2017 01_2017 ## 5 "10 MERCER RESTAURANT" 10/10/2016 10_2016
Let's examine the coffee shops of King County!
Filter your data to only include rows in which the Name
includes the word "coffee" (in any case!)
Create a new variable in your data which includes the length of the business name, after removing beginning/trailing whitespace.
Create a new variable in your data for the inspection year, using a stringr
function!
Create side-by-side boxplots for the length of business name vs. year.
Calculate the maximum Inspection_Score
by business and year.
Create a line plot of maximum score ("MaxScore") over time ("Year"), by business ("Name"). That is, you should have a single line for each business. (Don't try to label them, as there are far too many!)
1. Filter your data to only include rows in which the Name
includes the word "coffee" (in any case!)
coffee <- restaurantscoffee$Name <- str_to_lower(coffee$Name)coffee <- coffee %>% filter(str_detect(Name,"coffee"))
2.Create a new variable in your data which includes the length of the business name, after removing beginning/trailing whitespace.
3. Create a new variable in your data for the inspection year.
coffee$NameLength <- str_length(str_trim(coffee$Name))coffee$Year <- str_sub(coffee$Inspection_Date,-4,-1)
4. Create side-by-side boxplots for the length of business name vs. year.
library(ggplot2)ggplot(coffee,aes(Year,NameLength))+geom_boxplot()
5. Calculate the maximum Inspection_Score
by business and year.
coffee_summary <- coffee %>% group_by(Name,Year) %>% summarize(MaxScore=max(Inspection_Score))
## `summarise()` has grouped output by 'Name'. You can override using## the `.groups` argument.
6. Create a line plot of maximum score ("MaxScore") over time ("Year"), by business ("Name"). That is, you should have a single line for each business. (Don't try to label them, as there are far too many!)
ggplot(coffee_summary,aes(Year,MaxScore,group=Name))+ geom_line(alpha=.2)
Last time, we learned about,
apply()
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |