CSSS508, Lecture 8

class: center, top, title-slide

.title[
# CSSS508, Lecture 8
]
.subtitle[
## Working with Text Data
]
.author[
### Michael Pearce<br>(based on slides from Chuck Lanfear)
]
.date[
### May 17, 2022
]

---

class:inverse

# Topics

Last time, we learned about,

1. Aside: Visualizing the Goal
1. Building blocks of functions
1. Simple functions
1. Using functions with `apply()`

Today, we will cover,

1. Basics of Strings
1. Strings in Base R
1. Strings in `stringr` (tidyverse)

---
class:inverse
# 1. Basics of Strings

---
## Basics of Strings

+ A general programming term for a unit of character data is a **string**

+ Strings are a *sequence of characters*
  
  + In R, "strings" and "character data" are mostly interchangeable.
  
  + Some languages have more precise distinctions, but we won't worry about that here!
  
--

+ We can create strings by surrounding text, numbers, spaces, or symbols with quotes!

+ Examples: `"Hello! My name is Michael"` or `"%*$#01234"`

---
## Basics of Strings

R can treat strings in funny ways!

```r
"01" == "1"
```

```
## [1] FALSE
```

```r
"01" == 1
```

```
## [1] FALSE
```

```r
"1" == 1
```

```
## [1] TRUE
```

*Reminder:* We can check **data types** using the `class()` function!

```r
c(class("1"),class(1))
```

```
## [1] "character" "numeric"
```

---
class: inverse
# 2. Strings in Base R

+ `nchar()`
+ `substr()`
+ `paste()`

---
## Data: King County Restaurant Inspections!

Today we'll study real data on **food safety inspections in King County**, collected from [data.kingcounty.gov](https://data.kingcounty.gov/Health/Food-Establishment-Inspection-Data/f29f-zza5).

Note these data are *fairly large*. The following code can be used to download the data directly from my Github page:

```r
load(url("https://pearce790.github.io/CSSS508/Lectures/Lecture8/restaurants.Rdata"))
```

---
## Quick Examination of the Data

```r
names(restaurants)
```

```
##  [1] "Name"                       "Program_Identifier"        
##  [3] "Inspection_Date"            "Description"               
##  [5] "Address"                    "City"                      
##  [7] "Zip_Code"                   "Phone"                     
##  [9] "Longitude"                  "Latitude"                  
## [11] "Inspection_Business_Name"   "Inspection_Type"           
## [13] "Inspection_Score"           "Inspection_Result"         
## [15] "Inspection_Closed_Business" "Violation_Type"            
## [17] "Violation_Description"      "Violation_Points"          
## [19] "Business_ID"                "Inspection_Serial_Num"     
## [21] "Violation_Record_ID"        "Grade"                     
## [23] "Date"
```

```r
dim(restaurants)
```

```
## [1] 258630     23
```

---
## Quick Examination of the Data

**Good Questions to Ask:**

+ What does each row represent?
  + Is the data in long or wide format?
  + What are the key variables?
  + How are the data stored? (*data type*)

---
## `nchar()`

The `nchar()` function calculates the *number of characters* in a given string.

+ `length()` doesn't work with strings!!
+ Why not?

```r
nchar("Mike Pearce")
```

```
## [1] 11
```

In our `restaurants` data, let's see how many characters are in each zip code:

```r
length_zip <- nchar(restaurants$Zip_Code)
table(length_zip)
```

```
## length_zip
##      5     10 
## 258629      1
```

---
## `substr()`

The `substr()` function allows us to extract characters from a string.

For example, we can extract the third through fifth elements of a string as follows:

```r
substr("98126",3,5)
```

```
## [1] "126"
```

---
## `substr()`

Let's extract the first five chararacters from each zip code in the restaurants data, and add it to our dataset.

```r
library(dplyr)
restaurants$ZIP_5 <- substr(restaurants$Zip_Code,1,5)
restaurants %>% distinct(ZIP_5) %>% head()
```

```
## # A tibble: 6 × 1
##   ZIP_5
##   <chr>
## 1 98126
## 2 98109
## 3 98101
## 4 98032
## 5 98102
## 6 98004
```

---
## `paste()`

We combine strings together using `paste()`. By default, it puts a space between different strings.

For example, we can combine `"Michael"` and `"Pearce"` as follows:

```r
paste("Michael","Pearce")
```

```
## [1] "Michael Pearce"
```

---
## More complex `paste()` commands

There are two additional common arguments to use with `paste()`:

1. `sep=` controls what separates vectors, entry-wise
1. `collapse=` controls if/how multiple outputs are collapsed into a single string.

Examples:

```r
paste("CSSS","508",sep= "_")
```

```
## [1] "CSSS_508"
```

```r
paste(c("CSSS","STAT"),"508",sep= "_")
```

```
## [1] "CSSS_508" "STAT_508"
```

```r
paste(c("CSSS","STAT"),"508",sep= "_",collapse=" , ")
```

```
## [1] "CSSS_508 , STAT_508"
```

*When do we get one string as output vs. two?*

---
## `paste()`

Let's use `paste()` to create complete mailing addresses for each restaurant:

```r
restaurants$mailing_address <- 
  paste(restaurants$Address,", ",
        restaurants$City,", WA ",restaurants$ZIP_5,
        sep = "")
restaurants %>% distinct(mailing_address) %>% head()
```

```
## # A tibble: 6 × 1
##   mailing_address                                  
##   <chr>                                            
## 1 2920 SW AVALON WAY, Seattle, WA 98126            
## 2 10 MERCER ST, Seattle, WA 98109                  
## 3 1001 FAIRVIEW AVE N Unit 1700A, SEATTLE, WA 98109
## 4 1225 1ST AVE, SEATTLE, WA 98101                  
## 5 18114 E VALLEY HWY, KENT, WA 98032               
## 6 121 11TH AVE E, SEATTLE, WA 98102
```

---
class: inverse
# 3. Strings in `stringr`

+ `str_length()`
+ `str_sub()`
+ `str_c()`
+ `str_to_upper()`, `str_to_lower()`, and `str_to_title()`
+ `str_trim()`
+ `str_detect()`
+ `str_replace()`

---
## `stringr`

`stringr` is yet another R package from the Tidyverse (like `ggplot2`, `dplyr`, `tidyr`, `lubridate`, `readr`).

It provides TONS of functions for working with strings:

+ Some are equivalent/better versions of Base R functions
  + Some can do *fancier* tricks with strings
  
--

*Most* `stringr` functions begin with "`str_`" to make RStudio auto-complete more useful.

We'll cover the basics today, but know there's much more out there!

```r
library(stringr)
```

---
## Equivalencies: `str_length()`

`str_length()` is equivalent to `nchar()`:

```r
nchar("weasels")
```

```
## [1] 7
```

```r
str_length("weasels")
```

```
## [1] 7
```

---
## Equivalencies: `str_sub()`

`str_sub()` is like `substr()`:

```r
str_sub("Washington", 2,4)
```

```
## [1] "ash"
```

`str_sub()` also lets you put in negative values to count backwards from the end (-1 is the end, -3 is third from end):

```r
str_sub("Washington", 4, -3)
```

```
## [1] "hingt"
```

---
## Equivalencies: `str_c()`

`str_c()` ("string combine") is just like `paste()` but where the default is `sep = ""` (no space!)

```r
str_c(c("CSSS","STAT"),508)
```

```
## [1] "CSSS508" "STAT508"
```

```r
str_c(c("CSSS","STAT"),508,sep=" ")
```

```
## [1] "CSSS 508" "STAT 508"
```

```r
str_c(c("CSSS","STAT"),508,sep = " ",collapse = ", ")
```

```
## [1] "CSSS 508, STAT 508"
```

---
## Changing Cases

`str_to_upper()`, `str_to_lower()`, `str_to_title()` convert cases, which is often a good idea to do before searching for values:

```r
unique_cities <- unique(restaurants$City)
unique_cities %>% head()
```

```
## [1] "Seattle"  "SEATTLE"  "KENT"     "BELLEVUE" "KENMORE"  "Issaquah"
```

```r
str_to_upper(unique_cities) %>% head()
```

```
## [1] "SEATTLE"  "SEATTLE"  "KENT"     "BELLEVUE" "KENMORE"  "ISSAQUAH"
```

```r
str_to_lower(unique_cities) %>% head()
```

```
## [1] "seattle"  "seattle"  "kent"     "bellevue" "kenmore"  "issaquah"
```

```r
str_to_title(unique_cities) %>% head()
```

```
## [1] "Seattle"  "Seattle"  "Kent"     "Bellevue" "Kenmore"  "Issaquah"
```

---
## Whitespace: `str_trim()`

Extra leading or trailing whitespace is common in text data:

```r
unique_names <- unique(restaurants$Name)
unique_names %>% head(3)
```

```
## [1] "@ THE SHACK, LLC "    "10 MERCER RESTAURANT"
## [3] "100 LB CLAM"
```

We can remove the whitespace using `str_trim()`:

```r
str_trim(unique_names) %>% head(3)
```

```
## [1] "@ THE SHACK, LLC"     "10 MERCER RESTAURANT"
## [3] "100 LB CLAM"
```

---
## Patterns!

It's common to want to see if a string satisfies a certain *pattern*.

We did this with numeric values earlier in this course!

```r
cars %>% filter(speed < 5 | speed > 24)
```

```
##   speed dist
## 1     4    2
## 2     4   10
## 3    25   85
```

```r
cars %>% filter(dist > 2 & dist <= 10)
```

```
##   speed dist
## 1     4   10
## 2     7    4
## 3     9   10
```

---
## Patterns: `str_detect()`

We can do similar pattern-checking using `str_detect()`:

```r
str_detect(string,pattern)
```

+ `string` is the character string (or vector of strings) we want to examine
+ `pattern` is the pattern that we're checking for inside `string`
+ Output: TRUE/FALSE vector indicating if pattern was found

```r
str_detect(string = c("Hello","my name","is Michael"),
           pattern = "m")
```

```
## [1] FALSE  TRUE FALSE
```

```r
str_detect(string = c("Hello","my name","is Michael"),
           pattern = "M")
```

```
## [1] FALSE FALSE  TRUE
```

Results are case-sensitive!!

---
## Patterns: `str_detect()`

Let's see which phone numbers are in the 206 area code:

```r
unique_phones <- unique(restaurants$Phone)
unique_phones %>% tail(4)
```

```
## [1] "(360) 698-0417" "(206) 525-7747" "(206) 390-9205"
## [4] "(425) 557-4474"
```

```r
str_detect(unique_phones,"206") %>% tail(4)
```

```
## [1] FALSE  TRUE  TRUE FALSE
```

---
## Replacement: `str_replace()`

What about if you want to replace a string with something else? Use `str_replace()`!

This function works very similarly to `str_detect()`, but with one extra argument:

```r
str_replace(string, pattern, replacement)
```

+ `replacement` is what `pattern` is substituted for.

```r
str_replace(string="Hi, I'm Michael",
            pattern="Hi",replacement="Hello")
```

```
## [1] "Hello, I'm Michael"
```

---
## Replacement: `str_replace()`

In the `Date` variable, let's replace each dash ("-") with an underscore ("_")

```r
dates <- restaurants$Date
dates %>% tail(3)
```

```
## [1] "2017-03-21" "2017-03-21" "2016-10-10"
```

```r
str_replace(dates,"-","_") %>% tail(3)
```

```
## [1] "2017_03-21" "2017_03-21" "2016_10-10"
```

Wait, what?

---
## Replacement: `str_replace_all()`

`str_replace()` only changes the **first** instance of a pattern in each string!

If we want to replace **all** patterns, use `str_replace_all()`

```r
dates <- restaurants$Date
dates %>% tail(3)
```

```
## [1] "2017-03-21" "2017-03-21" "2016-10-10"
```

```r
str_replace_all(dates,"-","_") %>% tail(3)
```

```
## [1] "2017_03_21" "2017_03_21" "2016_10_10"
```

---
class:inverse
# Quick Summary

We've seen lots of functions today!

*Don't try to memorize them!* Instead, use this page as a reference.

+ Character Length: `nchar` and `str_length`

+ Subsetting: `substr` and `str_sub`

+ Combining: `paste` and `str_c`

+ Case Changes: `str_to_upper()`, `str_to_lower()`, and `str_to_title()`

+ Removing Whitespace: `str_trim`

+ Pattern Detection/Replacement: `str_detect()` and `str_replace()`

---
class:inverse
# Activity 1: Base R Functions

The variable `Inspection_Date` is in the format "MM/DD/YYYY". In this question, we'll change the format using functions for strings.

1. How long is each character string in this variable?
1. Use `substr()` to extract the month of each entry and save it to an object called "months"
1. Use `substr()` to extract the year of each entry and save it to an object called "years"
1. Use `paste()` to combine each month and year, separated by an underscore (`_`). Save this as a new variable in the data called "Inspection_Date_Formatted"

---
## Activity: My Answers

The variable `Inspection_Date` is in the format "MM/DD/YYYY". In this question, we'll change the format using functions for strings.

1.How long is each character string in this variable?

```r
table(nchar(restaurants$Inspection_Date))
```

```
## 
##     10 
## 258000
```

2.Use `substr()` to extract the month of each entry and save it to an object called "months"

3.Use `substr()` to extract the year of each entry and save it to an object called "years"

```r
months <- substr(restaurants$Inspection_Date,1,2)
years <- substr(restaurants$Inspection_Date,7,10)
```

---
## Activity: My Answers

4.Use `paste()` to combine each month and year, separated by an underscore (`_`). Save this as a new variable in the data called "Inspection_Date_Formatted"

```r
restaurants$Inspection_Date_Formatted <- 
  paste(months,years,sep="_")
restaurants %>% 
  select(Name,Inspection_Date,Inspection_Date_Formatted) %>%
  head(5)
```

```
## # A tibble: 5 × 3
##   Name                   Inspection_Date Inspection_Date_Formatted
##   <chr>                  <chr>           <chr>                    
## 1 "@ THE SHACK, LLC "    <NA>            NA_NA                    
## 2 "10 MERCER RESTAURANT" 01/24/2017      01_2017                  
## 3 "10 MERCER RESTAURANT" 01/24/2017      01_2017                  
## 4 "10 MERCER RESTAURANT" 01/24/2017      01_2017                  
## 5 "10 MERCER RESTAURANT" 10/10/2016      10_2016
```

---
class:inverse
# Activity 2: HW 8

Let's examine the coffee shops of King County!

1. Filter your data to only include rows in which the `Name` includes the word "coffee" (in any case!)

1. Create a new variable in your data which includes the length of the business name, after removing beginning/trailing whitespace.

1. Create a new variable in your data for the inspection year, *using a `stringr` function!*

1. Create side-by-side boxplots for the length of business name vs. year.

1. Calculate the maximum `Inspection_Score` by business and year.

1. Create a line plot of maximum score ("MaxScore") over time ("Year"), by business ("Name"). That is, you should have a single line for each business. (Don't try to label them, as there are far too many!)

---
## Activity: My Solutions

1\. Filter your data to only include rows in which the `Name` includes the word "coffee" (in any case!)

```r
coffee <- restaurants
coffee$Name <- str_to_lower(coffee$Name)
coffee <- coffee %>% filter(str_detect(Name,"coffee"))
```

---
## Activity: My Solutions

2\.Create a new variable in your data which includes the length of the business name, after removing beginning/trailing whitespace.

3\. Create a new variable in your data for the inspection year.

```r
coffee$NameLength <- str_length(str_trim(coffee$Name))
coffee$Year <- str_sub(coffee$Inspection_Date,-4,-1)
```

---
## Activity: My Solutions

4\.  Create side-by-side boxplots for the length of business name vs. year.

```r
library(ggplot2)
ggplot(coffee,aes(Year,NameLength))+geom_boxplot()
```

![](CSSS508_Lecture8_strings_files/figure-html/unnamed-chunk-24-1.png)

---
## Activity: My Solutions

5\. Calculate the maximum `Inspection_Score` by business and year.

```r
coffee_summary <- coffee %>% group_by(Name,Year) %>% 
  summarize(MaxScore=max(Inspection_Score))
```

```
## `summarise()` has grouped output by 'Name'. You can override using
## the `.groups` argument.
```

---
## Activity: My Solutions

6\. Create a line plot of maximum score ("MaxScore") over time ("Year"), by business ("Name"). That is, you should have a single line for each business. (Don't try to label them, as there are far too many!)

```r
ggplot(coffee_summary,aes(Year,MaxScore,group=Name))+
  geom_line(alpha=.2)
```

![](CSSS508_Lecture8_strings_files/figure-html/unnamed-chunk-26-1.png)