Processing math: 100%
  • Instructions
  • Homework 5
    • Importing the Data
    • Exploratory Data Analysis (EDA)
      • Precinct
      • Race
      • CounterGroup
      • Party
      • CounterType
      • SumOfCount
    • The quantities of interest
    • Filtering down the data
    • Seattle precincts
    • Registered voters and turnout rates
    • Quick Plot!
  • Homework 6
    • Calculating number of votes for Democrats
    • Combining with previous vote data by precinct and race
    • Calculating Democratic support
    • Is Democratic support different by race or Seattle precincts?
    • What if we only consider “large” precincts?

Instructions

For Homework 6, you will fill out this RMarkdown template. The first half is both a key for HW 5, and code to get you set up for HW 6! As before, create code (in chunks) and write text answers (outside chunks) to answer the provided questions in an analysis of King County election data from 2016. IMPORTANT: Do NOT add any additional code chunks, and do NOT modify any chunk options!

Homework 5

Importing the Data

Download the data from https://raw.githubusercontent.com/breonh/breonh.github.io/main/csss_508/homework/homework_5/king_county_elections_2016.txt. It is a plain text file of data, about 60 MB in size. Values are separated with commas. Read the file into R. Note the cache=TRUE chunk option, which allows R to store the file between “knits” of the RMarkdown document and thus save memory/time.

king_county_elections_2016 <- read_csv("https://raw.githubusercontent.com/breonh/breonh.github.io/main/csss_508/homework/homework_5/king_county_elections_2016.txt")

Exploratory Data Analysis (EDA)

Use functions str and/or summary to look at the data. Describe the data in their current state. How many rows are there? What variables are there? What kinds of values do they take? Are the column types sensible?

dim(king_county_elections_2016)
## [1] 643163      9
summary(king_county_elections_2016)
##    Precinct             Race                LEG              CC      
##  Length:643163      Length:643163      Min.   : 1.00   Min.   :1.00  
##  Class :character   Class :character   1st Qu.:32.00   1st Qu.:3.00  
##  Mode  :character   Mode  :character   Median :37.00   Median :5.00  
##                                        Mean   :34.91   Mean   :4.86  
##                                        3rd Qu.:45.00   3rd Qu.:7.00  
##                                        Max.   :48.00   Max.   :9.00  
##                                        NA's   :224     NA's   :224   
##        CG       CounterGroup          Party           CounterType       
##  Min.   :1.00   Length:643163      Length:643163      Length:643163     
##  1st Qu.:7.00   Class :character   Class :character   Class :character  
##  Median :7.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :7.03                                                           
##  3rd Qu.:9.00                                                           
##  Max.   :9.00                                                           
##                                                                         
##    SumOfCount    
##  Min.   :   0.0  
##  1st Qu.:   4.0  
##  Median : 121.0  
##  Mean   : 198.5  
##  3rd Qu.: 330.0  
##  Max.   :1126.0  
## 

The data includes 643,163 rows and 9 columns. The variables include Precint, Race, LEG, CC, CG, CounterGroup, Party, CounterType, and SumOfCount, some of which are numeric and some of which are characters. The current column types (numeric vs. char) seem sensible.

This real-world election data is provided to you in “tidy” format! That is, each row is an observation: The number of votes given to a candidate/ballot measure/answer type in a given political race across voters in a precinct. We will ignore the variables LEG, CC, and CG as they are not of practical use. For the remaining variables, we will explore them graphically and attempt to figure out what they mean. Remember that in real world data work, you often have to get by with intuition or poking around online to figure out the nature of the data.

In each code chunk below, present a summary of each variable on it’s own (such as a histogram, frequency table, barplot, etc.). If there are many categories, it’s fine to print just the first few. If you create a figure, use ggplot. After, write down a one sentence interpretation for each summary.

Precinct

table(king_county_elections_2016$Precinct) %>% head(20)
## 
##         ADAIR       ALDARRA ALDER SPRINGS     ALDERWOOD   ALG 30-0013 
##           237           257           237           237           250 
##   ALG 30-0014   ALG 30-3141        ALPINE     AMES LAKE    ANGEL CITY 
##           250           250           251           251           244 
##        ANGELO          ARIA        ARTHUR    ASPEN GLEN   AUB 30-0046 
##           257           250           251           237           250 
##   AUB 30-0053   AUB 30-0067   AUB 30-2702   AUB 30-2703   AUB 30-3476 
##           250           250           250           250           250

This variable appears to contain election precincts from King County, in slightly varying numbers.

Race

table(king_county_elections_2016$Race) %>% head(20)
## 
##                             Advisory Vote 14 
##                                        15102 
##                             Advisory Vote 15 
##                                        15102 
##                             Attorney General 
##                                        17619 
## Auburn School District No. 408 Proposition 1 
##                                          522 
##               City of Bellevue Proposition 1 
##                                         1014 
##               City of Bellevue Proposition 2 
##                                         1014 
##       City of Bothell Advisory Proposition 1 
##                                          156 
##                City of Bothell Proposition 1 
##                                          156 
##               City of Duvall Advisory Vote 1 
##                                           42 
##                 City of Duvall Proposition 1 
##                                           42 
##               City of Issaquah Proposition 1 
##                                          222 
##                City of Kenmore Proposition 1 
##                                          156 
##       City of Seattle Initiative Measure 124 
##                                         5772 
##              City of Shoreline Proposition 1 
##                                          522 
##             City of Snoqualmie Proposition 1 
##                                           72 
##                City of Tukwila Proposition 1 
##                                          108 
##                 Commissioner of Public Lands 
##                                        17619 
##                     Congressional District 1 
##                                         2268 
##                     Congressional District 7 
##                                         7007 
##                     Congressional District 8 
##                                         3059

This variable seems to contain election races from King County in 2016.

CounterGroup

table(king_county_elections_2016$CounterGroup)
## 
##  Total 
## 643163

Every value in this variable is “Total”, which makes the variable useless!

Party

ggplot(king_county_elections_2016,aes(Party))+
  geom_bar()+theme_bw()+ylab("Count")+ggtitle("Number of Observations by Party")

This variable seems to contain political parties, although, “NP” (which probably stands for “No Party” or “Non-Partisan”) is very prevalent. Perhaps this stands for races which are non-partisan.

CounterType

table(king_county_elections_2016$CounterType) %>% head(10)
## 
##                    Adam Smith                Alvin Rutledge 
##                           755                           106 
## Alyson Kennedy & Osborne Hart                Andrew Pilloud 
##                          2517                           209 
##                  Anthony Gipe                      Approved 
##                          2517                          5533 
##                Barbara Madsen                 Barry Knowles 
##                          2517                           136 
##         Benjamin Judah Phelps                   Bill Bryant 
##                           174                          2517

This variable seems to contain candidate names.

Notice something odd about CounterType in particular? It tells you what a given row of votes was for… but it also has Registered Voters and Times Counted. What are these values?

SumOfCount

ggplot(king_county_elections_2016,aes(SumOfCount))+geom_histogram(bins=30)+
  xlab("Number of Votes by Race and Precinct")+ylab("Count")+theme_bw()

This variable seems to include the number of votes by precinct and race. Many precincts have no votes for particular races, it appears.

The quantities of interest

In this assignment (and the next), we will focus on three major races in Washington in 2016:

  • “US President & Vice President”
  • “Governor”
  • “Lieutenant Governor”

With these races, we are interested in:

  1. Turnout rates for each of these races in each precinct. We will measure turnout as total number of submitted votes (including for a candidate, blank, write-in, or “over vote”) divided by the number of registered voters.
  1. Differences between precincts in Seattle and precincts elsewhere in King County.
  1. Precinct-level support for the Democratic candidates in King County in 2016 for each contest. We will measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins.

You will answer Questions #1 and #2 in this homework (Question #3 will be completed in homework 6). The sections below describe steps you may want to take to answer Questions 1 and 2. I suggest loading dplyr and tidyr (in the very first code chunk of this Rmd) to start!

Filtering down the data

For what we want to do, there are a lot of rows that are not useful. Start by filtering the dataset to only includes rows in which the race is one of: “US President & Vice President”, “Governor”, or “Lieutenant Governor”. Save this subsetted dataset as a new object, called king_county_elections_2016_Exec

king_county_elections_2016_Exec <- king_county_elections_2016 %>% 
  filter(Race %in% c("US President & Vice President", "Governor", "Lieutenant Governor"))

Seattle precincts

In our subsetted data, we want to add a “boolean” variable (TRUE or FALSE) for a precinct belonging to Seattle. The following code will create a vector of booleans. Using this code, add it to your dataset king_county_elections_2016_Exec as a new variable called “Seattle”

ifelse(substr(king_county_elections_2016_Exec$Precinct, start = 1, stop = 4) == "SEA ","Seattle","Not Seattle")

king_county_elections_2016_Exec$Seattle <- ifelse(substr(king_county_elections_2016_Exec$Precinct, start = 1, stop = 4) == "SEA ","Seattle","Not Seattle")

Registered voters and turnout rates

We want to calculate turnout rates for each race. We define Turnout=TotalVotesRegisteredVoters, where total votes are listed in rows where the variable CounterType equals ‘Times Counted’, and registered votes are listed in rows where CounterType equals ‘Registered Voters’. We can calculate turnout rates in three steps: Total votes by race/precinct, Registered votes by race/precinct, and finally turnout by race/precinct. Let’s do it!

First, create a dataset called Votes in which you filter king_county_elections_2016_Exec to contain rows only where CounterType == ‘Times Counted’. You should now have on row per Precint/Race. Add a variable called “TotalVotes” to the Votes dataset, which contains the number of total votes by race/precinct (currently in the “SumOfCount” variable in Votes).

Votes <- king_county_elections_2016_Exec %>% filter(CounterType == 'Times Counted')
Votes$TotalVotes <- Votes$SumOfCount

Second, create a dataset called Registered which contains rows only where CounterType == ‘Registered Voters’. The “SumOfCount” variable in this dataset includes the number of registered votes by precinct/race. Add a variable called “Registered” to the Votes dataset, which contains the number of registered voters by race/precinct (currently in the “SumOfCount” variable in Registered).

Registered <- king_county_elections_2016_Exec %>% filter(CounterType == 'Registered Voters')
Votes$Registered <- Registered$SumOfCount

Third, create a new variable in Votes called “Turnout”, which includes turnout calculated by dividing total votes by registered voters. Then, subset Votes to contain only the following variables: Precinct, Seattle, Race, TotalVotes, Registered, and Turnout. Display the first 10 rows of Votes, and print the number of rows/columns (there should be 7551 rows and 6 columns!).

Votes$Turnout <- Votes$TotalVotes/Votes$Registered
Votes <- Votes %>% select(Precinct,Seattle,Race,TotalVotes,Registered,Turnout)
Votes %>% head(10)
## # A tibble: 10 × 6
##    Precinct      Seattle     Race                        Total…¹ Regis…² Turnout
##    <chr>         <chr>       <chr>                         <dbl>   <dbl>   <dbl>
##  1 ADAIR         Not Seattle Governor                        485     519   0.934
##  2 ADAIR         Not Seattle Lieutenant Governor             485     519   0.934
##  3 ADAIR         Not Seattle US President & Vice Presid…     485     519   0.934
##  4 ALDARRA       Not Seattle Governor                        625     763   0.819
##  5 ALDARRA       Not Seattle Lieutenant Governor             625     763   0.819
##  6 ALDARRA       Not Seattle US President & Vice Presid…     625     763   0.819
##  7 ALDER SPRINGS Not Seattle Governor                        476     557   0.855
##  8 ALDER SPRINGS Not Seattle Lieutenant Governor             476     557   0.855
##  9 ALDER SPRINGS Not Seattle US President & Vice Presid…     476     557   0.855
## 10 ALDERWOOD     Not Seattle Governor                        404     472   0.856
## # … with abbreviated variable names ¹​TotalVotes, ²​Registered
dim(Votes)
## [1] 7551    6

Quick Plot!

Create ggplot histograms of turnout rates, first by “Race” and second by “Seattle”. Do you notice any changes in turnout based on race or whether or not a precinct was located in Seattle?

ggplot(Votes,aes(Turnout))+theme_bw()+
  geom_histogram(bins=30)+facet_wrap(~Race)+
  xlab("Turnout Rate")+ylab("Count")+
  ggtitle("Turnout Race by Executive Race",
          "2016 King County Election Results")+
  xlim(c(0,1))
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 6 rows containing missing values (`geom_bar()`).

ggplot(Votes,aes(Turnout))+theme_bw()+
  geom_histogram(bins=30)+facet_wrap(~Seattle)+
  xlab("Turnout Rate")+ylab("Count")+
  ggtitle("Turnout Race by Seattlevs. Non-Seattle Precinct",
          "2016 King County Election Results")+
  xlim(c(0,1))
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).

Turnout seems quite consistent across different races; however, turnout in Seattle seems slightly higher on average than turnout in other parts of King County.

That’s it for Homework 5!

Homework 6

In today’s homework, we’ll analyze the support for democratic candidates by precinct and political race using 2016 King County Election data. Prior to editing the code below, be sure you’ve run all the code above (you can run all chunks above by clicking the downward facing arrow in the upper-right corner of the first code chunk beneath this text).

Calculating number of votes for Democrats

Let’s first create a data.frame called DemVotes which includes three variables: “Precinct”, “Race”, and “DemVotes”, which contains the total number of votes received by candidates from the Democratic party in each precinct and political race. To create this dataset, I recommend following these steps:

  1. Start with the king_county_elections_2016_Exec, and filter to rows in which “Party” is either “Dem” or “DPN” (both correspond to Democrats).
  2. group_by Precinct and Race
  3. summarize the total number of votes (i.e., sum together the values in the “SumOfCount” variable) into a new variable called “DemVotes”.
# [CODE HERE]

Combining with previous vote data by precinct and race

Now, use the left_join function to merge the “Votes” data from last week with the “DemVotes” data you just created. Be sure you have merged the two data.frames by the “Precinct” and “Race” variables. Save the joined data.frame as “Votes_HW6”.

# [CODE HERE]

Calculating Democratic support

Now, create a new variable in the “Votes_HW6” data.frame called “DemPercent”, in which DemPercent is calculated by dividing the number of votes received by Democrats by the total number of votes (in each precinct and race). Create a histogram in ggplot of the “DemPercent” variable. (NOTE: All values should be between 0 and 1!!!). Write a one-sentence interpretation of the distribution you observe.

# [CODE HERE]

ANSWER: [ANSWER HERE!]

Is Democratic support different by race or Seattle precincts?

Using the facet_grid() or facet_wrap() layer functions, create side-by-side histograms in ggplot to explore whether or not democratic support varies based on (1) precincts located in Seattle or not, and (2) political race. Write a one or two-sentence explanation of your findings.

# [CODE HERE]

ANSWER: [ANSWER HERE!]

What if we only consider “large” precincts?

Repeat the previous question, but this time using only data from precincts with at least 500 registered voters. Are the patterns you observed any different now?

# [CODE HERE]

ANSWER: [ANSWER HERE!]