For Homework 6, you will fill out this RMarkdown template. The first half is both a key for HW 5, and code to get you set up for HW 6! As before, create code (in chunks) and write text answers (outside chunks) to answer the provided questions in an analysis of King County election data from 2016. IMPORTANT: Do NOT add any additional code chunks, and do NOT modify any chunk options!
Download the data from https://raw.githubusercontent.com/breonh/breonh.github.io/main/csss_508/homework/homework_5/king_county_elections_2016.txt. It is a plain text file of data, about 60 MB in size. Values are separated with commas. Read the file into R. Note the
cache=TRUE
chunk option, which allows R to store the file between “knits” of the RMarkdown document and thus save memory/time.
king_county_elections_2016 <- read_csv("https://raw.githubusercontent.com/breonh/breonh.github.io/main/csss_508/homework/homework_5/king_county_elections_2016.txt")
Use functions
str
and/orsummary
to look at the data. Describe the data in their current state. How many rows are there? What variables are there? What kinds of values do they take? Are the column types sensible?
dim(king_county_elections_2016)
## [1] 643163 9
summary(king_county_elections_2016)
## Precinct Race LEG CC
## Length:643163 Length:643163 Min. : 1.00 Min. :1.00
## Class :character Class :character 1st Qu.:32.00 1st Qu.:3.00
## Mode :character Mode :character Median :37.00 Median :5.00
## Mean :34.91 Mean :4.86
## 3rd Qu.:45.00 3rd Qu.:7.00
## Max. :48.00 Max. :9.00
## NA's :224 NA's :224
## CG CounterGroup Party CounterType
## Min. :1.00 Length:643163 Length:643163 Length:643163
## 1st Qu.:7.00 Class :character Class :character Class :character
## Median :7.00 Mode :character Mode :character Mode :character
## Mean :7.03
## 3rd Qu.:9.00
## Max. :9.00
##
## SumOfCount
## Min. : 0.0
## 1st Qu.: 4.0
## Median : 121.0
## Mean : 198.5
## 3rd Qu.: 330.0
## Max. :1126.0
##
The data includes 643,163 rows and 9 columns. The variables include Precint, Race, LEG, CC, CG, CounterGroup, Party, CounterType, and SumOfCount, some of which are numeric and some of which are characters. The current column types (numeric vs. char) seem sensible.
This real-world election data is provided to you in “tidy” format! That is, each row is an observation: The number of votes given to a candidate/ballot measure/answer type in a given political race across voters in a precinct. We will ignore the variables
LEG
,CC
, andCG
as they are not of practical use. For the remaining variables, we will explore them graphically and attempt to figure out what they mean. Remember that in real world data work, you often have to get by with intuition or poking around online to figure out the nature of the data.
In each code chunk below, present a summary of each variable on it’s own (such as a histogram, frequency table, barplot, etc.). If there are many categories, it’s fine to print just the first few. If you create a figure, use ggplot. After, write down a one sentence interpretation for each summary.
table(king_county_elections_2016$Precinct) %>% head(20)
##
## ADAIR ALDARRA ALDER SPRINGS ALDERWOOD ALG 30-0013
## 237 257 237 237 250
## ALG 30-0014 ALG 30-3141 ALPINE AMES LAKE ANGEL CITY
## 250 250 251 251 244
## ANGELO ARIA ARTHUR ASPEN GLEN AUB 30-0046
## 257 250 251 237 250
## AUB 30-0053 AUB 30-0067 AUB 30-2702 AUB 30-2703 AUB 30-3476
## 250 250 250 250 250
This variable appears to contain election precincts from King County, in slightly varying numbers.
table(king_county_elections_2016$Race) %>% head(20)
##
## Advisory Vote 14
## 15102
## Advisory Vote 15
## 15102
## Attorney General
## 17619
## Auburn School District No. 408 Proposition 1
## 522
## City of Bellevue Proposition 1
## 1014
## City of Bellevue Proposition 2
## 1014
## City of Bothell Advisory Proposition 1
## 156
## City of Bothell Proposition 1
## 156
## City of Duvall Advisory Vote 1
## 42
## City of Duvall Proposition 1
## 42
## City of Issaquah Proposition 1
## 222
## City of Kenmore Proposition 1
## 156
## City of Seattle Initiative Measure 124
## 5772
## City of Shoreline Proposition 1
## 522
## City of Snoqualmie Proposition 1
## 72
## City of Tukwila Proposition 1
## 108
## Commissioner of Public Lands
## 17619
## Congressional District 1
## 2268
## Congressional District 7
## 7007
## Congressional District 8
## 3059
This variable seems to contain election races from King County in 2016.
table(king_county_elections_2016$CounterGroup)
##
## Total
## 643163
Every value in this variable is “Total”, which makes the variable useless!
ggplot(king_county_elections_2016,aes(Party))+
geom_bar()+theme_bw()+ylab("Count")+ggtitle("Number of Observations by Party")
This variable seems to contain political parties, although, “NP” (which probably stands for “No Party” or “Non-Partisan”) is very prevalent. Perhaps this stands for races which are non-partisan.
table(king_county_elections_2016$CounterType) %>% head(10)
##
## Adam Smith Alvin Rutledge
## 755 106
## Alyson Kennedy & Osborne Hart Andrew Pilloud
## 2517 209
## Anthony Gipe Approved
## 2517 5533
## Barbara Madsen Barry Knowles
## 2517 136
## Benjamin Judah Phelps Bill Bryant
## 174 2517
This variable seems to contain candidate names.
Notice something odd about CounterType in particular? It tells you what a given row of votes was for… but it also has
Registered Voters
andTimes Counted
. What are these values?
ggplot(king_county_elections_2016,aes(SumOfCount))+geom_histogram(bins=30)+
xlab("Number of Votes by Race and Precinct")+ylab("Count")+theme_bw()
This variable seems to include the number of votes by precinct and race. Many precincts have no votes for particular races, it appears.
In this assignment (and the next), we will focus on three major races in Washington in 2016:
- “US President & Vice President”
- “Governor”
- “Lieutenant Governor”
With these races, we are interested in:
- Turnout rates for each of these races in each precinct. We will measure turnout as total number of submitted votes (including for a candidate, blank, write-in, or “over vote”) divided by the number of registered voters.
- Differences between precincts in Seattle and precincts elsewhere in King County.
- Precinct-level support for the Democratic candidates in King County in 2016 for each contest. We will measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins.
You will answer Questions #1 and #2 in this homework (Question #3 will be completed in homework 6). The sections below describe steps you may want to take to answer Questions 1 and 2. I suggest loading
dplyr
andtidyr
(in the very first code chunk of this Rmd) to start!
For what we want to do, there are a lot of rows that are not useful. Start by filtering the dataset to only includes rows in which the race is one of: “US President & Vice President”, “Governor”, or “Lieutenant Governor”. Save this subsetted dataset as a new object, called
king_county_elections_2016_Exec
king_county_elections_2016_Exec <- king_county_elections_2016 %>%
filter(Race %in% c("US President & Vice President", "Governor", "Lieutenant Governor"))
In our subsetted data, we want to add a “boolean” variable (TRUE or FALSE) for a precinct belonging to Seattle. The following code will create a vector of booleans. Using this code, add it to your dataset king_county_elections_2016_Exec as a new variable called “Seattle”
ifelse(substr(king_county_elections_2016_Exec$Precinct, start = 1, stop = 4) == "SEA ","Seattle","Not Seattle")
king_county_elections_2016_Exec$Seattle <- ifelse(substr(king_county_elections_2016_Exec$Precinct, start = 1, stop = 4) == "SEA ","Seattle","Not Seattle")
We want to calculate turnout rates for each race. We define Turnout=TotalVotesRegisteredVoters, where total votes are listed in rows where the variable
CounterType
equals ‘Times Counted’, and registered votes are listed in rows whereCounterType
equals ‘Registered Voters’. We can calculate turnout rates in three steps: Total votes by race/precinct, Registered votes by race/precinct, and finally turnout by race/precinct. Let’s do it!
First, create a dataset called
Votes
in which you filterking_county_elections_2016_Exec
to contain rows only where CounterType == ‘Times Counted’. You should now have on row per Precint/Race. Add a variable called “TotalVotes” to theVotes
dataset, which contains the number of total votes by race/precinct (currently in the “SumOfCount” variable inVotes
).
Votes <- king_county_elections_2016_Exec %>% filter(CounterType == 'Times Counted')
Votes$TotalVotes <- Votes$SumOfCount
Second, create a dataset called
Registered
which contains rows only where CounterType == ‘Registered Voters’. The “SumOfCount” variable in this dataset includes the number of registered votes by precinct/race. Add a variable called “Registered” to theVotes
dataset, which contains the number of registered voters by race/precinct (currently in the “SumOfCount” variable inRegistered
).
Registered <- king_county_elections_2016_Exec %>% filter(CounterType == 'Registered Voters')
Votes$Registered <- Registered$SumOfCount
Third, create a new variable in
Votes
called “Turnout”, which includes turnout calculated by dividing total votes by registered voters. Then, subsetVotes
to contain only the following variables: Precinct, Seattle, Race, TotalVotes, Registered, and Turnout. Display the first 10 rows ofVotes
, and print the number of rows/columns (there should be 7551 rows and 6 columns!).
Votes$Turnout <- Votes$TotalVotes/Votes$Registered
Votes <- Votes %>% select(Precinct,Seattle,Race,TotalVotes,Registered,Turnout)
Votes %>% head(10)
## # A tibble: 10 × 6
## Precinct Seattle Race Total…¹ Regis…² Turnout
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ADAIR Not Seattle Governor 485 519 0.934
## 2 ADAIR Not Seattle Lieutenant Governor 485 519 0.934
## 3 ADAIR Not Seattle US President & Vice Presid… 485 519 0.934
## 4 ALDARRA Not Seattle Governor 625 763 0.819
## 5 ALDARRA Not Seattle Lieutenant Governor 625 763 0.819
## 6 ALDARRA Not Seattle US President & Vice Presid… 625 763 0.819
## 7 ALDER SPRINGS Not Seattle Governor 476 557 0.855
## 8 ALDER SPRINGS Not Seattle Lieutenant Governor 476 557 0.855
## 9 ALDER SPRINGS Not Seattle US President & Vice Presid… 476 557 0.855
## 10 ALDERWOOD Not Seattle Governor 404 472 0.856
## # … with abbreviated variable names ¹TotalVotes, ²Registered
dim(Votes)
## [1] 7551 6
Create ggplot histograms of turnout rates, first by “Race” and second by “Seattle”. Do you notice any changes in turnout based on race or whether or not a precinct was located in Seattle?
ggplot(Votes,aes(Turnout))+theme_bw()+
geom_histogram(bins=30)+facet_wrap(~Race)+
xlab("Turnout Rate")+ylab("Count")+
ggtitle("Turnout Race by Executive Race",
"2016 King County Election Results")+
xlim(c(0,1))
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 6 rows containing missing values (`geom_bar()`).
ggplot(Votes,aes(Turnout))+theme_bw()+
geom_histogram(bins=30)+facet_wrap(~Seattle)+
xlab("Turnout Rate")+ylab("Count")+
ggtitle("Turnout Race by Seattlevs. Non-Seattle Precinct",
"2016 King County Election Results")+
xlim(c(0,1))
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
Turnout seems quite consistent across different races; however, turnout in Seattle seems slightly higher on average than turnout in other parts of King County.
That’s it for Homework 5!
In today’s homework, we’ll analyze the support for democratic candidates by precinct and political race using 2016 King County Election data. Prior to editing the code below, be sure you’ve run all the code above (you can run all chunks above by clicking the downward facing arrow in the upper-right corner of the first code chunk beneath this text).
Let’s first create a data.frame called DemVotes
which
includes three variables: “Precinct”, “Race”, and “DemVotes”, which
contains the total number of votes received by candidates from the
Democratic party in each precinct and political race. To create this
dataset, I recommend following these steps:
king_county_elections_2016_Exec
, and
filter to rows in which “Party” is either “Dem” or “DPN” (both
correspond to Democrats).group_by
Precinct and Racesummarize
the total number of votes (i.e.,
sum
together the values in the “SumOfCount” variable) into
a new variable called “DemVotes”.# [CODE HERE]
Now, use the left_join
function to merge the “Votes”
data from last week with the “DemVotes” data you just created. Be sure
you have merged the two data.frames by the “Precinct” and “Race”
variables. Save the joined data.frame as “Votes_HW6”.
# [CODE HERE]
Now, create a new variable in the “Votes_HW6” data.frame called “DemPercent”, in which DemPercent is calculated by dividing the number of votes received by Democrats by the total number of votes (in each precinct and race). Create a histogram in ggplot of the “DemPercent” variable. (NOTE: All values should be between 0 and 1!!!). Write a one-sentence interpretation of the distribution you observe.
# [CODE HERE]
ANSWER: [ANSWER HERE!]
Using the facet_grid()
or facet_wrap()
layer functions, create side-by-side histograms in ggplot to explore
whether or not democratic support varies based on (1) precincts located
in Seattle or not, and (2) political race. Write a one or two-sentence
explanation of your findings.
# [CODE HERE]
ANSWER: [ANSWER HERE!]
Repeat the previous question, but this time using only data from precincts with at least 500 registered voters. Are the patterns you observed any different now?
# [CODE HERE]
ANSWER: [ANSWER HERE!]