class: center, top, title-slide .title[ # CSSS 508, Lecture 2 ] .subtitle[ ## Visualizing Data ] .author[ ### Michael Pearce
(based on slides from Chuck Lanfear) ] .date[ ### April 5, 2023 ] --- class: inverse # Topics Last time, we learned about, 1. R and RStudio 1. RMarkdown headers, syntax, and chunks 1. Basics of functions, objects, and vectors 1. Dataframes and basic plots -- Today, we will cover, 1. Useful coding tips: packages, directories, and saving data 1. Basics of ggplot: layers and aesthetics 1. Advanced ggplot tools --- class: inverse # 1. Useful Coding Tips --- # Packages Packages are **collections of functions and tools** that make your life easier! The best part of R is the huge number of user-created packages. The `Packages` tab in the bottom-right pane of RStudio lists your installed packages. -- To install a new package in R, run the line of code: ```r install.packages("gapminder") ``` We always install packages in the **console**, because we only want to do it **once** --- # Loading Packages Installing a packages *does not* mean it's loaded in our R session. To do so, we call the package: ```r library(gapminder) ``` **NOTE:** Use quotes when installing packages, but not when loading packages! -- We need to run this code every time we open a new R session: **Where should we put this code?** -- **Answer:** In R/Rmd files, and not the console! --- # Working Directories R saves files and looks for files to open in your current **working directory**. You can ask R what this is: ```r getwd() ``` ``` ## [1] "/Users/pearce790/CSSS508/Lectures/Lecture2" ``` -- Similarly, we can set a working directory like so: ```r setwd("C:/Users/pearce790/CSSS508/HW2") ``` -- *Don't set a working directory in R Markdown documents!* They automatically set the directory they are in as the working directory. --- # Managing Files When managing R projects, it is normally best to give each project (such as a homework assignment) its own folder. I use the following system: * Every class or project has its own folder * Each assignment or task has a folder inside that, which is the working directory for that item. * `.Rmd` and `.R` files are named clearly and completely -- For example, this presentation is located and named this: `GitHub/CSSS508/Lectures/Lecture2/CSSS508_Lecture2_ggplot2.Rmd` **Be consistent** so your projects are organized! You don't want to lose files!! --- class: inverse # Aside: Gapminder Data --- # Gapminder Data In today's lecture and Homework 2, we'll use data from Hans Rosling's [Gapminder](http://www.gapminder.org) project. The data can be accessed through the `gapminder` R package. -- If you didn't already, run in the console: `install.packages("gapminder")`, and then load the package: ```r library(gapminder) ``` --- # Exploring Gapminder The data is in a dataframe called `gapminder`, which is available after loading the package. Let's explore it using functions from last week: .small[ ```r str(gapminder) ``` ``` ## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame) ## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... ## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ... ## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ... ## $ gdpPercap: num [1:1704] 779 821 853 836 740 ... ``` ] --- # What's Interesting Here? * **Factor** variables `country` and `continent` + Factors are categorical data + We'll spend a lot of time on factors later! -- * Many observations: `\(n=1704\)` rows -- * For each observation, a few variables: `\(p=6\)` columns -- * A nested/hierarchical structure: `year` in `country` in `continent` + These are panel data! --- class: inverse # 2. Basics of `ggplot2` --- ## Basic Plots .pull-left[ .small[ ```r China <- subset(gapminder, gapminder$country == "China") plot(lifeExp ~ year, data = China, xlab = "Year", ylab = "Life expectancy", main = "Life expectancy in China", col = "red", pch = 16) ``` ] The plot is made with *one function* and *many arguments* ] .pull-right[ <!-- --> ] Note: Don't worry about the code used to create the object `China`. We'll explore data manipulation next week! --- ## Fancier: `ggplot` .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy in China") + theme_bw(base_size=18) ``` ] This `ggplot` is made with *many functions* and *fewer arguments* in each. ] .pull-right[ <!-- --> ] --- # `ggplot2` The `ggplot2` package provides an alternative toolbox for plotting. ```r # install.packages("ggplot2") library(ggplot2) ``` The core idea underlying this package is the [**layered grammar of graphics**](https://doi.org/10.1198/jcgs.2009.07098): we can break up elements of a plot into pieces and combine them. -- `ggplot`s are a bit harder to create, but are usually: * prettier, * more professional, and * more customizable! --- # Structure of a ggplot `ggplot` graphics objects consist of two primary components: -- 1. **Layers**, the components of a graph. * We *add* layers to a `ggplot` object using `+`. * This includes adding lines, shapes, and text to a plot. -- 2. **Aesthetics**, which determine how the layers appear. * We *set* aesthetics using *arguments* (e.g. `color="red"`) inside layer functions. * This includes modifying locations, colors, and sizes of the layers. --- # Layers **Layers** are the components of the graph, such as: * `ggplot()`: initializes basic plotting object, specifies input data * `geom_point()`: layer of scatterplot points * `geom_line()`: layer of lines * `geom_histogram()`: layer of a histogram * `ggtitle()`, `xlab()`, `ylab()`: layers of labels * `facet_wrap()`: layer creating multiple plot panels * `theme_bw()`: layer replacing default gray background with black-and-white Layers are separated by a `+` sign. For clarity, I usually put each layer on a new line. --- # Aesthetics **Aesthetics** control the appearance of the layers: * `x`, `y`: `\(x\)` and `\(y\)` coordinate values to use * `color`: set color of elements based on some data value * `group`: describe which points are conceptually grouped together for the plot (often used with lines) * `size`: set size of points/lines based on some data value (greater than 0) * `alpha`: set transparency based on some data value (between 0 and 1) --- class: inverse # Examples: Basic Jargon in Action! We'll now build up two `ggplot`s together that demonstrate common layers and aesthetics. --- ## Axis Labels, Points, No Background ### 1: Base Plot .pull-left[ .small[ ```r *ggplot(data = China, * aes(x = year, y = lifeExp)) ``` ] ] .pull-right[ <!-- --> ] .footnote[Initialize the plot with `ggplot()` and `x` and `y` aesthetics **mapped** to variables. These aesthetics will be accessible to any future layers since they're in the primary layer.] --- ## Axis Labels, Points, No Background ### 2: Scatterplot .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + * geom_point() ``` ] ] .pull-right[ <!-- --> ] .footnote[Add a scatterplot **layer**.] --- ## Axis Labels, Points, No Background ### 3: Point Color and Size .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + * geom_point(color = "red", size = 3) ``` ] ] .pull-right[ <!-- --> ] .footnote[**Set** aesthetics to make the points large and red.] --- ## Axis Labels, Points, No Background ### 4: X-Axis Label .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + * xlab("Year") ``` ] ] .pull-right[ <!-- --> ] .footnote[Add a layer to capitalize the x-axis label.] --- ## Axis Labels, Points, No Background ### 5: Y-Axis Label .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + * ylab("Life expectancy") ``` ] ] .pull-right[ <!-- --> ] .footnote[Add a layer to clean up the y-axis label.] --- ## Axis Labels, Points, No Background ### 6: Title .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + * ggtitle("Life expectancy in China") ``` ] ] .pull-right[ <!-- --> ] .footnote[Add a title layer.] --- ## Axis Labels, Points, No Background ### 7: Theme .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy in China") + * theme_bw() ``` ] ] .pull-right[ <!-- --> ] .footnote[Pick a nicer theme with a new layer.] --- ## Axis Labels, Points, No Background ### 8: Text Size .pull-left[ .small[ ```r ggplot(data = China, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy in China") + * theme_bw(base_size=18) ``` ] ] .pull-right[ <!-- --> ] .footnote[Increase the base text size.] --- # Plotting All Countries We have a plot we like for China... ... but what if we want *all the countries*? --- # Plotting All Countries ### 1: A Mess! .pull-left[ .small[ ```r *ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) ``` ] ] .pull-right[ <!-- --> ] .footnote[We can't tell countries apart! Maybe we could follow *lines*?] --- # Plotting All Countries ### 2: Lines .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp)) + * geom_line(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) ``` ] ] .pull-right[ ``` ## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ## ℹ Please use `linewidth` instead. ``` <!-- --> ] .footnote[`ggplot2` doesn't know how to connect the lines!] --- # Plotting All Countries ### 3: Grouping .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, * group = country)) + geom_line(color = "red", size = 3) + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) ``` ] ] .pull-right[ <!-- --> ] .footnote[That looks more reasonable... but the lines are too thick!] --- # Plotting All Countries ### 4: Size .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + * geom_line(color = "red") + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) ``` ] ] .pull-right[ <!-- --> ] .footnote[Much better... but maybe we can do highlight regional differences?] --- # Plotting All Countries ### 5: Color .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, * color = continent)) + * geom_line() + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) ``` ] ] .pull-right[ <!-- --> ] .footnote[Patterns are obvious... but why not separate continents completely?] --- # Plotting All Countries ### 6: Facets .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line() + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw(base_size=18) + * facet_wrap(~ continent) ``` ] ] .pull-right[ <!-- --> ] .footnote[Now the text is too big!] --- # Plotting All Countries ### 7: Text Size .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line() + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + * theme_bw() + facet_wrap(~ continent) ``` ] ] .pull-right[ <!-- --> ] .footnote[Better. Do we even need the legend anymore?] --- # Plotting All Countries ### 8: No Legend .pull-left[ .small[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line() + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw() + facet_wrap(~ continent) + * theme(legend.position = "none") ``` ] ] .pull-right[ <!-- --> ] .footnote[Looking good!] --- class: inverse # 4. Advanced ggplot tools *(10 minute break!)* Next, we'll discuss: * Storing, modifying, and saving ggplots * Advanced axis changes (scales, text, ticks) * Legend changes (scales, colors, locations) --- # Storing Plots We can assign a `ggplot` object to a name: ```r lifeExp_by_year <- ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line() + xlab("Year") + ylab("Life expectancy") + ggtitle("Life expectancy over time") + theme_bw() + facet_wrap(~ continent) + theme(legend.position = "none") ``` Afterwards, you can display or modify `ggplot`s... --- # Showing a Stored Graph ```r lifeExp_by_year ``` <!-- --> --- # Adding More Layers ```r lifeExp_by_year + theme(legend.position = "bottom") ``` <!-- --> --- # Saving `ggplot` Plots If you want to save a ggplot, use `ggsave()`: ```r ggsave("I_saved_a_file.pdf", plot = lifeExp_by_year, height = 3, width = 5, units = "in") ``` If you didn't manually set font sizes, these will usually come out at a reasonable size given the dimensions of your output file. --- # Changing the Axes We can modify the axes in a variety of ways, such as: * Change the `\(x\)` or `\(y\)` range using `xlim()` or `ylim()` layers * Change to a logarithmic or square-root scale on either axis: `scale_x_log10()`, `scale_y_sqrt()` * Change where the major/minor breaks are: `scale_x_continuous(breaks =, minor_breaks = )` --- # Axis Changes ```r ggplot(data = China, aes(x = year, y = gdpPercap)) + geom_line() + * scale_y_log10(breaks = c(1000, 2000, 3000, 4000, 5000)) + xlim(1940, 2010) + ggtitle("Chinese GDP per capita") ``` <!-- --> --- ## Precise Legend Position .small[ ```r lifeExp_by_year + theme(legend.position = c(0.8, 0.2)) ``` <!-- --> ] Instead of coordinates, you could also use "top", "bottom", "left", or "right". --- # Scales for Color, Shape, etc. **Scales** are layers that control how the mapped aesthetics appear. You can modify these with a `scale_[aesthetic]_[option]()` layer: -- * `[aesthetic]` is `color`, `shape`, `linetype`, `alpha`, `size`, `fill`, etc. -- * `[option]` is something like `manual`, `continuous` or `discrete` (depending on nature of the variable). -- **Examples:** * `scale_linetype_manual()`: manually specify the linetype for each different value * `scale_color_manual()`: manually specify colors --- ## Legend Name and Manual Colors .small[ ```r lifeExp_by_year + theme(legend.position = c(0.8, 0.2)) + scale_color_manual( * name = "Which continent are\nwe looking at?", # \n adds a line break values = c("Africa" = "seagreen", "Americas" = "turquoise1", "Asia" = "royalblue", "Europe" = "violetred1", "Oceania" = "yellow")) ``` <!-- --> ] --- class: inverse # Bonus: Advanced Example! --- ## End Result We're going to *slowly* build up a *really detailed plot* now! <!-- --> --- ## 1. Base Plot .smaller[ ```r *ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) # # # # # # # # # ``` <!-- --> ] --- ## 2. Lines .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + * geom_line() # # # # # # # # ``` <!-- --> ] --- ## 3. Continent Average .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line() + * geom_line(stat = "smooth", method = "loess", * aes(group = continent)) # # # # # # ``` <!-- --> ] .footnote[Note: A [*loess* curve](https://en.wikipedia.org/wiki/Local_regression) is something like a moving average.] --- ## 4. Facets .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line() + geom_line(stat = "smooth", method = "loess", aes(group = continent)) + * facet_wrap(~ continent, nrow = 2) # # # # # ``` <!-- --> ] --- ## 5. Color Scale .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + * geom_line(aes(color = "Country")) + geom_line(stat = "smooth", method = "loess", * aes(group = continent, color = "Continent")) + facet_wrap(~ continent, nrow = 2) + * scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) # # # # ``` <!-- --> ] --- ## 6. Size Scale .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + * geom_line(aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", * aes(group = continent, color = "Continent", size = "Continent")) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + * scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) # # # ``` <!-- --> ] --- ## 7. Alpha (Transparency) .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + * geom_line(alpha = 0.5, aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", * aes(group = continent, color = "Continent", size = "Continent"), alpha = 0.5) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) # # # ``` <!-- --> ] --- ## 8. Theme and Labels .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line(alpha = 0.5, aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", aes(group = continent, color = "Continent", size = "Continent"), alpha = 0.5) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) + * theme_minimal(base_size = 14) + ylab("Years") + xlab("") # # ``` <!-- --> ] --- ## 9. Title and Subtitle .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line(alpha = 0.5, aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", aes(group = continent, color = "Continent", size = "Continent"), alpha = 0.5) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) + theme_minimal(base_size = 14) + ylab("Years") + xlab("") + * ggtitle("Life Expectancy, 1952-2007", subtitle = "By continent and country") # ``` <!-- --> ] --- ## 10. Angled Tick Values .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line(alpha = 0.5, aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", aes(group = continent, color = "Continent", size = "Continent"), alpha = 0.5) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) + theme_minimal(base_size = 14) + ylab("Years") + xlab("") + ggtitle("Life Expectancy, 1952-2007", subtitle = "By continent and country") + * theme(axis.text.x = element_text(angle = 45)) ``` <!-- --> ] .footnote[Note: Fewer values might be better than angled labels!] --- ## 11. Legend Position .smaller[ ```r ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) + geom_line(alpha = 0.5, aes(color = "Country", size = "Country")) + geom_line(stat = "smooth", method = "loess", aes(group = continent, color = "Continent", size = "Continent"), alpha = 0.5) + facet_wrap(~ continent, nrow = 2) + scale_color_manual(name = "Life Exp. for:", values = c("Country" = "black", "Continent" = "blue")) + scale_size_manual(name = "Life Exp. for:", values = c("Country" = 0.25, "Continent" = 3)) + theme_minimal(base_size = 14) + ylab("Years") + xlab("") + ggtitle("Life Expectancy, 1952-2007", subtitle = "By continent and country") + * theme(legend.position=c(0.82, 0.15), axis.text.x = element_text(angle = 45)) ``` <!-- --> ] --- class:inverse # Summary --- ## Summary `ggplot2` can do a LOT! I won't expect you to remember all these tools. .pull-left[ .small[ * With time and practice, you'll start to remember the key tools * When in doubt, Google it! ("*R ggplot rename title*") * There are lots of great resources out there: + The [Cookbook for R website](http://www.cookbook-r.com/Graphs/Legends_%28ggplot2%29) + The [RStudio ggplot Cheatsheets](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf). + Kieran Healy's book [Data Visualization: A Practical Introduction](https://socviz.co/) (right) which is targeted at social scientists without technical backgrounds. ] ] .pull-right[ <img src="img/dv-cover-pupress.jpg" width="90%"/> ] --- class: inverse # Exercise: Histograms In pairs, you will create a histogram of life expectancy observations in the complete Gapminder dataset. 1. Set the base layer by specifying the data as `gapminder` and the x variable as `lifeExp` 2. Add a second layer to create a histogram using the function `geom_histogram()` 3. Customize your plot with nice axis labels and a title. --- # My Solution ### 1: Set Base Layer .pull-left[ .small[ ```r ggplot(gapminder,aes(x=lifeExp)) ``` ] ] .pull-right[ <!-- --> ] --- # My Solution ### 2: Add Histogram Layer .pull-left[ .small[ ```r ggplot(gapminder,aes(x=lifeExp))+ geom_histogram(bins=30) ``` ] ] .pull-right[ <!-- --> ] .footnote[Setting the `bins` aesthetic removes a pesky message!] --- # My Solution ### 3: Add Label Layers .pull-left[ .small[ ```r ggplot(gapminder,aes(x=lifeExp))+ geom_histogram(bins=30)+ xlab("Life Expectancy")+ ylab("Count")+ ggtitle("Histogram of Life Expectancy in Gapminder Data") ``` ] ] .pull-right[ <!-- --> ] --- class: inverse # Homework 2 In this homework, you'll pose a question regarding the Gapminder dataset and investigate it graphically. * At the beginning of your document, write down a research question that is based on the Gapminder dataset (e.g., "How does population change over time in different countries?") * Create 3-6 plots to answer/investigate your research question. Consider histograms (`geom_histogram()`), scatterplots (`geom_point()`), or lineplots (`geom_line()`). * Be sure all titles, axes, and legends are clearly labelled. * Include at least one plot with `facet_wrap()` or `facet_grid()`. Your document should be pleasant for a peer to look at, with some organization. You must write up your observations in words as well as showing the graphs. Upload both the `.Rmd` file and the `.html` file to Canvas.