CSSS508, Lecture 10
Model Results and Reproducibility
Michael Pearce
(based on slides from Chuck Lanfear)
May 31, 2023
1 / 26

Topics

Last time, we learned about,

Basic mapping: ggplot, ggmap, and ggrepel
Advanced mapping: GIS with sf and tidycensus

2 / 26

Topics

Last time, we learned about,

Basic mapping: ggplot, ggmap, and ggrepel
Advanced mapping: GIS with sf and tidycensus

Today, we will cover,

Reproducible research
Best practices
Wrapping up the course!

2 / 26

Reproducible Research3 / 26

Why Reproducibility?

Reproducibility is not replication.

Replication is running a new study to show if and how results of a prior study hold.
Reproducibility is about rerunning the same study and getting the same results.

4 / 26

Why Reproducibility?

Reproducibility is not replication.

Replication is running a new study to show if and how results of a prior study hold.
Reproducibility is about rerunning the same study and getting the same results.

Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.

4 / 26

Why Reproducibility?

Reproducibility is not replication.

Replication is running a new study to show if and how results of a prior study hold.
Reproducibility is about rerunning the same study and getting the same results.

Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.

Reproducibility means:

Transparent research practices.
Minimal barriers to verifying your results.

4 / 26

Why Reproducibility?

Reproducibility is not replication.

Replication is running a new study to show if and how results of a prior study hold.
Reproducibility is about rerunning the same study and getting the same results.

Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.

Reproducibility means:

Transparent research practices.
Minimal barriers to verifying your results.

Any study that isn't reproducible can be trusted only on faith.

4 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

5 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

Empirical: Repeatability in data collection.

5 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.

5 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.

5 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.

R is particularly well suited to enabling computational reproducibility.¹

[1] Python is equally well suited.

5 / 26

Reproducibility Definitions

Reproducibility comes in three forms (Stodden 2014):

Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.

R is particularly well suited to enabling computational reproducibility.¹

[1] Python is equally well suited.

They will not fix flawed research design, nor offer a remedy for improper application of statistical methods.

Those are the difficult, non-automatable things you want skills in.

5 / 26

Computational Reproducibility

Elements of computational reproducibility:

6 / 26

Computational Reproducibility

Elements of computational reproducibility:

Shared data
- Researchers need your original data to verify and replicate your work.

6 / 26

Computational Reproducibility

Elements of computational reproducibility:

Shared data
- Researchers need your original data to verify and replicate your work.
Shared code
- Your code must be shared to make decisions transparent.

6 / 26

Computational Reproducibility

Elements of computational reproducibility:

Shared data
- Researchers need your original data to verify and replicate your work.
Shared code
- Your code must be shared to make decisions transparent.
Documentation
- The operation of code should be either self-documenting or have written descriptions to make its use clear.

6 / 26

Computational Reproducibility

Elements of computational reproducibility:

Shared data
- Researchers need your original data to verify and replicate your work.
Shared code
- Your code must be shared to make decisions transparent.
Documentation
- The operation of code should be either self-documenting or have written descriptions to make its use clear.
Version Control
- Documents the research process.
- Prevents losing work and facilitates sharing.

6 / 26

Levels of Reproducibility

For academic papers, degrees of reproducibility vary:

"Read the article"

7 / 26

Levels of Reproducibility

For academic papers, degrees of reproducibility vary:

"Read the article"
Shared data with documentation

7 / 26

Levels of Reproducibility

For academic papers, degrees of reproducibility vary:

"Read the article"
Shared data with documentation
Shared data and all code

7 / 26

Levels of Reproducibility

For academic papers, degrees of reproducibility vary:

"Read the article"
Shared data with documentation
Shared data and all code
Interactive document

7 / 26

Levels of Reproducibility

For academic papers, degrees of reproducibility vary:

"Read the article"
Shared data with documentation
Shared data and all code
Interactive document
Research compendium

7 / 26

Interactive Documents

Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.

Load and process data
Run models
Generate tables and plots in-line with text
In-text values automatically filled in

8 / 26

Interactive Documents

Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.

Load and process data
Run models
Generate tables and plots in-line with text
In-text values automatically filled in

Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.

8 / 26

Interactive Documents

Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.

Load and process data
Run models
Generate tables and plots in-line with text
In-text values automatically filled in

Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.

By re-running the code, they reproduce your results on demand.

8 / 26

Interactive Documents

Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.

Load and process data
Run models
Generate tables and plots in-line with text
In-text values automatically filled in

Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.

By re-running the code, they reproduce your results on demand.

Common Platforms:

R: R Markdown
Python: Jupyter Notebooks

8 / 26

Research Compendia

A research compendium is a portable, reproducible distribution of an article or other project.

9 / 26

Research Compendia

A research compendium is a portable, reproducible distribution of an article or other project.

Research compendia feature:

An interactive document as the foundation
Files organized in a recognizable structure (e.g. an R package)
Clear separation of data, method, and output. Data are read only.
A well-documented or even preserved computational environment (e.g. Docker)

9 / 26

Research Compendia

A research compendium is a portable, reproducible distribution of an article or other project.

Research compendia feature:

An interactive document as the foundation
Files organized in a recognizable structure (e.g. an R package)
Clear separation of data, method, and output. Data are read only.
A well-documented or even preserved computational environment (e.g. Docker)

rrtools by UW's Ben Markwick provides a simplified workflow to accomplish this in R.

9 / 26

Bookdown

bookdown—which is integrated into rrtools—can generate documents in the proper format for articles, theses, books, or dissertations.

10 / 26

Bookdown

bookdown—which is integrated into rrtools—can generate documents in the proper format for articles, theses, books, or dissertations.

bookdown provides an accessible alternative to writing $\LaTeX$ for typesetting and reference management.

10 / 26

Bookdown

bookdown—which is integrated into rrtools—can generate documents in the proper format for articles, theses, books, or dissertations.

bookdown provides an accessible alternative to writing $\LaTeX$ for typesetting and reference management.

You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).

10 / 26

Bookdown

bookdown—which is integrated into rrtools—can generate documents in the proper format for articles, theses, books, or dissertations.

bookdown provides an accessible alternative to writing $\LaTeX$ for typesetting and reference management.

You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).

bookdown supports .html output for ease and speed and also renders .pdf files through $\LaTeX$ for publication-ready documents.

10 / 26

Bookdown

bookdown—which is integrated into rrtools—can generate documents in the proper format for articles, theses, books, or dissertations.

bookdown provides an accessible alternative to writing $\LaTeX$ for typesetting and reference management.

You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).

bookdown supports .html output for ease and speed and also renders .pdf files through $\LaTeX$ for publication-ready documents.

For University of Washington theses and dissertations, consider Ben Marwick's huskydown package which uses Markdown but renders via a UW approved $\LaTeX$ template.

10 / 26

Best Practices11 / 26

Organization Systems

Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.

12 / 26

Organization Systems

Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.

Uniform organization makes switching between or revisiting projects easier.

12 / 26

Organization Systems

Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.

Uniform organization makes switching between or revisiting projects easier.

I suggest something like the following:

project/
   readme.md
   data/
     derived/
       processed_data.RData
     raw/
       core_data.csv
   docs/
     paper.Rmd
   syntax/
     functions.R
     models.R

There is a clear hierarchy
- Written content is in docs
- Code is in syntax
- Data is in data
Naming is uniform
- All lower case
- Words separated by underscores
Names are self-descriptive

12 / 26

Workflow versus Project

To summarize Jenny Bryan, one should separate workflow from projects.

13 / 26

Workflow versus Project

To summarize Jenny Bryan, one should separate workflow from projects.

Workflow

The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console

13 / 26

Workflow versus Project

To summarize Jenny Bryan, one should separate workflow from projects.

Workflow

The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console

Project

The raw data
The code that operates on your raw data
The packages you use
The output files or documents

13 / 26

Workflow versus Project

To summarize Jenny Bryan, one should separate workflow from projects.

Workflow

The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console

Project

The raw data
The code that operates on your raw data
The packages you use
The output files or documents

Projects should not modify anything outside of the project nor need to be modified by someone else (or future you) to run.

Projects should be independent of your workflow.

13 / 26

Portability

For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.

14 / 26

Portability

For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.

Do Not:

Use setwd() in scripts or .Rmd files.
Use absolute paths except for fixed, immovable sources (secure data).
- read_csv("C:/my_project/data/my_data.csv")
Use install.packages() in script or .Rmd files.
Use rm(list=ls()) anywhere but your console.

14 / 26

Portability

For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.

Do Not:

Use setwd() in scripts or .Rmd files.
Use absolute paths except for fixed, immovable sources (secure data).
- read_csv("C:/my_project/data/my_data.csv")
Use install.packages() in script or .Rmd files.
Use rm(list=ls()) anywhere but your console.

Do:

Use RStudio projects (or the here package) to set directories.
Use relative paths to load and save files:
- read_csv("./data/my_data.csv")
Load all required packages using library().
Clear your workspace when closing RStudio.
- Set Tools > Global Options... > Save workspace... to Never

14 / 26

Divide and Conquer

Often you do not want to include all code for a project in one .Rmd file:

The code takes too long to knit.
The file is so long it is difficult to read.

15 / 26

Divide and Conquer

Often you do not want to include all code for a project in one .Rmd file:

The code takes too long to knit.
The file is so long it is difficult to read.

There are two ways to deal with this:

Use separate .R scripts or .Rmd files which save results from complicated parts of a project, then load these results in the main .Rmd file.
- This is good for loading and cleaning large data.
- Also for running slow models.

15 / 26

Divide and Conquer

Often you do not want to include all code for a project in one .Rmd file:

The code takes too long to knit.
The file is so long it is difficult to read.

There are two ways to deal with this:

Use separate .R scripts or .Rmd files which save results from complicated parts of a project, then load these results in the main .Rmd file.
- This is good for loading and cleaning large data.
- Also for running slow models.
Use source() to run external .R scripts when the .Rmd knits.
- This can be used to run large files that aren't impractically slow.
- Also good for loading project-specific functions.

15 / 26

ToolsSome opinionated advice16 / 26

On Formats

Avoid "closed" or commercial software and file formats except where absolutely necessary.

17 / 26

On Formats

Avoid "closed" or commercial software and file formats except where absolutely necessary.

Use open source software and file formats.

17 / 26

On Formats

Avoid "closed" or commercial software and file formats except where absolutely necessary.

Use open source software and file formats.

It is always better for science:
- People should be able to explore your research without buying commercial software.
- You do not want your research to be inaccessible when software is updated.

17 / 26

On Formats

Avoid "closed" or commercial software and file formats except where absolutely necessary.

Use open source software and file formats.

It is always better for science:
- People should be able to explore your research without buying commercial software.
- You do not want your research to be inaccessible when software is updated.
It is often just better.
- It is usually updated more quickly
- It tends to be more secure
- It is rarely abandoned

17 / 26

On Formats

Avoid "closed" or commercial software and file formats except where absolutely necessary.

Use open source software and file formats.

It is always better for science:
- People should be able to explore your research without buying commercial software.
- You do not want your research to be inaccessible when software is updated.
It is often just better.
- It is usually updated more quickly
- It tends to be more secure
- It is rarely abandoned

The ideal: Use software that reads and writes raw text.

17 / 26

On Text

Writing and formatting documents are two completely separate jobs.

Write first
Format later
Markdown was made for this

18 / 26

On Text

Writing and formatting documents are two completely separate jobs.

Write first
Format later
Markdown was made for this

Word processors—like Microsoft Word—try to do both at the same time, usually badly.

They waste time by leading you to format instead of writing.

18 / 26

On Text

Writing and formatting documents are two completely separate jobs.

Write first
Format later
Markdown was made for this

Word processors—like Microsoft Word—try to do both at the same time, usually badly.

They waste time by leading you to format instead of writing.

Find a good modular text editor and learn to use it:

[Overleaf] (https://www.overleaf.com)
Atom
Sublime (Commercial)

18 / 26

On Version Control

Version control originates in collaborative software development.

The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.

19 / 26

On Version Control

Version control originates in collaborative software development.

The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.

Version control allows all decisions ever made in a research project to be documented automatically.

19 / 26

On Version Control

Version control originates in collaborative software development.

The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.

Version control allows all decisions ever made in a research project to be documented automatically.

Version control can:

Protect your work from destructive changes
Simplify collaboration by merging changes
Document design decisions
Make your research process transparent

19 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

20 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

Repositories are folders on your computer where all changes are tracked by Git.

20 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

Repositories are folders on your computer where all changes are tracked by Git.

Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.

20 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

Repositories are folders on your computer where all changes are tracked by Git.

Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.

Others can copy your project ("pull"), and if you permit, make suggestions for changes.

20 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

Repositories are folders on your computer where all changes are tracked by Git.

Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.

Others can copy your project ("pull"), and if you permit, make suggestions for changes.

Constantly committing and pulling changes automatically generates a running "history" that documents the evolution of a project.

20 / 26

Git and GitHub

git is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.

Repositories are folders on your computer where all changes are tracked by Git.

Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.

Others can copy your project ("pull"), and if you permit, make suggestions for changes.

Constantly committing and pulling changes automatically generates a running "history" that documents the evolution of a project.

git is integrated into RStudio under the Tools menu. It requires some setup.¹

[1] You can also use the GitHub desktop application.

20 / 26

GitHub as a CV

Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.

21 / 26

GitHub as a CV

Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.

By allowing others to view your projects, you can display competence in programming and research.

21 / 26

GitHub as a CV

Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.

By allowing others to view your projects, you can display competence in programming and research.

If you are planning on working in the private sector, an active GitHub profile will give you a leg up on the competition.

21 / 26

GitHub as a CV

Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.

By allowing others to view your projects, you can display competence in programming and research.

If you are planning on working in the private sector, an active GitHub profile will give you a leg up on the competition.

If you are aiming for academia, a GitHub account signals technical competence and an interest in research transparency.

21 / 26

Wrapping up the Course22 / 26

What You've Learned

A lot!

How to get data into R from a variety of formats
How to do "data custodian" work to manipulate and clean data
How to make pretty visualizations
How to automate with loops and functions
How to combine text, calculations, plots, and tables into dynamic R Markdown reports
How to acquire and work with spatial data

You all are now Rockstars!!

23 / 26

What Comes Next?

Learn more statistics!! (e.g. take more CSSS courses)
- Learn foundations to statistical inference, create and evaluate models, consider survey design, make fancy visualizations, etc.
- All of this is much easier to do if you already know R!

24 / 26

What Comes Next?

Learn more statistics!! (e.g. take more CSSS courses)
- Learn foundations to statistical inference, create and evaluate models, consider survey design, make fancy visualizations, etc.
- All of this is much easier to do if you already know R!
Practice, practice, practice!
- Replicate analyses you've done for practice (maybe in another language)
- Think about data using dplyr verbs, tidy data principles
- R Markdown for reproducibility

24 / 26

What Comes Next?

Learn more statistics!! (e.g. take more CSSS courses)
- Learn foundations to statistical inference, create and evaluate models, consider survey design, make fancy visualizations, etc.
- All of this is much easier to do if you already know R!
Practice, practice, practice!
- Replicate analyses you've done for practice (maybe in another language)
- Think about data using dplyr verbs, tidy data principles
- R Markdown for reproducibility

Do more advanced projects
- Use version control (git) in RStudio
- Create interactive Shiny web apps
- Write your own functions and put them in a package

24 / 26

Course Plugs

If you...

would like to review math - CSSS 505: Review of Math for Social Scientists
have no stats background yet - SOC 504: Applied Social Statistics
want to learn some stat theory - CSSS 510: Maximum Likelihood
want to master visualization - CSSS 569: Visualizing Data
study events or durations - CSSS 544: Event History Analysis
want to use network data - CSSS 567: Social Network Analysis
want to work with spatial data - CSSS 554: Spatial Statistics
want to work with time series - CSSS 512: Time Series and Panel Data

25 / 26

Thank you!

Please submit your course evals! I greatly appreciate any feedback you may have.
Remember to submit your final assignment (HW 8; due now!) and provide peer review feedback by Monday at 11:59pm!
Hand in (optional) HW 9 if you are short of the 20 points necessary to pass.
Feel free to reach out at any point in the future with questions or comments!

26 / 26

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

CSSS508, Lecture 10

Model Results and Reproducibility

Michael Pearce(based on slides from Chuck Lanfear)

May 31, 2023

Topics

Topics

Reproducible Research

Why Reproducibility?

Why Reproducibility?

Why Reproducibility?

Why Reproducibility?

Reproducibility Definitions

Reproducibility Definitions

Reproducibility Definitions

Reproducibility Definitions

Reproducibility Definitions

Reproducibility Definitions

Computational Reproducibility

Computational Reproducibility

Computational Reproducibility

Computational Reproducibility

Computational Reproducibility

Levels of Reproducibility

Levels of Reproducibility

Levels of Reproducibility

Levels of Reproducibility

Levels of Reproducibility

Interactive Documents

Interactive Documents

Interactive Documents

Interactive Documents

Research Compendia

Research Compendia

Research Compendia

Bookdown

Bookdown

Bookdown

Bookdown

Bookdown

Best Practices

Organization Systems

Organization Systems

Organization Systems

Workflow versus Project

Workflow versus Project

Workflow

Workflow versus Project

Workflow

Project

Workflow versus Project

Workflow

Project

Portability

Portability

Portability

Divide and Conquer

Divide and Conquer

Divide and Conquer

Tools

Some opinionated advice

On Formats

On Formats

On Formats

On Formats

On Formats

On Text

On Text

On Text

On Version Control

On Version Control

On Version Control

Git and GitHub

Git and GitHub

Git and GitHub

Git and GitHub

Git and GitHub

Git and GitHub

GitHub as a CV

GitHub as a CV

GitHub as a CV

Michael Pearce
(based on slides from Chuck Lanfear)