Last time, we learned about,
ggplot
, ggmap
, and ggrepel
sf
and tidycensus
Last time, we learned about,
ggplot
, ggmap
, and ggrepel
sf
and tidycensus
Today, we will cover,
Reproducibility is not replication.
Reproducibility is not replication.
Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.
Reproducibility is not replication.
Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.
Reproducibility means:
Reproducibility is not replication.
Reproducible studies can still be wrong... and in fact reproducibility makes proving a study wrong much easier.
Reproducibility means:
Any study that isn't reproducible can be trusted only on faith.
Reproducibility comes in three forms (Stodden 2014):
Reproducibility comes in three forms (Stodden 2014):
Reproducibility comes in three forms (Stodden 2014):
Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Reproducibility comes in three forms (Stodden 2014):
Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.
Reproducibility comes in three forms (Stodden 2014):
Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.
R is particularly well suited to enabling computational reproducibility.1
[1] Python is equally well suited.
Reproducibility comes in three forms (Stodden 2014):
Empirical: Repeatability in data collection.
Statistical: Verification with alternate methods of inference.
Computational: Reproducibility in cleaning, organizing, and presenting data and results.
R is particularly well suited to enabling computational reproducibility.1
[1] Python is equally well suited.
They will not fix flawed research design, nor offer a remedy for improper application of statistical methods.
Those are the difficult, non-automatable things you want skills in.
Elements of computational reproducibility:
Elements of computational reproducibility:
Shared data
Elements of computational reproducibility:
Shared data
Shared code
Elements of computational reproducibility:
Shared data
Shared code
Documentation
Elements of computational reproducibility:
Shared data
Shared code
Documentation
Version Control
For academic papers, degrees of reproducibility vary:
For academic papers, degrees of reproducibility vary:
"Read the article"
Shared data with documentation
For academic papers, degrees of reproducibility vary:
"Read the article"
Shared data with documentation
Shared data and all code
For academic papers, degrees of reproducibility vary:
"Read the article"
Shared data with documentation
Shared data and all code
Interactive document
For academic papers, degrees of reproducibility vary:
"Read the article"
Shared data with documentation
Shared data and all code
Interactive document
Research compendium
Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.
Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.
Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.
Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.
Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.
By re-running the code, they reproduce your results on demand.
Interactive documents—like R Markdown docs—combine code and text together into a self-contained document.
Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.
By re-running the code, they reproduce your results on demand.
Common Platforms:
A research compendium is a portable, reproducible distribution of an article or other project.
A research compendium is a portable, reproducible distribution of an article or other project.
Research compendia feature:
An interactive document as the foundation
Files organized in a recognizable structure (e.g. an R package)
Clear separation of data, method, and output. Data are read only.
A well-documented or even preserved computational environment (e.g. Docker)
A research compendium is a portable, reproducible distribution of an article or other project.
Research compendia feature:
An interactive document as the foundation
Files organized in a recognizable structure (e.g. an R package)
Clear separation of data, method, and output. Data are read only.
A well-documented or even preserved computational environment (e.g. Docker)
rrtools
by UW's Ben Markwick provides a simplified workflow to accomplish this in R.
bookdown
—which is integrated into rrtools
—can generate documents in the proper format for articles, theses, books, or dissertations.
bookdown
—which is integrated into rrtools
—can generate documents in the proper format for articles, theses, books, or dissertations.
bookdown
provides an accessible alternative to writing LATEX for typesetting and reference management.
bookdown
—which is integrated into rrtools
—can generate documents in the proper format for articles, theses, books, or dissertations.
bookdown
provides an accessible alternative to writing LATEX for typesetting and reference management.
You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).
bookdown
—which is integrated into rrtools
—can generate documents in the proper format for articles, theses, books, or dissertations.
bookdown
provides an accessible alternative to writing LATEX for typesetting and reference management.
You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).
bookdown
supports .html
output for ease and speed and also renders .pdf
files through LATEX for publication-ready documents.
bookdown
—which is integrated into rrtools
—can generate documents in the proper format for articles, theses, books, or dissertations.
bookdown
provides an accessible alternative to writing LATEX for typesetting and reference management.
You can integrate citations and automate reference page generation using bibtex files (such as produced by Zotero).
bookdown
supports .html
output for ease and speed and also renders .pdf
files through LATEX for publication-ready documents.
For University of Washington theses and dissertations, consider Ben Marwick's huskydown
package which uses Markdown but renders via a UW approved LATEX template.
Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.
Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.
Uniform organization makes switching between or revisiting projects easier.
Organizing research projects is something you either do accidentally—and badly—or purposefully with some upfront labor.
Uniform organization makes switching between or revisiting projects easier.
I suggest something like the following:
project/ readme.md data/ derived/ processed_data.RData raw/ core_data.csv docs/ paper.Rmd syntax/ functions.R models.R
docs
syntax
data
To summarize Jenny Bryan, one should separate workflow from projects.
To summarize Jenny Bryan, one should separate workflow from projects.
The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console
To summarize Jenny Bryan, one should separate workflow from projects.
The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console
The raw data
The code that operates on your raw data
The packages you use
The output files or documents
To summarize Jenny Bryan, one should separate workflow from projects.
The software you use to write your code (e.g. RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console
The raw data
The code that operates on your raw data
The packages you use
The output files or documents
Projects should not modify anything outside of the project nor need to be modified by someone else (or future you) to run.
Projects should be independent of your workflow.
For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.
For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.
Do Not:
setwd()
in scripts or .Rmd files.read_csv("C:/my_project/data/my_data.csv")
install.packages()
in script or .Rmd files.rm(list=ls())
anywhere but your console.For research to be reproducible, it must also be portable. Portable software operates independently of workflow such as fixed file locations.
Do Not:
setwd()
in scripts or .Rmd files.read_csv("C:/my_project/data/my_data.csv")
install.packages()
in script or .Rmd files.rm(list=ls())
anywhere but your console.Do:
here
package) to set directories.read_csv("./data/my_data.csv")
library()
.Often you do not want to include all code for a project in one .Rmd
file:
Often you do not want to include all code for a project in one .Rmd
file:
There are two ways to deal with this:
Use separate .R
scripts or .Rmd
files which save results from complicated parts of a project, then load these results in the main .Rmd
file.
Often you do not want to include all code for a project in one .Rmd
file:
There are two ways to deal with this:
Use separate .R
scripts or .Rmd
files which save results from complicated parts of a project, then load these results in the main .Rmd
file.
Use source()
to run external .R
scripts when the .Rmd
knits.
Avoid "closed" or commercial software and file formats except where absolutely necessary.
Avoid "closed" or commercial software and file formats except where absolutely necessary.
Use open source software and file formats.
Avoid "closed" or commercial software and file formats except where absolutely necessary.
Use open source software and file formats.
It is always better for science:
Avoid "closed" or commercial software and file formats except where absolutely necessary.
Use open source software and file formats.
It is always better for science:
It is often just better.
Avoid "closed" or commercial software and file formats except where absolutely necessary.
Use open source software and file formats.
It is always better for science:
It is often just better.
The ideal: Use software that reads and writes raw text.
Writing and formatting documents are two completely separate jobs.
Writing and formatting documents are two completely separate jobs.
Word processors—like Microsoft Word—try to do both at the same time, usually badly.
They waste time by leading you to format instead of writing.
Writing and formatting documents are two completely separate jobs.
Word processors—like Microsoft Word—try to do both at the same time, usually badly.
They waste time by leading you to format instead of writing.
Find a good modular text editor and learn to use it:
Version control originates in collaborative software development.
The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.
Version control originates in collaborative software development.
The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.
Version control allows all decisions ever made in a research project to be documented automatically.
Version control originates in collaborative software development.
The Idea: All changes ever made to a piece of software are documented, saved automatically, and revertible.
Version control allows all decisions ever made in a research project to be documented automatically.
Version control can:
git
is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.
Repositories are folders on your computer where all changes are tracked by Git.
Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.
git
is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.
Repositories are folders on your computer where all changes are tracked by Git.
Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.
Others can copy your project ("pull"), and if you permit, make suggestions for changes.
git
is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.
Repositories are folders on your computer where all changes are tracked by Git.
Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.
Others can copy your project ("pull"), and if you permit, make suggestions for changes.
Constantly committing and pulling changes automatically generates a running "history" that documents the evolution of a project.
git
is the dominant platform for version control, and GitHub is a free (and now Microsoft owned) platform for hosting repositories.
Repositories are folders on your computer where all changes are tracked by Git.
Once satisfied with changes, you "commit" them then "push" them to a remote repository that stores your project.
Others can copy your project ("pull"), and if you permit, make suggestions for changes.
Constantly committing and pulling changes automatically generates a running "history" that documents the evolution of a project.
git
is integrated into RStudio under the Tools menu. It requires some setup.1
[1] You can also use the GitHub desktop application.
Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.
Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.
By allowing others to view your projects, you can display competence in programming and research.
Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.
By allowing others to view your projects, you can display competence in programming and research.
If you are planning on working in the private sector, an active GitHub profile will give you a leg up on the competition.
Beyond archiving projects and allowing sharing, GitHub also serves as a sort of curriculum vitae for the programmer.
By allowing others to view your projects, you can display competence in programming and research.
If you are planning on working in the private sector, an active GitHub profile will give you a leg up on the competition.
If you are aiming for academia, a GitHub account signals technical competence and an interest in research transparency.
A lot!
You all are now Rockstars!!
Learn more statistics!! (e.g. take more CSSS courses)
Learn more statistics!! (e.g. take more CSSS courses)
Practice, practice, practice!
dplyr
verbs, tidy data principlesLearn more statistics!! (e.g. take more CSSS courses)
Practice, practice, practice!
dplyr
verbs, tidy data principlesDo more advanced projects
If you...
Last time, we learned about,
ggplot
, ggmap
, and ggrepel
sf
and tidycensus
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |