UW Statistics Directed Reading Program


In addition to my more formal experiences as a teaching assistant, I've led numerous projects with undergraduate students as part of the UW Statistics Directed Reading Program (DRP). The one-on-one experiences are intended to help undergraduates from a wide variety of backgrounds gain exposure to areas of statistics not often covered in the curriculum and learn about life as a graduate student and researcher. For more details of the program, click here. Below are the titles, prequisites, descriptions, and syllabi of my DRP offerings.

Social Choice Analysis of Peer Review Data (Spring 2022)

Student: Mingzhe (Mia) Zhang and Terry Yuan

Prerequisites: Computational skills (R required; other knowledge and experience, e.g., with python, is desirable). Preference given to Statistics and CSE majors and to candidates with interest and possibility to continue with the project in Summer and Fall 2022.

Description: In peer review settings, groups or panels of experts are tasked with evaluating submissions such as grant proposals or job candidate materials. For each submission, individual input is often given as a numeric score or a letter grade. The average or median of such scores is often used to summarize the collective opinion of a panel of experts. In this project, we will consider other ways to aggregate expert opinions by drawing a parallel between panel decisions and elections or voting. All voting procedures have two key features: types of input that are used and how these inputs are aggregated. Examples of voting procedures include majority rule, Borda rule, single transferrable vote, and majority judgement. Voting procedures matter in that a choice of voting procedure can change panel outcomes or which candidate(s) or proposal(s) are preferred. Social choice theory demonstrates that (a) no voting procedure for selection of one out of three or more choices can satisfy simultaneously a small number of natural desiderata (this result is known as Arrow's Impossibility Theorem), that (b) every voting procedure satisfy some desiderata but not others, and that (c) election outcomes can differ depending on what voting system is used. The points (a)-(c) constitute compelling reasons in favor of better understanding the influence of aggregation methods on panel-level outcomes: we will critically assess properties of voting procedures and whether these properties should be required or desired in panel opinion aggregation methods used in peer review. The project will involve applying social choice algorithms (e.g., Borda rule and Majority Judgement) to de-identified data on panel grant peer review.

Voting, Ranking, and Preference Modeling (Autumn 2021)

Student: Carolina Sawyer

Prerequisites: Stat 311 or equivalent.

Description: Preference data appears in many forms: voters deciding between candidates in an election, movie critics rating new releases, and search engines ranking web pages, to name a few! However, modeling preferences in a statistical manner can be challenging for a variety of reasons, such as computational difficulties in working with discrete and high-dimensional data. In this project, we will study a variety of models used for preference data, which includes both ranking and scoring models. Understanding challenges and uncertainty in aggregating preferences will be a key focus. Together, we will also carry out an applied project on preference data based on the student's interests.

Syllabus:

  • Week 1: Introduction to Preferences, Rankings, Ratings, and Voting. As you read these, don't try to understand every technical detail or memorize the information (there's no quiz at the end). Instead, focus on the big picture - how do each of these articles, in various levels of technicality, relate to understanding the preferences of people or systems? How do different methods of collecting, aggregating, or displaying preferences change behaviors or outcomes? What is confusing? What did you already know?
  • Week 2: Voting Systems
  • Week 3: Ranking Models I. Bradley-Terry and Plackett-Luce models; Luce's Choice Axiom. In each of these papers, focus mostly on the model formulation/applications, and less on the theory/estimation.
    • Bradley-Terry model: Chapter 11.6-11.6.3 in "Categorical Data Analysis" by Agresti (2013), accesible online at UW Libraries.
    • Plackett-Luce model: "The Analysis of Permutations", Plackett (1975)
    • Luce's Choice Axiom: An introduction paper by Yellott (2001)
    • Question and Exercise: How are the BT and PL models similar to or different from one another? Come up with a few examples for situations in which each could be used.
  • Week 4: Ranking Models II. Rank distances, Mallows' models. This week, you'll do some independent research, read an important ranking paper, and try out some code in R for ranking models.
    • Rank distances: Learn about the following six distance metrics between rankings: Spearman's Footrule, Spearman's Rank Correlation, Hamming distance, Kendall's tau, Cayley distance, and Ulam's distance. A good place to start learning about them is here (pages 112-119), but feel free to look for other resources as well. Once you understand them, write down two rankings of at least 6 objects and calculate the distance between them using each of the aforementioned metrics. How similar or different are they? Can you write down two rankings that are close on some metrics but far apart on others?
    • Mallows' model: Read this seminal paper on the Mallows' model: "Distance Based Ranking Models" by Fligner and Verducci (1986).
    • Exercise: Install and load the "PerMallows" package in R. Then, generate samples from a Mallows' distribution with a central ranking and scale parameter of your choosing (use the function "rmm"). After, fit a Mallows' model to the data you generated (use the function "lmm"). Repeat this many times, and see how often the estimated central ranking is the central ranking you provided initially. How accurate was estimation of the scale parameter, theta? Is there a connection between the accuracy of the two?
  • Week 5: Coding Preference Learning Models. This week, you'll write code to calculate distances between rankings and to find consensus rankings. For each exercise, be sure to demonstrate your functions with examples or illustrations!
    • Exercise 1: Write a function in R to calculate the distance between two rankings. The function should include an argument to specify which distance metric to use, with options being Spearman's Footrule, Spearman's Rank Correlation, Hamming, Kendall's tau, and Cayley.
    • Exercise 2: Write a function in R to calculate the total distance between a "central ranking" and a collection of rankings (i.e., the sum of the distances from the central ranking to each ranking in the collection). Again, your function should allow the user to specify a distance metric.
    • Exercise 3: Write a function that brute-force calculates the "optimal" consensus ranking given a collection of rankings and a distance metric. Optimal means the ranking that has the smallest total distance to the rankings in the collection. If multiple rankings have the smallest total distance, output them all. This function should be used only when the number of objects is 6 or fewer.
    • Exercise 4: Write a function that estimates the "optimal" consensus ranking based on the Kendall's tau distance metric (use the "average rank" procedure discussed last week). This function should be used when the number of objects is greater than 6.
    • Exercise 5: Run a simulation in which you compare the accuracy of the brute-force and estimation procedures to the true consensus ranking using rankings generated from a Mallows distribution for 5 objects and varying scale parameters. Visualize your results!
  • Week 6: Project Discussion. We'll discuss the final project this week. Consider the following options:
    • Summary: Describe the material from our DRP project. What topics, methods, and results did you learn? What was easy, and what was challenging? Where could what you learned by applied in the real world?
    • Tools: Create a web app for individuals to analyze ranking, scoring, or voting data using the models we've learned this quarter. In addition to built-in datasets, allow users to input their own data. How can you visualize or describe the results?
    • Analysis: Find a real life dataset and analyze it using the models we've learned this quarter. Possible datasets may include real-world voting data, movie rankings, music charts, etc.
  • Week 7: Project
  • Week 8: Project
  • Week 9: Project
  • Week 10: Final Presentation

Nonlinear Regression (Winter 2020, Winter 2021, Spring 2021)

Students: Oliver Bejar Tjalve, Alejandro Gonzalez, Muhammad Anas

Prerequisites: A basic knowledge of linear regression and some experience in R.

Description: Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" (ISLR) Chapter 7.

Syllabus:

  • Week 1: Introductions
  • Week 2: Review of Linear Regression
  • Week 3: Polynomial Regression, Step Functions, and Basis Functions
    • Readings: ISLR Chapters 7.1-7.3, Lab 7.8.1
  • Week 4: Regression Splines and Multivariate Adaptive Regression Splines (MARS)
  • Week 5: Smoothing Splines, Local Regression, and Kernel Regression
    • Readings: ISLR Chapter 7.5-7.6 and Lab 7.8.2; ESL Chapters 6.1-6.2
  • Week 6: Generalized Additive Models (GAMs) and Project Preliminaries
    • Readings: ISLR Chapter 7.7, Lab 7.8.3
  • Week 7: Project
  • Week 8: Project
  • Week 9: Project
  • Week 10: Final Presentation

History and Practice of Data Communication (Autumn 2020)

Student: Ziyi Li

Prerequisites: None; some experience with R or Python may be helpful but is not required.

Description: In this course, we'll learn about the development of data communication techniques and their modern use. We'll begin by studying how people have visualized patterns in data over time, and consider how those methods reflected the computational resources available in each era. Then, we'll shift our attention to modern issues in data communication, drawing examples from the COVID-19 pandemic and 2020 US presidential election: How do practitioners effectively show complex relationships or model uncertainty? How do people mislead readers through text and figures (intentionally or otherwise)? What common pitfalls exist, and how can we avoid them? We'll finish with a data communication project based on the student's interests.

Syllabus: