Introduction

Officially, the only pre-requisite is any course covering (at any level of detail) linear regression.

This course does not assume prior knowledge of R or a background in statistical computing. However, the learning curve is steep. For example, by week two you will need to be comfortable with loops in R.

You will also be expected to submit your problem sets using rmarkdown.

The primary purpose of these sections is therefore to assist you in gaining the computational skills necessary to understand the material, complete your problem sets, and (most importantly) implement experiments in the real world.


Terminology

Estimand, estimator, and estimate


Loading libraries in R

Below are some useful R preamble functions and libraries that we will use in class regularly. Note that before beginning a new work session or starting or working on a new script, you are stronly encouraged to restart your R session.

The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Knitr is a package for producing dynamic reports (e.g. PDFs) directly with R.

KableExtra is a package that builds flxibile and beautiful tables in R.

randomizr is designed to make conducting field, lab, survey, or online experiments easier by automating the random assignment process.



Importing data


First, we are going to import data from a CSV file.

## # A tibble: 20 x 4
##       X1    X2   Yi0   Yi1
##    <dbl> <dbl> <dbl> <dbl>
##  1     0     0     6    17
##  2     0     0     9    12
##  3     1     1     8    17
##  4     1     0    14    13
##  5     1     0    15    14
##  6     1     1     5    15
##  7     0     1    10    13
##  8     1     0     8    16
##  9     1     1     8    16
## 10     0     0    11    15
## 11     0     0     7    15
## 12     0     1     5    16
## 13     1     0     8    13
## 14     1     1    10     9
## 15     1     0    13    19
## 16     0     1    11    11
## 17     0     0    10    16
## 18     1     0     8    12
## 19     1     0     6    18
## 20     1     1     4    19



Data cleaning and manipulation in dplyr


Next we will perform some basic data cleaning and manipulation functions using dplyr.

There are many, many more useful functions in dplyr that are not shown here. Consult “R for Data Science” for a comprehensive overview. If you think a function might exist, or would like to manipulate your data in a particular way but don’t know how, try Googling “how to X in dplyr”!


## # A tibble: 100 x 2
##      Yi1   Yi0
##    <dbl> <dbl>
##  1    17     6
##  2    12     9
##  3    17     8
##  4    13    14
##  5    14    15
##  6    15     5
##  7    13    10
##  8    16     8
##  9    16     8
## 10    15    11
## # … with 90 more rows
## # A tibble: 100 x 2
##      Yi0   Yi1
##    <dbl> <dbl>
##  1     6    17
##  2     9    12
##  3     8    17
##  4    14    13
##  5    15    14
##  6     5    15
##  7    10    13
##  8     8    16
##  9     8    16
## 10    11    15
## # … with 90 more rows
## # A tibble: 100 x 5
##       X1    X2   Yi0   Yi1    X3
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     0     6    17     1
##  2     0     0     9    12     1
##  3     1     1     8    17     1
##  4     1     0    14    13     1
##  5     1     0    15    14     1
##  6     1     1     5    15     1
##  7     0     1    10    13     1
##  8     1     0     8    16     1
##  9     1     1     8    16     1
## 10     0     0    11    15     1
## # … with 90 more rows
## # A tibble: 100 x 5
##       X1    X2    X3   Yi0   Yi1
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     0     1     6    17
##  2     0     0     1     9    12
##  3     1     1     1     8    17
##  4     1     0     1    14    13
##  5     1     0     1    15    14
##  6     1     1     1     5    15
##  7     0     1     1    10    13
##  8     1     0     1     8    16
##  9     1     1     1     8    16
## 10     0     0     1    11    15
## # … with 90 more rows


Random assignment using randomizr and the switching equation

The code below uses the randomizer package to perform simple and complete random assignment.

Simple random assignment assigns all subjects to treatment with an equal probability (e.g., by flipping a (weighted) coin for each subject). Why might simple random assignment not be the best idea?

Complete random assignment allows the researcher to specify exactly how many units are assigned to each condition (e.g., 50 to treatment and 50 to control with N = 100).



Now let’s do something a bit more complex and combine pipes in dplyr with randomizr. We will do the following:

  1. Create a vector of randomly assigned treatment statuses.
  2. Create a treatment indicator variable \(D\) based on this vector of treatment assignments.
  3. Create a variable of revealed potential outcomes \(Y\) using the switching equation.
  4. Acknowledge the fundamental problem of causal inference by removing the original \(Y_i(0)\) and \(Y_i(1)\) variables from our dataset.
  5. Calculate the average treatment effect across all units in the sample:

(\(E[Y_i(1) | D = 1] - E[Y_i(0) | D = 0]\)).


## # A tibble: 100 x 3
##        D     Y   ATE
##    <int> <dbl> <dbl>
##  1     0     6   5.2
##  2     0     9   5.2
##  3     1    17   5.2
##  4     1    13   5.2
##  5     1    14   5.2
##  6     0     5   5.2
##  7     1    13   5.2
##  8     1    16   5.2
##  9     0     8   5.2
## 10     0    11   5.2
## # … with 90 more rows


We could also calculate treatment effects for indvidual units here, but this is a completely hypothetical exercise. Why?


## # A tibble: 100 x 3
##      Yi1   Yi0  diff
##    <dbl> <dbl> <dbl>
##  1    17     6    11
##  2    12     9     3
##  3    17     8     9
##  4    13    14    -1
##  5    14    15    -1
##  6    15     5    10
##  7    13    10     3
##  8    16     8     8
##  9    16     8     8
## 10    15    11     4
## # … with 90 more rows


Group by and summarise


Now we will use another particularly useful combination of dplyr functions often refered to as “group by summarise.” This allows us to calculate a number of summary statistics for a given dataframe/tibble.



Let’s display the output of our function in a pretty manner using Kable. Note that Kable has many more functions for customizing tables for output to both PDF and HTML that you can explore.

D groupmean size var se ATE
0 9.2 50 10.285714 0.4535574 5.2
1 14.4 50 9.795918 0.4426267 5.2


Loops


Loops allow you to reduce duplication in your code by repeating a function iteratively. This helps you when you need to do the same thing to multiple inputs, for example repeating the same operation on different rows, columns, or on different datasets.

As a general rule of thumb, you should not copy and past the same line of code to repeat a function more than two or three times, but there will also be times when you need to repeat a function more times than is practically possible by copying lines of code. These kinds of operations are often essential for random sampling and simulation, and therefore experiments.

First, we will practice a basic loop where we create a empty vector and fill it with random draws from a normal distribution. You can create “for loops,” “while loops,” and “repeat loops” in R. For this class you will need a firm understanding of “for loops.”



Next, we will create a plot of this thing using base R plotting. Note that I am only showing you base R plots so that you can see how much we can improve on this with ggplot in a moment!



Next week we will practice a procedure for hypothesis testing called randomization inference. This will require you to have a firm understanding of loops, so please ensure that you do, or come to office hours if anything is confusing!


ggplot2


Now we will create a plot using ggplot2—arguably the best and most flexible data visualization platform around. Other languages and packages have sophisticated plotting functions (e.g., Python’s Seaborn, Stata’s plotting functions, Tableau, etc.), but none are as comprehensive, flexible, and developed as ggplot2.


First, let’s recreate the histogram above using ggplot2.



Now let’s go step by step through how ggplot works, and construct a visualization of the average treatment effect we calculated above.


ggplot elements can be added iteratively in layers. In other words, you can continually add new elements to or on top of a plot. Let’s start by creating a basic scatterplot of our outcome variable \(Y\) versus our treatment indicator \(D\).


That’s a bit weird? We have 50 observations in each treatment group, but only 15 observations appear in control and 13 in treatment on the plot. This is because our points are on top of each other, we can give them some wiggle room using the “geom_jitter” function.



Maybe we don’t want that much horizontal jitter. Let’s reduce the width but add on some alpha - which is ggplot speak for transparency - and use a more interesting color than black.



Let’s rename our y axis. Never leave default axis labels. Always try to make things more clear for your reader.


Now let’s make the x axis more clear.


Now let’s add some functions that clean up the visualization.


Finally, let’s add a regression line to characterize the average treatment effect. We will get into this later, but the slope of this line is equal to the average treatment effect.



For now we’re keeping things fairly simple, but in the future we will need to create more complex plots. For example, we will want to add confidence intervals to estimates, break charts out by subgroup (called “faceting”) when looking at heterogenous effects, etc. ggplot can handle all of these tasks with ease.