Officially, the only pre-requisite is any course covering (at any level of detail) linear regression.

This course does not assume prior knowledge of R or a background in statistical computing. However, the learning curve is steep. For example, by week two you will need to be comfortable with loops in R.

You will also be expected to submit your problem sets using rmarkdown.

The primary purpose of these sections is therefore to assist you in gaining the computational skills necessary to understand the material, complete your problem sets, and (most importantly) implement experiments in the real world.

Estimand, estimator, and estimate

*Estimand:*our quantity of interest.*Estimator:*Statistical procedure (i.e. method) used to make a guess about a parameter. This is a sample statistic, and therefore a random variable. We use the “hat” (\(\hat{\theta}\) vs. \(\theta\)) symbol to denote an estimator.*Estimate:*the value of the estimand that results from use of a particular estimator (i.e. the value the estimator takes on).

Below are some useful R preamble functions and libraries that we will use in class regularly. Note that before beginning a new work session or starting or working on a new script, you are stronly encouraged to restart your R session.

The **tidyverse** is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

**Knitr** is a package for producing dynamic reports (e.g. PDFs) directly with R.

**KableExtra** is a package that builds flxibile and beautiful tables in R.

**randomizr** is designed to make conducting field, lab, survey, or online experiments easier by automating the random assignment process.

```
# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Preamble, libraries, and import
# ______________________________________________________________________________
# Libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(randomizr)
# Set a seed
set.seed(30)
```

First, we are going to import data from a CSV file.

```
## # A tibble: 20 x 4
## X1 X2 Yi0 Yi1
## <dbl> <dbl> <dbl> <dbl>
## 1 0 0 6 17
## 2 0 0 9 12
## 3 1 1 8 17
## 4 1 0 14 13
## 5 1 0 15 14
## 6 1 1 5 15
## 7 0 1 10 13
## 8 1 0 8 16
## 9 1 1 8 16
## 10 0 0 11 15
## 11 0 0 7 15
## 12 0 1 5 16
## 13 1 0 8 13
## 14 1 1 10 9
## 15 1 0 13 19
## 16 0 1 11 11
## 17 0 0 10 16
## 18 1 0 8 12
## 19 1 0 6 18
## 20 1 1 4 19
```

Next we will perform some basic data cleaning and manipulation functions using dplyr.

There are many, many more useful functions in dplyr that are not shown here. Consult “R for Data Science” for a comprehensive overview. If you think a function might exist, or would like to manipulate your data in a particular way but don’t know how, try Googling “how to X in dplyr”!

```
# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Basic data cleaning and manipulation using dplyr
# ______________________________________________________________________________
# Let's select only the variables we need using the "select" function
week1 %>% select(Yi1, Yi0)
```

```
## # A tibble: 100 x 2
## Yi1 Yi0
## <dbl> <dbl>
## 1 17 6
## 2 12 9
## 3 17 8
## 4 13 14
## 5 14 15
## 6 15 5
## 7 13 10
## 8 16 8
## 9 16 8
## 10 15 11
## # … with 90 more rows
```

```
## # A tibble: 100 x 2
## Yi0 Yi1
## <dbl> <dbl>
## 1 6 17
## 2 9 12
## 3 8 17
## 4 14 13
## 5 15 14
## 6 5 15
## 7 10 13
## 8 8 16
## 9 8 16
## 10 11 15
## # … with 90 more rows
```

```
## # A tibble: 100 x 5
## X1 X2 Yi0 Yi1 X3
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 6 17 1
## 2 0 0 9 12 1
## 3 1 1 8 17 1
## 4 1 0 14 13 1
## 5 1 0 15 14 1
## 6 1 1 5 15 1
## 7 0 1 10 13 1
## 8 1 0 8 16 1
## 9 1 1 8 16 1
## 10 0 0 11 15 1
## # … with 90 more rows
```

```
# We can use "pipes" to complete multiple dplyr procesese at once
week1 %>%
mutate(X3 = 1) %>%
select(X1, X2, X3, Yi0, Yi1)
```

```
## # A tibble: 100 x 5
## X1 X2 X3 Yi0 Yi1
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 1 6 17
## 2 0 0 1 9 12
## 3 1 1 1 8 17
## 4 1 0 1 14 13
## 5 1 0 1 15 14
## 6 1 1 1 5 15
## 7 0 1 1 10 13
## 8 1 0 1 8 16
## 9 1 1 1 8 16
## 10 0 0 1 11 15
## # … with 90 more rows
```

```
# Now let's select only the variable we need and save a new dataframe
week1_clean = week1 %>% select(Yi1, Yi0)
```

The code below uses the **randomizer** package to perform simple and complete random assignment.

**Simple random assignment** assigns all subjects to treatment with an equal probability (e.g., by flipping a (weighted) coin for each subject). Why might simple random assignment not be the best idea?

**Answer:**A different number of subjects might be assigned to each group (e.g., 40 to treatment and 60 to control with N = 100). The number of subjects assigned to treatment is itself a random number.

**Complete random assignment** allows the researcher to specify exactly how many units are assigned to each condition (e.g., 50 to treatment and 50 to control with N = 100).

```
# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Random assignment using randomizr and the switching equation
# ______________________________________________________________________________
# We can perform simple random assignment, but this is rare in practice
Z = simple_ra(N = 100) # N = number of observations
# Complete random assignment - all of these functions are equivalent
Z = complete_ra(N = 100) # N = number of observations
Z = complete_ra(N = 100, m = 50) # m = number of units in treatment
Z = complete_ra(N = 100, prob = 0.5) # prob = prob of assignment to treatment
```

Now let’s do something a bit more complex and combine pipes in dplyr with randomizr. We will do the following:

- Create a vector of randomly assigned treatment statuses.
- Create a treatment indicator variable \(D\) based on this vector of treatment assignments.
- Create a variable of revealed potential outcomes \(Y\) using the switching equation.
- Acknowledge the fundamental problem of causal inference by removing the original \(Y_i(0)\) and \(Y_i(1)\) variables from our dataset.
- Calculate the average treatment effect across all units in the sample:

(\(E[Y_i(1) | D = 1] - E[Y_i(0) | D = 0]\)).

```
# Perform functions described above
analysis = week1_clean %>%
mutate(
D = complete_ra(N = 100), # Randomly assign treatment statuses and create D variable
Y = ifelse(D == 1, Yi1, Yi0), # Reveal treatment or control outcomes and create Y variable
Y = (D * Yi1) + (1 - D) * Yi0, # Or using switching equation
) %>%
select(D, Y) %>% # Remove full schedule of potential outcomes. Can never see!
mutate(ATE = mean(Y[D == 1]) - mean(Y[D == 0])) # Calculate ATE
# Show final dataset
analysis
```

```
## # A tibble: 100 x 3
## D Y ATE
## <int> <dbl> <dbl>
## 1 0 6 5.2
## 2 0 9 5.2
## 3 1 17 5.2
## 4 1 13 5.2
## 5 1 14 5.2
## 6 0 5 5.2
## 7 1 13 5.2
## 8 1 16 5.2
## 9 0 8 5.2
## 10 0 11 5.2
## # … with 90 more rows
```

We could also calculate treatment effects for indvidual units here, but this is a completely hypothetical exercise. Why?

**Answer:**In reality, we can never see an individual unit’s treatment*and*control potential outcomes. If a unit is treated, it reveals its treatment outcome. If it is in control, it reveals its control outcome. This fact—that it is impossible to observe the causal effect on a single unit—is known as the*Fundamental Problem of Causal Inference*.

```
## # A tibble: 100 x 3
## Yi1 Yi0 diff
## <dbl> <dbl> <dbl>
## 1 17 6 11
## 2 12 9 3
## 3 17 8 9
## 4 13 14 -1
## 5 14 15 -1
## 6 15 5 10
## 7 13 10 3
## 8 16 8 8
## 9 16 8 8
## 10 15 11 4
## # … with 90 more rows
```

Now we will use another particularly useful combination of dplyr functions often refered to as “group by summarise.” This allows us to calculate a number of summary statistics for a given dataframe/tibble.

```
# Create a dataframe showing us groupmeans and standard errors
groupmeans = analysis %>%
group_by(D) %>%
summarise(
groupmean = mean(Y),
size = n(),
var = var(Y),
se = sqrt(var / size),
ATE = mean(ATE)
)
```

Let’s display the output of our function in a pretty manner using Kable. Note that Kable has many more functions for customizing tables for output to both PDF and HTML that you can explore.

D | groupmean | size | var | se | ATE |
---|---|---|---|---|---|

0 | 9.2 | 50 | 10.285714 | 0.4535574 | 5.2 |

1 | 14.4 | 50 | 9.795918 | 0.4426267 | 5.2 |

Loops allow you to reduce duplication in your code by repeating a function iteratively. This helps you when you need to do the same thing to multiple inputs, for example repeating the same operation on different rows, columns, or on different datasets.

As a general rule of thumb, you should not copy and past the same line of code to repeat a function more than two or three times, but there will also be times when you need to repeat a function more times than is practically possible by copying lines of code. These kinds of operations are often essential for random sampling and simulation, and therefore experiments.

First, we will practice a basic loop where we create a empty vector and fill it with random draws from a normal distribution. You can create “for loops,” “while loops,” and “repeat loops” in R. For this class you will need a firm understanding of “for loops.”

```
# Set the number of times we will draw from normal distribution
draws = 10000 # Try to get in the habit of not hardcoding these values directly into loops
# Create an empty vector of length draws to store our results
vector = rep(NA, draws)
normal_dist = rnorm(n = 10000, mean = 100, sd = 10)
# Basic loop drawing from normal distribution
for (i in 1:draws) {
vector[i] = sample(normal_dist, size = 1, replace = TRUE)
}
```

Next, we will create a plot of this thing using base R plotting. Note that I am only showing you base R plots so that you can see how much we can improve on this with ggplot in a moment!

Next week we will practice a procedure for hypothesis testing called randomization inference. This will require you to have a firm understanding of loops, so please ensure that you do, or come to office hours if anything is confusing!

Now we will create a plot using ggplot2—arguably the best and most flexible data visualization platform around. Other languages and packages have sophisticated plotting functions (e.g., Python’s Seaborn, Stata’s plotting functions, Tableau, etc.), but none are as comprehensive, flexible, and developed as ggplot2.

First, let’s recreate the histogram above using ggplot2.

```
# ggplot requires a dataframe
vector = data.frame(vector)
# Tell R that D is binary - will matter for plotting
analysis = analysis %>% mutate(D = as.factor(D))
# Create histogram
ggplot(vector, aes(vector)) +
geom_histogram(color = "white", fill = "steelblue2") +
xlab("Vector") +
ylab("Frequency") +
theme_classic()
```

Now let’s go step by step through how ggplot works, and construct a visualization of the average treatment effect we calculated above.

ggplot elements can be added iteratively in layers. In other words, you can continually add new elements to or on top of a plot. Let’s start by creating a basic scatterplot of our outcome variable \(Y\) versus our treatment indicator \(D\).

```
# Create and save plot
ate_plot =
ggplot(analysis, aes(x = D, y = Y)) +
geom_point()
# Display plot
ate_plot
```

That’s a bit weird? We have 50 observations in each treatment group, but only 15 observations appear in control and 13 in treatment on the plot. This is because our points are on top of each other, we can give them some wiggle room using the “geom_jitter” function.