Note that throughout section we will be using the data from problem set 6.



# Libraries
library(tidyverse)
library(knitr)
library(stargazer)
library(sandwich)
library(foreign)
library(plotly)
library(reshape2)
library(glmnet)
library(caTools)

# Import data
ps6 = read.dta('/Users/trevorincerti/Box Sync/Graduate/Teaching/500/section_notes/data/qog_std_cs_15may13.dta')

# Drop all observations such that wdi_gnipc has missing values.
ps6_clean = ps6 %>% 
  filter(!is.na(wdi_gnipc))

# Create logged wdi_gnipc variable
ps6_clean$ln_wdi_gnipc = log(ps6_clean$wdi_gnipc)

# Keep variables needed for analysis only
ps6_clean = ps6_clean %>% 
  select(ccode, cname, fh_ipolity2, ln_wdi_gnipc, ciri_empinx_new)

Bivariate regression vs. Multivariate regression

Interpreting a coefficient in a bivariate linear regression (e.g. \(\beta_1\) in \(Y = \alpha + \beta_1 x_1\)).

ps6_lm = lm(fh_ipolity2 ~ ln_wdi_gnipc, data = ps6_clean)


Dependent variable:
Polity score
Logged GNI per capita 0.846***
(0.129)
Constant -0.390
(1.101)
Observations 187
Adjusted R2 0.185
Note: p<0.1; p<0.05; p<0.01



## `geom_smooth()` using formula 'y ~ x'


Interpreting a coefficient in a multivariate linear regression (\(Y = \alpha + \beta_1x_1 + \beta_2x_2\)).

\[\beta_1 = \frac{\partial Y}{\partial x_1} \]


ps6_lm2 = lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)


Dependent variable:
Polity score
Logged GNI per capita 0.142*
(0.072)
CIRI Empowerment score 0.631***
(0.027)
Constant 0.475
(0.558)
Observations 187
Adjusted R2 0.792
Note: p<0.1; p<0.05; p<0.01



Significance testing

The easy trick

  • If the coefficient is twice (technically 1.96) the size of the standard error, we may call our result “statistically significant at the .05 level.”

  • If our null hypothesis is that \(\beta_1\) = 0, the coefficient divided by the standard error is the t-statistic (i.e. \(t = \frac{\hat{\beta_1} - 0}{SE({\hat{\beta_1}})})\).

  • This is very useful for quickly reading regression tables, especially when authors do not report “significance stars.”

  • Why might an author not report stars (hint: not p-hacking)? If it is hard to tell if the coefficient is twice the size of the standard error, what should we conclude?


Using confidence intervals

  • A range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

  • To test the null hypothesis, we need to determine whether \(\hat{\beta_1}\), our estimate for \(\beta_1\), is sufficiently far from zero that we can be confident that \(\beta_1\) is non-zero. How do we use a confidence interval to tell this?

Randomization inference

  • For experimental data

  • Considers what would have occurred not only under the random assignment that occurred in the experiment, but rather under all possible random assignments.

  • How does it work?
    1. Observe the magnitude of the actual treatment effect we saw in practice.
    2. Generate a new fake treatment assignments to all units.
    3. Re-estimate the treatment effect using the fake treatment assignment.
    4. Repeat steps 2-3 10,000 times.
    5. Calculate the proportion of times that the fake treatment assignment resulted in an effect larger than the actual treatment effect. This is a randomization inference based p-value.


Cross validation

Sample splitting for test and training

Basic example in OLS. To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance. For example, you could split your dataset multiple times by keeping the first 90% of your data in training and last 10% in test, then change which portion of your data is in test versus traning 10x until you have covered the entire dataset. This is called k-fold cross validation, where k = number of folds.


Basic one-split example


# Split dataset into test and training
sample = sample.split(ps6_clean, SplitRatio = .9)
train = subset(ps6_clean, sample == TRUE)
test  = subset(ps6_clean, sample == FALSE)

# Run OLS on training dataset
ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)

# Add prediction to training dataset
test$ols_pred = predict(ols_train, test)

# Calculate out-of-sample MSE
mse_ols = mean((test$fh_ipolity2 - test$ols_pred)^2)
mse_ols
## [1] 1.471299


Example using monte-carlo cross validation. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap.


sims = 100
mse_en = rep(NA, 100)

for (i in 1:100) {
  sample = sample.split(ps6_clean, SplitRatio = .9)
  train = subset(ps6_clean, sample == TRUE)
  test  = subset(ps6_clean, sample == FALSE)

  # Run OLS on training dataset
  ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)
  
  # Add prediction to training dataset
  test$ols_pred = predict(ols_train, test)
  
  # Calculate MSE
  mse_en[i] <- mean((test$fh_ipolity2 - test$ols_pred)^2)
}

mean_mse <- mean(mse_en)
print(mean_mse)
## [1] 2.032445


Parametric vs. non-parametric models

Examples

Consider the following statements:

  1. \(f(x)= \alpha + \beta x\) and \(\epsilon_i ∼ N(0,1)\)

This is a parametric statistical model with parameters \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would know the exact probability model.]

  1. The regression function is linear and the distribution of the error term is standard normal.

This is identical to the previous statement except described verbally.

  1. \(f(x) = \alpha + \beta x\) and \(|\epsilon_i| < k\) with probability one for some finite \(k\).

Semi-parametric statistical model with parameters of interest \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would not know the exact probability model, but we would know everything we want to know: the regression function.]

  1. The regression function is continuous and \(|\epsilon_i| < k\) with probability one for some finite \(k\).

Non-parametric statistical model. [No way to parameterize the model with a finite number of parameters.]


OLS vs. Logit vs. Probit

# Create binary outcome variable
ps6_clean$dem = with(ps6_clean, ifelse(fh_ipolity2 > 5, 1, 0))

# OLS
ols = lm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)

# Logit
logit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean, 
            family = binomial(link = "logit"))

# Probit
probit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean, 
             family = binomial(link = "probit"))
OLS vs. logit vs. probit
Dependent variable:
Democracy
OLS Logit Probit
Constant 0.11 -5.41*** -3.08***
(0.12) (1.68) (0.93)
Logged GNI per capita -0.02 0.19 0.10
(0.02) (0.19) (0.11)
CIRI Empowerment score 0.09*** 0.74*** 0.42***
(0.01) (0.11) (0.06)
Observations 187 187 187
Note: p<0.1; p<0.05; p<0.01


Ridge and LASSO

Note that this uses the cv.glmnet function, which stands for cross-validation. In this case 10-fold cross-validation is used to pick the value of the tuning parameter lamda.

A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda that minimizes mean squared error.

# Set up matrices
x = model.matrix(fh_ipolity2 ~ ., ps6_clean)[, -1]
y = ps6_clean$fh_ipolity2
  
# Run ridge net model
cv_ridge_fit <- cv.glmnet(x, y, alpha = 0)

# Run lasso model
cv_lasso_fit <- cv.glmnet(x, y, alpha = 1)