Regression continued

Note that throughout section we will be using the data from problem set 6.

# Libraries
library(tidyverse)
library(knitr)
library(stargazer)
library(sandwich)
library(foreign)
library(plotly)
library(reshape2)
library(glmnet)
library(caTools)

# Import data
ps6 = read.dta('/Users/trevorincerti/Box Sync/Graduate/Teaching/500/section_notes/data/qog_std_cs_15may13.dta')

# Drop all observations such that wdi_gnipc has missing values.
ps6_clean = ps6 %>% 
  filter(!is.na(wdi_gnipc))

# Create logged wdi_gnipc variable
ps6_clean$ln_wdi_gnipc = log(ps6_clean$wdi_gnipc)

# Keep variables needed for analysis only
ps6_clean = ps6_clean %>% 
  select(ccode, cname, fh_ipolity2, ln_wdi_gnipc, ciri_empinx_new)

Bivariate regression vs. Multivariate regression

Interpreting a coefficient in a bivariate linear regression (e.g. \(\beta_1\) in \(Y = \alpha + \beta_1 x_1\)).

\(\beta_1\) is the slope of of the BLP with respect to \(x\).
So, for our problem set, this is the slope of the BLP with respect to logged GNI. In other words, it tells us E[Y], or E[Polity], associated with a 1 unit change in logged GNI.
Shown numerically:

ps6_lm = lm(fh_ipolity2 ~ ln_wdi_gnipc, data = ps6_clean)


	Dependent variable:

	Polity score

Logged GNI per capita	0.846^***
	(0.129)

Constant	-0.390
	(1.101)


Observations	187
Adjusted R²	0.185

Note:	p<0.1; p<0.05; p<0.01

Shown visually:

## `geom_smooth()` using formula 'y ~ x'

Interpreting a coefficient in a multivariate linear regression (\(Y = \alpha + \beta_1x_1 + \beta_2x_2\)).

The coefficient now represents a conditional slope.
Generally, we now interpret interpret \(\beta_1\) as the average effect on \(Y\) of a one unit increase in \(x_1\), holding all other predictors fixed.
In this case, \(\beta_1\) now represents the change in E[Polity] associated with a 1 unit change in logged GNI, holding the Empowerment Rights Index constant.
Expressed as a partial derivative:

\[\beta_1 = \frac{\partial Y}{\partial x_1} \]

Shown numerically

ps6_lm2 = lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)


	Dependent variable:

	Polity score

Logged GNI per capita	0.142^*
	(0.072)

CIRI Empowerment score	0.631^***
	(0.027)

Constant	0.475
	(0.558)


Observations	187
Adjusted R²	0.792

Note:	p<0.1; p<0.05; p<0.01

Shown visually:

Significance testing

The easy trick

If the coefficient is twice (technically 1.96) the size of the standard error, we may call our result “statistically significant at the .05 level.”
If our null hypothesis is that \(\beta_1\) = 0, the coefficient divided by the standard error is the t-statistic (i.e. \(t = \frac{\hat{\beta_1} - 0}{SE({\hat{\beta_1}})})\).
This is very useful for quickly reading regression tables, especially when authors do not report “significance stars.”
Why might an author not report stars (hint: not p-hacking)? If it is hard to tell if the coefficient is twice the size of the standard error, what should we conclude?

Using confidence intervals

A range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
To test the null hypothesis, we need to determine whether \(\hat{\beta_1}\), our estimate for \(\beta_1\), is sufficiently far from zero that we can be confident that \(\beta_1\) is non-zero. How do we use a confidence interval to tell this?

Randomization inference

For experimental data
Considers what would have occurred not only under the random assignment that occurred in the experiment, but rather under all possible random assignments.
How does it work?
1. Observe the magnitude of the actual treatment effect we saw in practice.
2. Generate a new fake treatment assignments to all units.
3. Re-estimate the treatment effect using the fake treatment assignment.
4. Repeat steps 2-3 10,000 times.
5. Calculate the proportion of times that the fake treatment assignment resulted in an effect larger than the actual treatment effect. This is a randomization inference based p-value.

Cross validation

Sample splitting for test and training

Basic example in OLS. To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance. For example, you could split your dataset multiple times by keeping the first 90% of your data in training and last 10% in test, then change which portion of your data is in test versus traning 10x until you have covered the entire dataset. This is called k-fold cross validation, where k = number of folds.

Basic one-split example

# Split dataset into test and training
sample = sample.split(ps6_clean, SplitRatio = .9)
train = subset(ps6_clean, sample == TRUE)
test  = subset(ps6_clean, sample == FALSE)

# Run OLS on training dataset
ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)

# Add prediction to training dataset
test$ols_pred = predict(ols_train, test)

# Calculate out-of-sample MSE
mse_ols = mean((test$fh_ipolity2 - test$ols_pred)^2)
mse_ols

## [1] 1.471299

Example using monte-carlo cross validation. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap.

sims = 100
mse_en = rep(NA, 100)

for (i in 1:100) {
  sample = sample.split(ps6_clean, SplitRatio = .9)
  train = subset(ps6_clean, sample == TRUE)
  test  = subset(ps6_clean, sample == FALSE)

  # Run OLS on training dataset
  ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)
  
  # Add prediction to training dataset
  test$ols_pred = predict(ols_train, test)
  
  # Calculate MSE
  mse_en[i] <- mean((test$fh_ipolity2 - test$ols_pred)^2)
}

mean_mse <- mean(mse_en)
print(mean_mse)

## [1] 2.032445

Parametric vs. non-parametric models

Non-parametric
- Imposes no or almost no restrictions on the conditional distribution.
Semi-parametric
- Imposes some restrictions on the disctribution of \(Y|X\), however not enough to write \(f_{X|Y}(y)\) in terms of a finite number of real-valued parameters. In other words, we do not know enough that if given a few numbers we could back out the full conditional distribution.
Parametric
- Imposes full restrictions on \(Y|X\).

Examples

Consider the following statements:

\(f(x)= \alpha + \beta x\) and \(\epsilon_i ∼ N(0,1)\)

This is a parametric statistical model with parameters \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would know the exact probability model.]

The regression function is linear and the distribution of the error term is standard normal.

This is identical to the previous statement except described verbally.

\(f(x) = \alpha + \beta x\) and \(|\epsilon_i| < k\) with probability one for some finite \(k\).

Semi-parametric statistical model with parameters of interest \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would not know the exact probability model, but we would know everything we want to know: the regression function.]

The regression function is continuous and \(|\epsilon_i| < k\) with probability one for some finite \(k\).

Non-parametric statistical model. [No way to parameterize the model with a finite number of parameters.]

OLS vs. Logit vs. Probit

# Create binary outcome variable
ps6_clean$dem = with(ps6_clean, ifelse(fh_ipolity2 > 5, 1, 0))

# OLS
ols = lm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)

# Logit
logit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean, 
            family = binomial(link = "logit"))

# Probit
probit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean, 
             family = binomial(link = "probit"))

**OLS vs. logit vs. probit**

	Dependent variable:

	Democracy
	OLS	Logit	Probit

Constant	0.11	-5.41^***	-3.08^***
	(0.12)	(1.68)	(0.93)
Logged GNI per capita	-0.02	0.19	0.10
	(0.02)	(0.19)	(0.11)
CIRI Empowerment score	0.09^***	0.74^***	0.42^***
	(0.01)	(0.11)	(0.06)

Observations	187	187	187

Note:	p<0.1; p<0.05; p<0.01

Ridge and LASSO

Note that this uses the cv.glmnet function, which stands for cross-validation. In this case 10-fold cross-validation is used to pick the value of the tuning parameter lamda.

A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda that minimizes mean squared error.

# Set up matrices
x = model.matrix(fh_ipolity2 ~ ., ps6_clean)[, -1]
y = ps6_clean$fh_ipolity2
  
# Run ridge net model
cv_ridge_fit <- cv.glmnet(x, y, alpha = 0)

# Run lasso model
cv_lasso_fit <- cv.glmnet(x, y, alpha = 1)