I.I.D. Random Variables

Independent and identically distrubuted

\[F_{X,Y}(x,y) = F_X(x) \times F_Y(y) \forall x, y\]

\[F_x(x) = F_Y(x) \forall x\]


Random sampling

 

Estimation

\[\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\]

\[V[\bar{X}] = \frac{V[X]}{n}\]

Example of the weak law of large numbers in action:


Estimation theory

Some terminology

  • Estimand: our quantity of interest.
  • Estimator: Statistical procedure (i.e. method) used to make a guess about a parameter. This is a random variable. We use the “hat” symbol to denote an estimator.
  • Estimate: the value of the estimand that results from use of a particular estimator (i.e. the value the estimator takes on).
  • Asymptotics: describing the limiting behavior of a function. What happens as \(n\) becomes very large. What are the properties of the function as \(n \rightarrow \infty\)

Unbiasedness

  • How often does an estimator give us the the right answer on average?
  • Formally, the expected value of an estimator is equal to the true value of our population feature of interest (i.e. \(E[\hat{\theta}] = \theta\))
  • To measure the biasedness of an estimator: \(E[\hat{\theta}] - \theta\)

Example:

Imagine a population consisting of three units. Each unit has an associated measurement: \(Y_1\), \(Y_2\) and \(Y_3\). You are interested in the average \(Y_{avg} = (Y_1 + Y_2 + Y_3)/3\). You draw a sample of two units without replacement with equal probability and observe their measurements \({Y_a,Y_b}\). You plan to estimate \(Y_{avg}\) with the estimator \(\hat{Y} = \frac{(Y_a + Y_b )}{2}\).

  1. Derive the bias: \(E[\hat{Y} − Y_{avg}]\).
  2. Derive the variance: \(Var(\hat{Y})\).
  3. Derive the mean squared error: \(E[(\hat{Y} − Y_{avg})^2]\). Hint: MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator.


Answers:

a.) Due to linearity of expectations and the fact that \(Y_{avg}\) is given to be the population mean, \(E[\hat{Y} - Y_{avg}] = E[\hat{Y}] - E[Y_{avg}] = E[\hat{Y}] - Y_{avg}\).

\[E[\hat{Y} - Y_{AVG}] = E[\hat{Y}] - Y_{avg} = \frac{\frac{Y_1 + Y_2}{2} + \frac{Y_1 + Y_3}{2} + \frac{Y_2 + Y_3}{2}}{3} - \frac{Y_1 + Y_2 + Y_3}{3} = 0\]

b.) Note that we know know \(E[\hat{Y}]\), so its square is simple to derive. However, \(\hat{Y}^2 \neq \hat{Y}\). So, while \(E[\hat{Y}] = Y_{avg}\), \(E[\hat{Y^2}] \neq E[Y_{avg}^2]\)

\[V[\hat{Y}] = E[\hat{Y^2}] - E[\hat{Y}]^2 = \]

\[\frac{(Y_1 + Y_2/2)^2 + (Y_1 + Y_3/2)^2 + (Y_2 + Y_3/2)^2}{3} - \left(\frac{Y_1 + Y_2 + Y_3}{3}\right)^2 = \] Starting with the first term:

\[\frac{(Y_1 + Y_2/2)^2 + (Y_1 + Y_3/2)^2 + (Y_2 + Y_3/2)^2}{3} = \] \[\frac{(Y_1^2 + Y_2^2 + 2Y_1Y_2 + Y_1^2 + Y_3^2 + 2Y_1Y_3 + Y_2^2 + Y_3^2 + 2Y_2Y_3/4)}{3} = \] \[\frac{2(Y_1^2 + Y_2^2 + Y_3^2 + Y_1Y_2 + Y_1Y_3 + Y_2Y_3)}{12} = \] \[\frac{Y_1^2 + Y_2^2 + Y_3^2 + Y_1Y_2 + Y_1Y_3 + Y_2Y_3}{6}\]

For the second term:

\[\big(\frac{Y_1 + Y_2 + Y_3}{3}\big)^2 = \]

\[\frac{Y_1^2 + Y_2^2 + Y_3^2 + 2Y_1Y_2 + 2Y_2Y_3 + 2Y_1Y_3}{9}\] Recombining and subtracting:

\[3\left(\frac{Y_1^2 + Y_2^2 + Y_3^2 + Y_1Y_2 + Y_1Y_3 + Y_2Y_3}{6}\right) - 2\left(\frac{Y_1^2 + Y_2^2 + Y_3^2 + 2Y_1Y_2 + 2Y_2Y_3 + 2Y_1Y_3}{9}\right) = \] \[\left(\frac{Y_1^2 + Y_2^2 + Y_3^2 - Y_1Y_2 - Y_2Y_3 - Y_1Y_3}{18}\right)\]

c.)

\[E[(\hat{Y} - Y_{AVG})^2] = \]

\[V[\hat{Y}] + (E[\hat{Y}] - Y_{AVG})^2 = \]

\[V[\hat{Y}] + E[\hat{Y}]^2 + Y_{AVG}^2 - 2E[\hat{Y}]Y_{AVG} = \]

\[V[\hat{Y}] + \] \[\frac{Y_1^2 + Y_2^2 + Y_3^2 + 2Y_1Y_2 + 2Y_2Y_3 + 2Y_1Y_3}{9} + \frac{Y_1^2 + Y_2^2 + Y_3^2 + 2Y_1Y_2 + 2Y_2Y_3 + 2Y_1Y_3}{9} - \] \[2\big(\frac{Y_1 + Y_2 + Y_3}{3}\big)\big(\frac{Y_1 + Y_2 + Y_3}{3}\big)\]

\[ = V[\hat{Y}] + 0 =\]

\[V[\hat{Y}] = \left(\frac{Y_1^2 + Y_2^2 + Y_3^2 - Y_1Y_2 - Y_2Y_3 - Y_1Y_3}{18}\right)\]


Consistency

  • If we had enough data, the probability that our estimate \(\hat{\theta}\) would be far from the truth \(\theta\) will be close to zero.
  • This connects back to the weak law of large numbers, which states that as the sample size grows, the sample mean under random sampling is increasingly likely to approximate the population mean. In other words, the WLLN states that the sample mean is a consistent estimator for the population mean.
  • This is different from unbiasedness. Unbiasedness is not affected by increasing sample size. An estimate is unbiased if its expected value equals the true parameter value. An estimator can be unbiased but not consistent or biased but consistent.
  • Unbiasedness is a statement about the expected value of the sampling distribution of the estimator. Consistency is a statement about “where the sampling distribution of the estimator is going” as the sample size increases.
  • E.g. \({1 \over n}\sum x_{i}+{1 \over n}\) is a biased estimator of the mean, but as as \(n\rightarrow \infty\), it approaches the correct value, and so it is consistent.


Example:

There is an infinite population of units from which you draw a simple random sample of n units. Each unit in the population has a measurement \(Yi\). You are interested in the population mean of \(Yi\), and plan to use the sample mean as your estimator. \(Yi = 100\) for 5% of the population, and \(Yi = 0\) for the remaining 95%. Is this estimator consistent?

# Define sample size
n = 10

# Sample from infinite population using rbinom 10,000 times
estimates = vector(mode = "integer", length = 10000)

for (i in seq_along(estimates)) {
  rs = (rbinom(n, 1, .05)*100)
  estimates[i] = mean(rs)
}

# Calculate the estimation error of each estimate
pop_mean = mean(rep(c(0, 100),  c(95, 5)))
error = pop_mean - estimates


Central limit theorem

  • Most simply, the sampling distribution of the sample mean approximates normal distribution.
  • If n is large, the sampling distribution of the sample mean will tend to be approximately normal even when the population under consideration is not distributed normally.
  • In addition, given a sufficiently large sample size from a population, the mean of all samples from the same population will be approximately equal to the mean of the original population.
  • Not to be confused with the WLLN. WLLN: as the size of a sample increases, the sample mean will become a more accurate estimate of the population mean. The WLLN therefore refers to a single sample.
  • The CLT pertains to the distribution of sample means.

See below for an example of the central limit theorem in action. A random generative process has PMF:

\[\begin{equation} f(x) = \begin{cases} \frac{1}{6} & : x = 1 \\ \frac{1}{6} & : x = 2 \\ \frac{1}{6} & : x = 3 \\ \frac{1}{6} & : x = 4 \\ \frac{1}{6} & : x = 5 \\ \frac{1}{6} & : x = 6 \\ 0 & : otherwise \\ \end{cases} \end{equation}\]

What is this process? Now let’s observe 5, 10, 100, 1000, and 10,000 outcomes of this random generative process, and record the sample mean. We will repeat this process 5 times for 5 draws, 10 times for 10 draws, 100 times for 100 draws, etc., and examine the distribution of our sample means. A visual depiction is provided below:


Getting ahead, but this is the basis of the plug in principle, which we will discuss next week. The plug in principle allows us to use the sample analogue to estimate population features we are interested in. For example, we can use the sample mean to estimate the expected value, the sample variance to estimate the population variance, etc. Note also the implications of the central limit theorem for statistical inference and hypothesis testing, which we will get to shortly.

Intuition that may help when we begin to discuss hypothesis testing: If we roll a die 5 times and the average value of its roll is 3.7, how certain are we that it is a fair die? How about if we roll a die 10,000 times and the average value of its roll is 3.7? We will return to this later in hypothesis testing.


Another example: Part of the beauty of the central limit theorem is that the original distribution that we are sampling from does not have to be normal for the distribution of its sample means to be distributed normally.

Let’s look at an exponential distribution with mean 50.


Now let’s randomly sample from this exponential distribution 1000 times and take the mean, 50 times.


Let’s try that again, but this time take the mean 100 times.


Now let’s try it 10,000 times.