Alfa Investor In A Good Mood: lambada

Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
(Asymptopia is my name for the land of asymptotics, where everything works out well and there’s no messes. The land of infinite data is nice that way.)
Asymptotics are incredibly useful for simple statistical inference and approximations
(Not covered in this class) Asymptotics often lead to nice understanding of procedures
Asymptotics generally give no assurances about finite sample performance
Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)

Limits of random variables

Fortunately, for the sample mean there’s a set of powerful results These results allow us to talk about the large sample distribution of sample means of a collection of

iid observations The first of these results we inuitively know It says that the average limits to what its estimating, the population mean It’s called the Law of Large Numbers Example

X¯n could be the average of the result of

n coin flips (i.e. the sample proportion of heads) As we flip a fair coin over and over, it evetually converges to the true probability of a head The LLN forms the basis of frequency style thinking

Law of large numbers in action

   library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

   n <- span=""> 10000; means <- span=""> cumsum(rnorm(n)) / (1  : n); 
   g <- span=""> ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) 
   g <- span=""> g + geom_hline(yintercept = 0) + geom_line(size = 2) 
   g <- span=""> g + labs(x = "Number of obs", y = "Cumulative mean")
   g

Discussion

An estimator is consistent if it converges to what you want to estimate The LLN says that the sample mean of iid sample is consistent for the population mean Typically, good estimators are consistent; it’s not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer The sample variance and the sample standard deviation of iid random variables are consistent as well

The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most important theorems in statistics For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases The CLT applies in an endless variety of settings The result is that

X ¯ n - μ σ / n - - \sqrt = n - - \sqrt ( X ¯ n - μ ) σ = Estimate - Mean of estimate Std. Err. of estimate

has a distribution like that of a standard normal for large

n. Replacing the standard error by its estimated value doesn’t change the CLT) The useful way to think about the CLT is that

X¯n is approximately

N(μ,σ2/n)

Example

Simulate a standard normal random variable by rolling

n (six sided) - Let

Xi be the outcome for die

i - Then note that

μ=E[Xi]=3.5 -

Var(Xi)=2.92 - SE

2.92/n−−−−−−√=1.71/n−−√ - Lets roll

n dice, take their mean, subtract off 3.5, and divide by

1.71/n−−√ and repeat this over and over

Result of our die rolling experiment

Coin CLT

Let

Xi be the

0 or

1 result of the

ith flip of a possibly unfair coin The sample proportion, say

p^, is the average of the coin flips

E[Xi]=p and

Var(Xi)=p(1−p) Standard error of the mean is

p(1−p)/n−−−−−−−−−√ Then

p ^ - p p ( 1 - p ) / n - - - - - - - - - \sqrt

will be approximately normally distributed Let’s flip a coin

n times, take the sample proportion of heads, subtract off .5 and multiply the result by

2n−−√ (divide by

1/(2n−−√))

Simulation results

Simulation results, p=0.9

Confidence intervals

According to the CLT, the sample mean,

X¯, is approximately normal with mean

μ and sd

σ/n−−√

μ+2σ/n−−√ is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail) Similarly,

μ−2σ/n−−√ is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail) So the probability

X¯ is bigger than

μ+2σ/n−−√ or smaller than

μ−2σ/n−−√ is 5% Or equivalently, the probability of being between these limits is 95% The quantity

X¯±2σ/n−−√ is called a 95% interval for

μ - The 95% refers to the fact that if one were to repeatly get samples of size

n, about 95% of the intervals obtained would contain

μ The 97.5th quantile is 1.96 (so I rounded to 2 above) 90% interval you want (100 - 90) / 2 = 5% in each tail - So you want the 95th percentile (1.645)

Give a confidence interval for the average height of sons in Galton’s data

      library(UsingR);data(father.son); x <- span=""> father.son$sheight

## Warning: package 'UsingR' was built under R version 3.1.2

## Loading required package: MASS
## Loading required package: HistData

## Warning: package 'HistData' was built under R version 3.1.2

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.1.2

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula

## Warning: package 'Formula' was built under R version 3.1.2

## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## Loading required package: quantreg

## Warning: package 'quantreg' was built under R version 3.1.2

## Loading required package: SparseM

## Warning: package 'SparseM' was built under R version 3.1.2

## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## 
## Attaching package: 'quantreg'
## 
## The following object is masked from 'package:Hmisc':
## 
##     latex
## 
## The following object is masked from 'package:survival':
## 
##     untangle.specials
## 
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:survival':
## 
##     cancer
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies

      (mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12

## [1] 5.709670 5.737674

Sample proportions

In the event that each

Xi is

0 or

1 with common success probability

p then

σ2=p(1−p) - The interval takes the form

p^\pm z 1 - α / 2 p ( 1 - p ) n - - - - - - - \sqrt

- Replacing

p by

p^ in the standard error results in what is called a Wald confidence interval for

p - For 95% intervals

p^\pm 1 n - - \sqrt

is a quick CI estimate for

Example

Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you. * Can you relax? Do you have this race in the bag? * Without access to a computer or calculator, how precise is this estimate? * 1/sqrt(100)=0.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66) * Not enough for you to relax, better go do more campaigning! * Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.

      round(1 / sqrt(10 ^ (1 : 6)), 3)

## [1] 0.316 0.100 0.032 0.010 0.003 0.001

Binomial interval

      .56 + c(-1, 1) * qnorm(.975) * sqrt(.56 * .44 / 100)

## [1] 0.4627099 0.6572901

      binom.test(56, 100)$conf.int

## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95

Simulation

      n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
      coverage <- span=""> sapply(pvals, function(p){
      phats <- span=""> rbinom(nosim, prob = p, size = n) / n
      ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
      ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
      mean(ll < p & ul > p)
      })

Plot of the results (not so good)

What’s happening?

n isn’t large enough for the CLT to be applicable for many of the values of

p - Quick fix, form the interval with

X + 2 n + 4

- (Add two successes and failures, Agresti/Coull interval)

Simulation

  First let's show that coverage gets better with $n$

      n <- span=""> 100; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
      coverage2 <- span=""> sapply(pvals, function(p){
        phats <- span=""> rbinom(nosim, prob = p, size = n) / n
        ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
        ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
        mean(ll < p & ul > p)
      })

Plot of coverage for n=100

Simulation

    Now let's look at $n=20$ but adding 2 successes and failures

      n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
      coverage <- span=""> sapply(pvals, function(p){
      phats <- span=""> (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
      ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
      ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
      mean(ll < p & ul > p)
      })

Adding 2 successes and 2 failures

  (It's a little conservative)

Poisson interval

 * A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?

X∼Poisson(λt).
Estimate λ^=X/t
- Var(λ^)=λ/t
- λ^/t is our variance estimate

R code

   x <- span=""> 5; t <- span=""> 94.32; lambda <- span=""> x / t
   round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)

## [1] 0.007 0.099

   poisson.test(x, T = 94.32)$conf

## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95

Simulating the Poisson coverage rate

 Let's see how this interval performs for lambda

values near what we’re estimating

   lambdavals <- span=""> seq(0.005, 0.10, by = .01); nosim <- span=""> 1000
   t <- span=""> 100
   coverage <- span=""> sapply(lambdavals, function(lambda){
     lhats <- span=""> rpois(nosim, lambda = lambda * t) / t
     ll <- span=""> lhats - qnorm(.975) * sqrt(lhats / t)
     ul <- span=""> lhats + qnorm(.975) * sqrt(lhats / t)
     mean(ll < lambda & ul > lambda)
   })

Covarage

 (Gets really bad for small values of lambda)

What if we increase t to 1000?

Summary

The LLN states that averages of iid samples converge to the population means that they are estimating The CLT states that averages are approximately normal, with distributions centered at the population mean with standard deviation equal to the standard error of the mean CLT gives no guarantee that

n is large enough Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean - Adding and subtracting 2 SEs works for 95% intervals - Confidence intervals get wider as the coverage increases (why?) - Confidence intervals get narrower with less variability or larger sample sizes - The Poisson and binomial case have exact intervals that don’t require the CLT - But a quick fix for small sample size binomial calculations is to add 2 successes and failures

lambada

Limits of random variables

Law of large numbers in action

Discussion

The Central Limit Theorem

Example

Result of our die rolling experiment

Coin CLT

Simulation results

Simulation results, p=0.9

Confidence intervals

Give a confidence interval for the average height of sons in Galton’s data

Sample proportions

Example

Binomial interval

Simulation

Plot of the results (not so good)

What’s happening?

Simulation

Plot of coverage for n=100

Simulation

Adding 2 successes and 2 failures

Poisson interval

R code

Simulating the Poisson coverage rate

Covarage

What if we increase t to 1000?

Summary

Nikkei225