- Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
- (Asymptopia is my name for the land of asymptotics, where everything works out well and there’s no messes. The land of infinite data is nice that way.)
- Asymptotics are incredibly useful for simple statistical inference and approximations
- (Not covered in this class) Asymptotics often lead to nice understanding of procedures
- Asymptotics generally give no assurances about finite sample performance
- Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)
X∼Poisson(λt) .- Estimate
λ^=X/t Var(λ^)=λ/t λ^/t is our variance estimate
Limits of random variables
Fortunately, for the sample mean there’s a set of powerful results These results allow us to talk about the large sample distribution of sample means of a collection of iid observations The first of these results we inuitively know It says that the average limits to what its estimating, the population mean It’s called the Law of Large Numbers Example X¯n could be the average of the result of n coin flips (i.e. the sample proportion of heads) As we flip a fair coin over and over, it evetually converges to the true probability of a head The LLN forms the basis of frequency style thinking
Law of large numbers in action
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
n <- span=""> 10000; means <- span=""> cumsum(rnorm(n)) / (1 : n);
g <- span=""> ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
g <- span=""> g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- span=""> g + labs(x = "Number of obs", y = "Cumulative mean")
g->->->->->
Discussion
An estimator is consistent if it converges to what you want to estimate The LLN says that the sample mean of iid sample is consistent for the population mean Typically, good estimators are consistent; it’s not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer The sample variance and the sample standard deviation of iid random variables are consistent as well
The Central Limit Theorem
The Central Limit Theorem (CLT) is one of the most important theorems in statistics For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases The CLT applies in an endless variety of settings The result is that
Example
Simulate a standard normal random variable by rolling n (six sided) - Let Xi be the outcome for die i - Then note that μ=E[Xi]=3.5 - Var(Xi)=2.92 - SE 2.92/n−−−−−−√=1.71/n−−√ - Lets roll n dice, take their mean, subtract off 3.5, and divide by 1.71/n−−√ and repeat this over and over
Result of our die rolling experiment
Coin CLT
Let Xi be the 0 or 1 result of the ith flip of a possibly unfair coin The sample proportion, say p^ , is the average of the coin flips E[Xi]=p and Var(Xi)=p(1−p) Standard error of the mean is p(1−p)/n−−−−−−−−−√ Then
Simulation results
Simulation results, p=0.9
Confidence intervals
According to the CLT, the sample mean, X¯ , is approximately normal with mean μ and sd σ/n−−√ μ+2σ/n−−√ is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail) Similarly, μ−2σ/n−−√ is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail) So the probability X¯ is bigger than μ+2σ/n−−√ or smaller than μ−2σ/n−−√ is 5% Or equivalently, the probability of being between these limits is 95% The quantity X¯±2σ/n−−√ is called a 95% interval for μ - The 95% refers to the fact that if one were to repeatly get samples of size n , about 95% of the intervals obtained would contain μ The 97.5th quantile is 1.96 (so I rounded to 2 above) 90% interval you want (100 - 90) / 2 = 5% in each tail - So you want the 95th percentile (1.645)
Give a confidence interval for the average height of sons in Galton’s data
library(UsingR);data(father.son); x <- span=""> father.son$sheight->
## Warning: package 'UsingR' was built under R version 3.1.2
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 3.1.2
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.1.2
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
## Loading required package: quantreg
## Warning: package 'quantreg' was built under R version 3.1.2
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 3.1.2
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
##
##
## Attaching package: 'quantreg'
##
## The following object is masked from 'package:Hmisc':
##
## latex
##
## The following object is masked from 'package:survival':
##
## untangle.specials
##
##
## Attaching package: 'UsingR'
##
## The following object is masked from 'package:survival':
##
## cancer
##
## The following object is masked from 'package:ggplot2':
##
## movies
(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
## [1] 5.709670 5.737674
Sample proportions
In the event that each Xi is 0 or 1 with common success probability p then σ2=p(1−p) - The interval takes the form
Example
Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you. * Can you relax? Do you have this race in the bag? * Without access to a computer or calculator, how precise is this estimate? *
1/sqrt(100)=0.1
so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66)
* Not enough for you to relax, better go do more campaigning! * Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. round(1 / sqrt(10 ^ (1 : 6)), 3)
## [1] 0.316 0.100 0.032 0.010 0.003 0.001
Binomial interval
.56 + c(-1, 1) * qnorm(.975) * sqrt(.56 * .44 / 100)
## [1] 0.4627099 0.6572901
binom.test(56, 100)$conf.int
## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95
Simulation
n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
coverage <- span=""> sapply(pvals, function(p){
phats <- span=""> rbinom(nosim, prob = p, size = n) / n
ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
mean(ll < p & ul > p)
})->->->->->->->
Plot of the results (not so good)
What’s happening?
Simulation
First let's show that coverage gets better with $n$
n <- span=""> 100; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
coverage2 <- span=""> sapply(pvals, function(p){
phats <- span=""> rbinom(nosim, prob = p, size = n) / n
ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
mean(ll < p & ul > p)
})->->->->->->->
Plot of coverage for n=100
Simulation
Now let's look at $n=20$ but adding 2 successes and failures
n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
coverage <- span=""> sapply(pvals, function(p){
phats <- span=""> (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
mean(ll < p & ul > p)
})->->->->->->->
Adding 2 successes and 2 failures
(It's a little conservative)
Poisson interval
* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
R code
x <- span=""> 5; t <- span=""> 94.32; lambda <- span=""> x / t
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)->->->
## [1] 0.007 0.099
poisson.test(x, T = 94.32)$conf
## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95
Simulating the Poisson coverage rate
Let's see how this interval performs for lambda
values near what we’re estimating
lambdavals <- span=""> seq(0.005, 0.10, by = .01); nosim <- span=""> 1000
t <- span=""> 100
coverage <- span=""> sapply(lambdavals, function(lambda){
lhats <- span=""> rpois(nosim, lambda = lambda * t) / t
ll <- span=""> lhats - qnorm(.975) * sqrt(lhats / t)
ul <- span=""> lhats + qnorm(.975) * sqrt(lhats / t)
mean(ll < lambda & ul > lambda)
})->->->->->->->
Covarage
(Gets really bad for small values of lambda)
What if we increase t to 1000?
Summary
The LLN states that averages of iid samples converge to the population means that they are estimating The CLT states that averages are approximately normal, with distributions centered at the population mean with standard deviation equal to the standard error of the mean CLT gives no guarantee that n is large enough Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean - Adding and subtracting 2 SEs works for 95% intervals - Confidence intervals get wider as the coverage increases (why?) - Confidence intervals get narrower with less variability or larger sample sizes - The Poisson and binomial case have exact intervals that don’t require the CLT - But a quick fix for small sample size binomial calculations is to add 2 successes and failures