lambada

  • Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
  • (Asymptopia is my name for the land of asymptotics, where everything works out well and there’s no messes. The land of infinite data is nice that way.)
  • Asymptotics are incredibly useful for simple statistical inference and approximations
  • (Not covered in this class) Asymptotics often lead to nice understanding of procedures
  • Asymptotics generally give no assurances about finite sample performance
  • Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)
  • Limits of random variables

    Fortunately, for the sample mean there’s a set of powerful results These results allow us to talk about the large sample distribution of sample means of a collection of iid observations The first of these results we inuitively know It says that the average limits to what its estimating, the population mean It’s called the Law of Large Numbers Example X¯n could be the average of the result of n coin flips (i.e. the sample proportion of heads) As we flip a fair coin over and over, it evetually converges to the true probability of a head The LLN forms the basis of frequency style thinking

    Law of large numbers in action

       library(ggplot2)
    ## Warning: package 'ggplot2' was built under R version 3.1.2
       n <- span=""> 10000; means <- span=""> cumsum(rnorm(n)) / (1  : n); 
       g <- span=""> ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) 
       g <- span=""> g + geom_hline(yintercept = 0) + geom_line(size = 2) 
       g <- span=""> g + labs(x = "Number of obs", y = "Cumulative mean")
       g

    Discussion

    An estimator is consistent if it converges to what you want to estimate The LLN says that the sample mean of iid sample is consistent for the population mean Typically, good estimators are consistent; it’s not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer The sample variance and the sample standard deviation of iid random variables are consistent as well

    The Central Limit Theorem

    The Central Limit Theorem (CLT) is one of the most important theorems in statistics For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases The CLT applies in an endless variety of settings The result is that
    X¯nμσ/n=n(X¯nμ)σ=EstimateMean of estimateStd. Err. of estimate
    has a distribution like that of a standard normal for large n. Replacing the standard error by its estimated value doesn’t change the CLT) The useful way to think about the CLT is that X¯n is approximately N(μ,σ2/n)

    Example

    Simulate a standard normal random variable by rolling n (six sided) - Let Xi be the outcome for die i - Then note that μ=E[Xi]=3.5 - Var(Xi)=2.92 - SE 2.92/n=1.71/n - Lets roll n dice, take their mean, subtract off 3.5, and divide by 1.71/n and repeat this over and over

    Result of our die rolling experiment

    Coin CLT

    Let Xi be the 0 or 1 result of the ith flip of a possibly unfair coin The sample proportion, say p^, is the average of the coin flips E[Xi]=p and Var(Xi)=p(1p) Standard error of the mean is p(1p)/n Then
    p^pp(1p)/n
    will be approximately normally distributed Let’s flip a coin n times, take the sample proportion of heads, subtract off .5 and multiply the result by 2n (divide by 1/(2n))

    Simulation results

    Simulation results, p=0.9

    Confidence intervals

    According to the CLT, the sample mean, X¯, is approximately normal with mean μ and sd σ/n μ+2σ/n is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail) Similarly, μ2σ/n is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail) So the probability X¯ is bigger than μ+2σ/n or smaller than μ2σ/n is 5% Or equivalently, the probability of being between these limits is 95% The quantity X¯±2σ/n is called a 95% interval for μ - The 95% refers to the fact that if one were to repeatly get samples of size n, about 95% of the intervals obtained would contain μ The 97.5th quantile is 1.96 (so I rounded to 2 above) 90% interval you want (100 - 90) / 2 = 5% in each tail - So you want the 95th percentile (1.645)

    Give a confidence interval for the average height of sons in Galton’s data

          library(UsingR);data(father.son); x <- span=""> father.son$sheight
    ## Warning: package 'UsingR' was built under R version 3.1.2
    ## Loading required package: MASS
    ## Loading required package: HistData
    ## Warning: package 'HistData' was built under R version 3.1.2
    ## Loading required package: Hmisc
    ## Warning: package 'Hmisc' was built under R version 3.1.2
    ## Loading required package: grid
    ## Loading required package: lattice
    ## Loading required package: survival
    ## Loading required package: splines
    ## Loading required package: Formula
    ## Warning: package 'Formula' was built under R version 3.1.2
    ## 
    ## Attaching package: 'Hmisc'
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     format.pval, round.POSIXt, trunc.POSIXt, units
    ## 
    ## Loading required package: quantreg
    ## Warning: package 'quantreg' was built under R version 3.1.2
    ## Loading required package: SparseM
    ## Warning: package 'SparseM' was built under R version 3.1.2
    ## 
    ## Attaching package: 'SparseM'
    ## 
    ## The following object is masked from 'package:base':
    ## 
    ##     backsolve
    ## 
    ## 
    ## Attaching package: 'quantreg'
    ## 
    ## The following object is masked from 'package:Hmisc':
    ## 
    ##     latex
    ## 
    ## The following object is masked from 'package:survival':
    ## 
    ##     untangle.specials
    ## 
    ## 
    ## Attaching package: 'UsingR'
    ## 
    ## The following object is masked from 'package:survival':
    ## 
    ##     cancer
    ## 
    ## The following object is masked from 'package:ggplot2':
    ## 
    ##     movies
          (mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
    ## [1] 5.709670 5.737674

    Sample proportions

    In the event that each Xi is 0 or 1 with common success probability p then σ2=p(1p) - The interval takes the form
    p^±z1α/2p(1p)n
    - Replacing p by p^ in the standard error results in what is called a Wald confidence interval for p - For 95% intervals
    p^±1n
    is a quick CI estimate for p

    Example

    Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you. * Can you relax? Do you have this race in the bag? * Without access to a computer or calculator, how precise is this estimate? * 1/sqrt(100)=0.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66) * Not enough for you to relax, better go do more campaigning! * Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.
          round(1 / sqrt(10 ^ (1 : 6)), 3)
    ## [1] 0.316 0.100 0.032 0.010 0.003 0.001

    Binomial interval

          .56 + c(-1, 1) * qnorm(.975) * sqrt(.56 * .44 / 100)
    ## [1] 0.4627099 0.6572901
          binom.test(56, 100)$conf.int
    ## [1] 0.4571875 0.6591640
    ## attr(,"conf.level")
    ## [1] 0.95

    Simulation

          n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
          coverage <- span=""> sapply(pvals, function(p){
          phats <- span=""> rbinom(nosim, prob = p, size = n) / n
          ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
          ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
          mean(ll < p & ul > p)
          })

    Plot of the results (not so good)

    What’s happening?

    n isn’t large enough for the CLT to be applicable for many of the values of p - Quick fix, form the interval with
    X+2n+4
    - (Add two successes and failures, Agresti/Coull interval)

    Simulation

      First let's show that coverage gets better with $n$
        
          n <- span=""> 100; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
          coverage2 <- span=""> sapply(pvals, function(p){
            phats <- span=""> rbinom(nosim, prob = p, size = n) / n
            ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
            ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
            mean(ll < p & ul > p)
          })

    Plot of coverage for n=100

    Simulation

        Now let's look at $n=20$ but adding 2 successes and failures
          n <- span=""> 20; pvals <- span=""> seq(.1, .9, by = .05); nosim <- span=""> 1000
          coverage <- span=""> sapply(pvals, function(p){
          phats <- span=""> (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
          ll <- span=""> phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
          ul <- span=""> phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
          mean(ll < p & ul > p)
          })

    Adding 2 successes and 2 failures

      (It's a little conservative)
      

    Poisson interval

     * A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
    • XPoisson(λt).
    • Estimate λ^=X/t
      • Var(λ^)=λ/t
      • λ^/t is our variance estimate

    R code

       x <- span=""> 5; t <- span=""> 94.32; lambda <- span=""> x / t
       round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
    ## [1] 0.007 0.099
       poisson.test(x, T = 94.32)$conf
    ## [1] 0.01721254 0.12371005
    ## attr(,"conf.level")
    ## [1] 0.95

    Simulating the Poisson coverage rate

     Let's see how this interval performs for lambda
    values near what we’re estimating
       lambdavals <- span=""> seq(0.005, 0.10, by = .01); nosim <- span=""> 1000
       t <- span=""> 100
       coverage <- span=""> sapply(lambdavals, function(lambda){
         lhats <- span=""> rpois(nosim, lambda = lambda * t) / t
         ll <- span=""> lhats - qnorm(.975) * sqrt(lhats / t)
         ul <- span=""> lhats + qnorm(.975) * sqrt(lhats / t)
         mean(ll < lambda & ul > lambda)
       })

    Covarage

     (Gets really bad for small values of lambda)
     

    What if we increase t to 1000?

    Summary

    The LLN states that averages of iid samples converge to the population means that they are estimating The CLT states that averages are approximately normal, with distributions centered at the population mean with standard deviation equal to the standard error of the mean CLT gives no guarantee that n is large enough Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean - Adding and subtracting 2 SEs works for 95% intervals - Confidence intervals get wider as the coverage increases (why?) - Confidence intervals get narrower with less variability or larger sample sizes - The Poisson and binomial case have exact intervals that don’t require the CLT - But a quick fix for small sample size binomial calculations is to add 2 successes and failures

Nikkei225

28000-28550 up in the early session, down lately.