▲Simulating and Visualising the Central Limit Theoremblog.foletta.net

130 points by gjf 11 hours ago | 47 comments

antognini 7 minutes ago [-]

There's an interesting extension of the Central Limit Theorem called the Edgeworth Series. If you have a large but finite sample, the resulting distribution will be approximately Gaussian, but will deviate from a Gaussian distribution in a predictable way described by Hermite polynomials.

https://en.wikipedia.org/wiki/Edgeworth_series

jethkl 5 hours ago [-]

There is an analogue of the CLT for extreme values. The Fisher–Tippett–Gnedenko theorem is the extreme-values analogue of the CLT: if the properly normalized maximum of an i.i.d. sample converges, it must be Gumbel, Fréchet, or Weibull—unified as the Generalized Extreme Value distribution. Unlike the CLT, whose assumptions (in my experience) rarely hold in practice, this result is extremely general and underpins methods like wavelet thresholding and signal denoising—easy to demonstrate with a quick simulation.

kqr 4 hours ago [-]

There's also a more conservative rule similar to the CLT that works off of the definition of variance, and thus rests on no assumptions other than the existence of variance. Chebyshev's inequality tells us that the probability that any sample is more than k standard deviations away is bounded by 1/k².

In other words, it is possible (given sufficiently weird distributions) that not a single sample lands inside one standard deviation, but 75% of them must be inside two standard deviations, 88% inside three standard deviations, and so on.

There's also a one-sided version of it (Cantelli's inequality) which bounds the probability of any sample by 1/(1+k)², meaning at least 75 % of samples must be less than one standard deviation, 88% less than two standard deviations, etc.

Think of this during the next financial crisis when bank people no doubt will say they encountered "six sigma daily movements which should happen only once every hundred million years!!" or whatever. According to the CLT, sure, but for sufficiently odd distributions the Cantelli bound might be a more useful guide, and it says six sigma daily movements could happen as often as every fifty days.

ivan_ah 3 hours ago [-]

I love the simulations. They are such a good way to learn STATS... you can still look at the theorem using math notation after, but if you've seen it work first using simulated random samples, then the math will make a lot more sense.

Here is a notebook with some more graphs and visualizations of the CLT: https://nobsstats.com/site/notebooks/28_random_samples/#samp...

runnable link: https://mybinder.org/v2/gh/minireference/noBSstats/main?labp...

gus_massa 2 hours ago [-]

> It’s very subjective, but I think the uniform stsrts looking reasonably good at a sample size of 8. The exponential however takes much longer to converge to a normal.

That's a good observation. The main idea behind the Central Limit Theorem is to take the Fourier Transform, operate and then go back. After that, after normalization the result is that the new distribution for the sum of N variables is something like

  Normal(X) + 1/N * "Skewness" * Something(X) + 1/N^2 * IDont * Remember(X) + ...

Where "Skewness" is a number defined in https://en.wikipedia.org/wiki/Skewness

The uniform distribution is symmetric, so skewness=0 and the correction decrease like 1/N^2.

The exponential distribution is very asymmetrical and and skewness!=0, so the main correction is like 1/N and takes longer to dissapear.

niemandhier 10 hours ago [-]

Highly entertaining, here a little fun fact: there exist a generalisation of the central limit theorem for distributions without find out variance.

For some reasons this is much less known, also the implications are vast. Via the detour of stable distributions and limiting distributions, this generalised central limit theorem plays an important role in the rise of power laws in physics.

Tachyooon 9 hours ago [-]

3blue1brown has a great series of videos on the central limit theorem, and it makes me wish there were something similar covering the generalised form in a similar format. I have a textbook on my reading list that covers it, unfortunately I'm I can't seem to find it or the title right now. (edit: it's "The Fundamentals of Heavy Tails" by Nair, Wierman, and Zwart from 2022)

Do you have any good sources for the physics angle?

hodgehog11 9 hours ago [-]

I thought the rise of power laws in physics is predominantly attributed to Kesten's law concerning multiplicative processes, e.g. https://arxiv.org/pdf/cond-mat/9708231

usgroup 9 hours ago [-]

https://en.wikipedia.org/wiki/Central_limit_theorem#The_gene...

nextos 5 hours ago [-]

Yes, came here to say the same thing. Telling people that the CLT makes strong assumptions is important.

Otherwise, they might end up underestimating rare events, with potentially catastrophic consequences. There are also CLTs for product and max operators, aside from the sum.

The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation discusses these topics in a rigorous way, but without excessive mathematics. See: https://adamwierman.com/book

selimthegrim 3 hours ago [-]

I was at the NeurIPS workshop and saw one of the talks towards the end and it was pretty good

kgwgk 9 hours ago [-]

> find out

Finite?

ngriffiths 1 hours ago [-]

I was definitely expecting you'd need a higher sample size for the Q-Q plots to start looking good. All the points in other comments about drawbacks or poorly behaved distributions are well taken, and this is nothing new, but wow it really does work well.

kqr 4 hours ago [-]

This is a very neat illustration, but I want to leave a reminder that when we cherry-pick well-behaved distributions for illustrating the CLT, people get unrealistic expectations of what it means: https://entropicthoughts.com/it-takes-long-to-become-gaussia...

firesteelrain 5 hours ago [-]

“ You’re also likely not going to have the resources to take twenty-thousand different samples.”

There are methods to calculate how many estimated samples you need. It’s not in the 20k unless your population is extremely high

jdhwosnhw 4 hours ago [-]

I’m not sure what you mean by “higher population” but fyi what determines the required number of samples is a function of the full shape of the underlying distribution. For instance the Berry Esseen inequality puts bounds on the convergence rate as a function of the first two central moments of the underlying distribution. But the point is that the convergence rate to Gaussian can be arbitrarily slow!

https://en.m.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem

kqr 4 hours ago [-]

> It’s not in the 20k unless your population is extremely high

Common misconception. Population size has almost nothing to do with the necessary sample size. (It does enter into the finite population correction factor, but that's only really relevant if you have a small population, not a large one.)

...actually, come to think of it, you meant to write "unless your population variance is extremely high", right?

firesteelrain 3 hours ago [-]

Right and the methods I had in mind were the usual ones: proportion-based, mean-based, and power analysis, depending on what’s being measured. Thanks for catching that

ForceBru 5 hours ago [-]

Speaking of CLTs, is there a good book or reference paper that discusses various CLTs (not just the basic IID one) in a somewhat introductory manner?

vinnyorvinny 5 hours ago [-]

For a very light discussion, Tao does a good job here: https://terrytao.wordpress.com/2015/11/19/275a-notes-5-varia...

lottin 9 hours ago [-]

Looking at the R code in this article, I'm having a hard time understanding the appeal of tidyverse.

ngriffiths 1 hours ago [-]

For me the appeal is less that tidyverse is great and more that the R standard library is horrible. It's full of esoteric names, inconsistent use and order of parameters, unreasonable default behavior, behavior that surprises you coming from other programming experience. It's all in a couple massive packages instead of broken up into manageable pieces.

Tidyverse is imperfect and it feels heavy-handed and awkward to replace all the major standard library functions, but Tidyverse stuff is way more ergonomic.

gjf 8 hours ago [-]

Author here; I think I understand where you might be coming from. I find functional nature of R combined with pipes incredibly powerful and elegant to work with.

OTOH in a pipeline, you're mutating/summarising/joining a data frame, and it's really difficult to look at it and keep track of what state the data is in. I try my best to write in a way that you understand the state of the data (hence the tables I spread throughout the post), but I do acknowledge it can be inscrutable.

lottin 8 hours ago [-]

A "pipe" is simply a composition of functions. Tidyverse adds a different syntax for doing function composition, using the pipe operator, which I don't particularly like. My general objection to Tidyverse is that it tries to reinvent everything but the end result is a language that is less practical and less transparent than standard R.

mi_lk 8 hours ago [-]

Can you rewrite some of those snippets in standard R w/o Tidyverse? Curious what it would look like

lottin 5 hours ago [-]

I didn't rewrite the whole thing. But here's the first part. It uses the `histogram` function from the lattice package.

    population_data <- data.frame(
        uniform = runif(10000, min = -20, max = 20),
        normal = rnorm(10000, mean = 0, sd = 4),
        binomial = rbinom(10000, size = 1, prob = .5),
        beta = rbeta(10000, shape1 = .9, shape2 = .5),
        exponential = rexp(10000, .4),
        chisquare = rchisq(10000, df = 2)
    )
    
    histogram(~ values|ind, stack(population_data),
              layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)
    
    take_random_sample_mean <- function(data, sample_size) {
        x <- sample(data, sample_size)
        c(mean = mean(x), sd = sqrt(var(x)))
    }
    
    sample_statistics <- replicate(20000, sapply(population_data, take_random_sample_mean, 60))
    
    sample_mean <- as.data.frame(t(sample_statistics["mean", , ]))
    sample_sd <- as.data.frame(t(sample_statistics["sd", , ]))
    
    histogram(sample_mean[["uniform"]])
    histogram(sample_mean[["binomial"]])
    
    histogram(~values|ind, stack(sample_mean), layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)

apwheele 6 hours ago [-]

I mean, for the main simulation I would do it like this:

    set.seed(10)
    n <- 10000; samp_size <- 60
    df <- data.frame(
        uniform = runif(n, min = -20, max = 20),
        normal = rnorm(n, mean = 0, sd = 4),
        binomial = rbinom(n, size = 1, prob = .5),
        beta = rbeta(n, shape1 = .9, shape2 = .5),
        exponential = rexp(n, .4),
        chisquare = rchisq(n, df = 2)
    )
    
    sf <- function(df,samp_size){
        sdf <- df[sample.int(nrow(df),samp_size),]
        colMeans(sdf)
    }
    
    sim <- t(replicate(20000,sf(df,samp_size)))

I am old, so I do not like tidyverse either -- I can concede it is of personal preference though. (Personally do not agree with the lattice vs ggplot comment for example.)

RA_Fisher 9 hours ago [-]

Why? The tidyverse is so readable, elegant, compositional, functional and declarative. It allows me to produce a lot more and higher quality than I could without it. ggplot2 is the best visualization software hands down, and dplyr leverages Unix’s famous point free programming style (that reduces the surface area for errors).

lottin 9 hours ago [-]

I disagree. In this example tidyverse looks convoluted compared to just using an array and apply. ggplot2 is okay but we already had lattice. Lattice does everything ggplot2 does and produces much better-looking plots IMO.

RA_Fisher 6 hours ago [-]

I like simplicity and I love a good base R idiom, but there's a lot less consistency in base R compared to the tidyverse (and that comes with a productivity penalty).

Lattice is really low-level. It's like doing vis with matplotlib (requires a lot of time and hair-pulling). Higher level interfaces boost productivity.

ekianjo 9 hours ago [-]

the equivalent in any other language would be an ugly, unreadable, inconsistent mess.

globalnode 6 hours ago [-]

The definition under "A Brief Recap" seems incorrect. The sample size doesn't approach infinity, the number of samples does. I'm in a similar situation to the author, I skipped stats, so I could be wrong. Overall great article though.

k2enemy 5 hours ago [-]

It is correct in the article. As the sample size approaches infinity, the distribution of the sample means approaches normal.

https://en.wikipedia.org/wiki/Central_limit_theorem

jaccola 6 hours ago [-]

Yes indeed, if the sample size approached infinity (and not the number of samples), you would essentially just be calculating the mean of the original distribution.

tucnak 9 hours ago [-]

Obligatory 3Blue1Brown reference

https://www.youtube.com/watch?v=zeJD6dqJ5lo

oriettaxx 9 hours ago [-]

and the Galton Board https://en.m.wikipedia.org/wiki/Galton_board

(yes, that Galton who invented eugenetics)

gjf 8 hours ago [-]

Very much an inspiration and resource when composing the post.

jpcompartir 6 hours ago [-]

Edit: OP confirms there's no AI-generated code, so do ignore me.

The code style - and in particular the *comments - indicate most of the code was written by AI. My apologies if you are not trying to hide this fact, but it seems like common decency to label that you're heavily using AI?

*Comments like this: "# Anonymous function"

gtsnexp 6 hours ago [-]

https://gptzero.me/ Says that at large portions of it are 100% human

robluxus 6 hours ago [-]

Interesting comment. Why is it common decency to call out how much ai was used for generating an artifact?

Is there a threshold? I assume spell checkers, linters and formatters are fair game. The other extreme is full-on ai slop. Where do we as a society should start to feel the need to police this (better)?

Sharlin 6 hours ago [-]

The threshold should be exactly the same as when using another human's original text (or code) in your article. AI cannot have copyright, but for full disclosure one should act as if they did. Anything that's merely something that a human editor (or code reviewer) would do is fair game IMO.

robluxus 5 hours ago [-]

Maybe OP just used an ai editor to add their silly comments, so that would be fair game I guess? Or some humans just add silly comments. The article didn't stand out to me as emberrassingly ai-written. Not an em dash in sight :)

Edit: just found this disclaimer in the article:

> I’ll show the generating R code, with a liberal sprinking of comments so it’s hopefully not too inscrutable.

Doesn't come out the gate and say who wrote the comments but ostensibly OP is a new grad / junior, the commenting style is on-brand.

gjf 5 hours ago [-]

Op here, no AI generated code, I'm wondering what gives the impression that it is?

I use Rmarkdown, so the code that's presented is also the same code that 'generates' the data/tables/graphs (source: https://github.com/gregfoletta/articles.foletta.org/blob/pro...).

jpcompartir 5 hours ago [-]

If you say there's no AI-generated code then I retract the original comment, nice work.

jpcompartir 5 hours ago [-]

That is not a disclaimer for generated code, it's referring to the code that generated the simulations/plots.

I had read that line before I commented, it was partly what sparked me to comment as it was a clear place for a disclaimer.

jpcompartir 6 hours ago [-]

Agree here - in a nutshell it strikes me as intellectually dishonest to intentionally pass off some other entity's work as one's own.

coderatlarge 5 hours ago [-]

i personally have no problem with people including AI gen’d code without attribution so long as they stand by it and own the consequences of what they submit. after all, we all know by now how much cajoling and insisting it takes to get any AI gen’d code to do what it’s actually requested and intended to do.

the only exception being contexts that explicitly prohibit it.

4 hours ago [-]

evrennetwork 6 hours ago [-]

[dead]

Loading comments...

antognini 7 minutes ago [-]

https://en.wikipedia.org/wiki/Edgeworth_series

jethkl 5 hours ago [-]

kqr 4 hours ago [-]

ivan_ah 3 hours ago [-]

Here is a notebook with some more graphs and visualizations of the CLT: https://nobsstats.com/site/notebooks/28_random_samples/#samp...

runnable link: https://mybinder.org/v2/gh/minireference/noBSstats/main?labp...

gus_massa 2 hours ago [-]

> It’s very subjective, but I think the uniform stsrts looking reasonably good at a sample size of 8. The exponential however takes much longer to converge to a normal.

  Normal(X) + 1/N * "Skewness" * Something(X) + 1/N^2 * IDont * Remember(X) + ...

Where "Skewness" is a number defined in https://en.wikipedia.org/wiki/Skewness

The uniform distribution is symmetric, so skewness=0 and the correction decrease like 1/N^2.

The exponential distribution is very asymmetrical and and skewness!=0, so the main correction is like 1/N and takes longer to dissapear.

niemandhier 10 hours ago [-]

Highly entertaining, here a little fun fact: there exist a generalisation of the central limit theorem for distributions without find out variance.

Tachyooon 9 hours ago [-]

Do you have any good sources for the physics angle?

hodgehog11 9 hours ago [-]

I thought the rise of power laws in physics is predominantly attributed to Kesten's law concerning multiplicative processes, e.g. https://arxiv.org/pdf/cond-mat/9708231

usgroup 9 hours ago [-]

https://en.wikipedia.org/wiki/Central_limit_theorem#The_gene...

nextos 5 hours ago [-]

Yes, came here to say the same thing. Telling people that the CLT makes strong assumptions is important.

Otherwise, they might end up underestimating rare events, with potentially catastrophic consequences. There are also CLTs for product and max operators, aside from the sum.

The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation discusses these topics in a rigorous way, but without excessive mathematics. See: https://adamwierman.com/book

selimthegrim 3 hours ago [-]

I was at the NeurIPS workshop and saw one of the talks towards the end and it was pretty good

kgwgk 9 hours ago [-]

> find out

Finite?

ngriffiths 1 hours ago [-]

kqr 4 hours ago [-]

firesteelrain 5 hours ago [-]

“ You’re also likely not going to have the resources to take twenty-thousand different samples.”

There are methods to calculate how many estimated samples you need. It’s not in the 20k unless your population is extremely high

jdhwosnhw 4 hours ago [-]

https://en.m.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem

kqr 4 hours ago [-]

> It’s not in the 20k unless your population is extremely high

...actually, come to think of it, you meant to write "unless your population variance is extremely high", right?

firesteelrain 3 hours ago [-]

Right and the methods I had in mind were the usual ones: proportion-based, mean-based, and power analysis, depending on what’s being measured. Thanks for catching that

ForceBru 5 hours ago [-]

Speaking of CLTs, is there a good book or reference paper that discusses various CLTs (not just the basic IID one) in a somewhat introductory manner?

vinnyorvinny 5 hours ago [-]

For a very light discussion, Tao does a good job here: https://terrytao.wordpress.com/2015/11/19/275a-notes-5-varia...

lottin 9 hours ago [-]

Looking at the R code in this article, I'm having a hard time understanding the appeal of tidyverse.

ngriffiths 1 hours ago [-]

Tidyverse is imperfect and it feels heavy-handed and awkward to replace all the major standard library functions, but Tidyverse stuff is way more ergonomic.

gjf 8 hours ago [-]

Author here; I think I understand where you might be coming from. I find functional nature of R combined with pipes incredibly powerful and elegant to work with.

lottin 8 hours ago [-]

mi_lk 8 hours ago [-]

Can you rewrite some of those snippets in standard R w/o Tidyverse? Curious what it would look like

lottin 5 hours ago [-]

I didn't rewrite the whole thing. But here's the first part. It uses the `histogram` function from the lattice package.

    population_data <- data.frame(
        uniform = runif(10000, min = -20, max = 20),
        normal = rnorm(10000, mean = 0, sd = 4),
        binomial = rbinom(10000, size = 1, prob = .5),
        beta = rbeta(10000, shape1 = .9, shape2 = .5),
        exponential = rexp(10000, .4),
        chisquare = rchisq(10000, df = 2)
    )
    
    histogram(~ values|ind, stack(population_data),
              layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)
    
    take_random_sample_mean <- function(data, sample_size) {
        x <- sample(data, sample_size)
        c(mean = mean(x), sd = sqrt(var(x)))
    }
    
    sample_statistics <- replicate(20000, sapply(population_data, take_random_sample_mean, 60))
    
    sample_mean <- as.data.frame(t(sample_statistics["mean", , ]))
    sample_sd <- as.data.frame(t(sample_statistics["sd", , ]))
    
    histogram(sample_mean[["uniform"]])
    histogram(sample_mean[["binomial"]])
    
    histogram(~values|ind, stack(sample_mean), layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)

apwheele 6 hours ago [-]

I mean, for the main simulation I would do it like this:

    set.seed(10)
    n <- 10000; samp_size <- 60
    df <- data.frame(
        uniform = runif(n, min = -20, max = 20),
        normal = rnorm(n, mean = 0, sd = 4),
        binomial = rbinom(n, size = 1, prob = .5),
        beta = rbeta(n, shape1 = .9, shape2 = .5),
        exponential = rexp(n, .4),
        chisquare = rchisq(n, df = 2)
    )
    
    sf <- function(df,samp_size){
        sdf <- df[sample.int(nrow(df),samp_size),]
        colMeans(sdf)
    }
    
    sim <- t(replicate(20000,sf(df,samp_size)))

I am old, so I do not like tidyverse either -- I can concede it is of personal preference though. (Personally do not agree with the lattice vs ggplot comment for example.)

RA_Fisher 9 hours ago [-]

lottin 9 hours ago [-]

RA_Fisher 6 hours ago [-]

I like simplicity and I love a good base R idiom, but there's a lot less consistency in base R compared to the tidyverse (and that comes with a productivity penalty).

Lattice is really low-level. It's like doing vis with matplotlib (requires a lot of time and hair-pulling). Higher level interfaces boost productivity.

ekianjo 9 hours ago [-]

the equivalent in any other language would be an ugly, unreadable, inconsistent mess.

globalnode 6 hours ago [-]

k2enemy 5 hours ago [-]

It is correct in the article. As the sample size approaches infinity, the distribution of the sample means approaches normal.

https://en.wikipedia.org/wiki/Central_limit_theorem

jaccola 6 hours ago [-]

Yes indeed, if the sample size approached infinity (and not the number of samples), you would essentially just be calculating the mean of the original distribution.

tucnak 9 hours ago [-]

Obligatory 3Blue1Brown reference

https://www.youtube.com/watch?v=zeJD6dqJ5lo

oriettaxx 9 hours ago [-]

and the Galton Board https://en.m.wikipedia.org/wiki/Galton_board

(yes, that Galton who invented eugenetics)

gjf 8 hours ago [-]

Very much an inspiration and resource when composing the post.

jpcompartir 6 hours ago [-]

Edit: OP confirms there's no AI-generated code, so do ignore me.

*Comments like this: "# Anonymous function"

gtsnexp 6 hours ago [-]

https://gptzero.me/ Says that at large portions of it are 100% human

robluxus 6 hours ago [-]

Interesting comment. Why is it common decency to call out how much ai was used for generating an artifact?

Sharlin 6 hours ago [-]

robluxus 5 hours ago [-]

Edit: just found this disclaimer in the article:

> I’ll show the generating R code, with a liberal sprinking of comments so it’s hopefully not too inscrutable.

Doesn't come out the gate and say who wrote the comments but ostensibly OP is a new grad / junior, the commenting style is on-brand.

gjf 5 hours ago [-]

Op here, no AI generated code, I'm wondering what gives the impression that it is?

I use Rmarkdown, so the code that's presented is also the same code that 'generates' the data/tables/graphs (source: https://github.com/gregfoletta/articles.foletta.org/blob/pro...).

jpcompartir 5 hours ago [-]

If you say there's no AI-generated code then I retract the original comment, nice work.

jpcompartir 5 hours ago [-]

That is not a disclaimer for generated code, it's referring to the code that generated the simulations/plots.

I had read that line before I commented, it was partly what sparked me to comment as it was a clear place for a disclaimer.

jpcompartir 6 hours ago [-]

Agree here - in a nutshell it strikes me as intellectually dishonest to intentionally pass off some other entity's work as one's own.

coderatlarge 5 hours ago [-]

the only exception being contexts that explicitly prohibit it.

4 hours ago [-]

evrennetwork 6 hours ago [-]

[dead]