2 min read

Central Limit Theorem and Latent Variables

I have recently come across the concept of Central Limit Theorem (CLT), and wanted to visualise the phenomenon using an interactive shiny app.

Basically, CLT says that when you find the means (or sums, or other functions) of many variables, the resulting distribution is likely to be normal, independent of the distributions of each variable.

In the context of Psychometrics, CLT means that multi-item measures may produce normally distributed data when aggregated, regardless of the distribution of responses to each individual item.

With the app below (also found at https://daikman.shinyapps.io/CLTLV/), you can simulate data collected from a measure of a latent variable. In the simulation, all items have the same underlying distribution of responses, but CLT would apply in other cases also.

Notice how sample size makes very little difference to the distribution of observed variable means. Of course, sample size is important for other reasons.

The app is based on a custom R function, as defined below:

library(tidyverse)
library(ggplot2)
library(cowplot)
library(shiny)

CLT <- function(n.items = 10, sample.size = 50, 
                distribution = c("normal", "uniform", "poisson", "exponential", "binomial")) {
  
  sample.means <- switch(distribution,
              "exponential" = replicate(n.items, mean(rexp(sample.size)))
              %>% as.data.frame(),
              "normal" = replicate(n.items, mean(rnorm(sample.size)))
              %>% as.data.frame(),
              "uniform" = replicate(n.items, mean(runif(sample.size, -1, 1)))
              %>% as.data.frame(),
              "poisson" = replicate(n.items, mean(rpois(sample.size, 1)))
              %>% as.data.frame(),
              "binomial" = replicate(n.items, mean(rbinom(sample.size, 1, .7)))
              %>% as.data.frame()
              )

  gQQ <- ggplot(sample.means, aes(sample = .)) +
    stat_qq() +
    stat_qq_line(color = "blue") +
    xlab("Theoretical") +
    ylab("Sample")

  gH <- ggplot(sample.means, aes(x =  .)) +
    geom_density(color = "black", fill = "red", alpha = 0.5) +
    stat_function(color = "blue", size = 1, fun = dnorm, 
                  args = list(mean = mean(sample.means$.), sd = sd(sample.means$.))) +
    ylab("Density")

  cowplot::plot_grid(gQQ, gH, ncol = 1)

}