This is a short blurb I was asked to write for Elder Research’s company newsletter. I previously shared this on LinkedIn.
Many of the highest-impact problems we work on involve causation. You’ve likely heard that “correlation is not causation,” but how are they different, really?
Correlation describes two (or more) quantities that “move together”: the sales of flip-flops and ice cream cones or even the number of Nicolas Cage movies and the number of airline screeners in North Dakota. Correlations can be meaningful, but they aren’t required to be so. (Maybe folks really are replacing their sandals en masse after dropping their ice cream?) Causation takes the notion of quantities moving together and lifts it to the stronger statement, “A causes B.”
Isolating these stronger connections requires additional data or constraints above what machine learning models already need. These extra requirements make causal models harder to build and test: we need counterfactuals (example outcomes where A does not happen and cause B) to build causal models, and we need controlled experiments to properly validate them. If we don’t have these data already, then we must collect them from scratch.
Taking these extra steps toward causal analysis is worth it, though, particularly when we want to take action on the results. Correlations can mislead and even point in the wrong direction; when causal inference is the right tool for the job, it is good value to gather the right data and make the right analysis.