Limitations of Summary Statistics

limitations-of-summary-statistics

Summary statistics of a distribution give valuable information about the distribution. Mean gives the expected value whereas mode gives the value with the highest probability mass or probability density and variance is a measure of "spread" of a distribution. For some of the probability distributions like Gaussian distribution, summary statistics like mean and variance are enough to describe the distribution itself. The question now becomes to what extend we should use summary statistics to describe the distributions in general. The answer is that we should always take an explorative approach to visualize the data for analysis. Summary statistics can be tricky when it comes to decribing probability distributions. To find out why, keep reading and see you at finish line!

Summary Statistics¶

Before answering the question of to what extend we can use the summary statistics to describe a distribution, let's define some of them.

Mean¶

The most familiar property of a distribution is its mean or expected value, often denoted by $\mu$. For a continous random variable, mean is defined as: $$E[X]=\int_{X} xp(x)dx$$ where $p(x)$ is probability denstiy. For a discrete random variable on the other hand, the equation becomes $$\sum_{X} xp(x)$$ where $p(x)$ is probability mass function.

There are many properties of mean like linearity, but we will not go into much detail for this post.

Variance¶

Variance is a measure of "spread" of a distribution. It is often denoted by $\sigma^2$. It is defined as follows: $$V[X]=E[(x-\mu)^2]=\int(x-\mu)^2p(x)d(x)$$ $$=\int x^2p(x)d(x)-2\mu\int xp(x)d(x)+\mu^2\int p(x)d(x)$$ $$=E[X^2] - \mu^2$$

Limitations¶

A probability distribution can be represented using summary statistics. However we can lose a lot of information by just using summary statistics to decscribe distributions. You can find the example called Anscombe’s quartet. There are 4 different datasets all of which have the same summary statistics, where $E[X]=9, V[X]=11, E[Y]=7.50, V[Y]=4.125$. Although they all have the same summary statistics, the joint distributions $p(x,y)$ from which these points are sampled are quite different.

<seaborn.axisgrid.FacetGrid at 0x7fc7350f3070>

In the paper, Anscombe points out the importance of analyzing scatter plot of the points in the dataset to make sure that there are no outliers that have dominant affect on the calculation of regression coefficients. Especially for the fourth dataset, we can see a single data point determining the slope of the line. We can clearly see that if we remove the data point, the slope of the line will not be the same. This is a huge take away for us to always plot the data points when we can to make sure that there aren't any odd data points in our dataset that may have affect on our analysis. If we were just to take into account summary statistics of these datasets, we would have concluded that they are drawn from the same distribution.

In Same Stats, Different Graphs, we see another demonstration of why data visualization is important to analyze a dataset. The authors define a method to systematically make any dataset appear like a given shape when plotted on a graph while preserving same summary statistics. They accomplished this by biasing random point movements towards a particular shape while preserving same summary statistics using Simulated Annealing in each iteration. You can see how a distribution looking like a dinosaur drawing when plotted on a graph has the same summary statistics with different datasets.

Datasaurus

Datasaurus2

Wrap up¶

In this post, we saw examples of how important it can be to visualize datasets. An effective (and often used) example used to demonstrate that visualizing data is important is Anscome's Quartet. Developed by F.J. Anscombe in 1973, Anscombe's Quartet is a set of four datasets, where each having a quite different structure produces the same summary statistics (mean, standard deviation, and correlation). Another example is Datasaurus Dozen dataset which urges us once again to "never trust summary statistics alone; always visualize your data". Understanding how important it is to visualize your data for analysis is key to any analysis! With that, we have reached the end of this post, we have covered importance of visualization of data and I really enjoyed going over a few important concepts with you. I hope that you enjoy the post as well :) If you have any questions about the post or data science in general, you can find me on Linkedin. I would highly appreciate to get any comment, question or just to have a chat about data science and topics around it with you! See you at the next one...