More on how to compare box plots
We showed a quick and easy way to compare box plots in previous post. Let’s dig deeper into what information you can use to compare two box plots.
Overlapping boxes and medians
It gets tricky when the boxes overlap and their median lines are inside the overlap range. As always, math comes to the rescue. Follow this simple formula:
Distance Between Medians / Overall Visible Spread * 100 =
There is likely to be a difference between two groups if this percentage is:
- Over 33% for a sample size of 30.
- Over 20% for a sample size of 100.
- Over 10% for a sample size of 1000.
Since we are on sample size, let’s not forget that:
Box plots are about ranges, not actual counts.
At first glance, it is easy to think a longer section on a box plot represent a higher count. That is not the case. Take a look at this box plot:
Each section contains exactly the same number of data points: a quarter of the whole group. The different sizes come from how variable the values are in each section. If they are far apart from one another, the section grows longer. Which leads us into talking about skewness.
Box plots skewed to the right? To the left?
When the right side of the box-and-whisker plot is longer, it is skewed to the right. The values on this side — the upper end of the scale — are more variable. Most observations concentrate at the low end of the scale.
When a box plot is left-skewed, values gather at the upper end, making a short and tight section there. To the left of that crowd, data points spread out, creating a longer tail.
Box plots are like the base of distribution curves. Skewness suggests that data may not be normally distributed.
Limitations of box plots
- No indication of sample size: Though you can use box plots on non-parametric data, it is best to have a sample size of at least 20 (some might even say 30). For a smaller sample size, consider using individual value plots.
- The illusion of bar graphs: Box plots resembles bar graphs in their appearance, yet they present completely different information. Bar graphs compare groups by their absolute counts, while box plots show their distributional ranges. Remember: the size of each section in a box plot shows how widely spread a data range is; it says nothing about the quantity of the group.
- The troubles are in the whiskers: Box plots’ whiskers are mistaken as error bars more often than you’d think, especially when there are asterisks representing outliers on top of them. They are not. They show the lowest and highest quartiles of values. They contain half of the data points; the other half are in the box.
- The secret box: Box plots sometimes hide important information. When data “morph” but manage to maintain their ranges and medians, their box plots stay the same.
- Violin plot is a better alternative: Violin plots present the same information as box plots, and more. They have a built-in density plot, and therefore show “the shape of data” more clearly. All data points are contained inside the violin. And they look nothing like bar graphs.
To sum up:
Box-and-whiskers plots are an excellent way to visualize differences among groups. To compare two box plots with overlapping boxes and medians, calculate the Distance Between Medians as a percentage of the Overall Visible Spread.
Keep in mind that box plots are about ranges, not the absolute counts of data. Their skewness suggests that the data might not assume a normal distribution.
They have limitations, such as being misinterpreted as bar graphs, and concealing information. Violin plots are a better alternative.
BioVinci is a drag-and-drop software that helps you make box plots, violin plots, and many more. It’s super easy to use and will only take a few minutes to get the job done.
Originally published at blog.bioturing.com on May 22, 2018.