Error Bars

Error bars #


Think about if the standard error bar with bar ends is really the best choice for your data.


Errors bars are a feature of plots used to try to show a range of values around a data point. Though the common name for them is “error bars,” they are not always used to show error—at least in the colloquial sense. They can also be used simply to indicate a distribution of data and can be a great way to simplify a plot. Consider, for instance, a plot that shows the average tuition at 4-year universities for each state in the USA. This would look like:

This is a real mess—what data scientists pithily call a “unicorn mane”. It is essentially useless, except to give us a general vibe. That is, except for just a few trends, we cannot really tell one apart from another. There is simply no way to attain good contrast between all of these.

There are a few options for attempting to show trends for all these lines, such as using small multiples. But there are times that we might not really care to show all the data. One way is to get rid of all the colors, and just show the general block, with no attempt made to tell apart the lines.

A plot like this sends a clear message: you are not supposed to attempt to follow the individual lines. But people might still try, and this will take away focus from what you are trying to show… the general trend. Additionally, this remain a bit messy. To be sure, there are times you might wish to use a plot like the one above. However, in general, there is a better option: represent the mean value with a single point, and represent the spread in values using error bars. This yields the following:

This plot is much cleaner and keep the point focused squarely on the central trend. This sort of plot will serve you well. But there are a few small things that one can pay attention to that can improve the plot even more.

Consider the width of your bar ends. #

The default in many software packages is to have bar ends on the error bars and then to set the width of these bars to be the same as the width of the data point. Many times this can work well, but as the bar ends begin to get close to one another, they can clutter plot area. In an extreme sense, they can lead to a “flattening” out effect of the trend. For instance, imagine the bars above were just 50% wider…

Now, where the trend reaches it maximum, the error bar ends overlap, and this gives a sense of straight lines near the top. Additionally, this places ink associated with one point to the left and right of ink associated with another point, leading to some visual confusion.

In such cases, it might be better to just remove bar ends completely:

Which, to my eye, is much more clean.

So, for me, I will use bar ends, when points are widely spaced—maybe 2 times farther apart than the bar end width. If the points get closer than that, I typically remove the bar ends, which yields a cleaner and less confusing plot.

Make sure to note what the error bars represent #

Above, we replaced a large amount of data with a set of bars. But what do these bars represent? There are two standards, and many other possibilities. The two standards are:

  • standard deviation
  • standard error

These two values are related by standard error = standard deviation / sqrt(N), where N is the number of points measured. This is not a statistics website, but there are reasons one might prefer one value over another.

Other values that might be represented by error bars include: full range of values, interquartile range, or multiples of any of the above values. These are less common, but you will see them. However, the fact that there are two standards means that one can never be sure what is being used. This also means that one should always clearly state what the error bars represent. Something as simple as an annotation can do this.

One could also add short text like “\pm 1 std err” directly next to the bar. In scientific journals, you could add this to the caption, but you always need some indication of what you are plotting.

Other ways to represent error #

The error bar is the most standard means by which to represent error, but there are others. Probably the next most common is to use a filled region on the plot:

This is especially seen when representing confidence intervals and prediction intervals associated with fitting equations to data.

If there is error on both the x axis and y axis, then people will sometimes represent this as a box or ellipse around the points—though this is much less common. I am sure that, if you think some, you can invent your own cool ways to represent error. But the ones we have covered above are the most common.

Conclusions #

Having a spread in values is a fact of working with data. Error bars are a way to represent these values. But even though there are some strong conventions in representing error, you can still think through how you want to do this, and you should always always always tell your viewer what you are representing.