As we become more efficient at gathering information in the world, from shopping habits and most-searched-for vacation spots to satellites that can adjust for cloud cover, it follows that how we communicate and organize this information for general understanding should become increasingly important. And it has. The demand for data analysts has grown over the past decade and the challenges that accompany analyzing this information are several, but despite all the hard work that goes into processing data sets, very little is shared with the public through data visualization.
Data visualization is the presentation of data in graphical or visual format to communicate information clearly and make complex data more accessible. Its goal is to provide the information necessary for the reader to make an informed decision, grasp a difficult concept, identify patterns or otherwise use the information presented.
Through my classes and research internship I’ve created several visualizations of my own and used others’ data representations to further my own understanding of the given topic. In doing so, I’ve come to realize how difficult it is to avoid bias. In my research here at Bigelow, a large part of my work has been mapping and creating other visuals (ie. box plots, anomaly maps, etc) to look for and try to identify patterns or correlations in natural phenomena. More specifically, I’ve been focusing on chlorophyll blooms in the Nordic Seas that occur earlier than in the majority of the basin. When I plotted the time series of these early chlorophyll blooms for each year, I was excited to find that this phenomena occurred every year with varying strength.
Upon taking a closer look at my data though, I discovered that this was not the case. In 1999, the red cluster — representing the early bloom — doesn’t exist. Which means the time series could be based off one or two pixels that may even be clustered incorrectly.
Which plot is telling the truth then? In order to test this I ran another type of clustering analysis that assigned membership values to each pixel. This value indicates how confident we can be that the pixel was clustered correctly. After running this analysis and plotting only the pixels with a membership value above a certain threshold it became clear that the time series plot for 1999 and even 2001 were not accurate representations. Once the membership value filter was applied to the data there were no points from 1999 or 2001 that we could be confident were clustered correctly.
It was in this way I discovered how difficult it was to not accidentally bias your images when creating figures, but also how important it is to be critical when reading results from representations. This skewed representation of data, whether intentional or not, can have subtle and dangerous consequences.
Aside from its expected appearances in scientific literature and statistics, information graphics have been increasingly popular in the media. A good representation of data will communicate accurate findings at a glance, but this isn’t always the case for visualizations presented in advertisements and the media. Competing browsers compare running speed in their advertisements and judging by this Microsoft Edge ad you might agree, Edge is faster. But in looking more critically at this representation, Edge is faster because its speed score is higher? There isn’t a translation for what that speed score means so without units we can’t interpret how important the difference between the speed scores is. At a glance, Edge appears almost twice as fast as Firefox when in reality it is only 9% faster. While this is a fairly innocent manipulation of consumers it highlights how commonplace bad data visualizations are becoming.
Another example with political implications is a figure seen on Fox News showing how many people had enrolled in US government sponsored healthcare. At first glance, it would seem as though as of March 27th they were less than a third of the way to their goal. But again looking at the numbers reveals a difference of 15% between the two — not 66%.
In both of these examples the visuals have been skewed for the desired initial interpretation and reaction that takes an extra deliberate decision of the audience to reevaluate. While both these examples in the media were likely deliberate, bias data representation can be accidental as well (as in my data of the Nordic Seas). For this reason, it’s important to be critical of the visualizations being presented to you; are there units, labeled axes, scaled plots? Finally, it’s easy for advertisements to throw large numbers at people resulting in people assuming a statistic has more significance than it should. Always look for context and compare stats with meaning; instead of comparing how much a country spends on its military, compare how much it spends relative to its population.
Adelaida Arjona is a Harvard University student in Bigelow Laboratory for Ocean Science’s Research Experience for Undergraduates program. This intensive experience provides an immersion in ocean research with an emphasis on hands-on, state-of-the-art methods and technologies.