Perspective
Principles of Effective Data Visualization
Stephen R. Midway1,*
1Department of Oceanography and Coastal Sciences, Louisiana State University, Baton Rouge, LA 70803, USA *Correspondence:
Copyright By PowCoder代写 加微信 powcoder
https://doi.org/10.1016/j.patter.2020.100141
We live in a contemporary society surrounded by visuals, which, along with software options and electronic distribution, has created an increased importance on effective scientific visuals. Unfortunately, across scien- tific disciplines, many figures incorrectly present information or, when not incorrect, still use suboptimal data visualization practices. Presented here are ten principles that serve as guidance for authors who seek to improve their visual message. Some principles are less technical, such as determining the message before starting the visual, while other principles are more technical, such as how different color combinations imply different information. Because figure making is often not formally taught and figure standards are not readily enforced in science, it is incumbent upon scientists to be aware of best practices in order to most effectively tell the story of their data.
INTRODUCTION
Visual learning is one of the primary forms of interpreting infor- mation, which has historically combined images such as charts and graphs (see Box 1) with reading text.1 However, develop- ments on learning styles have suggested splitting up the visual learning modality in order to recognize the distinction between text and images.2 Technology has also enhanced visual presen- tation, in terms of the ability to quickly create complex visual in- formation while also cheaply distributing it via digital means (compared with paper, ink, and physical distribution). Visual in- formation has also increased in scientific literature. In addition to the fact that figures are commonplace in scientific publica- tions, many journals now require graphical abstracts3 or might tweet figures to advertise an article. Dating back to the 1970s when computer-generated graphics began,4 papers repre- sented by an image on the journal cover have been cited more frequently than papers without a cover image.5
There are numerous advantages to quickly and effectively conveying scientific information; however, scientists often lack the design principles or technical skills to generate effective visuals. Going back several decades, Cleveland6 found that 30% of graphs in the journal Science had at least one type of er- ror. Several other studies have documented widespread errors or inefficiencies in scientific figures.7–9 In fact, the increasing
menu of visualization options can sometimes lead to poor fits be- tween information and its presentation. These poor fits can even have the unintended consequence of confusing the readers and setting them back in their understanding of the material. While objective errors in graphs are hopefully in the minority of scienti- fic works, what might be more common is suboptimal figure design, which takes place when a design element may not be objectively wrong but is ineffective to the point of limiting infor- mation transfer.
Effective figures suggest an understanding and interpretation of data; ineffective figures suggest the opposite. Although the field of data visualization has grown in recent years, the process of displaying information cannot—and perhaps should not—be fully mechanized. Much like statistical analyses often require expert opinions on top of best practices, figures also require choice despite well-documented recommendations. In other words, there may not be a singular best version of a given figure. Rather, there may be multiple effective versions of displaying a single piece of information, and it is the figure maker’s job to weigh the advantages and disadvantages of each. Fortunately, there are numerous principles from which decisions can be made, and ultimately design is choice.7
The data visualization literature includes many great re- sources. While several resources are targeted at developing design proficiency, such as the series of columns run by Nature
OPEN ACCESS
THE BIGGER PICTURE Visuals are an increasingly important form of science communication, yet many sci- entists are not well trained in design principles for effective messaging. Despite challenges, many visuals can be improved by taking some simple steps before, during, and after their creation. This article presents some sequential principles that are designed to improve visual messages created by scientists.
Mainstream: Data science output is well understood and (nearly) universally adopted
PATTER 1, December 11, 2020 a 2020 The Author(s). 1 This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Regarding terminology, the terms graph, plot, chart, image, figure, and data visual(ization) are often used interchangeably, although they may have different meanings in different in- stances. Graph, plot, and chart often refer to the display of data, data summaries, and models, while image suggests a picture. Figure is a general term but is commonly used to refer to visual elements, such as plots, in a scientific work. A visual, or data visualization, is a newer and ostensibly more inclusive term to describe everything from figures to infographics. Here, I adopt common terminology, such as bar plot, while also at- tempting to use the terms figure and data visualization for gen- eral reference.
OPEN ACCESS
Perspective
signed to make complex, technical, and effective figures. Recognize that you might need to learn a new software—or expand your knowledge of a software you already know. While highly effective and aesthetically pleasing figures can be made quickly and simply, this may still represent a challenge to some. However, figure making is a method like anything else, and in order to do it, new methodologies may need to be learned. You would not expect to improve a field or lab method without changing something or learning something new. Data visualiza- tion is the same, with the added benefit that most software is readily available, inexpensive, or free, and many come with large online help resources. This article does not promote any specific software, and readers are encouraged to reference other work14 for an overview of software resources.
Principle #3 Use an Effective Geometry and Show Data
Geometries are the shapes and features that are often synony- mous with a type of figure; for example, the bar geometry creates a bar plot. While geometries might be the defining visual element of a figure, it can be tempting to jump directly from a dataset to pairing it with one of a small number of well-known geometries. Some of this thinking is likely to naturally happen. However, ge- ometries are representations of the data in different forms, and often there may be more than one geometry to consider. Under- lying all your decisions about geometries should be the data-ink ratio,7 which is the ratio of ink used on data compared with over- all ink used in a figure. High data-ink ratios are the best, and you might be surprised to find how much non-data-ink you use and how much of that can be removed.
Most geometries fall into categories: amounts (or compari- sons), compositions (or proportions), distributions, or relation- ships. Although seemingly straightforward, one geometry may work in more than one category, in addition to the fact that one dataset may be visualized with more than one geometry (some- times even in the same figure). Excellent resources exist on detailed approaches to selecting your geometry,15 and this article only highlights some of the more common geometries and their applications.
Amounts or comparisons are often displayed with a bar plot (Figure 1A), although numerous other options exist, including Cleveland dot plots and even heatmaps (Figure 1F). Bar plots are among the most common geometry, along with lines,9 although bar plots are noted for their very low data density16 (i.e., low data-ink ratio). Geometries for amounts should only be used when the data do not have distributional information or uncertainty associated with them. A good use of a bar plot might be to show counts of something, while poor use of a bar plot might be to show group means. Numerous studies have dis- cussed inappropriate uses of bar plots,9,17 noting that ‘‘because the bars always start at zero, they can be misleading: for example, part of the range covered by the bar might have never been observed in the sample.’’17 Despite the numerous reports on incorrect usage, bar plots remain one of the most common problems in data visualization.
Compositions or proportions may take a wide range of geom- etries. Although the traditional pie chart is one option, the pie ge- ometry has fallen out of favor among some18 due to the inherent difficulties in making visual comparisons. Although there may be some applications for a pie chart, stacked or clustered bar plots
Communications,10 Wilkinson’s The Grammar of Graphics11 pre- sents a unique technical interpretation of the structure of graphics. Wilkinson breaks down the notion of a graphic into its constituent parts—e.g., the data, scales, coordinates, geom- etries, aesthetics—much like conventional grammar breaks down a sentence into nouns, verbs, punctuation, and other ele- ments of writing. The popularity and utility of this approach has been implemented in a number of software packages, including the popular ggplot2 package12 currently available in R.13 (Although the grammar of graphics approach is not explicitly adopted here, the term geometry is used consistently with Wil- kinson to refer to different geometrical representations, whereas the term aesthetics is not used consistently with the grammar of graphics and is used simply to describe something that is visu- ally appealing and effective.) By understanding basic visual design principles and their implementation, many figure authors may find new ways to emphasize and convey their information.
THE TEN PRINCIPLES
Principle #1 Diagram First
The first principle is perhaps the least technical but very impor- tant: before you make a visual, prioritize the information you want to share, envision it, and design it. Although this seems obvious, the larger point here is to focus on the information and message first, before you engage with software that in some way starts to limit or bias your visual tools. In other words, don’t necessarily think of the geometries (dots, lines) you will eventually use, but think about the core information that needs to be conveyed and what about that information is going to make your point(s). Is your visual objective to show a compari- son? A ranking? A composition? This step can be done mentally, or with a pen and paper for maximum freedom of thought. In par- allel to this approach, it can be a good idea to save figures you come across in scientific literature that you identify as particu- larly effective. These are not just inspiration and evidence of what is possible, but will help you develop an eye for detail and technical skills that can be applied to your own figures.
Principle #2 Use the Right Software
Effective visuals typically require good command of one or more software. In other words, it might be unrealistic to expect com- plex, technical, and effective figures if you are using a simple spreadsheet program or some other software that is not de-
2 PATTER 1, December 11, 2020
Perspective
OPEN ACCESS
Figure 1. Examples of Visual Designs
(A) Clustered bar plots are effective at showing units within a group (A–C) when the data are amounts. (B) Histograms are effective at showing the distri- bution of data, which in this case is a random draw of values from a Poisson distribution and which use a sequential color scheme that emphasizes the mean as red and values farther from the mean as yellow.
(C) Scatterplot where the black circles represent the data.
(D) Logistic regression where the blue line repre- sents the fitted model, the gray shaded region represents the confidence interval for the fitted model, and the dark-gray dots represent the jittered data.
(E) Box plot showing (simulated) ages of re- spondents grouped by their answer to a question, with gray dots representing the raw data used in the box plot. The divergent colors emphasize the dif- ferences in values. For each box plot, the box rep- resents the interquartile range (IQR), the thick black line represents the median value, and the whiskers extend to 1.5 times the IQR. Outliers are repre- sented by the data.
(F) Heatmap of simulated visibility readings in four lakes over 5 months. The green colors represent lower visibility and the blue colors represent greater visibility. The white numbers in the cells are the average visibility measures (in meters).
(G) Density plot of simulated temperatures by sea- son, where each season is presented as a small multiple within the larger figure.
For all figures the data were simulated, and any examples are fictitious.
geometries (Figure 1D), and while this can be a good thing, presenting raw data and inferential statistical models are two different messages that need to be distin- guished (see Data and Models Are Different Things).
Finally, it is almost always recommen- ded to show the data.7 Even if a geometry might be the focus of the figure, data can usually be added and displayed in a way
0 5 10 15 Value
Yes Maybe UnsureDoubtful No
3 9 10 14 16 Lake2 39101416
1.00 0.75 0.50 0.25 0.00
−2 −1 0 1 2
Lake3 37101820
3 7 10 18 20
June July August
0.12 0.09 0.06 0.03 0.00
Density Age y Count
p(y) Count
Temperature C
(Figure 1A), stacked density plots, mosaic plots, and treemaps offer alternatives.
Geometries for distributions are an often underused class of visuals that demonstrate high data density. The most common geometry for distributional information is the box plot19 (Figure 1E), which shows five types of information in one object. Although more common in exploratory analyses than in final re- ports, the histogram (Figure 1B) is another robust geometry that can reveal information about data. Violin plots and density plots (Figure 1G) are other common distributional geometries, although many less-common options exist.
Relationships are the final category of visuals covered here, and they are often the workhorse of geometries because they include the popular scatterplot (Figures 1C and 1D) and other presentations of x- and y-coordinate data. The basic scatterplot remains very effective, and layering information by modifying point symbols, size, and color are good ways to highlight addi- tional messages without taking away from the scatterplot. It is worth mentioning here that scatterplots often develop into line
that does not detract from the geometry but instead provides the context for the geometry (e.g., Figures 1D and 1E). The data are often at the core of the message, yet in figures the data are often ignored on account of their simplicity.
Principle #4 Colors Always Mean Something
The use of color in visualization can be incredibly powerful, and there is rarely a reason not to use color. Even if authors do not wish to pay for color figures in print, most journals still permit free color figures in digital formats. In a large study20 of what makes visualizations memorable, colorful visualizations were re- ported as having a higher memorability score, and that seven or more colors are best. Although some of the visuals in this study were photographs, other studies21 also document the effective- ness of colors.
In today’s digital environment, color is cheap. This is over- whelmingly a good thing, but also comes with the risk of colors being applied without intention. Black-and-white visuals were more accepted decades ago when hard copies of papers were
PATTER 1, December 11, 2020 3
OPEN ACCESS
more common and color printing represented a large cost. Now, however, the vast majority of readers view scientific papers on an electronic screen where color is free. For those who still print documents, color printing can be done relatively cheaply in com- parison with some years ago.
Color represents information, whether in a direct and obvious way, or in an indirect and subtle way. A direct example of using color may be in maps where water is blue and land is green or brown. However, the vast majority of (non-mapping) visualiza- tions use color in one of three schemes: sequential, diverging, or qualitative. Sequential color schemes are those that range from light to dark typically in one or two (related) hues and are often applied to convey increasing values for increasing dark- ness (Figures 1B and 1F). Diverging color schemes are those that have two sequential schemes that represent two extremes, often with a white or neutral color in the middle (Figure 1E). A classic example of a diverging color scheme is the red to blue hues applied to jurisdictions in order to show voting preference in a two-party political system. Finally, qualitative color schemes are found when the intensity of the color is not of primary impor- tance, but rather the objective is to use different and otherwise unrelated colors to convey qualitative group differences (Figures 1A and 1G).
While it is recommended to use color and capture the power that colors convey, there exist some technical recommenda- tions. First, it is always recommended to design color figures that work effectively in both color and black-and-white formats (Figures 1B and 1F). In other words, whenever possible, use co- lor that can be converted to an effective grayscale such that no information is lost in the conversion. Along with this approach, colors can be combined with symbols, line types, and other design elements to share the same information that the color was sharing. It is also good practice to use color schemes that are effective for colorblind readers (Figures 1A and 1E). Excellent resources, such as ColorBrewer,22 exist to help in selecting color schemes based on colorblind criteria. Finally, color transparency is another powerful tool, much like a volume knob for color (Fig- ures 1D and 1E). Not all colors have to be used at full value, and when not part of a sequential or diverging color scheme—and especially when a figure has more than one colored geome- try—it can be very effective to increase the transparency such that the information of the color is retained but it is not visually overwhelming or outcompeting other design elements. Color will often be the first visual information a reader gets, and with this knowledge color should be strategically used to amplify your visual message.
Principle #5 Include Uncertainty
Not only is uncertainty an inherent part of understanding most systems, failure to include uncertainty in a visual can be misleading. There exist two primary challenges with including uncertainty in visuals: failure to include uncertainty and misrep- resentation (or misinterpretation) of uncertainty.
Uncertainty is often not included in figures and, therefore, part of the statistical message is left out—possibly calling into ques- tion other parts of the statistical message, such as inference on the mean. Including uncertainty is typically easy in most software programs, and can take the form of common geometries such as error bars and shaded intervals (polygons), among other fea-
Perspective
tures.15 Another way to approach visualizing uncertainty is whether it is included implicitly into the existing geometries, such as in a box plot (Figure 1E) or distribution (Figures 1B and 1G), or whether it is included explicitly as an additional geometry, such as an error bar or shaded region (Figure 1D).
Representing unc
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com