Data Visualization

Data Visualization#

Many types of visualizations have been developed to examine different kinds of relationships in datasets. We will review methods for plotting distributions, summarizing central tendency in different conditions, and plotting relationships.

Preparing data for plotting#

To plot data in a DataFrame, we will first need to load the data and do any necessary data cleaning and reorganization. We will prepare the Osth & Fox (2019) using methods that we have discussed before.

Let’s load the Osth & Fox (2019) dataset, then clean it as we’ve done previously. This time, we’ll use the datascipsych.datasets.clean_osth function to prepare the DataFrame. We’ll also use filter to get just the trials from the test phase.

import polars as pl
from datascipsych import datasets
raw = pl.read_csv(datasets.get_dataset_file("Osth2019"))
data = datasets.clean_osth(raw).filter(pl.col("phase") == "test")
data.head()

shape: (5, 12)

subj	cycle	phase	trial	type	word1	word2	response	RT	correct	lag	probe_type
i64	i64	str	i64	str	str	str	i64	f64	i64	i64	str
101	0	"test"	-1	"rearranged"	"waste"	"degree"	0	2.312	1	2	"lure"
101	0	"test"	0	"rearranged"	"needed"	"able"	0	3.542	1	1	"lure"
101	0	"test"	1	"rearranged"	"single"	"clean"	0	2.084	1	3	"lure"
101	0	"test"	2	"rearranged"	"train"	"useful"	0	1.669	1	2	"lure"
101	0	"test"	3	"rearranged"	"knees"	"various"	0	2.326	1	5	"lure"

Next, we’ll need to calculate the descriptive statistics that we want to plot. First, we’ll calculate the hit rate and false-alarm rate for each subject.

perf = (
    data.group_by("subj", "probe_type")
    .agg(
        pl.col("response").mean(), 
        pl.col("RT").mean(),
    )
    .sort("subj", "probe_type")
)
perf.head()

shape: (5, 4)

subj	probe_type	response	RT
i64	str	f64	f64
101	"lure"	0.108333	1.924008
101	"target"	0.76569	1.646904
102	"lure"	0.241667	1.088396
102	"target"	0.2875	1.061537
103	"lure"	0.091667	1.415992

Now that we have calculated our measures of interest, we are ready to create a plot to visualize responses in the lure and target conditions.

Using Seaborn#

In Seaborn, plots are usually created by passing a DataFrame to a function and specifying how the variables in the DataFrame should be used in the plot. There are three main types of plot:

Distribution : plot the distribution of one or more variables using a histogram or density plot.

Categorical : plot a distribution or central tendency for multiple variables side-by-side.

Relational : plot the relationship between two or more variables.

Seaborn has tools for easily making all of these plot types, and also includes helpful features to change the appearance of figures.

To work with Seaborn, we must first import it. The Seaborn package is named after Samuel Norman Seaborn, the fictional White House Communications Director on the TV show The West Wing. The conventional way to abbreviate Seaborn when importing it is using sns, after the initials of the package’s namesake.

import seaborn as sns

Seaborn comes with different built-in styles. To illustrate them, let’s use the example plot from Seaborn’s tutorial. It plots a series of different sine waves.

import numpy as np
import matplotlib.pyplot as plt

def sinplot(n=10, flip=1):
    x = np.linspace(0, 14, 100)
    for i in range(1, n + 1):
        plt.plot(x, np.sin(x + i * .5) * (n + 2 - i) * flip)

Seaborn comes with four styles, which make plots that include a grid, ticks, or neither. To visualize these styles, we use the axes_style function to set a different style for each of a series of subplots.

f = plt.figure(figsize=(8, 2))
styles = ["darkgrid", "white", "ticks", "whitegrid"]
gs = f.add_gridspec(1, len(styles))
for i, style in enumerate(styles):
    with sns.axes_style(style):
        ax = f.add_subplot(gs[i])
        sinplot(6)
        ax.set_title(style)

../../_images/59abcbd86dfe4432d6abaaf28f289b15bfd488608ccbd9c4417ec97059e66835.png

The ticks style is probably the most commonly used. The easiest way to set the style, however, is to just run set_style once at the top of a notebook. That will set the style for all plots from that point on.

sns.set_style("ticks")

Distribution plots#

In Seaborn, there are multiple ways to visualize a distribution of observations. Before, we used plt.hist to make histograms. Seaborn lets us quickly make histograms from DataFrames more flexibly.

To make a distribution plot, we can use the displot function. The kind variable determines how the distribution is visualized.

histplot : summarize data in bins using a histogram.

kdeplot : summarize data with a kernel density estimation plot.

We can create a histogram by calling displot and setting the kind to "hist". We’ll first make a histogram of mean response time across subjects. We set data to the DataFrame with the data we want to plot, and x to the name of the column of data we want to plot. Note that we end the command with a semicolon (;); this suppresses printing of the Seaborn object produced by sns.displot.

sns.displot(data=perf, x="RT", kind="hist", height=3);

../../_images/5d391e8cabf6f90d8a1988eaf77f71990bf3440eedf4e8edfba3cc62e9ec9b14.png

Note that Seaborn does some work for us, getting the data from the DataFrame and making a histogram, and labeling the axes based on our data. The optional height input sets the height of the figure in inches.

Sometimes, it’s helpful to compare multiple distributions on the same plot. We can do this by setting the hue parameter, which splits up the data and plots each group with a different color.

sns.displot(data=perf, x="RT", hue="probe_type", kind="hist", height=3);

../../_images/061ebc47cc4f0c44c7cf31d8fa06794bb37981678ec952851272aaf624359b50.png

Seaborn automatically determines good visualization properties for us, selecting two highly contrasting colors and plotting the overlap between the histograms using a different color. This makes it easy to see that response times for lure and target trials are similar, but target trial responses tend to be faster.

Histograms split the data into bins and count up the number of observations in each bin. Seaborn also supports kernel density estimation plots, which use a kernel instead of bins. This method smooths the underlying data to give an estimate of how dense observations are at each point.

sns.displot(data=perf, x="RT", hue="probe_type", kind="kde", height=3);

../../_images/f31f155590c0b6bf28f886f9466839814fec32957c3572e2d0d9732f252ef00b.png

This type of plot can sometimes make it easier to compare conditions. However, it is important to keep in mind that a KDE plot does not show the underlying data directly; instead, it estimates a smoothed version of the distribution and plots that.

We can save the figure to a file using the Seaborn object that is returned by its plotting functions. First, we assign the Seaborn FacetGrid object to a variable called g. Then, we use the savefig method to save out the file. The type of file is automatically determined by the file extension we provde. Here, we’ll save to a PDF file. PDF files are often a good choice for plots, because they store a vector format that does not have a resolution. This means that we can zoom in as much as we want without seeing any pixels.

g = sns.displot(data=perf, x="RT", hue="probe_type", kind="hist", height=3)
# g.savefig("histogram.pdf")  # uncomment to save figure

See the Seaborn documentation for more information about visualizing distributions.

Exercise: distributional plots#

Given the data DataFrame, which has all the responses from the test phase of Osth & Fox (2019), make a histogram showing the distributions of response time for lure and target trials. The lure and target trials should be shown in different hues in the plot.

Make a kernel density estimate (KDE) plot of the same data.

data.head()

shape: (5, 12)

subj	cycle	phase	trial	type	word1	word2	response	RT	correct	lag	probe_type
i64	i64	str	i64	str	str	str	i64	f64	i64	i64	str
101	0	"test"	-1	"rearranged"	"waste"	"degree"	0	2.312	1	2	"lure"
101	0	"test"	0	"rearranged"	"needed"	"able"	0	3.542	1	1	"lure"
101	0	"test"	1	"rearranged"	"single"	"clean"	0	2.084	1	3	"lure"
101	0	"test"	2	"rearranged"	"train"	"useful"	0	1.669	1	2	"lure"
101	0	"test"	3	"rearranged"	"knees"	"various"	0	2.326	1	5	"lure"

# your code here

Categorical plots#

In Seaborn, categorical plots are used to visualize one or more distributions (usually at least two). For example, “categories” might correspond to different experimental conditions.

A common type of plot for visualizing different conditions is called a bar plot. Let’s make a bar plot showing the probability of responding “old” in the target and lure conditions.

sns.catplot(data=perf, x="probe_type", y="response", kind="bar", height=3);

../../_images/beb3c54dcee2c55db752ab314af3151f11bc346227e4f896900f5f5c36fa9d54.png

To make this plot, we indicated that the x-axis should be based on the probe_type column and the y-axis should be based on the response column. The barplot function automatically split our data into “lure” and “target” bins and calculated the mean response.

It also calculated 95% confidence intervals using a bootstrap method, which estimate how much the mean would vary if we repeated the same experiment and collected the same measures over and over. To create a bootstrap confidence interval, Seaborn takes the distribution of observations and randomly samples from it with replacement to make a new set of observations. Then it calculates the mean of this bootstrap distribution. It does this 1000 times and then plots a line showing the middle 95% of means from the bootstrap.

We can make exactly the same plot using a horizontal bar plot, if we exchange the x-axis and y-axis inputs. This is sometimes useful if the labels of the different categories are long, as long labels are harder to fit on the x-axis.

sns.catplot(data=perf, x="response", y="probe_type", kind="bar", height=3);

../../_images/dd9286d777391bcb51761687dbeaafb7362227edbed32ba129af7236960bf310.png

Bar plots just show the means and uncertainty in the means. We can also make swarm plots that show the individual points in the different distributions. This is more complicated to look at, but it gives a more realistic sense of the data.

sns.catplot(data=perf, x="probe_type", y="response", kind="swarm", height=3);

../../_images/9d2b0c48a041ee6d2d66ae1fb4d14fe0aa5feafba6c53dacdfeb51bb40450190.png

A similar type of plot is called a strip plot. Instead of plotting points that don’t overlap, points are plotted within a strip with some jitter.

sns.catplot(data=perf, x="probe_type", y="response", kind="strip", height=3);

../../_images/95f8d516e12f571e853ceecc9c847800984877b1eb6b2be0c6d404e3c20b74b2.png

Strip plots sometimes work better than swarm plots when there are a lot of observations that bunch up and make the swarm plot too wide for the plot.

Another common type of plot is called a box plot. It shows more detail than a bar plot, but does not show individual points like a swarm plot. The box shows the middle quartiles of the distribution. The quartiles include the 25th, 50th, and 75th percentiles. Box plots separate out points that are near the central tendency of the distribution from outliers.

Outliers are determined based on the interquartile range (IQR), which is the difference between the 75th and 25th percentiles.

Points are considered outliers if they are more than 1.5 IQRs away from the nearest quartile. The whiskers at the top and bottom show the range of points that are not outliers. Any outliers are plotted as individual points.

g = sns.catplot(data=perf, x="probe_type", y="response", kind="box", height=3)

../../_images/a0850b084a2746842cd059555ce0ca15444583bb6ca6cdcd3635f246309c2a1c.png

Swarm and strip plots give a better sense of the underlying data, but box plots make the range and median of the data easy to see.

Sometimes we care about multiple dimensions of the data. For example, we can split the response times up by the probe type and type of response (“old” or “new”). First, we’ll get the data in shape. This example uses a conditional expression to relabel the responses of 0 and 1 as “new” and “old”, respectively.

conditions = (
    data.with_columns(
        response=pl.when(pl.col("response") == 0).then(pl.lit("new")).otherwise(pl.lit("old"))
    )
    .group_by("subj", "probe_type", "response")
    .agg(pl.col("RT").mean())
    .sort("subj", "probe_type", "response")
)
conditions.head(4)

shape: (4, 4)

subj	probe_type	response	RT
i64	str	str	f64
101	"lure"	"new"	1.913257
101	"lure"	"old"	2.0125
101	"target"	"new"	1.850839
101	"target"	"old"	1.584497

Now that we have our bins defined, we can use a bar plot to visualize how mean response time varies depending on probe type and response. We’ll represent response using the hue parameter.

sns.catplot(data=conditions, x="probe_type", y="RT", hue="response", kind="bar", height=3);

../../_images/295c4e41f4018bf751e41ffcc99254c1f9ab3fc1b23dec34bc3a011a2cf433af.png

From this plot, we can see that correct responses (“new” for lures, and “old” for targets) tend to be faster. Responses to targets also tend to be faster overall.

See the Seaborn documentation for more information about visualizing categorical data.

Exercise: categorical plots#

First, let’s set up DataFrames for plotting. We’ll use data from the Morton et al. (2013) free-recall study. Participants studied lists of celebrities, locations, and objects, and then were asked to recall the items in any order. Lists were either pure (all the same category) or mixed (items from each of the three categories).

fr = (
    pl.read_csv(datasets.get_dataset_file("Morton2013"), null_values="n/a")
    .filter(pl.col("study"))
)

First, we’ll get the mean fraction of items recalled for each subject and list type.

recall_lt = (
    fr.group_by("subject", "list_type")
    .agg(pl.col("recall").mean())
    .sort("subject", "list_type")
)
recall_lt.head(2)

shape: (2, 3)

subject	list_type	recall
i64	str	f64
1	"mixed"	0.476389
1	"pure"	0.564815

Next, we’ll split things up a little more, to get the mean fraction of items recalled for each subject, list type, and category.

recall_lt_cat = (
    fr.group_by("subject", "list_type", "category")
    .agg(pl.col("recall").mean())
    .sort("subject", "list_type", "category")
)
recall_lt_cat.head(6)

shape: (6, 4)

subject	list_type	category	recall
i64	str	str	f64
1	"mixed"	"cel"	0.5875
1	"mixed"	"loc"	0.433333
1	"mixed"	"obj"	0.408333
1	"pure"	"cel"	0.625
1	"pure"	"loc"	0.5625
1	"pure"	"obj"	0.506944

recall_lt.head(2)

shape: (2, 3)

subject	list_type	recall
i64	str	f64
1	"mixed"	0.476389
1	"pure"	0.564815

recall_lt_cat.head(2)

shape: (2, 4)

subject	list_type	category	recall
i64	str	str	f64
1	"mixed"	"cel"	0.5875
1	"mixed"	"loc"	0.433333

Use the recall_lt DataFrame to make a bar plot of showing mean recall for each list type.

Use the recall_lt DataFrame to make a swarm plot showing mean recall for each list type.

Use the recall_lt_cat DataFrame to make a bar plot showing mean recall for each list type, split by category.

# your code here

Relational plots#

Relational plots are used to show a relationship between two variables. For example, you might be interested in whether two measures are correlated with one another, or how some variable changes over time.

Say we are interested in the relationship between response time on target and lure trials, across participants. Is speed on these different trials related? First, let’s display the DataFrame to remind ourselves how the data are arranged.

perf.head()

shape: (5, 4)

subj	probe_type	response	RT
i64	str	f64	f64
101	"lure"	0.108333	1.924008
101	"target"	0.76569	1.646904
102	"lure"	0.241667	1.088396
102	"target"	0.2875	1.061537
103	"lure"	0.091667	1.415992

We can use a pivot table to set up our variables of interest in different columns. Here, we’ll make columns from the probe_type column and get the values from the RT column.

response_time = perf.pivot("probe_type", index="subj", values="RT")
response_time.head()

shape: (5, 3)

subj	lure	target
i64	f64	f64
101	1.924008	1.646904
102	1.088396	1.061537
103	1.415992	1.152863
104	1.102379	1.096696
105	1.331477	1.186833

We can use relplot to visualize this relationship. By default, relplot will make a scatterplot, with one point for each pair of observations.

sns.relplot(data=response_time, x="target", y="lure", height=3);

../../_images/6d38fae023a20d074f5ef019ebeae1cb2dbe3f853291df036ea6a4b309391c77.png

Sometimes, we have an independent variable that varies across multiple numeric values, and a dependent variable that may be related to that independent variable. We can visualize this type of relationship using a line plot.

In the Osth & Fox (2019) dataset, the main question of interest was whether lag between rearranged pairs would affect the false-alarm rate. Let’s get the mean probability of responding “old” for each subject and lag.

lures = (
    data.filter(pl.col("probe_type") == "lure")
    .group_by("subj", "lag")
    .agg(
        pl.col("response").mean(),
        pl.col("RT").mean()
    )
    .sort("subj", "lag")
)
lures.head(10)

shape: (10, 4)

subj	lag	response	RT
i64	i64	f64	f64
101	1	0.0625	1.723833
101	2	0.1875	1.795167
101	3	0.0625	1.745646
101	4	0.125	2.085083
101	5	0.104167	2.270312
102	1	0.333333	1.06575
102	2	0.208333	1.057562
102	3	0.166667	1.131146
102	4	0.3125	1.052354
102	5	0.1875	1.135167

Now we can plot false-alarm rate as it varies by lag, to see if there seems to be a relationship there.

g = sns.relplot(data=lures, x="lag", y="response", kind="line", height=3)
g.set(ylim=(0, 0.5));

../../_images/7e7d4f593bfeb0579405d1694eb753c2a0cfb4a94f6621a1cf785506a02d0fd7.png

Here, we’ve used the FacetGrid object that Seaborn returns to set a property of the figure after making it. We used the ylim parameter to indicate that the y-axis should range from 0 (the lowest possible false-alarm rate) to 0.5 (a pretty high false-alarm rate). The plot shows bootstrap confidence intervals, similar to what we had before in our bar plots. Here, there is a confidence band that connects the different confidence intervals at each lag.

The plot makes it easy to see that there isn’t an obvious relationship between false alarms and lag. There may be a slight trend toward a decrease in false-alarm rate with lag, but it probably isn’t statistically reliable.

See the Seaborn documentation for more information about visualizing statistical relationships.

Exercise: scatter plot#

Say that we want to examine whether recall in the mixed and pure lists is correlated across subjects. First, we’ll make a pivot table with one column for recall in each of the two list types.

pivot = recall_lt.pivot("list_type", index="subject", values="recall")
pivot.head()

shape: (5, 3)

subject	mixed	pure
i64	f64	f64
1	0.476389	0.564815
2	0.55	0.555556
3	0.484722	0.474537
4	0.5375	0.564815
5	0.676389	0.601852

Use data in the pivot DataFrame to make a scatterplot showing how recall in the mixed condition relates to recall in the pure condition.

# your code here

Exercise: line plot#

Say that we want to know whether recall depends on the input position of each item in the list. First, we get the mean recall by position for each subject.

spc = (
    fr.group_by("subject", "input")
    .agg(pl.col("recall").mean())
    .sort("subject", "input")
)
spc.head()

shape: (5, 3)

subject	input	recall
i64	i64	f64
1	1	0.541667
1	2	0.458333
1	3	0.625
1	4	0.333333
1	5	0.4375

Use the spc DataFrame to make a line plot showing how recall varies with input position.

Advanced#

Seaborn plotting functions like relplot return a FacetGrid object, which has methods for changing a plot. The set method, for example, can be used to change the ylim property that determines the range of the y-axis. Add .set(ylim=(0, 1)) immeediately after your call to relplot to set the limits of the y-axis.

# your code here

Using subplots#

The main Seaborn plotting functions (displot, catplot, and relplot) can be used to make multiple plots in one figure, to help visualize different facets of the data.

Let’s look again at visualizing response time distributions and how they vary by trial type. Before, we used hue to differentiate response time distributions for different trial types.

sns.displot(data=perf, x="RT", hue="probe_type", kind="hist", height=3);

This can be a little hard to read, given the overlap between the distributions.

Another option is to plot lures and targets in different subplots. We can do this using the col input instead of hue. Each group will be assigned to a different column of the subplot grid.

sns.displot(data=perf, x="RT", col="probe_type", kind="hist", height=3);

../../_images/72c12b9a392677d342398f63874630156d5c42b5775efd50e84f74bded153482.png

We can also use subplots to split up other kinds of plots. For example, we saw before that we can use relplot to visualize how false alarms vary with lure lag.

sns.relplot(data=lures, x="lag", y="response", kind="line", height=3).set(ylim=(0, 0.5));

../../_images/67da49c69a6eb7500d59feff20af21a8da3f03a804f008aa8a3e2a4af705ac52.png

We can use the col input to visualize this same relationship for individual subjects. We can use the col_wrap input to make a grid; it will “wrap” after the fourth column and move to the next row. We’ll plot the first 8 subjects, to see how the relationship looks for different people.

subj8 = lures["subj"].unique()[:8].to_list()
lure8 = lures.filter(pl.col("subj").is_in(subj8))
sns.relplot(data=lure8, x="lag", y="response", col="subj", col_wrap=4, kind="line", height=2);

../../_images/2bebe04e13f1cc1240506be0357aa1dd70a8aa42910e0449d3e439b96824c5cf.png

We can use the col input to split up a plot by any group that we have a column for in the DataFrame.

Exercise: using subplots#

Say we want to examine how recall in the Morton et al. (2013) dataset varies with input position and category. First, we’ll calculate mean recall by subject, input position, and category.

spc_cat = (
    fr.group_by("subject", "input", "category")
    .agg(pl.col("recall").mean())
    .sort("subject", "input", "category")
)
spc_cat.head()

shape: (5, 4)

subject	input	category	recall
i64	i64	str	f64
1	1	"cel"	0.526316
1	1	"loc"	0.307692
1	1	"obj"	0.75
1	2	"cel"	0.526316
1	2	"loc"	0.384615

Use the spc_cat DataFrame to make three line plots showing how recall varies with input position, split up by category (one plot in each of three columns).

Compare this plot to a single line plot that labels each category using a different color.

# your code here

Choosing color palettes#

When color is used as a dimension of a plot, the choice of color is very important. Color differences should be constant over the whole scale, easy to understand, and viewable by people with color blindness.

Seaborn comes with a collection of color palettes that are useful for different purposes. We can visualize these palettes separately in Jupyter.

display(sns.color_palette())  # the default palette
display(sns.color_palette("viridis"))  # a sequential scale with 6 bins
display(sns.color_palette("viridis", as_cmap=True))  # a continuous sequential scale

viridis

under

bad

over

A qualitative palette maps a small number of distinct values, such as categories, to distinct colors.

A sequential palette maps an ordered variable to a color scale. Sequential maps may either have a small set of bins, or may be continuous.

We will use the Morton et al. (2013) study of free recall for these examples.

Qualitative color palettes#

Qualitative color palettes are best to use when there is no inherent ordering of categories. Seaborn tries to identify when this is the case (for example, when conditions are divided up based on a column with strings) and select a palette that maximizes contrast between categories.

For example, say we want to compare recall based on category and list type. We first calculate the mean recall for each combination of conditions.

cat_lt = (
    fr.group_by("subject", "category", "list_type")
    .agg(pl.col("recall").mean())
)

We can then plot the mean recall using a bar plot, with different hues for the different categories. The optional palette input determines the set of hues that will be used.

sns.catplot(data=cat_lt, x="list_type", y="recall", hue="category", kind="bar", palette="Set2", height=3);

../../_images/acc9321c1b57cf6b875227da6350a7448dac1cbb04500acf4bcc883b7f53ea96.png

Seaborn comes with a number of good qualitative palettes, which vary in how bright or muted they are and how colorblind-friendly they are. It’s a good idea to avoid having red and green in the same set of colors, to make your plots easier for colorblind people to read. See the Seaborn qualitative palette tutorial for more information.

display(sns.color_palette())
display(sns.color_palette("Set1"))
display(sns.color_palette("Set2"))
display(sns.color_palette("colorblind"))

Sequential color palettes#

Sequential color palettes are designed for visualization of variables that vary on a range. For example, recall accuracy varies from 0 to 1. Ideally, we want the color scale to vary smoothly, to match the underlying data. Seaborn includes sequential palettes that are carefully designed based on what we know about color perception.

Sequential color palettes are often used to present data in images. For example, say we want to see how recall varies by the serial position of items in a list, and we want to see this for each subject in a study. We can visualize this using a heatmap.

First, we get a two-dimensional array with the data we want to plot.

sp = (
    fr.filter(pl.col("study"))
    .group_by("subject", "input")
    .agg(pl.col("recall").mean())
)
n = sp.shape[0]
M = 24
N = int(n / M)
mat = sp.sort("subject", "input")["recall"].to_numpy().reshape((N, M))

Next, we’ll convert the matrix into a Pandas DataFrame. Unlike Polars DataFrames, Pandas DataFrames include both column labels and row labels, which are called an index. This lets us set labels for both the x and y dimensions automatically when using Seaborn to visualize it.

import pandas as pd
x_tick_labels = [str(i) for i in range (1, M + 1)]
y_tick_labels = fr["subject"].unique()
df = pd.DataFrame(mat, columns=x_tick_labels, index=y_tick_labels)

Now that the data matrix is prepared, we can use Seaborn’s heatmap function to plot it. We’ll add labels for the x-axis and y-axis.

fig, ax = plt.subplots(figsize=(3.5, 3))
sns.heatmap(df, ax=ax)
ax.set_ylabel("Subject")
ax.set_xlabel("Serial position");

../../_images/a263b5f8a7dfe3fc145cb8551ddd0cd21539e6f43e84677c9abe4a16e72f1ae6.png

The heatmap includes an x-axis (serial position), a y-axis (subject), and a color scale (recall probability). This lets us see how three different variables relate to one another. The color bar on the right indicates the relationship between color in the plot and the dimension we are plotting.

Seaborn includes a number of excellent sequential color palettes, which are designed to vary in a perceptually uniform way. If we use one of these scales, then changes in the variable being plotted will correspond to a similar visual change in the color. See the Seaborn sequential palettes tutorial for more information.

display(sns.color_palette("viridis", as_cmap=True))
display(sns.color_palette("inferno", as_cmap=True))
display(sns.color_palette("magma", as_cmap=True))
display(sns.color_palette("rocket", as_cmap=True))
display(sns.color_palette("mako", as_cmap=True))

viridis

under

bad

over

inferno

under

bad

over

magma

under

bad

over

rocket

under

bad

over

mako

under

bad

over

The choice of palette, also sometimes called a color map, is very important, because it affects the appearance of our data.

Say that we have some matrix that we want to visualize. We might consider three common palettes:

gray : A simple scale for mapping values to colors. Brighter areas correspond to greater values.

viridis : A recently developed color map that is designed to have a wide range of colors that is perceptually uniform and easy to interpret. Brighter, more yellow areas correspond to greater values.

jet : A “rainbow” color map that uses most colors that humans can perceive. Warmer colors correspond to greater values.

display(sns.color_palette("gray", as_cmap=True))
display(sns.color_palette("viridis", as_cmap=True))
# display(sns.color_palette("jet", as_cmap=True))  # throws an error because the author of Seaborn hates jet

gray

under

bad

over

viridis

under

bad

over

When creating a heatmap, we can indicate the colormap to use using the cmap input. In this example, we will use matplotlib to set up a figure with three subplots so we can visualize different heatmaps side-by-side.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 3, figsize=(9, 3))
cmaps = ["gray", "viridis", "jet"]
for i, cmap in enumerate(cmaps):
    sns.heatmap(df, ax=ax[i], cmap=cmap)
    ax[i].axis("off")
    ax[i].set_title(cmap)

../../_images/f504d7ae68118854f40967f6510c4616aca064a795204c2a888655cf3f042e01.png

Researchers often choose a rainbow color map like jet because it has been used historically in scientific publications, and shows a lot of detail due to its wide variation in color. However, it has serious problems that cause issues with using it in scientific research. It is perceptually not uniform, with the appearance of bands in the color scale. Values mapped near to yellow and cyan appear to vary more than values that are mapped near red and blue. The order of colors is also not obvious; the reader needs to carefully consult the color bar to know which colors correspond to greater values.

It can be hard to appreciate differences between color maps when looking at an unfamiliar data matrix. In contrast, using them to visualize a photo makes the advantages and disadvantages of different color maps obvious. Let’s look at a black-and-white photograph of the psychology researcher Hermann Ebbinghaus, who ran himself through rigorous tests of his own memory.

from PIL import Image
image = Image.open("Ebbinghaus.jpg")

fig, ax = plt.subplots(1, 3, figsize=(9, 3))
cmaps = ["gray", "viridis", "jet"]
for i, cmap in enumerate(cmaps):
    sns.heatmap(image, ax=ax[i], cmap=cmap)
    ax[i].axis("equal")
    ax[i].axis("off")
    ax[i].set_title(cmap)

../../_images/c45c9029d321037e0475fc433c0108db76c2b9efe863e6af701c5d926a6c0a02.png

The viridis color map looks similar to the original photo, but uses variation in color to make it easier to see highlights and shadows. The jet color map distorts the data, making the image appear to have color “bands” that are not in the actual photo. It also makes it hard to tell which parts of the image are supposed to be light, and which are supposed to be dark.

To avoid these problems, use one of Seaborn’s perceptually uniform palettes.

Summary#

Plot types#

Seaborn makes it relatively easy to make common types of plots.

Distributional plots can be used to visualize variation in some measurement.

Categorical plots can be used to visualize how a measurement differs between experimental conditions.

Relational plots can be used to visualize relationships between variables.

Color Palettes#

Color palettes built into Seaborn help to visualize variables.

Qualitative palettes are designed for representing unordered categories.

Sequential palettes are designed for representing values that vary over some range.

Data Visualization

Contents

Data Visualization#

Preparing data for plotting#

Using Seaborn#

Distribution plots#

Exercise: distributional plots#

Categorical plots#

Exercise: categorical plots#

Relational plots#

Exercise: scatter plot#

Exercise: line plot#

Advanced#

Using subplots#

Exercise: using subplots#

Choosing color palettes#

Qualitative color palettes#

Sequential color palettes#

Summary#

Plot types#

Color Palettes#