Python Pandas Data Cleaning Cheat Sheet#

Note: the course focuses on using Polars, a package that was recently developed to take advantage of progress in technology and interface design, and which is more similar to other data science approaches such as SQL and R’s Tidyverse. This guide demonstrates how to do the same operations discussed in the chapter using the popular Pandas package instead of Polars. Note that it is generally simple to convert between Pandas and Polars data frames, so both packages can be used in a given workflow.

import pandas as pd

Handling missing data sentinels#

Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with pd.NA to indicate that they are missing.

By default, Pandas does not support storing integers along with missing data. To use nullable integers to represent a column, we can set the data type to "Int64" using the dtype argument when reading in the CSV file. We will also store columns with text data using the "string" data type, which is recommended in the Pandas documentation, instead of the default object data type.

data = pd.read_csv(
    "scores.csv", 
    dtype={
        "participant_id": "string", 
        "group": "string",
        "test1": "Int64", 
        "test2": "Int64", 
        "test3": "Int64",
    },
)
data
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 -999
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

To replace missing data sentinels in the test3 column, we can use assign with the replace method. The assign method can be used to create or modify columns. In Pandas, instead of setting missing values to None, we use pd.NA.

data.assign(test3=data["test3"].replace(-999, pd.NA))
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 <NA>
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

To replace missing data sentinels in multiple columns, we can use assignment with the apply method. To operate on multiple columns using apply, we must supply a function that performs the operation we want to apply to each column. Here, we use an anonymous function created using the lambda syntax. This method is used to quickly define functions without having to use the usual def function definition syntax. Our function takes one argument (a Pandas Series) and runs replace on it.

columns = ["test1", "test2", "test3"]
data[columns] = data[columns].apply(lambda x: x.replace(-999, pd.NA))
data
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 <NA>
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Alternatively, we can set the na_values input when calling read_csv to set those special values to NA immediately, in all columns.

data = pd.read_csv(
    "scores.csv", 
    dtype={
        "participant_id": "string", 
        "group": "string",
        "test1": "Int64", 
        "test2": "Int64", 
        "test3": "Int64",
    },
    na_values=[-999],
)
data
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 <NA>
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Recoding variables#

Use replace to change the values of variables and astype to change the data type of variables.

The replace method can also be used to change labels used in columns. To change multiple values, pass a dictionary with a key for each value you want to replace.

data.assign(
    group=data["group"].replace({"exp": "Experimental", "con": "Control"})
)  # replace exp with Experimental and con with Control
participant_id group condition test1 test2 test3
0 01 Experimental 1 6 9 <NA>
1 01 Experimental 2 4 8 9
2 02 Control 1 9 10 8
3 02 Control 2 7 9 7

To recode a variable into a different data type, use astype.

data.assign(
    condition=data["condition"].astype("string")
)
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 <NA>
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Working with missing data#

Missing data should be marked as NA in DataFrames. They will be excluded from most statistical calculations by default.

Use isna and sum to count the number of NA values in each column.

data.isna().sum()
participant_id    0
group             0
condition         0
test1             0
test2             0
test3             1
dtype: int64

Statistical calculations will ignore NA values by default.

data[["test1", "test2", "test3"]].mean()
test1    6.5
test2    9.0
test3    8.0
dtype: Float64

Statistical calculations can instead include all values, including NA, if skipna is set to False.

data[["test1", "test2", "test3"]].mean(skipna=False)
test1     6.5
test2     9.0
test3    <NA>
dtype: Float64

Altneratively, can replace NA values with some other value using fillna.

data.fillna(0)
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 0
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Use dropna to completely exclude rows with NA.

data.dropna()
participant_id group condition test1 test2 test3
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Grouping and aggregation#

Use groupby and agg to split data into groups and calculate statistics for each group.

This groups rows by the condition label, then calculates statistics for the test 1 scores within each group. When a data frame is grouped and aggregated, Pandas sets the index of the data frame to the grouping variables. Because Pandas indexes are frequently complicated to work with, and because they have no analogue in Polars or R, here we use reset_index to make the grouping column be a normal column again instead of an index.

data.groupby("group")["test1"].agg("mean").reset_index()
group test1
0 con 8.0
1 exp 5.0
data.groupby("group")["test1"].agg("std").reset_index()
group test1
0 con 1.414214
1 exp 1.414214

Pass a list of statistics names to agg to calculate multiple statistics at once. The names of columns in the result can be changed using rename.

(
    data.groupby("group")["test1"]
    .agg(["mean", "std"])
    .rename(columns={"std": "sd"})
    .reset_index()
)
group mean sd
0 con 8.0 1.414214
1 exp 5.0 1.414214

Grouping by multiple variables#

To calculate statistics for each combination of multiple variables, use groupby with multiple column names in a list.

(
    data.groupby(["group", "condition"])["test1"]
    .mean()
    .reset_index()
)
group condition test1
0 con 1 9.0
1 con 2 7.0
2 exp 1 6.0
3 exp 2 4.0

Use size to get the number of samples in each condition. Here, we use to_frame to convert the result, which is a Series, into a DataFrame with a column named "len".

(
    data.groupby(["group", "condition"])
    .size()
    .to_frame(name="len")
    .reset_index()
)
group condition len
0 con 1 1
1 con 2 1
2 exp 1 1
3 exp 2 1

Reshaping data to long format#

In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.

This example data frame is wide format, because it has multiple test scores in each row.

data
participant_id group condition test1 test2 test3
0 01 exp 1 6 9 <NA>
1 01 exp 2 4 8 9
2 02 con 1 9 10 8
3 02 con 2 7 9 7

Convert from wide format to long format using melt. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.

data.melt(value_vars=["test1", "test2", "test3"])
variable value
0 test1 6
1 test1 4
2 test1 9
3 test1 7
4 test2 9
5 test2 8
6 test2 10
7 test2 9
8 test3 <NA>
9 test3 9
10 test3 8
11 test3 7

To keep the participant_id, group, and condition columns, pass those to the id_vars argument. We’ll also sort by those columns using sort_values.

(
    data.melt(
        id_vars=["participant_id", "group", "condition"], 
        value_vars=["test1", "test2", "test3"],
    )
    .sort_values(["participant_id", "group", "condition"])
)
participant_id group condition variable value
0 01 exp 1 test1 6
4 01 exp 1 test2 9
8 01 exp 1 test3 <NA>
1 01 exp 2 test1 4
5 01 exp 2 test2 8
9 01 exp 2 test3 9
2 02 con 1 test1 9
6 02 con 1 test2 10
10 02 con 1 test3 8
3 02 con 2 test1 7
7 02 con 2 test2 9
11 02 con 2 test3 7

To change the names of the new columns, use var_name to specify the name of the column that keeps the test label, and value_name to specify the name of the column with the score value.

long = (
    data.melt(
        id_vars=["participant_id", "group", "condition"],
        value_vars=["test1", "test2", "test3"],
        var_name="test_type",
        value_name="score",
    )
    .sort_values(["participant_id", "group", "condition"])
)
long
participant_id group condition test_type score
0 01 exp 1 test1 6
4 01 exp 1 test2 9
8 01 exp 1 test3 <NA>
1 01 exp 2 test1 4
5 01 exp 2 test2 8
9 01 exp 2 test3 9
2 02 con 1 test1 9
6 02 con 1 test2 10
10 02 con 1 test3 8
3 02 con 2 test1 7
7 02 con 2 test2 9
11 02 con 2 test3 7

Reshaping data to wide format#

Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.

In this example, we first get the mean score for each test type, for each participant.

means = (
    long.groupby(["participant_id", "test_type"])["score"]
    .agg("mean")
    .reset_index()
)
means
participant_id test_type score
0 01 test1 5.0
1 01 test2 8.5
2 01 test3 9.0
3 02 test1 8.0
4 02 test2 9.5
5 02 test3 7.5

Now we can use pivot to make a table that shows the results for each participant, with a column for each test. To avoid complications with column names, here we use rename_axis to set the name of the column index to None before resetting the index.

wide = (
    means.pivot(
        columns="test_type", index="participant_id", values="score"
    )
    .rename_axis(columns=None)
    .reset_index()
)

With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.

wide.assign(
    total=wide["test1"] + wide["test2"] + wide["test3"]
)
participant_id test1 test2 test3 total
0 01 5.0 8.5 9.0 22.5
1 02 8.0 9.5 7.5 25.0