Python Pandas Data Cleaning Cheat Sheet#
Note: the course focuses on using Polars, a package that was recently developed to take advantage of progress in technology and interface design, and which is more similar to other data science approaches such as SQL and R’s Tidyverse. This guide demonstrates how to do the same operations discussed in the chapter using the popular Pandas package instead of Polars. Note that it is generally simple to convert between Pandas and Polars data frames, so both packages can be used in a given workflow.
import pandas as pd
Handling missing data sentinels#
Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with pd.NA to indicate that they are missing.
By default, Pandas does not support storing integers along with missing data. To use nullable integers to represent a column, we can set the data type to "Int64" using the dtype argument when reading in the CSV file. We will also store columns with text data using the "string" data type, which is recommended in the Pandas documentation, instead of the default object data type.
data = pd.read_csv(
"scores.csv",
dtype={
"participant_id": "string",
"group": "string",
"test1": "Int64",
"test2": "Int64",
"test3": "Int64",
},
)
data
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | -999 |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
To replace missing data sentinels in the test3 column, we can use assign with the replace method. The assign method can be used to create or modify columns. In Pandas, instead of setting missing values to None, we use pd.NA.
data.assign(test3=data["test3"].replace(-999, pd.NA))
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | <NA> |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
To replace missing data sentinels in multiple columns, we can use assignment with the apply method. To operate on multiple columns using apply, we must supply a function that performs the operation we want to apply to each column. Here, we use an anonymous function created using the lambda syntax. This method is used to quickly define functions without having to use the usual def function definition syntax. Our function takes one argument (a Pandas Series) and runs replace on it.
columns = ["test1", "test2", "test3"]
data[columns] = data[columns].apply(lambda x: x.replace(-999, pd.NA))
data
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | <NA> |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Alternatively, we can set the na_values input when calling read_csv to set those special values to NA immediately, in all columns.
data = pd.read_csv(
"scores.csv",
dtype={
"participant_id": "string",
"group": "string",
"test1": "Int64",
"test2": "Int64",
"test3": "Int64",
},
na_values=[-999],
)
data
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | <NA> |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Recoding variables#
Use replace to change the values of variables and astype to change the data type of variables.
The replace method can also be used to change labels used in columns. To change multiple values, pass a dictionary with a key for each value you want to replace.
data.assign(
group=data["group"].replace({"exp": "Experimental", "con": "Control"})
) # replace exp with Experimental and con with Control
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | Experimental | 1 | 6 | 9 | <NA> |
| 1 | 01 | Experimental | 2 | 4 | 8 | 9 |
| 2 | 02 | Control | 1 | 9 | 10 | 8 |
| 3 | 02 | Control | 2 | 7 | 9 | 7 |
To recode a variable into a different data type, use astype.
data.assign(
condition=data["condition"].astype("string")
)
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | <NA> |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Working with missing data#
Missing data should be marked as NA in DataFrames. They will be excluded from most statistical calculations by default.
Use isna and sum to count the number of NA values in each column.
data.isna().sum()
participant_id 0
group 0
condition 0
test1 0
test2 0
test3 1
dtype: int64
Statistical calculations will ignore NA values by default.
data[["test1", "test2", "test3"]].mean()
test1 6.5
test2 9.0
test3 8.0
dtype: Float64
Statistical calculations can instead include all values, including NA, if skipna is set to False.
data[["test1", "test2", "test3"]].mean(skipna=False)
test1 6.5
test2 9.0
test3 <NA>
dtype: Float64
Altneratively, can replace NA values with some other value using fillna.
data.fillna(0)
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | 0 |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Use dropna to completely exclude rows with NA.
data.dropna()
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Grouping and aggregation#
Use groupby and agg to split data into groups and calculate statistics for each group.
This groups rows by the condition label, then calculates statistics for the test 1 scores within each group. When a data frame is grouped and aggregated, Pandas sets the index of the data frame to the grouping variables. Because Pandas indexes are frequently complicated to work with, and because they have no analogue in Polars or R, here we use reset_index to make the grouping column be a normal column again instead of an index.
data.groupby("group")["test1"].agg("mean").reset_index()
| group | test1 | |
|---|---|---|
| 0 | con | 8.0 |
| 1 | exp | 5.0 |
data.groupby("group")["test1"].agg("std").reset_index()
| group | test1 | |
|---|---|---|
| 0 | con | 1.414214 |
| 1 | exp | 1.414214 |
Pass a list of statistics names to agg to calculate multiple statistics at once. The names of columns in the result can be changed using rename.
(
data.groupby("group")["test1"]
.agg(["mean", "std"])
.rename(columns={"std": "sd"})
.reset_index()
)
| group | mean | sd | |
|---|---|---|---|
| 0 | con | 8.0 | 1.414214 |
| 1 | exp | 5.0 | 1.414214 |
Grouping by multiple variables#
To calculate statistics for each combination of multiple variables, use groupby with multiple column names in a list.
(
data.groupby(["group", "condition"])["test1"]
.mean()
.reset_index()
)
| group | condition | test1 | |
|---|---|---|---|
| 0 | con | 1 | 9.0 |
| 1 | con | 2 | 7.0 |
| 2 | exp | 1 | 6.0 |
| 3 | exp | 2 | 4.0 |
Use size to get the number of samples in each condition. Here, we use to_frame to convert the result, which is a Series, into a DataFrame with a column named "len".
(
data.groupby(["group", "condition"])
.size()
.to_frame(name="len")
.reset_index()
)
| group | condition | len | |
|---|---|---|---|
| 0 | con | 1 | 1 |
| 1 | con | 2 | 1 |
| 2 | exp | 1 | 1 |
| 3 | exp | 2 | 1 |
Reshaping data to long format#
In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.
This example data frame is wide format, because it has multiple test scores in each row.
data
| participant_id | group | condition | test1 | test2 | test3 | |
|---|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | 6 | 9 | <NA> |
| 1 | 01 | exp | 2 | 4 | 8 | 9 |
| 2 | 02 | con | 1 | 9 | 10 | 8 |
| 3 | 02 | con | 2 | 7 | 9 | 7 |
Convert from wide format to long format using melt. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.
data.melt(value_vars=["test1", "test2", "test3"])
| variable | value | |
|---|---|---|
| 0 | test1 | 6 |
| 1 | test1 | 4 |
| 2 | test1 | 9 |
| 3 | test1 | 7 |
| 4 | test2 | 9 |
| 5 | test2 | 8 |
| 6 | test2 | 10 |
| 7 | test2 | 9 |
| 8 | test3 | <NA> |
| 9 | test3 | 9 |
| 10 | test3 | 8 |
| 11 | test3 | 7 |
To keep the participant_id, group, and condition columns, pass those to the id_vars argument. We’ll also sort by those columns using sort_values.
(
data.melt(
id_vars=["participant_id", "group", "condition"],
value_vars=["test1", "test2", "test3"],
)
.sort_values(["participant_id", "group", "condition"])
)
| participant_id | group | condition | variable | value | |
|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | test1 | 6 |
| 4 | 01 | exp | 1 | test2 | 9 |
| 8 | 01 | exp | 1 | test3 | <NA> |
| 1 | 01 | exp | 2 | test1 | 4 |
| 5 | 01 | exp | 2 | test2 | 8 |
| 9 | 01 | exp | 2 | test3 | 9 |
| 2 | 02 | con | 1 | test1 | 9 |
| 6 | 02 | con | 1 | test2 | 10 |
| 10 | 02 | con | 1 | test3 | 8 |
| 3 | 02 | con | 2 | test1 | 7 |
| 7 | 02 | con | 2 | test2 | 9 |
| 11 | 02 | con | 2 | test3 | 7 |
To change the names of the new columns, use var_name to specify the name of the column that keeps the test label, and value_name to specify the name of the column with the score value.
long = (
data.melt(
id_vars=["participant_id", "group", "condition"],
value_vars=["test1", "test2", "test3"],
var_name="test_type",
value_name="score",
)
.sort_values(["participant_id", "group", "condition"])
)
long
| participant_id | group | condition | test_type | score | |
|---|---|---|---|---|---|
| 0 | 01 | exp | 1 | test1 | 6 |
| 4 | 01 | exp | 1 | test2 | 9 |
| 8 | 01 | exp | 1 | test3 | <NA> |
| 1 | 01 | exp | 2 | test1 | 4 |
| 5 | 01 | exp | 2 | test2 | 8 |
| 9 | 01 | exp | 2 | test3 | 9 |
| 2 | 02 | con | 1 | test1 | 9 |
| 6 | 02 | con | 1 | test2 | 10 |
| 10 | 02 | con | 1 | test3 | 8 |
| 3 | 02 | con | 2 | test1 | 7 |
| 7 | 02 | con | 2 | test2 | 9 |
| 11 | 02 | con | 2 | test3 | 7 |
Reshaping data to wide format#
Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.
In this example, we first get the mean score for each test type, for each participant.
means = (
long.groupby(["participant_id", "test_type"])["score"]
.agg("mean")
.reset_index()
)
means
| participant_id | test_type | score | |
|---|---|---|---|
| 0 | 01 | test1 | 5.0 |
| 1 | 01 | test2 | 8.5 |
| 2 | 01 | test3 | 9.0 |
| 3 | 02 | test1 | 8.0 |
| 4 | 02 | test2 | 9.5 |
| 5 | 02 | test3 | 7.5 |
Now we can use pivot to make a table that shows the results for each participant, with a column for each test. To avoid complications with column names, here we use rename_axis to set the name of the column index to None before resetting the index.
wide = (
means.pivot(
columns="test_type", index="participant_id", values="score"
)
.rename_axis(columns=None)
.reset_index()
)
With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.
wide.assign(
total=wide["test1"] + wide["test2"] + wide["test3"]
)
| participant_id | test1 | test2 | test3 | total | |
|---|---|---|---|---|---|
| 0 | 01 | 5.0 | 8.5 | 9.0 | 22.5 |
| 1 | 02 | 8.0 | 9.5 | 7.5 | 25.0 |