Python Data Cleaning Cheat Sheet#
import polars as pl
pl.Config.set_tbl_rows(12)
data = pl.read_csv("scores.csv", schema_overrides={"participant_id": str})
Handling missing data sentinels#
Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | -999 |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
To replace missing data sentinels in the test3 column, we can use the replace method.
data.with_columns(pl.col("test3").replace(-999, None))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | null |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
To replace missing data sentinels in multiple columns, just pass them to pl.col.
data.with_columns(
pl.col("test1", "test2", "test3").replace(-999, None)
) # replace -999 in any test score with None (null)
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | null |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
Alternatively, we can set the null_values input when calling read_csv to set those special values to null immediately, in all columns.
data = pl.read_csv(
"scores.csv",
null_values=["-999"],
schema_overrides={"participant_id": str},
)
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | null |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
Recoding variables#
Use replace to change the values of variables and cast to change the data type of variables.
The replace method can also be used to change labels used in columns. To change multiple values, pass a dictionary with a key for each value you want to replace.
data.with_columns(
pl.col("group").replace({"exp": "Experimental", "con": "Control"})
) # replace exp with Experimental and con with Control
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "Experimental" | 1 | 6 | 9 | null |
| "01" | "Experimental" | 2 | 4 | 8 | 9 |
| "02" | "Control" | 1 | 9 | 10 | 8 |
| "02" | "Control" | 2 | 7 | 9 | 7 |
To recode a variable into a different data type, use cast.
data.with_columns(
pl.col("condition").cast(pl.String) # convert variable to a string
)
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | str | i64 | i64 | i64 |
| "01" | "exp" | "1" | 6 | 9 | null |
| "01" | "exp" | "2" | 4 | 8 | 9 |
| "02" | "con" | "1" | 9 | 10 | 8 |
| "02" | "con" | "2" | 7 | 9 | 7 |
Casting can be useful when we want to recode a numeric label as a string instead.
data.with_columns(
pl.col("condition").cast(pl.String).replace({"1": "target", "2": "lure"})
) # convert variable to string and then replace values
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | str | i64 | i64 | i64 |
| "01" | "exp" | "target" | 6 | 9 | null |
| "01" | "exp" | "lure" | 4 | 8 | 9 |
| "02" | "con" | "target" | 9 | 10 | 8 |
| "02" | "con" | "lure" | 7 | 9 | 7 |
Working with missing data#
Missing data should be marked as null in DataFrames. They will be excluded from calculations.
Use null_count to count the number of null values in each column.
data.null_count()
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 0 | 0 | 0 | 0 | 1 |
Statistical calculations will ignore null values.
data.select("test1", "test2", "test3").mean()
| test1 | test2 | test3 |
|---|---|---|
| f64 | f64 | f64 |
| 6.5 | 9.0 | 8.0 |
Altneratively, can replace null values with some other value using fill_null.
data.with_columns(pl.col("test1", "test2", "test3").fill_null(0))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | 0 |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
Use drop_nulls to completely exclude rows with null.
data.drop_nulls()
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
Grouping and aggregation#
Use group_by and agg to split data into groups and calculate statistics for each group.
This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.
data.group_by("group").agg(pl.col("test1").mean()) # mean for each condition
| group | test1 |
|---|---|
| str | f64 |
| "exp" | 5.0 |
| "con" | 8.0 |
data.group_by("group").agg(pl.col("test1").std()) # standard deviation
| group | test1 |
|---|---|
| str | f64 |
| "con" | 1.414214 |
| "exp" | 1.414214 |
Pass multiple values to agg to calculate multiple statistics at once.
(
data.group_by("group")
.agg(mean=pl.col("test1").mean(), sd=pl.col("test1").std())
)
| group | mean | sd |
|---|---|---|
| str | f64 | f64 |
| "con" | 8.0 | 1.414214 |
| "exp" | 5.0 | 1.414214 |
Grouping by multiple variables#
To calculate statistics for each combination of multiple variables, use group_by with multiple columns.
(
data.group_by("group", "condition") # group by two variables
.agg(pl.col("test1").mean()) # calculate mean for each combination
.sort("group", "condition") # sort the result
)
| group | condition | test1 |
|---|---|---|
| str | i64 | f64 |
| "con" | 1 | 9.0 |
| "con" | 2 | 7.0 |
| "exp" | 1 | 6.0 |
| "exp" | 2 | 4.0 |
Use len to get the number of samples in each condition.
(
data.group_by("group", "condition") # group by two variables
.len() # get the number of samples
.sort("group", "condition") # sort the result
)
| group | condition | len |
|---|---|---|
| str | i64 | u32 |
| "con" | 1 | 1 |
| "con" | 2 | 1 |
| "exp" | 1 | 1 |
| "exp" | 2 | 1 |
Reshaping data to long format#
In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.
This example data frame is wide format, because it has multiple test scores in each row.
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "01" | "exp" | 1 | 6 | 9 | null |
| "01" | "exp" | 2 | 4 | 8 | 9 |
| "02" | "con" | 1 | 9 | 10 | 8 |
| "02" | "con" | 2 | 7 | 9 | 7 |
Convert from wide format to long format using unpivot. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.
data.unpivot(["test1", "test2", "test3"])
| variable | value |
|---|---|
| str | i64 |
| "test1" | 6 |
| "test1" | 4 |
| "test1" | 9 |
| "test1" | 7 |
| "test2" | 9 |
| "test2" | 8 |
| "test2" | 10 |
| "test2" | 9 |
| "test3" | null |
| "test3" | 9 |
| "test3" | 8 |
| "test3" | 7 |
To keep the participant_id, group, and condition columns, pass those to the index argument. We’ll also sort by those columns.
(
data.unpivot(
["test1", "test2", "test3"],
index=["participant_id", "group", "condition"],
)
.sort("participant_id", "group", "condition")
)
| participant_id | group | condition | variable | value |
|---|---|---|---|---|
| str | str | i64 | str | i64 |
| "01" | "exp" | 1 | "test1" | 6 |
| "01" | "exp" | 1 | "test2" | 9 |
| "01" | "exp" | 1 | "test3" | null |
| "01" | "exp" | 2 | "test1" | 4 |
| "01" | "exp" | 2 | "test2" | 8 |
| "01" | "exp" | 2 | "test3" | 9 |
| "02" | "con" | 1 | "test1" | 9 |
| "02" | "con" | 1 | "test2" | 10 |
| "02" | "con" | 1 | "test3" | 8 |
| "02" | "con" | 2 | "test1" | 7 |
| "02" | "con" | 2 | "test2" | 9 |
| "02" | "con" | 2 | "test3" | 7 |
To change the names of the new columns, use variable_name to specify the name of the column that keeps the test label, and value_name to specify the name of the column with the score value.
long = (
data.unpivot(
["test1", "test2", "test3"],
index=["participant_id", "group", "condition"],
variable_name="test_type",
value_name="score",
)
.sort("participant_id", "group", "condition")
)
long
| participant_id | group | condition | test_type | score |
|---|---|---|---|---|
| str | str | i64 | str | i64 |
| "01" | "exp" | 1 | "test1" | 6 |
| "01" | "exp" | 1 | "test2" | 9 |
| "01" | "exp" | 1 | "test3" | null |
| "01" | "exp" | 2 | "test1" | 4 |
| "01" | "exp" | 2 | "test2" | 8 |
| "01" | "exp" | 2 | "test3" | 9 |
| "02" | "con" | 1 | "test1" | 9 |
| "02" | "con" | 1 | "test2" | 10 |
| "02" | "con" | 1 | "test3" | 8 |
| "02" | "con" | 2 | "test1" | 7 |
| "02" | "con" | 2 | "test2" | 9 |
| "02" | "con" | 2 | "test3" | 7 |
Reshaping data to wide format#
Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.
In this example, we first get the mean score for each test type, for each participant.
means = (
long.group_by("participant_id", "test_type")
.agg(pl.col("score").mean())
.sort("participant_id", "test_type")
)
means
| participant_id | test_type | score |
|---|---|---|
| str | str | f64 |
| "01" | "test1" | 5.0 |
| "01" | "test2" | 8.5 |
| "01" | "test3" | 9.0 |
| "02" | "test1" | 8.0 |
| "02" | "test2" | 9.5 |
| "02" | "test3" | 7.5 |
Now we can use pivot to make a table that shows the results for each participant, with a column for each test.
wide = means.pivot("test_type", index="participant_id", values="score")
wide
| participant_id | test1 | test2 | test3 |
|---|---|---|---|
| str | f64 | f64 | f64 |
| "01" | 5.0 | 8.5 | 9.0 |
| "02" | 8.0 | 9.5 | 7.5 |
With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.
wide.with_columns(
total=pl.col("test1") + pl.col("test2") + pl.col("test3")
)
| participant_id | test1 | test2 | test3 | total |
|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 |
| "01" | 5.0 | 8.5 | 9.0 | 22.5 |
| "02" | 8.0 | 9.5 | 7.5 | 25.0 |