Python Data Cleaning Cheat Sheet#

import polars as pl
pl.Config.set_tbl_rows(12)
data = pl.read_csv("scores.csv", schema_overrides={"participant_id": str})

Handling missing data sentinels#

Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.

data
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"169-999
"01""exp"2489
"02""con"19108
"02""con"2797

To replace missing data sentinels in the test3 column, we can use the replace method.

data.with_columns(pl.col("test3").replace(-999, None))
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"169null
"01""exp"2489
"02""con"19108
"02""con"2797

To replace missing data sentinels in multiple columns, just pass them to pl.col.

data.with_columns(
    pl.col("test1", "test2", "test3").replace(-999, None)  
)  # replace -999 in any test score with None (null)
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"169null
"01""exp"2489
"02""con"19108
"02""con"2797

Alternatively, we can set the null_values input when calling read_csv to set those special values to null immediately, in all columns.

data = pl.read_csv(
    "scores.csv", 
    null_values=["-999"], 
    schema_overrides={"participant_id": str},
)
data
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"169null
"01""exp"2489
"02""con"19108
"02""con"2797

Recoding variables#

Use replace to change the values of variables and cast to change the data type of variables.

The replace method can also be used to change labels used in columns. To change multiple values, pass a dictionary with a key for each value you want to replace.

data.with_columns(
    pl.col("group").replace({"exp": "Experimental", "con": "Control"})
)  # replace exp with Experimental and con with Control
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""Experimental"169null
"01""Experimental"2489
"02""Control"19108
"02""Control"2797

To recode a variable into a different data type, use cast.

data.with_columns(
    pl.col("condition").cast(pl.String)  # convert variable to a string
)
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstrstri64i64i64
"01""exp""1"69null
"01""exp""2"489
"02""con""1"9108
"02""con""2"797

Casting can be useful when we want to recode a numeric label as a string instead.

data.with_columns(
    pl.col("condition").cast(pl.String).replace({"1": "target", "2": "lure"})
)  # convert variable to string and then replace values
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstrstri64i64i64
"01""exp""target"69null
"01""exp""lure"489
"02""con""target"9108
"02""con""lure"797

Working with missing data#

Missing data should be marked as null in DataFrames. They will be excluded from calculations.

Use null_count to count the number of null values in each column.

data.null_count()
shape: (1, 6)
participant_idgroupconditiontest1test2test3
u32u32u32u32u32u32
000001

Statistical calculations will ignore null values.

data.select("test1", "test2", "test3").mean()
shape: (1, 3)
test1test2test3
f64f64f64
6.59.08.0

Altneratively, can replace null values with some other value using fill_null.

data.with_columns(pl.col("test1", "test2", "test3").fill_null(0))
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"1690
"01""exp"2489
"02""con"19108
"02""con"2797

Use drop_nulls to completely exclude rows with null.

data.drop_nulls()
shape: (3, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"2489
"02""con"19108
"02""con"2797

Grouping and aggregation#

Use group_by and agg to split data into groups and calculate statistics for each group.

This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.

data.group_by("group").agg(pl.col("test1").mean())  # mean for each condition
shape: (2, 2)
grouptest1
strf64
"exp"5.0
"con"8.0
data.group_by("group").agg(pl.col("test1").std())  # standard deviation
shape: (2, 2)
grouptest1
strf64
"con"1.414214
"exp"1.414214

Pass multiple values to agg to calculate multiple statistics at once.

(
    data.group_by("group")
    .agg(mean=pl.col("test1").mean(), sd=pl.col("test1").std())
)
shape: (2, 3)
groupmeansd
strf64f64
"con"8.01.414214
"exp"5.01.414214

Grouping by multiple variables#

To calculate statistics for each combination of multiple variables, use group_by with multiple columns.

(
    data.group_by("group", "condition")  # group by two variables
    .agg(pl.col("test1").mean())         # calculate mean for each combination
    .sort("group", "condition")          # sort the result
)
shape: (4, 3)
groupconditiontest1
stri64f64
"con"19.0
"con"27.0
"exp"16.0
"exp"24.0

Use len to get the number of samples in each condition.

(
    data.group_by("group", "condition")  # group by two variables
    .len()                               # get the number of samples
    .sort("group", "condition")          # sort the result
)
shape: (4, 3)
groupconditionlen
stri64u32
"con"11
"con"21
"exp"11
"exp"21

Reshaping data to long format#

In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.

This example data frame is wide format, because it has multiple test scores in each row.

data
shape: (4, 6)
participant_idgroupconditiontest1test2test3
strstri64i64i64i64
"01""exp"169null
"01""exp"2489
"02""con"19108
"02""con"2797

Convert from wide format to long format using unpivot. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.

data.unpivot(["test1", "test2", "test3"])
shape: (12, 2)
variablevalue
stri64
"test1"6
"test1"4
"test1"9
"test1"7
"test2"9
"test2"8
"test2"10
"test2"9
"test3"null
"test3"9
"test3"8
"test3"7

To keep the participant_id, group, and condition columns, pass those to the index argument. We’ll also sort by those columns.

(
    data.unpivot(
        ["test1", "test2", "test3"],
        index=["participant_id", "group", "condition"],
    )
    .sort("participant_id", "group", "condition")
)
shape: (12, 5)
participant_idgroupconditionvariablevalue
strstri64stri64
"01""exp"1"test1"6
"01""exp"1"test2"9
"01""exp"1"test3"null
"01""exp"2"test1"4
"01""exp"2"test2"8
"01""exp"2"test3"9
"02""con"1"test1"9
"02""con"1"test2"10
"02""con"1"test3"8
"02""con"2"test1"7
"02""con"2"test2"9
"02""con"2"test3"7

To change the names of the new columns, use variable_name to specify the name of the column that keeps the test label, and value_name to specify the name of the column with the score value.

long = (
    data.unpivot(
        ["test1", "test2", "test3"],
        index=["participant_id", "group", "condition"],
        variable_name="test_type",
        value_name="score",
    )
    .sort("participant_id", "group", "condition")
)
long
shape: (12, 5)
participant_idgroupconditiontest_typescore
strstri64stri64
"01""exp"1"test1"6
"01""exp"1"test2"9
"01""exp"1"test3"null
"01""exp"2"test1"4
"01""exp"2"test2"8
"01""exp"2"test3"9
"02""con"1"test1"9
"02""con"1"test2"10
"02""con"1"test3"8
"02""con"2"test1"7
"02""con"2"test2"9
"02""con"2"test3"7

Reshaping data to wide format#

Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.

In this example, we first get the mean score for each test type, for each participant.

means = (
    long.group_by("participant_id", "test_type")
    .agg(pl.col("score").mean())
    .sort("participant_id", "test_type")
)
means
shape: (6, 3)
participant_idtest_typescore
strstrf64
"01""test1"5.0
"01""test2"8.5
"01""test3"9.0
"02""test1"8.0
"02""test2"9.5
"02""test3"7.5

Now we can use pivot to make a table that shows the results for each participant, with a column for each test.

wide = means.pivot("test_type", index="participant_id", values="score")
wide
shape: (2, 4)
participant_idtest1test2test3
strf64f64f64
"01"5.08.59.0
"02"8.09.57.5

With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.

wide.with_columns(
    total=pl.col("test1") + pl.col("test2") + pl.col("test3")
)
shape: (2, 5)
participant_idtest1test2test3total
strf64f64f64f64
"01"5.08.59.022.5
"02"8.09.57.525.0