Python Data Cleaning Cheat Sheet

8.9. Python Data Cleaning Cheat Sheet#

import polars as pl
pl.Config.set_tbl_rows(12)
data = pl.read_csv("scores.csv", schema_overrides={"participant_id": str})

Handling missing data sentinels#

Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.

data

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	-999
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

To replace missing data sentinels in the test3 column, we can use the replace method.

data.with_columns(pl.col("test3").replace(-999, None))

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	null
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

To replace missing data sentinels in multiple columns, just pass them to pl.col.

data.with_columns(
    pl.col("test1", "test2", "test3").replace(-999, None)  
)  # replace -999 in any test score with None (null)

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	null
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

Alternatively, we can set the null_values input when calling read_csv to set those special values to null immediately, in all columns.

data = pl.read_csv(
    "scores.csv", 
    null_values=["-999"], 
    schema_overrides={"participant_id": str},
)
data

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	null
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

Recoding variables#

Use replace to change the values of variables and cast to change the data type of variables.

The replace method can also be used to change labels used in columns. To change multiple values, pass a dictionary with a key for each value you want to replace.

data.with_columns(
    pl.col("group").replace({"exp": "Experimental", "con": "Control"})
)  # replace exp with Experimental and con with Control

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"Experimental"	1	6	9	null
"01"	"Experimental"	2	4	8	9
"02"	"Control"	1	9	10	8
"02"	"Control"	2	7	9	7

To recode a variable into a different data type, use cast.

data.with_columns(
    pl.col("condition").cast(pl.String)  # convert variable to a string
)

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	str	i64	i64	i64
"01"	"exp"	"1"	6	9	null
"01"	"exp"	"2"	4	8	9
"02"	"con"	"1"	9	10	8
"02"	"con"	"2"	7	9	7

Casting can be useful when we want to recode a numeric label as a string instead.

data.with_columns(
    pl.col("condition").cast(pl.String).replace({"1": "target", "2": "lure"})
)  # convert variable to string and then replace values

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	str	i64	i64	i64
"01"	"exp"	"target"	6	9	null
"01"	"exp"	"lure"	4	8	9
"02"	"con"	"target"	9	10	8
"02"	"con"	"lure"	7	9	7

Working with missing data#

Missing data should be marked as null in DataFrames. They will be excluded from calculations.

Use null_count to count the number of null values in each column.

data.null_count()

shape: (1, 6)

participant_id	group	condition	test1	test2	test3
u32	u32	u32	u32	u32	u32
0	0	0	0	0	1

Statistical calculations will ignore null values.

data.select("test1", "test2", "test3").mean()

shape: (1, 3)

test1	test2	test3
f64	f64	f64
6.5	9.0	8.0

Altneratively, can replace null values with some other value using fill_null.

data.with_columns(pl.col("test1", "test2", "test3").fill_null(0))

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	0
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

Use drop_nulls to completely exclude rows with null.

data.drop_nulls()

shape: (3, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

Grouping and aggregation#

Use group_by and agg to split data into groups and calculate statistics for each group.

This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.

data.group_by("group").agg(pl.col("test1").mean())  # mean for each condition

shape: (2, 2)

group	test1
str	f64
"con"	8.0
"exp"	5.0

data.group_by("group").agg(pl.col("test1").std())  # standard deviation

shape: (2, 2)

group	test1
str	f64
"exp"	1.414214
"con"	1.414214

Pass multiple values to agg to calculate multiple statistics at once.

(
    data.group_by("group")
    .agg(mean=pl.col("test1").mean(), sd=pl.col("test1").std())
)

shape: (2, 3)

group	mean	sd
str	f64	f64
"exp"	5.0	1.414214
"con"	8.0	1.414214

Grouping by multiple variables#

To calculate statistics for each combination of multiple variables, use group_by with multiple columns.

(
    data.group_by("group", "condition")  # group by two variables
    .agg(pl.col("test1").mean())         # calculate mean for each combination
    .sort("group", "condition")          # sort the result
)

shape: (4, 3)

group	condition	test1
str	i64	f64
"con"	1	9.0
"con"	2	7.0
"exp"	1	6.0
"exp"	2	4.0

Use len to get the number of samples in each condition.

(
    data.group_by("group", "condition")  # group by two variables
    .len()                               # get the number of samples
    .sort("group", "condition")          # sort the result
)

shape: (4, 3)

group	condition	len
str	i64	u32
"con"	1	1
"con"	2	1
"exp"	1	1
"exp"	2	1

Reshaping data to long format#

In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.

This example data frame is wide format, because it has multiple test scores in each row.

data

shape: (4, 6)

participant_id	group	condition	test1	test2	test3
str	str	i64	i64	i64	i64
"01"	"exp"	1	6	9	null
"01"	"exp"	2	4	8	9
"02"	"con"	1	9	10	8
"02"	"con"	2	7	9	7

Convert from wide format to long format using unpivot. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.

data.unpivot(["test1", "test2", "test3"])

shape: (12, 2)

variable	value
str	i64
"test1"	6
"test1"	4
"test1"	9
"test1"	7
"test2"	9
"test2"	8
"test2"	10
"test2"	9
"test3"	null
"test3"	9
"test3"	8
"test3"	7

To keep the participant_id, group, and condition columns, pass those to the index argument. We’ll also sort by those columns.

(
    data.unpivot(
        ["test1", "test2", "test3"],
        index=["participant_id", "group", "condition"],
    )
    .sort("participant_id", "group", "condition")
)

shape: (12, 5)

participant_id	group	condition	variable	value
str	str	i64	str	i64
"01"	"exp"	1	"test1"	6
"01"	"exp"	1	"test2"	9
"01"	"exp"	1	"test3"	null
"01"	"exp"	2	"test1"	4
"01"	"exp"	2	"test2"	8
"01"	"exp"	2	"test3"	9
"02"	"con"	1	"test1"	9
"02"	"con"	1	"test2"	10
"02"	"con"	1	"test3"	8
"02"	"con"	2	"test1"	7
"02"	"con"	2	"test2"	9
"02"	"con"	2	"test3"	7

To change the names of the new columns, use variable_name to specify the name of the column that keeps the test label, and value_name to specify the name of the column with the score value.

long = (
    data.unpivot(
        ["test1", "test2", "test3"],
        index=["participant_id", "group", "condition"],
        variable_name="test_type",
        value_name="score",
    )
    .sort("participant_id", "group", "condition")
)
long

shape: (12, 5)

participant_id	group	condition	test_type	score
str	str	i64	str	i64
"01"	"exp"	1	"test1"	6
"01"	"exp"	1	"test2"	9
"01"	"exp"	1	"test3"	null
"01"	"exp"	2	"test1"	4
"01"	"exp"	2	"test2"	8
"01"	"exp"	2	"test3"	9
"02"	"con"	1	"test1"	9
"02"	"con"	1	"test2"	10
"02"	"con"	1	"test3"	8
"02"	"con"	2	"test1"	7
"02"	"con"	2	"test2"	9
"02"	"con"	2	"test3"	7

Reshaping data to wide format#

Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.

In this example, we first get the mean score for each test type, for each participant.

means = (
    long.group_by("participant_id", "test_type")
    .agg(pl.col("score").mean())
    .sort("participant_id", "test_type")
)
means

shape: (6, 3)

participant_id	test_type	score
str	str	f64
"01"	"test1"	5.0
"01"	"test2"	8.5
"01"	"test3"	9.0
"02"	"test1"	8.0
"02"	"test2"	9.5
"02"	"test3"	7.5

Now we can use pivot to make a table that shows the results for each participant, with a column for each test.

wide = means.pivot("test_type", index="participant_id", values="score")
wide

shape: (2, 4)

participant_id	test1	test2	test3
str	f64	f64	f64
"01"	5.0	8.5	9.0
"02"	8.0	9.5	7.5

With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.

wide.with_columns(
    total=pl.col("test1") + pl.col("test2") + pl.col("test3")
)

shape: (2, 5)

participant_id	test1	test2	test3	total
str	f64	f64	f64	f64
"01"	5.0	8.5	9.0	22.5
"02"	8.0	9.5	7.5	25.0

Python Data Cleaning Cheat Sheet

Contents

8.9. Python Data Cleaning Cheat Sheet#

Handling missing data sentinels#

Recoding variables#

Working with missing data#

Grouping and aggregation#

Grouping by multiple variables#

Reshaping data to long format#

Reshaping data to wide format#