Data Cleaning

Data Cleaning#

Often, we will have a DataFrame with the information that we need, but the data are not organized in the right way. We may need to substitute variables, handle missing data, calculate summary statistics, and rearrange data to get it in the right shape. Polars has many tools for doing these things.

Reading data from the datascipsych package#

The datascipsych package includes a datasets module that can be used to access datasets that are included with the package.

To use the datascipsych package, it must be first be installed.

To install the package: (1) make sure you have a virtual environment and that it is activated; (2) open a terminal (with the virtual environment activated) and run pip install -e ..

This will install functions and datasets available in the datascipsych package into your environment.

First, we will use the datascipsych.datasets module to access the data for the Osth & Fox (2019) study. Note that the dataset is not stored in the current directory. Instead, we access the dataset from the installed datascipsych package. This allows us to access the same dataset, no matter what our current directory is.

from datascipsych import datasets
dataset_file = datasets.get_dataset_file("Osth2019")

The dataset_file should give the path to the dataset on your system.

If you get a ModuleNotFoundError, you do not have the datascipsych package installed. Follow the instructions given above to install the package.

Now, we can read the dataset file from the installed package. We’ll make things simpler by selecting a subset of columns.

import polars as pl
raw = (
    pl.read_csv(dataset_file)
    .select("subj", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag")
)
raw.head()

shape: (5, 9)

subj	phase	type	word1	word2	response	RT	correct	lag
i64	str	str	str	str	i64	f64	i64	i64
101	"study"	"intact"	"formal"	"positive"	-1	-1.0	-1	-1
101	"study"	"intact"	"skin"	"careful"	-1	-1.0	-1	-1
101	"study"	"intact"	"upon"	"miss"	-1	-1.0	-1	-1
101	"study"	"intact"	"single"	"tradition"	-1	-1.0	-1	-1
101	"study"	"intact"	"prove"	"airport"	-1	-1.0	-1	-1

Recoding variables#

One relatively simple form of data cleaning is to take the data in a column and recode it. For example, a spreadsheet may label conditions with numeric codes that are hard to read and remember. Replacing these codes with strings makes the dataset easier to understand.

The Osth & Fox (2019) dataset has a number of -1 values in it. These -1 values indicate missing or undefined data. For example, during study trials, participants do not make any response. On these trials, the response column is set to -1 to indicate that no response was recorded.

raw.head()

shape: (5, 9)

subj	phase	type	word1	word2	response	RT	correct	lag
i64	str	str	str	str	i64	f64	i64	i64
101	"study"	"intact"	"formal"	"positive"	-1	-1.0	-1	-1
101	"study"	"intact"	"skin"	"careful"	-1	-1.0	-1	-1
101	"study"	"intact"	"upon"	"miss"	-1	-1.0	-1	-1
101	"study"	"intact"	"single"	"tradition"	-1	-1.0	-1	-1
101	"study"	"intact"	"prove"	"airport"	-1	-1.0	-1	-1

We can replace the -1 values using the replace method. Instead of -1, we want missing values to be set to the special value null. Polars will automatically exclude those values from calculations. To set a value to null in the DataFrame, we can set it to None when calling replace.

raw.with_columns(
    pl.col("response").replace(-1, None)
).head()

shape: (5, 9)

subj	phase	type	word1	word2	response	RT	correct	lag
i64	str	str	str	str	i64	f64	i64	i64
101	"study"	"intact"	"formal"	"positive"	null	-1.0	-1	-1
101	"study"	"intact"	"skin"	"careful"	null	-1.0	-1	-1
101	"study"	"intact"	"upon"	"miss"	null	-1.0	-1	-1
101	"study"	"intact"	"single"	"tradition"	null	-1.0	-1	-1
101	"study"	"intact"	"prove"	"airport"	null	-1.0	-1	-1

We used with_columns to keep all the existing columns. We used pl.col("response") to represent the response column, and then called the replace method. The first input indicates the value to replace, and the second input indicates the value to replace it with.

Note that there are multiple columns that follow the same convention, with -1 indicating missing values. We’ll want to replace all of them.

In Polars, we can carry out the same expression on multiple columns using expression expansion. Instead of writing multiple expressions with replace (one for each column), we can write just one expression that should be applied to multiple columns.

raw.with_columns(
    pl.col("response", "RT", "correct", "lag").replace(-1, None)
).head()

shape: (5, 9)

subj	phase	type	word1	word2	response	RT	correct	lag
i64	str	str	str	str	i64	f64	i64	i64
101	"study"	"intact"	"formal"	"positive"	null	null	null	null
101	"study"	"intact"	"skin"	"careful"	null	null	null	null
101	"study"	"intact"	"upon"	"miss"	null	null	null	null
101	"study"	"intact"	"single"	"tradition"	null	null	null	null
101	"study"	"intact"	"prove"	"airport"	null	null	null	null

Sometimes, it’s helpful to recode a column to get different labels. The type column has two values: "intact" and "rearranged". Say we want to translate this into more standard labels typically used in recognition memory experiments such as this, namely "target" and "lure". We can do this by passing a dictionary to the replace method. This time, we’ll create a new column called probe_type that has this alternate coding.

raw.with_columns(
    probe_type=pl.col("type").replace({"intact": "target", "rearranged": "lure"})
).head()

shape: (5, 10)

subj	phase	type	word1	word2	response	RT	correct	lag	probe_type
i64	str	str	str	str	i64	f64	i64	i64	str
101	"study"	"intact"	"formal"	"positive"	-1	-1.0	-1	-1	"target"
101	"study"	"intact"	"skin"	"careful"	-1	-1.0	-1	-1	"target"
101	"study"	"intact"	"upon"	"miss"	-1	-1.0	-1	-1	"target"
101	"study"	"intact"	"single"	"tradition"	-1	-1.0	-1	-1	"target"
101	"study"	"intact"	"prove"	"airport"	-1	-1.0	-1	-1	"target"

Now that we have tested out the commands we want to run, let’s put them all together to get a set of commands that will run all our recoding and assign the output to a new variable called data.

data = raw.with_columns(
    pl.col("response", "RT", "correct", "lag").replace(-1, None),
    probe_type=pl.col("type").replace({"intact": "target", "rearranged": "lure"})
)
data.head()

shape: (5, 10)

subj	phase	type	word1	word2	response	RT	correct	lag	probe_type
i64	str	str	str	str	i64	f64	i64	i64	str
101	"study"	"intact"	"formal"	"positive"	null	null	null	null	"target"
101	"study"	"intact"	"skin"	"careful"	null	null	null	null	"target"
101	"study"	"intact"	"upon"	"miss"	null	null	null	null	"target"
101	"study"	"intact"	"single"	"tradition"	null	null	null	null	"target"
101	"study"	"intact"	"prove"	"airport"	null	null	null	null	"target"

Sometimes, a raw data file may code conditions as integers, but we want to convert the numeric condition codes into string labels. For example, say that a dataset has target trials coded as 1, and lure trials coded as 2.

simple = pl.DataFrame(
    {
        "trial": [1, 2, 3, 4],
        "condition": [1, 1, 2, 2],
    }
)
simple

shape: (4, 2)

trial	condition
i64	i64
1	1
2	1
3	2
4	2

We cannot use replace to change the type of a column. To recode integers as strings, we can first cast the integers as strings, using .cast(pl.String). The pl.String attribute represents the string format that Polars uses.

simple.with_columns(
    condition=pl.col("condition").cast(pl.String)
)

shape: (4, 2)

trial	condition
i64	str
1	"1"
2	"1"
3	"2"
4	"2"

Finally, we can chain a call to replace after casting the column to string, to substitute “target” when the condition is “1” and “lure” when the condition is “2”.

simple.with_columns(
    condition=pl.col("condition").cast(pl.String).replace({"1": "target", "2": "lure"})
)

shape: (4, 2)

trial	condition
i64	str
1	"target"
2	"target"
3	"lure"
4	"lure"

Now, instead of numeric codes that are hard to read, we instead have strings that clearly show the condition corresponding to each trial.

Exercise: recoding variables#

Given the DataFrame defined below, recode the correct column to set the -999 values to null. Recode the condition column so that 1 is now "target" and 2 is now "lure".

df = pl.DataFrame({"trial": [1, 2, 3, 4], "condition": [1, 2, 1, 2], "correct": [0, 1, -999, 1]})
# your code here

Handling missing data#

Sometimes, observations will be missing, for example the response on a trial may be undefined if the participant did not make a response.

Let’s set up an example DataFrame with some missing responses.

responses = pl.DataFrame(
    {
        "correct": [1, 0, 1, None, 0, 0, 1, None, 1],
        "response_time": [1.2, 2.1, 1.4, None, 1.8, 1.6, 1.4, None, 1.6],
    }
)
responses

shape: (9, 2)

correct	response_time
i64	f64
1	1.2
0	2.1
1	1.4
null	null
0	1.8
0	1.6
1	1.4
null	null
1	1.6

We can tally the number of missing observations in each column using null_count.

responses.null_count()

shape: (1, 2)

correct	response_time
u32	u32
2	2

We can calculate the mean accuracy and response time using the mean function.

responses.mean()

shape: (1, 2)

correct	response_time
f64	f64
0.571429	1.585714

Polars functions generally ignore missing responses when performing calculations. Note that this works differently than in NumPy; even if there are missing observations, we don’t get NaN when calculating statistics like the mean. Make sure to check for missing observations in your data using null_count or describe, which includes a count of null values in each column. Missing responses can sometimes indicate that there is something wrong with your data.

While the mean function will just ignore trials with missing responses, we might instead want to count missing responses as incorrect by coding them as 0. We can recode null values as something else using fill_null.

missing = responses.with_columns(
    correct_missing=pl.col("correct").fill_null(0)
)
missing.mean()

shape: (1, 3)

correct	response_time	correct_missing
f64	f64	f64
0.571429	1.585714	0.444444

Note that the accuracy is lower when we count missing responses as incorrect.

We can also use filter together with is_not_null to remove rows where a column is null.

included = responses.filter(pl.col("correct").is_not_null())
included

shape: (7, 2)

correct	response_time
i64	f64
1	1.2
0	2.1
1	1.4
0	1.8
0	1.6
1	1.4
1	1.6

Exercise: Handling missing data#

Given the DataFrame defined below, get the number of missing observations in the correct column. Calculate the mean accuracy across all trials without missing responses. Then calculate mean accuracy if missing responses are considered incorrect.

df = pl.DataFrame({"trial": [1, 2, 3, 4, 5, 6], "correct": [0, 1, 1, 0, None, 1]})
# answer here

Grouping and aggregation#

When we are analyzing a dataset, we usually don’t just want to calculate summary statistics for a whole column of data. Often we will instead want to break the dataset into groups, and calculate the summary statistics for each group. We can do this using the group_by and agg methods.

Grouping and aggregation allow us to calculate summary statistics for groups within our data using very little code. Here, we’ll calculate the mean response time for correct trials (where correct == 1) and incorrect trials (where correct == 0).

included.group_by("correct").agg(pl.col("response_time").mean())

shape: (2, 2)

correct	response_time
i64	f64
1	1.4
0	1.833333

We first grouped the data to make a group for each value of the correct column. We then aggregated the data in the response_time column by calculating the mean within each group.

Let’s break down this code to see how it works.

To perform aggregation, we start by calling group_by. Here, we give the name of the correct column to indicate that we want to make one group for each value of that column. There will be one group for correct == 0 and one group for correct == 1.

g = included.group_by("correct")

Normally, we would next call the agg method to calculate some aggregation wtihin each group. Instead, we’ll convert the GroupBy object into a list so we can look at it.

g = list(g)

Each group includes a name (the value for each group, which here is 0 or 1) and a DataFrame that just has the rows that match that value.

Let’s take a look at the first group.

group_name, group_df = g[0]
print(group_name)
group_df

(0,)

shape: (3, 2)

correct	response_time
i64	f64
0	2.1
0	1.8
0	1.6

Next, let’s look at the second group.

group_name, group_df = g[1]
print(group_name)
group_df

(1,)

shape: (4, 2)

correct	response_time
i64	f64
1	1.2
1	1.4
1	1.4
1	1.6

We can aggregate the data within each group through multiple different methods. One simple method is to just pass the name of a column to the agg function. This will give us a column that has a list of all the observations that were in each group.

included.group_by("correct").agg(pl.col("response_time"))

shape: (2, 2)

correct	response_time
i64	list[f64]
1	[1.2, 1.4, … 1.6]
0	[2.1, 1.8, 1.6]

Often, we’ll want to calculate a summary statistic, like the mean, for each group. We can do that by specifying the column to aggregate and calling the mean function.

included.group_by("correct").agg(pl.col("response_time").mean())

shape: (2, 2)

correct	response_time
i64	f64
1	1.4
0	1.833333

We can also use calculate other summary statistics, like the standard deviation.

included.group_by("correct").agg(pl.col("response_time").std())

shape: (2, 2)

correct	response_time
i64	f64
1	0.163299
0	0.251661

We can even calculate multiple summary statistics at the same time. In this case, we must give the columns different names to avoid getting an error. Like in select and with_columns, we can assign names using keyword arguments to agg.

(
    included.group_by("correct")
    .agg(
        mean=pl.col("response_time").mean(),
        std=pl.col("response_time").std(),
    )
)

shape: (2, 3)

correct	mean	std
i64	f64	f64
0	1.833333	0.251661
1	1.4	0.163299

See the Polars documentation for lists of methods that can be used to aggregate whole groups or individual columns.

Exercise: grouping and aggregation#

Given the DataFrame defined below, calculate the mean response time for each condition.

df = pl.DataFrame(
    {
        "trial": [1, 2, 3, 4, 5, 6], 
        "condition": [1, 1, 1, 2, 2, 2], 
        "response_time": [1.2, 1.4, 1.3, 1.7, 1.6, 1.8],
    }
)
# answer here

Reshaping data to long format#

Sometimes, we need to transform our dataset to a different shape to make it easier to work with.

Say that we have data in a study where trials are split into blocks. Each block is represented by one row in the DataFrame. The 1 column indicates if trial 1 was correct, the 2 column indicates if trial 2 was correct, and the 3 column indicates if trial 3 was correct.

block	1	2	3
1	1	0	0
2	0	1	0

This is called a wide format table. If we want to analyze the accuracy of trials within each block, we will have to access the correct column.

Often it is easier to work with long format data instead. In a long-format DataFrame, there is one row for each set of observations. Here, that means there is one row for each trial.

block	trial	correct
1	1	1
1	2	0
1	3	0
2	1	0
2	2	1
2	3	0

First, we’ll make a DataFrame to match our example wide data.

df = pl.DataFrame(
    {
        "block": [1, 2],
        "1": [1, 0],
        "2": [0, 1],
        "3": [0, 0],
    }
)
df

shape: (2, 4)

block	1	2	3
i64	i64	i64	i64
1	1	0	0
2	0	1	0

Now, let’s unpivot the data to transform it to long format.

The first input indicates the columns that should be unpivoted by placing them in separate rows.

The index indicates one or more columns to use as identifier variables. These variables provide information about each of the observations contained in the rows.

The variable_name indicates how the unpivoted column labels should be named.

The value_name indicates how the unpivoted column values should be named.

unpivoted = df.unpivot(
    ["1", "2", "3"], 
    index="block",
    variable_name="trial",
    value_name="correct",
)
unpivoted

shape: (6, 3)

block	trial	correct
i64	str	i64
1	"1"	1
2	"1"	0
1	"2"	0
2	"2"	1
1	"3"	0
2	"3"	0

Trial is better represented as an integer than a string, so let’s cast it to an integer. We’ll also sort by block using the sort function, to make the ordering more intuitive.

cleaned = unpivoted.with_columns(pl.col("trial").cast(pl.Int64)).sort("block")
cleaned

shape: (6, 3)

block	trial	correct
i64	i64	i64
1	1	1
1	2	0
1	3	0
2	1	0
2	2	1
2	3	0

Now we have a format that is very flexible when using group_by or making plots using Seaborn (more on that later). For example, we can easily calculate mean accuracy using group_by.

cleaned.group_by("block").agg(accuracy=pl.col("correct").mean())

shape: (2, 2)

block	accuracy
i64	f64
1	0.333333
2	0.333333

Exercise: reshaping data to long format#

Given the DataFrame defined below, reshape it to long format, so that there is one observation per row. There should be 9 rows in the long table. The first column should be subject. Make an assessment column that indicates one of the tests, and an accuracy column that indicates performance on that test.

df = pl.DataFrame(
    {
        "subject": ["001", "002", "003"], 
        "test1": [0.5, 0.7, 0.4], 
        "test2": [0.6, 0.8, 0.5], 
        "test3": [0.9, 1.0, 0.7],
    }
)
# answer here

Reshaping data to wide format#

Sometimes it is useful to pivot some data to represent different variables in different columns.

For example, say that we have a measure of some response in two conditions for each of a number of subjects. For subject 001, in the target condition, they said “old” on 0.8 of trials (80%). In the lure condition, they said “old” on 0.2 of trials (20%).

subject	condition	response
001	target	0.8
001	lure	0.2
002	target	0.75
002	lure	0.15
003	target	0.85
003	lure	0.25

Say that we want to compare average responding in the target and lure conditions, separately for each subject. This is difficult to do when the data are in long format. We can pivot the data into a wide format that organizes the data in a convenient way for comparing conditions.

subject	target	lure
001	0.8	0.2
002	0.75	0.15
003	0.85	0.25

Note that this has the same information as in the long table, but now there is a column for each of the two conditions. Now it will be easy to compare target and lure responses for each subject; for example, we can calculate corrected hit rate by subtracting the lure column from the target column.

Let’s see how we can use pivoting with a real dataset. We’ll focus on the test phase of the Osth & Fox (2019) study. For each trial during the test phase, we have the response (0 for “new”, 1 for “old”) and the probe type (target or lure).

data.filter(pl.col("phase") == "test").head(6)

shape: (6, 10)

subj	phase	type	word1	word2	response	RT	correct	lag	probe_type
i64	str	str	str	str	i64	f64	i64	i64	str
101	"test"	"rearranged"	"waste"	"degree"	0	2.312	1	2	"lure"
101	"test"	"rearranged"	"needed"	"able"	0	3.542	1	1	"lure"
101	"test"	"rearranged"	"single"	"clean"	0	2.084	1	3	"lure"
101	"test"	"rearranged"	"train"	"useful"	0	1.669	1	2	"lure"
101	"test"	"rearranged"	"knees"	"various"	0	2.326	1	5	"lure"
101	"test"	"intact"	"skin"	"careful"	1	1.407	1	null	"target"

We want to summarize performance for each subject. We’ll use group_by to calculate the mean response for each subject and probe type. By passing multiple columns, we can create a group for each combination of the unique values in each column. Here, for each subject, we will get separate “target” and “lure” groups.

p_old = (
    data.filter(pl.col("phase") == "test")
    .group_by("subj", "probe_type")
    .agg(pl.col("response").mean())
)
p_old.head()

shape: (5, 3)

subj	probe_type	response
i64	str	f64
169	"target"	0.525
186	"target"	0.404167
117	"target"	0.679167
101	"lure"	0.108333
146	"target"	0.575

We now have the probability of responding “old” for each participant, separately for target and lure trials. Note that there is no particular sorting of the data; to speed up running code, Polars doesn’t sort the groups unless we tell it to.

To sort the DataFrame, we can use the sort function. This doesn’t usually matter for analysis, but it’s easier to look at and understand. We’ll sort by subject, then probe type, and show the first three subjects using head.

p_old.sort("subj", "probe_type").head(6)

shape: (6, 3)

subj	probe_type	response
i64	str	f64
101	"lure"	0.108333
101	"target"	0.76569
102	"lure"	0.241667
102	"target"	0.2875
103	"lure"	0.091667
103	"target"	0.866667

Next, we can pivot the data so that there is a separate column for targets and lures, and just one row per participant.

The first input indicates a column to use for defining the new columns in the pivoted data.

The index is a column that will be used to organize the rows of the pivoted data.

The values indicate the column used to populate the table. Here, we use the response column. For each subject, in each condition, we will get the corresponding response value.

pivoted = p_old.pivot("probe_type", index="subj", values="response")
pivoted.head()

shape: (5, 3)

subj	target	lure
i64	f64	f64
169	0.525	0.104167
186	0.404167	0.054167
117	0.679167	0.05
101	0.76569	0.108333
146	0.575	0.3875

Again, sorting can make the data easier to understand; here, we’ll sort the rows by subject.

pivoted = pivoted.sort("subj")
pivoted.head()

shape: (5, 3)

subj	target	lure
i64	f64	f64
101	0.76569	0.108333
102	0.2875	0.241667
103	0.866667	0.091667
104	0.508333	0.3875
105	0.575	0.372385

Finally, we’re in a good position to calculate the corrected hit rate by subtracting P(old) for lures (false-alarm rate) from P(old) for targets (hit rate).

perf = pivoted.with_columns(
    perf=pl.col("target") - pl.col("lure")
)
perf.head()

shape: (5, 4)

subj	target	lure	perf
i64	f64	f64	f64
101	0.76569	0.108333	0.657357
102	0.2875	0.241667	0.045833
103	0.866667	0.091667	0.775
104	0.508333	0.3875	0.120833
105	0.575	0.372385	0.202615

With the perf column, we now have a measure of overall test performance for each subject.

Let’s summarize our results by calculating the mean for each column.

perf.select(pl.col("lure", "target", "perf").mean())

shape: (1, 3)

lure	target	perf
f64	f64	f64
0.21641	0.531649	0.315238

Exercise: reshaping data to wide format#

Given the DataFrame defined below, create a pivot table with target and lure columns. There should be three rows, one for each subject. The first column should be subject. The target and lure columns should have values taken from the response column in the original data.

df = pl.DataFrame(
    {
        "subject": ["001", "001", "002", "002", "003", "003"],
        "condition": ["target", "lure", "target", "lure", "target", "lure"],
        "response": [0.8, 0.2, 0.75, 0.15, 0.85, 0.25],
    }
)
# answer here

Summary#

When working with data, we can recode variables to handle special values or change how information is represented using the cast and replace methods.

There are special functions for working with null data, to count missing samples (null_count), replace them (fill_null), or filter trials with null values in a column using is_not_null.

We can quickly calculate summary statistics for different conditions using grouping with group_by and aggregation with agg.

There are two main data formats that are useful for different applications. Long data are useful for working with grouping and aggregation. Wide data (also known as pivot tables) are useful for matching up related observations. We can use pivot to convert to long format and unpivot to convert to wide format.

Data Cleaning

Contents

Data Cleaning#

Reading data from the datascipsych package#

Recoding variables#

Exercise: recoding variables#

Handling missing data#

Exercise: Handling missing data#

Grouping and aggregation#

Exercise: grouping and aggregation#

Reshaping data to long format#

Exercise: reshaping data to long format#

Reshaping data to wide format#

Exercise: reshaping data to wide format#

Summary#