Data Cleaning#
Often, we will have a DataFrame with the information that we need, but the data are not organized in the right way. We may need to substitute variables, handle missing data, calculate summary statistics, and rearrange data to get it in the right shape. Polars has many tools for doing these things.
Reading data from the datascipsych package#
The datascipsych
package includes a datasets
module that can be used to access datasets that are included with the package.
To use the datascipsych
package, it must be first be installed.
To install the package: (1) make sure you have a virtual environment and that it is activated; (2) open a terminal (with the virtual environment activated) and run pip install -e .
.
This will install functions and datasets available in the datascipsych
package into your environment.
First, we will use the datascipsych.datasets
module to access the data for the Osth & Fox (2019) study. Note that the dataset is not stored in the current directory. Instead, we access the dataset from the installed datascipsych
package. This allows us to access the same dataset, no matter what our current directory is.
from datascipsych import datasets
dataset_file = datasets.get_dataset_file("Osth2019")
The dataset_file
should give the path to the dataset on your system.
If you get a ModuleNotFoundError
, you do not have the datascipsych
package installed. Follow the instructions given above to install the package.
Now, we can read the dataset file from the installed package. We’ll make things simpler by selecting a subset of columns.
import polars as pl
raw = (
pl.read_csv(dataset_file)
.select("subj", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag")
)
raw.head()
subj | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 |
Recoding variables#
One relatively simple form of data cleaning is to take the data in a column and recode it. For example, a spreadsheet may label conditions with numeric codes that are hard to read and remember. Replacing these codes with strings makes the dataset easier to understand.
The Osth & Fox (2019) dataset has a number of -1 values in it. These -1 values indicate missing or undefined data. For example, during study trials, participants do not make any response. On these trials, the response
column is set to -1 to indicate that no response was recorded.
raw.head()
subj | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 |
101 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 |
We can replace the -1 values using the replace
method. Instead of -1, we want missing values to be set to the special value null
. Polars will automatically exclude those values from calculations. To set a value to null
in the DataFrame, we can set it to None
when calling replace
.
raw.with_columns(
pl.col("response").replace(-1, None)
).head()
subj | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | "study" | "intact" | "formal" | "positive" | null | -1.0 | -1 | -1 |
101 | "study" | "intact" | "skin" | "careful" | null | -1.0 | -1 | -1 |
101 | "study" | "intact" | "upon" | "miss" | null | -1.0 | -1 | -1 |
101 | "study" | "intact" | "single" | "tradition" | null | -1.0 | -1 | -1 |
101 | "study" | "intact" | "prove" | "airport" | null | -1.0 | -1 | -1 |
We used with_columns
to keep all the existing columns. We used pl.col("response")
to represent the response
column, and then called the replace
method. The first input indicates the value to replace, and the second input indicates the value to replace it with.
Note that there are multiple columns that follow the same convention, with -1 indicating missing values. We’ll want to replace all of them.
In Polars, we can carry out the same expression on multiple columns using expression expansion. Instead of writing multiple expressions with replace
(one for each column), we can write just one expression that should be applied to multiple columns.
raw.with_columns(
pl.col("response", "RT", "correct", "lag").replace(-1, None)
).head()
subj | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | "study" | "intact" | "formal" | "positive" | null | null | null | null |
101 | "study" | "intact" | "skin" | "careful" | null | null | null | null |
101 | "study" | "intact" | "upon" | "miss" | null | null | null | null |
101 | "study" | "intact" | "single" | "tradition" | null | null | null | null |
101 | "study" | "intact" | "prove" | "airport" | null | null | null | null |
Sometimes, it’s helpful to recode a column to get different labels. The type
column has two values: "intact"
and "rearranged"
. Say we want to translate this into more standard labels typically used in recognition memory experiments such as this, namely "target"
and "lure"
. We can do this by passing a dictionary to the replace
method. This time, we’ll create a new column called probe_type
that has this alternate coding.
raw.with_columns(
probe_type=pl.col("type").replace({"intact": "target", "rearranged": "lure"})
).head()
subj | phase | type | word1 | word2 | response | RT | correct | lag | probe_type |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 | str |
101 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 | "target" |
101 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 | "target" |
101 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 | "target" |
101 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 | "target" |
101 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 | "target" |
Now that we have tested out the commands we want to run, let’s put them all together to get a set of commands that will run all our recoding and assign the output to a new variable called data
.
data = raw.with_columns(
pl.col("response", "RT", "correct", "lag").replace(-1, None),
probe_type=pl.col("type").replace({"intact": "target", "rearranged": "lure"})
)
data.head()
subj | phase | type | word1 | word2 | response | RT | correct | lag | probe_type |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 | str |
101 | "study" | "intact" | "formal" | "positive" | null | null | null | null | "target" |
101 | "study" | "intact" | "skin" | "careful" | null | null | null | null | "target" |
101 | "study" | "intact" | "upon" | "miss" | null | null | null | null | "target" |
101 | "study" | "intact" | "single" | "tradition" | null | null | null | null | "target" |
101 | "study" | "intact" | "prove" | "airport" | null | null | null | null | "target" |
Sometimes, a raw data file may code conditions as integers, but we want to convert the numeric condition codes into string labels. For example, say that a dataset has target trials coded as 1, and lure trials coded as 2.
simple = pl.DataFrame(
{
"trial": [1, 2, 3, 4],
"condition": [1, 1, 2, 2],
}
)
simple
trial | condition |
---|---|
i64 | i64 |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 2 |
We cannot use replace
to change the type of a column. To recode integers as strings, we can first cast
the integers as strings, using .cast(pl.String)
. The pl.String
attribute represents the string format that Polars uses.
simple.with_columns(
condition=pl.col("condition").cast(pl.String)
)
trial | condition |
---|---|
i64 | str |
1 | "1" |
2 | "1" |
3 | "2" |
4 | "2" |
Finally, we can chain a call to replace
after casting the column to string, to substitute “target” when the condition is “1” and “lure” when the condition is “2”.
simple.with_columns(
condition=pl.col("condition").cast(pl.String).replace({"1": "target", "2": "lure"})
)
trial | condition |
---|---|
i64 | str |
1 | "target" |
2 | "target" |
3 | "lure" |
4 | "lure" |
Now, instead of numeric codes that are hard to read, we instead have strings that clearly show the condition corresponding to each trial.
Exercise: recoding variables#
Given the DataFrame defined below, recode the correct
column to set the -999 values to null
. Recode the condition
column so that 1
is now "target"
and 2
is now "lure"
.
df = pl.DataFrame({"trial": [1, 2, 3, 4], "condition": [1, 2, 1, 2], "correct": [0, 1, -999, 1]})
# your code here
Handling missing data#
Sometimes, observations will be missing, for example the response on a trial may be undefined if the participant did not make a response.
Let’s set up an example DataFrame with some missing responses.
responses = pl.DataFrame(
{
"correct": [1, 0, 1, None, 0, 0, 1, None, 1],
"response_time": [1.2, 2.1, 1.4, None, 1.8, 1.6, 1.4, None, 1.6],
}
)
responses
correct | response_time |
---|---|
i64 | f64 |
1 | 1.2 |
0 | 2.1 |
1 | 1.4 |
null | null |
0 | 1.8 |
0 | 1.6 |
1 | 1.4 |
null | null |
1 | 1.6 |
We can tally the number of missing observations in each column using null_count
.
responses.null_count()
correct | response_time |
---|---|
u32 | u32 |
2 | 2 |
We can calculate the mean accuracy and response time using the mean
function.
responses.mean()
correct | response_time |
---|---|
f64 | f64 |
0.571429 | 1.585714 |
Polars functions generally ignore missing responses when performing calculations. Note that this works differently than in NumPy; even if there are missing observations, we don’t get NaN
when calculating statistics like the mean. Make sure to check for missing observations in your data using null_count
or describe
, which includes a count of null values in each column. Missing responses can sometimes indicate that there is something wrong with your data.
While the mean
function will just ignore trials with missing responses, we might instead want to count missing responses as incorrect by coding them as 0. We can recode null
values as something else using fill_null
.
missing = responses.with_columns(
correct_missing=pl.col("correct").fill_null(0)
)
missing.mean()
correct | response_time | correct_missing |
---|---|---|
f64 | f64 | f64 |
0.571429 | 1.585714 | 0.444444 |
Note that the accuracy is lower when we count missing responses as incorrect.
We can also use filter
together with is_not_null
to remove rows where a column is null
.
included = responses.filter(pl.col("correct").is_not_null())
included
correct | response_time |
---|---|
i64 | f64 |
1 | 1.2 |
0 | 2.1 |
1 | 1.4 |
0 | 1.8 |
0 | 1.6 |
1 | 1.4 |
1 | 1.6 |
Exercise: Handling missing data#
Given the DataFrame defined below, get the number of missing observations in the correct
column. Calculate the mean accuracy across all trials without missing responses. Then calculate mean accuracy if missing responses are considered incorrect.
df = pl.DataFrame({"trial": [1, 2, 3, 4, 5, 6], "correct": [0, 1, 1, 0, None, 1]})
# answer here
Grouping and aggregation#
When we are analyzing a dataset, we usually don’t just want to calculate summary statistics for a whole column of data. Often we will instead want to break the dataset into groups, and calculate the summary statistics for each group. We can do this using the group_by
and agg
methods.
Grouping and aggregation allow us to calculate summary statistics for groups within our data using very little code. Here, we’ll calculate the mean response time for correct trials (where correct == 1
) and incorrect trials (where correct == 0
).
included.group_by("correct").agg(pl.col("response_time").mean())
correct | response_time |
---|---|
i64 | f64 |
1 | 1.4 |
0 | 1.833333 |
We first grouped the data to make a group for each value of the correct
column. We then aggregated the data in the response_time
column by calculating the mean within each group.
Let’s break down this code to see how it works.
To perform aggregation, we start by calling group_by
. Here, we give the name of the correct
column to indicate that we want to make one group for each value of that column. There will be one group for correct == 0
and one group for correct == 1
.
g = included.group_by("correct")
Normally, we would next call the agg
method to calculate some aggregation wtihin each group. Instead, we’ll convert the GroupBy
object into a list so we can look at it.
g = list(g)
Each group includes a name (the value for each group, which here is 0 or 1) and a DataFrame that just has the rows that match that value.
Let’s take a look at the first group.
group_name, group_df = g[0]
print(group_name)
group_df
(0,)
correct | response_time |
---|---|
i64 | f64 |
0 | 2.1 |
0 | 1.8 |
0 | 1.6 |
Next, let’s look at the second group.
group_name, group_df = g[1]
print(group_name)
group_df
(1,)
correct | response_time |
---|---|
i64 | f64 |
1 | 1.2 |
1 | 1.4 |
1 | 1.4 |
1 | 1.6 |
We can aggregate the data within each group through multiple different methods. One simple method is to just pass the name of a column to the agg
function. This will give us a column that has a list of all the observations that were in each group.
included.group_by("correct").agg(pl.col("response_time"))
correct | response_time |
---|---|
i64 | list[f64] |
1 | [1.2, 1.4, … 1.6] |
0 | [2.1, 1.8, 1.6] |
Often, we’ll want to calculate a summary statistic, like the mean, for each group. We can do that by specifying the column to aggregate and calling the mean
function.
included.group_by("correct").agg(pl.col("response_time").mean())
correct | response_time |
---|---|
i64 | f64 |
1 | 1.4 |
0 | 1.833333 |
We can also use calculate other summary statistics, like the standard deviation.
included.group_by("correct").agg(pl.col("response_time").std())
correct | response_time |
---|---|
i64 | f64 |
1 | 0.163299 |
0 | 0.251661 |
We can even calculate multiple summary statistics at the same time. In this case, we must give the columns different names to avoid getting an error. Like in select
and with_columns
, we can assign names using keyword arguments to agg
.
(
included.group_by("correct")
.agg(
mean=pl.col("response_time").mean(),
std=pl.col("response_time").std(),
)
)
correct | mean | std |
---|---|---|
i64 | f64 | f64 |
0 | 1.833333 | 0.251661 |
1 | 1.4 | 0.163299 |
See the Polars documentation for lists of methods that can be used to aggregate whole groups or individual columns.
Exercise: grouping and aggregation#
Given the DataFrame defined below, calculate the mean response time for each condition.
df = pl.DataFrame(
{
"trial": [1, 2, 3, 4, 5, 6],
"condition": [1, 1, 1, 2, 2, 2],
"response_time": [1.2, 1.4, 1.3, 1.7, 1.6, 1.8],
}
)
# answer here
Reshaping data to long format#
Sometimes, we need to transform our dataset to a different shape to make it easier to work with.
Say that we have data in a study where trials are split into blocks. Each block is represented by one row in the DataFrame. The 1
column indicates if trial 1 was correct, the 2
column indicates if trial 2 was correct, and the 3
column indicates if trial 3 was correct.
block |
1 |
2 |
3 |
---|---|---|---|
1 |
1 |
0 |
0 |
2 |
0 |
1 |
0 |
This is called a wide format table. If we want to analyze the accuracy of trials within each block, we will have to access the correct column.
Often it is easier to work with long format data instead. In a long-format DataFrame, there is one row for each set of observations. Here, that means there is one row for each trial.
block |
trial |
correct |
---|---|---|
1 |
1 |
1 |
1 |
2 |
0 |
1 |
3 |
0 |
2 |
1 |
0 |
2 |
2 |
1 |
2 |
3 |
0 |
First, we’ll make a DataFrame to match our example wide data.
df = pl.DataFrame(
{
"block": [1, 2],
"1": [1, 0],
"2": [0, 1],
"3": [0, 0],
}
)
df
block | 1 | 2 | 3 |
---|---|---|---|
i64 | i64 | i64 | i64 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
Now, let’s unpivot the data to transform it to long format.
The first input indicates the columns that should be unpivoted by placing them in separate rows.
The index
indicates one or more columns to use as identifier variables. These variables provide information about each of the observations contained in the rows.
The variable_name
indicates how the unpivoted column labels should be named.
The value_name
indicates how the unpivoted column values should be named.
unpivoted = df.unpivot(
["1", "2", "3"],
index="block",
variable_name="trial",
value_name="correct",
)
unpivoted
block | trial | correct |
---|---|---|
i64 | str | i64 |
1 | "1" | 1 |
2 | "1" | 0 |
1 | "2" | 0 |
2 | "2" | 1 |
1 | "3" | 0 |
2 | "3" | 0 |
Trial is better represented as an integer than a string, so let’s cast it to an integer. We’ll also sort by block using the sort
function, to make the ordering more intuitive.
cleaned = unpivoted.with_columns(pl.col("trial").cast(pl.Int64)).sort("block")
cleaned
block | trial | correct |
---|---|---|
i64 | i64 | i64 |
1 | 1 | 1 |
1 | 2 | 0 |
1 | 3 | 0 |
2 | 1 | 0 |
2 | 2 | 1 |
2 | 3 | 0 |
Now we have a format that is very flexible when using group_by
or making plots using Seaborn (more on that later). For example, we can easily calculate mean accuracy using group_by
.
cleaned.group_by("block").agg(accuracy=pl.col("correct").mean())
block | accuracy |
---|---|
i64 | f64 |
1 | 0.333333 |
2 | 0.333333 |
Exercise: reshaping data to long format#
Given the DataFrame defined below, reshape it to long format, so that there is one observation per row. There should be 9 rows in the long table. The first column should be subject
. Make an assessment
column that indicates one of the tests, and an accuracy
column that indicates performance on that test.
df = pl.DataFrame(
{
"subject": ["001", "002", "003"],
"test1": [0.5, 0.7, 0.4],
"test2": [0.6, 0.8, 0.5],
"test3": [0.9, 1.0, 0.7],
}
)
# answer here
Reshaping data to wide format#
Sometimes it is useful to pivot some data to represent different variables in different columns.
For example, say that we have a measure of some response in two conditions for each of a number of subjects. For subject 001, in the target condition, they said “old” on 0.8 of trials (80%). In the lure condition, they said “old” on 0.2 of trials (20%).
subject |
condition |
response |
---|---|---|
001 |
target |
0.8 |
001 |
lure |
0.2 |
002 |
target |
0.75 |
002 |
lure |
0.15 |
003 |
target |
0.85 |
003 |
lure |
0.25 |
Say that we want to compare average responding in the target and lure conditions, separately for each subject. This is difficult to do when the data are in long format. We can pivot the data into a wide format that organizes the data in a convenient way for comparing conditions.
subject |
target |
lure |
---|---|---|
001 |
0.8 |
0.2 |
002 |
0.75 |
0.15 |
003 |
0.85 |
0.25 |
Note that this has the same information as in the long table, but now there is a column for each of the two conditions. Now it will be easy to compare target and lure responses for each subject; for example, we can calculate corrected hit rate by subtracting the lure column from the target column.
Let’s see how we can use pivoting with a real dataset. We’ll focus on the test phase of the Osth & Fox (2019) study. For each trial during the test phase, we have the response (0 for “new”, 1 for “old”) and the probe type (target or lure).
data.filter(pl.col("phase") == "test").head(6)
subj | phase | type | word1 | word2 | response | RT | correct | lag | probe_type |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | i64 | f64 | i64 | i64 | str |
101 | "test" | "rearranged" | "waste" | "degree" | 0 | 2.312 | 1 | 2 | "lure" |
101 | "test" | "rearranged" | "needed" | "able" | 0 | 3.542 | 1 | 1 | "lure" |
101 | "test" | "rearranged" | "single" | "clean" | 0 | 2.084 | 1 | 3 | "lure" |
101 | "test" | "rearranged" | "train" | "useful" | 0 | 1.669 | 1 | 2 | "lure" |
101 | "test" | "rearranged" | "knees" | "various" | 0 | 2.326 | 1 | 5 | "lure" |
101 | "test" | "intact" | "skin" | "careful" | 1 | 1.407 | 1 | null | "target" |
We want to summarize performance for each subject. We’ll use group_by
to calculate the mean response for each subject and probe type. By passing multiple columns, we can create a group for each combination of the unique values in each column. Here, for each subject, we will get separate “target” and “lure” groups.
p_old = (
data.filter(pl.col("phase") == "test")
.group_by("subj", "probe_type")
.agg(pl.col("response").mean())
)
p_old.head()
subj | probe_type | response |
---|---|---|
i64 | str | f64 |
169 | "target" | 0.525 |
186 | "target" | 0.404167 |
117 | "target" | 0.679167 |
101 | "lure" | 0.108333 |
146 | "target" | 0.575 |
We now have the probability of responding “old” for each participant, separately for target and lure trials. Note that there is no particular sorting of the data; to speed up running code, Polars doesn’t sort the groups unless we tell it to.
To sort the DataFrame, we can use the sort
function. This doesn’t usually matter for analysis, but it’s easier to look at and understand. We’ll sort by subject, then probe type, and show the first three subjects using head
.
p_old.sort("subj", "probe_type").head(6)
subj | probe_type | response |
---|---|---|
i64 | str | f64 |
101 | "lure" | 0.108333 |
101 | "target" | 0.76569 |
102 | "lure" | 0.241667 |
102 | "target" | 0.2875 |
103 | "lure" | 0.091667 |
103 | "target" | 0.866667 |
Next, we can pivot the data so that there is a separate column for targets and lures, and just one row per participant.
The first input indicates a column to use for defining the new columns in the pivoted data.
The index is a column that will be used to organize the rows of the pivoted data.
The values indicate the column used to populate the table. Here, we use the response
column. For each subject, in each condition, we will get the corresponding response value.
pivoted = p_old.pivot("probe_type", index="subj", values="response")
pivoted.head()
subj | target | lure |
---|---|---|
i64 | f64 | f64 |
169 | 0.525 | 0.104167 |
186 | 0.404167 | 0.054167 |
117 | 0.679167 | 0.05 |
101 | 0.76569 | 0.108333 |
146 | 0.575 | 0.3875 |
Again, sorting can make the data easier to understand; here, we’ll sort the rows by subject.
pivoted = pivoted.sort("subj")
pivoted.head()
subj | target | lure |
---|---|---|
i64 | f64 | f64 |
101 | 0.76569 | 0.108333 |
102 | 0.2875 | 0.241667 |
103 | 0.866667 | 0.091667 |
104 | 0.508333 | 0.3875 |
105 | 0.575 | 0.372385 |
Finally, we’re in a good position to calculate the corrected hit rate by subtracting P(old) for lures (false-alarm rate) from P(old) for targets (hit rate).
perf = pivoted.with_columns(
perf=pl.col("target") - pl.col("lure")
)
perf.head()
subj | target | lure | perf |
---|---|---|---|
i64 | f64 | f64 | f64 |
101 | 0.76569 | 0.108333 | 0.657357 |
102 | 0.2875 | 0.241667 | 0.045833 |
103 | 0.866667 | 0.091667 | 0.775 |
104 | 0.508333 | 0.3875 | 0.120833 |
105 | 0.575 | 0.372385 | 0.202615 |
With the perf
column, we now have a measure of overall test performance for each subject.
Let’s summarize our results by calculating the mean for each column.
perf.select(pl.col("lure", "target", "perf").mean())
lure | target | perf |
---|---|---|
f64 | f64 | f64 |
0.21641 | 0.531649 | 0.315238 |
Exercise: reshaping data to wide format#
Given the DataFrame defined below, create a pivot table with target
and lure
columns. There should be three rows, one for each subject. The first column should be subject
. The target
and lure
columns should have values taken from the response
column in the original data.
df = pl.DataFrame(
{
"subject": ["001", "001", "002", "002", "003", "003"],
"condition": ["target", "lure", "target", "lure", "target", "lure"],
"response": [0.8, 0.2, 0.75, 0.15, 0.85, 0.25],
}
)
# answer here
Summary#
When working with data, we can recode variables to handle special values or change how information is represented using the cast
and replace
methods.
There are special functions for working with null data, to count missing samples (null_count
), replace them (fill_null
), or filter trials with null values in a column using is_not_null
.
We can quickly calculate summary statistics for different conditions using grouping with group_by
and aggregation with agg
.
There are two main data formats that are useful for different applications. Long data are useful for working with grouping and aggregation. Wide data (also known as pivot tables) are useful for matching up related observations. We can use pivot
to convert to long format and unpivot
to convert to wide format.