7. Data Frames#
DataFrames are used to organize spreadsheets of data. There powerful methods in Python for organizing, filtering, and summarizing data in spreadsheets.
7.1. Importing data#
We can easily use the Polars package to import data from spreadsheets into a variable called a DataFrame. First, we will import the Polars module.
import polars as pl
We can use pl.read_csv to import a CSV file into a DataFrame (in this case, data from Osth & Fox 2019). The head method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5). In Jupyter, the last command in a cell will be displayed below that cell. In this case, we have raw.head() at the end. This produces a DataFrame, which Jupyter then displays in a nice format.
raw = pl.read_csv("exp1.csv")
raw.head()
| cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
| 0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
| 0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
| 0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
| 0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
We can also use print, but the formatting is less nice than what Jupyter uses.
print(raw.head())
shape: (5, 16)
┌───────┬───────┬───────┬────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0 ┆ -1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 0 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 2 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 3 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
└───────┴───────┴───────┴────────┴───┴──────┴───────────┴──────────────┴────────┘
There are a lot of columns, including some that we won’t use. Let’s use the select method to make the data a little simpler to look at. The select method lets us create a new DataFrame with a specified set of columns. The simplest way to use it is to get a subset of the columns in the DataFrame by passing a list of column names.
columns = ["subj", "cycle", "trial", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag"]
df = raw.select(columns)
df.head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 |
A DataFrame has multiple columns, each of which can have a different type of data. For example, i64 means a 64-bit integer, and str indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.
The list of columns and their data types is known as a schema. We can access it for an existing DataFrame using the schema attribute.
df.schema
Schema([('subj', Int64),
('cycle', Int64),
('trial', Int64),
('phase', String),
('type', String),
('word1', String),
('word2', String),
('response', Int64),
('RT', Float64),
('correct', Int64),
('lag', Int64)])
When importing data, we can make changes to the schema to read things in differently. We can either pass in a full schema that we want to use, as a dictionary, or a schema_overrides input to indicate just some specific column formats. If we only specify some columns, the format of the other ones will be inferred automatically.
raw_override = pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64})
raw_override.head()
| cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | -1 | "study" | "intact" | "formal" | "positive" | -1.0 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
| 0 | 0 | "study" | "intact" | "skin" | "careful" | -1.0 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
| 0 | 1 | "study" | "intact" | "upon" | "miss" | -1.0 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
| 0 | 2 | "study" | "intact" | "single" | "tradition" | -1.0 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
| 0 | 3 | "study" | "intact" | "prove" | "airport" | -1.0 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
Note that the response column is now a 64-bit float (f64) instead of an integer.
We can also easily write out a DataFrame to a spreadsheet file, using the write_csv method. We must specify the path to where we want the file to be saved.
We’ll first use the head method to get the first five lines. Note that we can chain together calls to DataFrame methods. When we call head, it returns a new DataFrame. Then we immediately call write_csv, which works on the trimmed DataFrame.
df.head(5).write_csv("exp1_trimmed.csv")
Take a look at the exp1_trimmed.csv spreadsheet in VSCode.
Exercise: importing data using Polars#
Read the file in this directory called participants.csv into a DataFrame called participants. Use the columns input to select just these columns: participant_id, session_date, score_task1. Use Jupyter to display the DataFrame.
Advanced: handle dates#
Set try_parse_dates to True to automatically identify dates and read them in as dates instead of strings. Dates can be used in Python to calculate things like differences in the dates on which different sessions were run.
# answer here
7.2. Creating a DataFrame directly in Python#
It’s also possible to create a DataFrame directly using a dictionary of lists. Each key in the dictionary becomes a column in the DataFrame. This method is mostly used for demonstrations; real data are usually in a file that has to be imported.
We can create a DataFrame directly by calling pl.DataFrame. The input gives all the data and the column names, as a Python dictionary. We use the special date datatype to represent dates. The dates are given in the international standard format, which follows this ordering: year, month, day.
from datetime import date
df_manual = pl.DataFrame(
{
"participant_id": ["sub-001", "sub-002", "sub-003", "sub-004"],
"session_date": [
date(2025, 1, 8),
date(2025, 1, 18),
date(2025, 1, 25),
date(2025, 2, 2),
],
"score_task1": [8, 4, 6, 5],
"score_task2": [6, 5, 4, 5],
}
)
Let’s take a look at the DataFrame we created. Note that each key in the dictionary is now a column. The lists we passed in become the data for each column.
df_manual
| participant_id | session_date | score_task1 | score_task2 |
|---|---|---|---|
| str | date | i64 | i64 |
| "sub-001" | 2025-01-08 | 8 | 6 |
| "sub-002" | 2025-01-18 | 4 | 5 |
| "sub-003" | 2025-01-25 | 6 | 4 |
| "sub-004" | 2025-02-02 | 5 | 5 |
Exercise: creating a DataFrame#
Create a DataFrame called data with the same data and column names as this table. The participant_id column should have strings; the age column should have integers; and the accuracy column should have floats.
participant_id |
age |
accuracy |
|---|---|---|
001 |
24 |
0.85 |
002 |
21 |
0.72 |
003 |
33 |
0.92 |
# answer here
7.3. Accessing data in a DataFrame#
DataFrames organize data within a more complex data structure than we’ve seen so far. We can access parts of a DataFrame to get some of the same structures that we have worked with before.
We can access columns of a DataFrame using df["column"], like we’ve seen before with dictionaries. This gives us a special data type called a Series. It’s used in Polars to represent single columns of data separately from a DataFrame. They aren’t used very often by themselves.
df["type"]
| type |
|---|
| str |
| "intact" |
| "intact" |
| "intact" |
| "intact" |
| "intact" |
| … |
| "intact" |
| "rearranged" |
| "rearranged" |
| "rearranged" |
| "intact" |
We can convert Series into NumPy arrays using the to_numpy method.
x = df["response"].to_numpy()
x
array([-1, -1, -1, ..., 1, 0, 1], shape=(107443,))
We can then run any NumPy methods we want, like calculating the mean. We’ll use nanmean to exclude missing samples.
import numpy as np
np.nanmean(x)
np.float64(-0.31397112887763745)
Usually converting data to NumPy arrays isn’t the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.
df["response"].mean()
-0.31397112887763745
We can also use mean on the DataFrame to calculate it for all columns. Note that this doesn’t really make sense for all the columns, though. To get a meaningful result, we need to do some data cleaning first to do things like remove the -1 values that are used to indicate missing data.
df.mean()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | str | str | str | str | f64 | f64 | f64 | f64 |
| 165.644798 | 3.497492 | 28.496598 | null | null | null | null | -0.313971 | 0.16512 | -0.172007 | -0.0004 |
Exercise: accessing data#
Read the participants.csv file into a DataFrame. Convert the score_task1 variable into a NumPy array and use NumPy to calculate the mean. Then use Polars to calculate the mean of all columns.
# answer here
7.4. Polars contexts#
Polars has different contexts that can be used to accomplish different kinds of operations.
select is used to get a subset of columns or to create new columns from data in the DataFrame.
with_columns works the same as select, but adds columns instead of creating a whole new DataFrame.
filter is used to select a subset of the rows in the DataFrame.
group_by is used to analyze groups of observations and calculate summary statistics.
7.5. Using select and with_columns#
We can use select to create a new DataFrame using expressions that access data from an existing DataFrame.
We can select multiple columns, and optionally change their order, using the select method.
df.select(["subj", "phase", "response", "RT"]).head()
| subj | phase | response | RT |
|---|---|---|---|
| i64 | str | i64 | f64 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
Select is very powerful, and can be used for much more than just getting a subset of columns. We can also use it to transform columns in many different ways. Take the example of the participants data, which includes scores on two tasks.
participants = pl.read_csv("participants.csv", try_parse_dates=True)
participants
| participant_id | session_date | score_task1 | score_task2 |
|---|---|---|---|
| str | date | i64 | i64 |
| "sub-001" | 2025-01-08 | 8 | 6 |
| "sub-002" | 2025-01-18 | 4 | 5 |
| "sub-003" | 2025-01-25 | 6 | 4 |
| "sub-004" | 2025-02-02 | 5 | 5 |
Given the columns with scores on two tasks, we can use these data to calculate a new column with an average score across the two tasks. First, we’ll define an expression for the value we want to calculate. Expressions specify something we want to do without actually doing it yet. This allows Polars to optimize how it runs different operations. The pl.col function can be used to specify one or more columns in a DataFrame, as a placeholder. Notice that we define the expression without referring to any specific DataFrame.
avg_expr = (pl.col("score_task1") + pl.col("score_task2")) / 2
print(avg_expr)
[([(col("score_task1")) + (col("score_task2"))]) / (dyn int: 2)]
We can use this expression with our DataFrame to calculate the average score. We can specify what to name the column using keyword arguments. Here, we set the input variable avg_score to the expression that we defined before.
participants.select(avg_score=avg_expr)
| avg_score |
|---|
| f64 |
| 7.0 |
| 4.5 |
| 5.0 |
| 5.0 |
Before, we used a variable to store the average score expression, but usually we’ll write everything all together like this.
participants.select(avg_score=(pl.col("score_task1") + pl.col("score_task2")) / 2)
| avg_score |
|---|
| f64 |
| 7.0 |
| 4.5 |
| 5.0 |
| 5.0 |
The select method only returns the columns that were defined in the call. If we want to keep existing columns, we can instead using with_columns.
participants.with_columns(avg_score=(pl.col("score_task1") + pl.col("score_task2")) / 2)
| participant_id | session_date | score_task1 | score_task2 | avg_score |
|---|---|---|---|---|
| str | date | i64 | i64 | f64 |
| "sub-001" | 2025-01-08 | 8 | 6 | 7.0 |
| "sub-002" | 2025-01-18 | 4 | 5 | 4.5 |
| "sub-003" | 2025-01-25 | 6 | 4 | 5.0 |
| "sub-004" | 2025-02-02 | 5 | 5 | 5.0 |
We can pass in multiple inputs to specify multiple columns. In this example, we add additional columns with reference information, including the mean across subjects and the maximum possible score.
participants.with_columns(
avg_score=avg_expr, # expression to calculate average
mean_avg_score=avg_expr.mean(), # mean across participants
max_score=10, # max possible score
)
| participant_id | session_date | score_task1 | score_task2 | avg_score | mean_avg_score | max_score |
|---|---|---|---|---|---|---|
| str | date | i64 | i64 | f64 | f64 | i32 |
| "sub-001" | 2025-01-08 | 8 | 6 | 7.0 | 5.375 | 10 |
| "sub-002" | 2025-01-18 | 4 | 5 | 4.5 | 5.375 | 10 |
| "sub-003" | 2025-01-25 | 6 | 4 | 5.0 | 5.375 | 10 |
| "sub-004" | 2025-02-02 | 5 | 5 | 5.0 | 5.375 | 10 |
Here, we create two columns that are the same for all rows. The first one is the mean score across subjects. Note that we can chain together parts of an expression; this example calculates the average score for each participant and then calculates the mean across participants. The second is a reference value showing the maximum score, which is 10.
Exercise: using select and with_columns#
Create the DataFrame scores as defined below.
participant_id |
score1 |
score2 |
|---|---|---|
001 |
9 |
5 |
002 |
7 |
3 |
003 |
10 |
6 |
Create a new column called total that is the sum of the score1 and score2 columns, and a new column average that is the mean of the score1 and score2 columns. Compare what happens when you use select and with_columns.
# answer here
7.6. Using filter#
We can use filter to create a new DataFrame with a subset of rows.
We can filter the rows included in the table using the filter method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the pl.col function.
df.filter(pl.col("phase") == "test").head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | -1 | "test" | "rearranged" | "waste" | "degree" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | 0 | "test" | "rearranged" | "needed" | "able" | 0 | 3.542 | 1 | 1 |
| 101 | 0 | 1 | "test" | "rearranged" | "single" | "clean" | 0 | 2.084 | 1 | 3 |
| 101 | 0 | 2 | "test" | "rearranged" | "train" | "useful" | 0 | 1.669 | 1 | 2 |
| 101 | 0 | 3 | "test" | "rearranged" | "knees" | "various" | 0 | 2.326 | 1 | 5 |
Here, we indicate that we only want rows where the phase is test, to get only the test trials and exclude all the study trials.
We can make more complicated expressions using operators for and (&), or (|), and not (~). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.
targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | 4 | "test" | "intact" | "skin" | "careful" | 1 | 1.407 | 1 | -1 |
| 101 | 0 | 5 | "test" | "intact" | "doctor" | "contrast" | 0 | 4.056 | 0 | -1 |
| 101 | 0 | 7 | "test" | "intact" | "homes" | "fuel" | 1 | 2.499 | 1 | -1 |
| 101 | 0 | 8 | "test" | "intact" | "liked" | "tone" | 1 | 1.609 | 1 | -1 |
| 101 | 0 | 9 | "test" | "intact" | "notice" | "explain" | 1 | 1.352 | 1 | -1 |
We can put together multiple operations by chaining methods. Here, we’ll select some columns and filter the rows in one go.
df_test = (
df.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_test.head(3)
| subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.
Exercise: using filter#
Use the participants DataFrame. If you don’t have the variable defined currently, you can read it in from participants.csv. Use filter to get participants who scored at least 5 on both tasks.
# answer here
7.7. Using sort#
Data frames can be sorted based on multiple columns.
To change the sorting of the rows in a data frame, use sort.
participants.sort("score_task1")
| participant_id | session_date | score_task1 | score_task2 |
|---|---|---|---|
| str | date | i64 | i64 |
| "sub-002" | 2025-01-18 | 4 | 5 |
| "sub-004" | 2025-02-02 | 5 | 5 |
| "sub-003" | 2025-01-25 | 6 | 4 |
| "sub-001" | 2025-01-08 | 8 | 6 |
To sort in descending order instead of ascending, set descending=True.
participants.sort("session_date", descending=True)
| participant_id | session_date | score_task1 | score_task2 |
|---|---|---|---|
| str | date | i64 | i64 |
| "sub-004" | 2025-02-02 | 5 | 5 |
| "sub-003" | 2025-01-25 | 6 | 4 |
| "sub-002" | 2025-01-18 | 4 | 5 |
| "sub-001" | 2025-01-08 | 8 | 6 |
We can also sort by multiple columns simulateously. Columns are sorted in the order in which they are passed.
study = pl.read_csv("study.csv")
study
| participant_id | trial_type | item_type | response | response_time |
|---|---|---|---|---|
| i64 | str | i64 | i64 | f64 |
| 1 | "target" | 1 | 1 | 1.5 |
| 1 | "lure" | 2 | 0 | 2.3 |
| 1 | "target" | 2 | 0 | 2.2 |
| 1 | "lure" | 1 | 1 | 1.8 |
| 2 | "lure" | 2 | 0 | 2.2 |
| 2 | "target" | 1 | 1 | 1.2 |
| 2 | "lure" | 1 | 1 | 2.1 |
| 2 | "target" | 2 | 1 | 1.6 |
study.sort("participant_id", "item_type", "trial_type")
| participant_id | trial_type | item_type | response | response_time |
|---|---|---|---|---|
| i64 | str | i64 | i64 | f64 |
| 1 | "lure" | 1 | 1 | 1.8 |
| 1 | "target" | 1 | 1 | 1.5 |
| 1 | "lure" | 2 | 0 | 2.3 |
| 1 | "target" | 2 | 0 | 2.2 |
| 2 | "lure" | 1 | 1 | 2.1 |
| 2 | "target" | 1 | 1 | 1.2 |
| 2 | "lure" | 2 | 0 | 2.2 |
| 2 | "target" | 2 | 1 | 1.6 |
When there are multiple sorting columns, can change whether each one is ascending or descending by passing a list to the descending argument.
study.sort(
"participant_id", "item_type", "trial_type",
descending=[False, False, True],
)
| participant_id | trial_type | item_type | response | response_time |
|---|---|---|---|---|
| i64 | str | i64 | i64 | f64 |
| 1 | "target" | 1 | 1 | 1.5 |
| 1 | "lure" | 1 | 1 | 1.8 |
| 1 | "target" | 2 | 0 | 2.2 |
| 1 | "lure" | 2 | 0 | 2.3 |
| 2 | "target" | 1 | 1 | 1.2 |
| 2 | "lure" | 1 | 1 | 2.1 |
| 2 | "target" | 2 | 1 | 1.6 |
| 2 | "lure" | 2 | 0 | 2.2 |
7.8. Calculating summary statistics#
When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.
First, let’s use method chaining to set up a DataFrame with data from the test phase. This example reads the data from the file, filters it to get the test phase, and selects some columns to focus on.
df_summary = (
pl.read_csv("exp1.csv")
.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_summary.head()
| subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
| 101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 |
| 101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 |
We can get some common summary statistics for all the columns using describe.
df_summary.describe()
| statistic | subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | str | str | f64 | f64 | f64 | f64 |
| "count" | 53700.0 | 53700.0 | "53700" | "53700" | 53700.0 | 53700.0 | 53700.0 | 53700.0 |
| "null_count" | 0.0 | 0.0 | "0" | "0" | 0.0 | 0.0 | 0.0 | 0.0 |
| "mean" | 165.655866 | 28.5 | null | null | 0.372607 | 1.331173 | 0.656648 | 1.0 |
| "std" | 99.008229 | 17.318264 | null | null | 0.48554 | 0.76009 | 0.474832 | 2.236089 |
| "min" | 101.0 | -1.0 | "test" | "intact" | -1.0 | -1.0 | 0.0 | -1.0 |
| "25%" | 128.0 | 14.0 | null | null | 0.0 | 0.882 | 0.0 | -1.0 |
| "50%" | 158.0 | 29.0 | null | null | 0.0 | 1.125 | 1.0 | 1.0 |
| "75%" | 186.0 | 43.0 | null | null | 1.0 | 1.526 | 1.0 | 3.0 |
| "max" | 1150.0 | 58.0 | "test" | "rearranged" | 1.0 | 7.925 | 1.0 | 5.0 |
The count row gives the nuber of non-null samples in each column, while the null_count row indicates how many samples are null.
The mean and std rows give the mean and standard deviation.
The min and max rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The 50% percentile is also known as the median.
We can calculate more targeted statistics using expressions.
df_summary.select(pl.col("response", "RT").mean())
| response | RT |
|---|---|
| f64 | f64 |
| 0.372607 | 1.331173 |
Note that here we use pl.col to select multiple columns. The mean function is then applied to both columns.
We can use .name.suffix to distinguish these columns from existing ones, by adding a suffix.
result = df_summary.with_columns(pl.col("response", "RT").mean().name.suffix("_mean"))
result.head()
| subj | trial | phase | type | response | RT | correct | lag | response_mean | RT_mean |
|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 | f64 | f64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 | 0.372607 | 1.331173 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 | 0.372607 | 1.331173 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 | 0.372607 | 1.331173 |
| 101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 | 0.372607 | 1.331173 |
| 101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 | 0.372607 | 1.331173 |
See the Polars documentation for information about other descriptive statistics.
Exercise: summary statistics#
Use the participants DataFrame with describe to calculate summary statistics for the score_task1 and score_task2 columns. Do you understand what each of the statistics means?
# answer here
7.9. Summary#
Data in table format, such as data stored in spreadsheet files, can be represented in Python using DataFrames. DataFrames are similar to a dictionary that holds a series of NumPy arrays. Each column must contain data of the same datatype.
The Polars package has many tools for working with DataFrames. The select and with_columns methods can be used to select a subset of existing columns or to create new columns using expressions. The filter method makes it possible to flexibly choose a subset of rows.
The data in a DataFrame can be summarized using the describe method or various descriptive statistics functions such as mean and std.