Data Frames#
DataFrames are used to organize spreadsheets of data. There powerful methods in Python for organizing, filtering, and summarizing data in spreadsheets.
Importing data#
We can easily use the Polars package to import data from spreadsheets into a variable called a DataFrame. First, we will import the Polars module.
import polars as pl
We can use pl.read_csv to import a CSV file into a DataFrame. The head method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5). In Jupyter, the last command in a cell will be displayed below that cell. In this case, we have raw.head() at the end. This produces a DataFrame, which Jupyter then displays in a nice format.
raw = pl.read_csv("exp1.csv")
raw.head()
| cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
| 0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
| 0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
| 0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
| 0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
We can also use print, but the formatting is less nice than what Jupyter uses.
print(raw.head())
shape: (5, 16)
┌───────┬───────┬───────┬────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0 ┆ -1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 0 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 2 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 3 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
└───────┴───────┴───────┴────────┴───┴──────┴───────────┴──────────────┴────────┘
There are a lot of columns, including some that we won’t use. Let’s use the select method to make the data a little simpler to look at. The select method lets us create a new DataFrame with a specified set of columns. The simplest way to use it is to get a subset of the columns in the DataFrame by passing a list of column names.
columns = ["subj", "cycle", "trial", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag"]
df = raw.select(columns)
df.head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 |
| 101 | 0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 |
A DataFrame has multiple columns, each of which can have a different type of data. For example, i64 means a 64-bit integer, and str indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.
The list of columns and their data types is known as a schema. We can access it for an existing DataFrame using the schema attribute.
df.schema
Schema([('subj', Int64),
('cycle', Int64),
('trial', Int64),
('phase', String),
('type', String),
('word1', String),
('word2', String),
('response', Int64),
('RT', Float64),
('correct', Int64),
('lag', Int64)])
When importing data, we can make changes to the schema to read things in differently. We can either pass in a full schema that we want to use, as a dictionary, or a schema_overrides input to indicate just some specific column formats. If we only specify some columns, the format of the other ones will be inferred automatically.
raw_override = pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64})
raw_override.head()
| cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | -1 | "study" | "intact" | "formal" | "positive" | -1.0 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
| 0 | 0 | "study" | "intact" | "skin" | "careful" | -1.0 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
| 0 | 1 | "study" | "intact" | "upon" | "miss" | -1.0 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
| 0 | 2 | "study" | "intact" | "single" | "tradition" | -1.0 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
| 0 | 3 | "study" | "intact" | "prove" | "airport" | -1.0 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
Note that the response column is now a 64-bit float (f64) instead of an integer.
We can also easily write out a DataFrame to a spreadsheet file, using the write_csv method. We must specify the path to where we want the file to be saved.
We’ll first use the head method to get the first five lines. Note that we can chain together calls to DataFrame methods. When we call head, it returns a new DataFrame. Then we immediately call write_csv, which works on the trimmed DataFrame.
df.head(5).write_csv("exp1_trimmed.csv")
Take a look at the exp1_trimmed.csv spreadsheet in VSCode.
Exercise: importing data using Polars#
Read the file in this directory called people.csv into a DataFrame called people. Use the columns input to select just these columns: name, weight, height. Use Jupyter to display the DataFrame.
Advanced: handle dates#
Set try_parse_dates to True to automatically identify dates and read them in as dates instead of strings. Dates can be used in Python to calculate things like differences in age.
# answer here
Creating a DataFrame directly in Python#
It’s also possible to create a DataFrame directly using a dictionary of lists. Each key in the dictionary becomes a column in the DataFrame. This method is mostly used for demonstrations; real data are usually in a file that has to be imported.
We can create a DataFrame directly by calling pl.DataFrame. The input gives all the data and the column names, as a Python dictionary. We use the special date datatype to represent dates. The dates are given in the international standard format, which follows this ordering: year, month, day.
from datetime import date
df_manual = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
Let’s take a look at the DataFrame we created. Note that each key in the dictionary is now a column. The lists we passed in become the data for each column.
df_manual
| name | birthdate | weight | height |
|---|---|---|---|
| str | date | f64 | f64 |
| "Alice Archer" | 1997-01-10 | 57.9 | 1.56 |
| "Ben Brown" | 1985-02-15 | 72.5 | 1.77 |
| "Chloe Cooper" | 1983-03-22 | 53.6 | 1.65 |
| "Daniel Donovan" | 1981-04-30 | 83.1 | 1.75 |
Exercise: creating a DataFrame#
Create a DataFrame called data with the same data and column names as this table. The participant_id column should have strings; the age column should have integers; and the accuracy column should have floats.
participant_id |
age |
accuracy |
|---|---|---|
001 |
24 |
0.85 |
002 |
21 |
0.72 |
003 |
33 |
0.92 |
# answer here
Accessing data in a DataFrame#
DataFrames organize data within a more complex data structure than we’ve seen so far. We can access parts of a DataFrame to get some of the same structures that we have worked with before.
We can access columns of a DataFrame using df["column"], like we’ve seen before with dictionaries. This gives us a special data type called a Series. It’s used in Polars to represent single columns of data separately from a DataFrame. They aren’t used very often by themselves.
df["type"]
| type |
|---|
| str |
| "intact" |
| "intact" |
| "intact" |
| "intact" |
| "intact" |
| … |
| "intact" |
| "rearranged" |
| "rearranged" |
| "rearranged" |
| "intact" |
We can convert Series into NumPy arrays using the to_numpy method.
x = df["response"].to_numpy()
x
array([-1, -1, -1, ..., 1, 0, 1], shape=(107443,))
We can then run any NumPy methods we want, like calculating the mean. We’ll use nanmean to exclude missing samples.
import numpy as np
np.nanmean(x)
np.float64(-0.31397112887763745)
Usually converting data to NumPy arrays isn’t the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.
df["response"].mean()
-0.31397112887763745
We can also use mean on the DataFrame to calculate it for all columns. Note that this doesn’t really make sense for all the columns, though. To get a meaningful result, we need to do some data cleaning first to do things like remove the -1 values that are used to indicate missing data.
df.mean()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | str | str | str | str | f64 | f64 | f64 | f64 |
| 165.644798 | 3.497492 | 28.496598 | null | null | null | null | -0.313971 | 0.16512 | -0.172007 | -0.0004 |
Exercise: accessing data#
Read the people.csv file into a DataFrame. Convert the height variable into a NumPy array and use NumPy to calculate the mean. Then use Polars to calculate the mean of all columns.
# answer here
Polars contexts#
Polars has different contexts that can be used to accomplish different kinds of operations.
select is used to get a subset of columns or to create new columns from data in the DataFrame.
with_columns works the same as select, but adds columns instead of creating a whole new DataFrame.
filter is used to select a subset of the rows in the DataFrame.
group_by is used to analyze groups of observations and calculate summary statistics.
Using select and with_columns#
We can use select to create a new DataFrame using expressions that access data from an existing DataFrame.
We can select multiple columns, and optionally change their order, using the select method.
df.select(["subj", "phase", "response", "RT"]).head()
| subj | phase | response | RT |
|---|---|---|---|
| i64 | str | i64 | f64 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
| 101 | "study" | -1 | -1.0 |
Select is very powerful, and can be used for much more than just getting a subset of columns. We can also use it to transform columns in many different ways. Take the example of the people data, which includes weight and height.
people = pl.read_csv("people.csv")
people
| name | birthdate | weight | height |
|---|---|---|---|
| str | str | f64 | f64 |
| "Alice Archer" | "1997-01-10" | 57.9 | 1.56 |
| "Ben Brown" | "1985-02-15" | 72.5 | 1.77 |
| "Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 |
| "Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 |
We can use these data to calculate a new column with body mass index (BMI). First, we’ll define an expression for the value we want to calculate. Expressions specify something we want to do without actually doing it yet. This allows Polars to optimize how it runs different operations. The pl.col function can be used to specify one or more columns in a DataFrame, as a placeholder. Notice that we define the expression without referring to any specific DataFrame.
bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]
We can use this expression with our DataFrame to calculate BMI. We can specify what to name the column using keyword arguments. Here, we set the input variable bmi to the expression that we defined before.
people.select(bmi=bmi_expr)
| bmi |
|---|
| f64 |
| 23.791913 |
| 23.141498 |
| 19.687787 |
| 27.134694 |
Before, we used a variable to store the BMI expression, but usually we’ll write everything all together like this.
people.select(bmi=pl.col("weight") / (pl.col("height") ** 2))
| bmi |
|---|
| f64 |
| 23.791913 |
| 23.141498 |
| 19.687787 |
| 27.134694 |
The select method only returns the columns that were defined in the call. If we want to keep existing columns, we can instead using with_columns.
people.with_columns(bmi=pl.col("weight") / (pl.col("height") ** 2))
| name | birthdate | weight | height | bmi |
|---|---|---|---|---|
| str | str | f64 | f64 | f64 |
| "Alice Archer" | "1997-01-10" | 57.9 | 1.56 | 23.791913 |
| "Ben Brown" | "1985-02-15" | 72.5 | 1.77 | 23.141498 |
| "Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 | 19.687787 |
| "Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 | 27.134694 |
We can pass in multiple inputs to specify multiple columns.
people.with_columns(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
reference_bmi=25,
)
| name | birthdate | weight | height | bmi | avg_bmi | reference_bmi |
|---|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 | i32 |
| "Alice Archer" | "1997-01-10" | 57.9 | 1.56 | 23.791913 | 23.438973 | 25 |
| "Ben Brown" | "1985-02-15" | 72.5 | 1.77 | 23.141498 | 23.438973 | 25 |
| "Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 | 19.687787 | 23.438973 | 25 |
| "Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 | 27.134694 | 23.438973 | 25 |
Here, we create two columns that are the same for all rows. The first one is the mean BMI across people. Note that we can chain together parts of an expression; this example calculates BMI for each person and then calculates the mean across people. The second is a reference value for BMI, which is assigned to 25.
Exercise: using select and with_columns#
Create the DataFrame scores as defined below.
participant_id |
score1 |
score2 |
|---|---|---|
001 |
9 |
5 |
002 |
7 |
3 |
003 |
10 |
6 |
Create a new column called total that is the sum of the score1 and score2 columns, and a new column average that is the mean of the score1 and score2 columns. Compare what happens when you use select and with_columns.
# answer here
Using filter#
We can use filter to create a new DataFrame with a subset of rows.
We can filter the rows included in the table using the filter method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the pl.col function.
df.filter(pl.col("phase") == "test").head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | -1 | "test" | "rearranged" | "waste" | "degree" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | 0 | "test" | "rearranged" | "needed" | "able" | 0 | 3.542 | 1 | 1 |
| 101 | 0 | 1 | "test" | "rearranged" | "single" | "clean" | 0 | 2.084 | 1 | 3 |
| 101 | 0 | 2 | "test" | "rearranged" | "train" | "useful" | 0 | 1.669 | 1 | 2 |
| 101 | 0 | 3 | "test" | "rearranged" | "knees" | "various" | 0 | 2.326 | 1 | 5 |
Here, we indicate that we only want rows where the phase is test, to get only the test trials and exclude all the study trials.
We can make more complicated expressions using operators for and (&), or (|), and not (~). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.
targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()
| subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
| 101 | 0 | 4 | "test" | "intact" | "skin" | "careful" | 1 | 1.407 | 1 | -1 |
| 101 | 0 | 5 | "test" | "intact" | "doctor" | "contrast" | 0 | 4.056 | 0 | -1 |
| 101 | 0 | 7 | "test" | "intact" | "homes" | "fuel" | 1 | 2.499 | 1 | -1 |
| 101 | 0 | 8 | "test" | "intact" | "liked" | "tone" | 1 | 1.609 | 1 | -1 |
| 101 | 0 | 9 | "test" | "intact" | "notice" | "explain" | 1 | 1.352 | 1 | -1 |
We can put together multiple operations by chaining methods. Here, we’ll select some columns and filter the rows in one go.
df_test = (
df.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_test.head(3)
| subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.
Exercise: using filter#
Use the people DataFrame. If you don’t have the variable defined currently, you can read it in from people.csv. Use filter to get people whose height is less than 1.7 meters and whose weight is greater than 55 kilograms.
# answer here
Calculating summary statistics#
When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.
First, let’s use method chaining to set up a DataFrame with data from the test phase. This example reads the data from the file, filters it to get the test phase, and selects some columns to focus on.
df_summary = (
pl.read_csv("exp1.csv")
.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_summary.head()
| subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
| 101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 |
| 101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 |
We can get some common summary statistics for all the columns using describe.
df_summary.describe()
| statistic | subj | trial | phase | type | response | RT | correct | lag |
|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | str | str | f64 | f64 | f64 | f64 |
| "count" | 53700.0 | 53700.0 | "53700" | "53700" | 53700.0 | 53700.0 | 53700.0 | 53700.0 |
| "null_count" | 0.0 | 0.0 | "0" | "0" | 0.0 | 0.0 | 0.0 | 0.0 |
| "mean" | 165.655866 | 28.5 | null | null | 0.372607 | 1.331173 | 0.656648 | 1.0 |
| "std" | 99.008229 | 17.318264 | null | null | 0.48554 | 0.76009 | 0.474832 | 2.236089 |
| "min" | 101.0 | -1.0 | "test" | "intact" | -1.0 | -1.0 | 0.0 | -1.0 |
| "25%" | 128.0 | 14.0 | null | null | 0.0 | 0.882 | 0.0 | -1.0 |
| "50%" | 158.0 | 29.0 | null | null | 0.0 | 1.125 | 1.0 | 1.0 |
| "75%" | 186.0 | 43.0 | null | null | 1.0 | 1.526 | 1.0 | 3.0 |
| "max" | 1150.0 | 58.0 | "test" | "rearranged" | 1.0 | 7.925 | 1.0 | 5.0 |
The count row gives the nuber of non-null samples in each column, while the null_count row indicates how many samples are null.
The mean and std rows give the mean and standard deviation.
The min and max rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The 50% percentile is also known as the median.
We can calculate more targeted statistics using expressions.
df_summary.select(pl.col("response", "RT").mean())
| response | RT |
|---|---|
| f64 | f64 |
| 0.372607 | 1.331173 |
Note that here we use pl.col to select multiple columns. The mean function is then applied to both columns.
We can use .name.suffix to distinguish these columns from existing ones, by adding a suffix.
result = df_summary.with_columns(pl.col("response", "RT").mean().name.suffix("_mean"))
result.head()
| subj | trial | phase | type | response | RT | correct | lag | response_mean | RT_mean |
|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | str | str | i64 | f64 | i64 | i64 | f64 | f64 |
| 101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 | 0.372607 | 1.331173 |
| 101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 | 0.372607 | 1.331173 |
| 101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 | 0.372607 | 1.331173 |
| 101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 | 0.372607 | 1.331173 |
| 101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 | 0.372607 | 1.331173 |
See the Polars documentation for information about other descriptive statistics.
Exercise: summary statistics#
Use the people DataFrame with describe to calculate summary statistics for the weight and height columns. Do you understand what each of the statistics means?
# answer here
Summary#
Data in table format, such as data stored in spreadsheet files, can be represented in Python using DataFrames. DataFrames are similar to a dictionary that holds a series of NumPy arrays. Each column must contain data of the same datatype.
The Polars package has many tools for working with DataFrames. The select and with_columns methods can be used to select a subset of existing columns or to create new columns using expressions. The filter method makes it possible to flexibly choose a subset of rows.
The data in a DataFrame can be summarized using the describe method or various descriptive statistics functions such as mean and std.