Data Frames#
DataFrames are used to organize spreadsheets of data. There powerful methods in Python for organizing, filtering, and summarizing data in spreadsheets.
Importing data#
We can easily use the Polars package to import data from spreadsheets into a variable called a DataFrame. First, we will import the Polars module.
import polars as pl
We can use pl.read_csv
to import a CSV file into a DataFrame. The head
method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5). In Jupyter, the last command in a cell will be displayed below that cell. In this case, we have raw.head()
at the end. This produces a DataFrame, which Jupyter then displays in a nice format.
raw = pl.read_csv("exp1.csv")
raw.head()
cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
We can also use print
, but the formatting is less nice than what Jupyter uses.
print(raw.head())
shape: (5, 16)
┌───────┬───────┬───────┬────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0 ┆ -1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 0 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 1 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 2 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
│ 0 ┆ 3 ┆ study ┆ intact ┆ … ┆ 101 ┆ 0 ┆ 0 ┆ 0 │
└───────┴───────┴───────┴────────┴───┴──────┴───────────┴──────────────┴────────┘
There are a lot of columns, including some that we won’t use. Let’s use the select
method to make the data a little simpler to look at. The select
method lets us create a new DataFrame with a specified set of columns. The simplest way to use it is to get a subset of the columns in the DataFrame by passing a list of column names.
columns = ["subj", "cycle", "trial", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag"]
df = raw.select(columns)
df.head()
subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | 0 | -1 | "study" | "intact" | "formal" | "positive" | -1 | -1.0 | -1 | -1 |
101 | 0 | 0 | "study" | "intact" | "skin" | "careful" | -1 | -1.0 | -1 | -1 |
101 | 0 | 1 | "study" | "intact" | "upon" | "miss" | -1 | -1.0 | -1 | -1 |
101 | 0 | 2 | "study" | "intact" | "single" | "tradition" | -1 | -1.0 | -1 | -1 |
101 | 0 | 3 | "study" | "intact" | "prove" | "airport" | -1 | -1.0 | -1 | -1 |
A DataFrame has multiple columns, each of which can have a different type of data. For example, i64
means a 64-bit integer, and str
indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.
The list of columns and their data types is known as a schema. We can access it for an existing DataFrame using the schema
attribute.
df.schema
Schema([('subj', Int64),
('cycle', Int64),
('trial', Int64),
('phase', String),
('type', String),
('word1', String),
('word2', String),
('response', Int64),
('RT', Float64),
('correct', Int64),
('lag', Int64)])
When importing data, we can make changes to the schema to read things in differently. We can either pass in a full schema
that we want to use, as a dictionary, or a schema_overrides
input to indicate just some specific column formats. If we only specify some columns, the format of the other ones will be inferred automatically.
raw_override = pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64})
raw_override.head()
cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag | serPos1 | serPos2 | subj | intactLag | prevResponse | prevRT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
0 | -1 | "study" | "intact" | "formal" | "positive" | -1.0 | -1.0 | -1 | -1 | 0 | 0 | 101 | 0 | 0 | 0 |
0 | 0 | "study" | "intact" | "skin" | "careful" | -1.0 | -1.0 | -1 | -1 | 1 | 1 | 101 | 0 | 0 | 0 |
0 | 1 | "study" | "intact" | "upon" | "miss" | -1.0 | -1.0 | -1 | -1 | 2 | 2 | 101 | 0 | 0 | 0 |
0 | 2 | "study" | "intact" | "single" | "tradition" | -1.0 | -1.0 | -1 | -1 | 3 | 3 | 101 | 0 | 0 | 0 |
0 | 3 | "study" | "intact" | "prove" | "airport" | -1.0 | -1.0 | -1 | -1 | 4 | 4 | 101 | 0 | 0 | 0 |
Note that the response
column is now a 64-bit float (f64
) instead of an integer.
We can also easily write out a DataFrame to a spreadsheet file, using the write_csv
method. We must specify the path to where we want the file to be saved.
We’ll first use the head
method to get the first five lines. Note that we can chain together calls to DataFrame methods. When we call head
, it returns a new DataFrame. Then we immediately call write_csv
, which works on the trimmed DataFrame.
df.head(5).write_csv("exp1_trimmed.csv")
Take a look at the exp1_trimmed.csv
spreadsheet in VSCode.
Exercise: importing data using Polars#
Read the file in this directory called people.csv
into a DataFrame called people
. Use the columns
input to select just these columns: name, weight, height. Use Jupyter to display the DataFrame.
Advanced: handle dates#
Set try_parse_dates
to True
to automatically identify dates and read them in as dates instead of strings. Dates can be used in Python to calculate things like differences in age.
# answer here
Creating a DataFrame directly in Python#
It’s also possible to create a DataFrame directly using a dictionary of lists. Each key in the dictionary becomes a column in the DataFrame. This method is mostly used for demonstrations; real data are usually in a file that has to be imported.
We can create a DataFrame directly by calling pl.DataFrame
. The input gives all the data and the column names, as a Python dictionary. We use the special date
datatype to represent dates. The dates are given in the international standard format, which follows this ordering: year, month, day.
from datetime import date
df_manual = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
Let’s take a look at the DataFrame we created. Note that each key in the dictionary is now a column. The lists we passed in become the data for each column.
df_manual
name | birthdate | weight | height |
---|---|---|---|
str | date | f64 | f64 |
"Alice Archer" | 1997-01-10 | 57.9 | 1.56 |
"Ben Brown" | 1985-02-15 | 72.5 | 1.77 |
"Chloe Cooper" | 1983-03-22 | 53.6 | 1.65 |
"Daniel Donovan" | 1981-04-30 | 83.1 | 1.75 |
Exercise: creating a DataFrame#
Create a DataFrame called data
with the same data and column names as this table. The participant_id
column should have strings; the age
column should have integers; and the accuracy
column should have floats.
participant_id |
age |
accuracy |
---|---|---|
001 |
24 |
0.85 |
002 |
21 |
0.72 |
003 |
33 |
0.92 |
# answer here
Accessing data in a DataFrame#
DataFrames organize data within a more complex data structure than we’ve seen so far. We can access parts of a DataFrame to get some of the same structures that we have worked with before.
We can access columns of a DataFrame using df["column"]
, like we’ve seen before with dictionaries. This gives us a special data type called a Series
. It’s used in Polars to represent single columns of data separately from a DataFrame. They aren’t used very often by themselves.
df["type"]
type |
---|
str |
"intact" |
"intact" |
"intact" |
"intact" |
"intact" |
… |
"intact" |
"rearranged" |
"rearranged" |
"rearranged" |
"intact" |
We can convert Series into NumPy arrays using the to_numpy
method.
x = df["response"].to_numpy()
x
array([-1, -1, -1, ..., 1, 0, 1], shape=(107443,))
We can then run any NumPy methods we want, like calculating the mean. We’ll use nanmean
to exclude missing samples.
import numpy as np
np.nanmean(x)
np.float64(-0.31397112887763745)
Usually converting data to NumPy arrays isn’t the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.
df["response"].mean()
-0.31397112887763745
We can also use mean
on the DataFrame to calculate it for all columns. Note that this doesn’t really make sense for all the columns, though. To get a meaningful result, we need to do some data cleaning first to do things like remove the -1
values that are used to indicate missing data.
df.mean()
subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|---|---|
f64 | f64 | f64 | str | str | str | str | f64 | f64 | f64 | f64 |
165.644798 | 3.497492 | 28.496598 | null | null | null | null | -0.313971 | 0.16512 | -0.172007 | -0.0004 |
Exercise: accessing data#
Read the people.csv
file into a DataFrame. Convert the height
variable into a NumPy array and use NumPy to calculate the mean. Then use Polars to calculate the mean of all columns.
# answer here
Polars contexts#
Polars has different contexts that can be used to accomplish different kinds of operations.
select is used to get a subset of columns or to create new columns from data in the DataFrame.
with_columns works the same as select, but adds columns instead of creating a whole new DataFrame.
filter is used to select a subset of the rows in the DataFrame.
group_by is used to analyze groups of observations and calculate summary statistics.
Using select and with_columns#
We can use select to create a new DataFrame using expressions that access data from an existing DataFrame.
We can select multiple columns, and optionally change their order, using the select
method.
df.select(["subj", "phase", "response", "RT"]).head()
subj | phase | response | RT |
---|---|---|---|
i64 | str | i64 | f64 |
101 | "study" | -1 | -1.0 |
101 | "study" | -1 | -1.0 |
101 | "study" | -1 | -1.0 |
101 | "study" | -1 | -1.0 |
101 | "study" | -1 | -1.0 |
Select is very powerful, and can be used for much more than just getting a subset of columns. We can also use it to transform columns in many different ways. Take the example of the people data, which includes weight and height.
people = pl.read_csv("people.csv")
people
name | birthdate | weight | height |
---|---|---|---|
str | str | f64 | f64 |
"Alice Archer" | "1997-01-10" | 57.9 | 1.56 |
"Ben Brown" | "1985-02-15" | 72.5 | 1.77 |
"Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 |
"Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 |
We can use these data to calculate a new column with body mass index (BMI). First, we’ll define an expression for the value we want to calculate. Expressions specify something we want to do without actually doing it yet. This allows Polars to optimize how it runs different operations. The pl.col
function can be used to specify one or more columns in a DataFrame, as a placeholder. Notice that we define the expression without referring to any specific DataFrame.
bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]
We can use this expression with our DataFrame to calculate BMI. We can specify what to name the column using keyword arguments. Here, we set the input variable bmi
to the expression that we defined before.
people.select(bmi=bmi_expr)
bmi |
---|
f64 |
23.791913 |
23.141498 |
19.687787 |
27.134694 |
Before, we used a variable to store the BMI expression, but usually we’ll write everything all together like this.
people.select(bmi=pl.col("weight") / (pl.col("height") ** 2))
bmi |
---|
f64 |
23.791913 |
23.141498 |
19.687787 |
27.134694 |
The select
method only returns the columns that were defined in the call. If we want to keep existing columns, we can instead using with_columns
.
people.with_columns(bmi=pl.col("weight") / (pl.col("height") ** 2))
name | birthdate | weight | height | bmi |
---|---|---|---|---|
str | str | f64 | f64 | f64 |
"Alice Archer" | "1997-01-10" | 57.9 | 1.56 | 23.791913 |
"Ben Brown" | "1985-02-15" | 72.5 | 1.77 | 23.141498 |
"Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 | 19.687787 |
"Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 | 27.134694 |
We can pass in multiple inputs to specify multiple columns.
people.with_columns(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
reference_bmi=25,
)
name | birthdate | weight | height | bmi | avg_bmi | reference_bmi |
---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | i32 |
"Alice Archer" | "1997-01-10" | 57.9 | 1.56 | 23.791913 | 23.438973 | 25 |
"Ben Brown" | "1985-02-15" | 72.5 | 1.77 | 23.141498 | 23.438973 | 25 |
"Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 | 19.687787 | 23.438973 | 25 |
"Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 | 27.134694 | 23.438973 | 25 |
Here, we create two columns that are the same for all rows. The first one is the mean BMI across people. Note that we can chain together parts of an expression; this example calculates BMI for each person and then calculates the mean across people. The second is a reference value for BMI, which is assigned to 25.
Exercise: using select and with_columns#
Create the DataFrame scores
as defined below.
participant_id |
score1 |
score2 |
---|---|---|
001 |
9 |
5 |
002 |
7 |
3 |
003 |
10 |
6 |
Create a new column called total
that is the sum of the score1
and score2
columns, and a new column average
that is the mean of the score1
and score2
columns. Compare what happens when you use select
and with_columns
.
# answer here
Using filter#
We can use filter to create a new DataFrame with a subset of rows.
We can filter the rows included in the table using the filter
method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the pl.col
function.
df.filter(pl.col("phase") == "test").head()
subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | 0 | -1 | "test" | "rearranged" | "waste" | "degree" | 0 | 2.312 | 1 | 2 |
101 | 0 | 0 | "test" | "rearranged" | "needed" | "able" | 0 | 3.542 | 1 | 1 |
101 | 0 | 1 | "test" | "rearranged" | "single" | "clean" | 0 | 2.084 | 1 | 3 |
101 | 0 | 2 | "test" | "rearranged" | "train" | "useful" | 0 | 1.669 | 1 | 2 |
101 | 0 | 3 | "test" | "rearranged" | "knees" | "various" | 0 | 2.326 | 1 | 5 |
Here, we indicate that we only want rows where the phase is test
, to get only the test trials and exclude all the study trials.
We can make more complicated expressions using operators for and (&
), or (|
), and not (~
). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.
targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()
subj | cycle | trial | phase | type | word1 | word2 | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | str | str | str | i64 | f64 | i64 | i64 |
101 | 0 | 4 | "test" | "intact" | "skin" | "careful" | 1 | 1.407 | 1 | -1 |
101 | 0 | 5 | "test" | "intact" | "doctor" | "contrast" | 0 | 4.056 | 0 | -1 |
101 | 0 | 7 | "test" | "intact" | "homes" | "fuel" | 1 | 2.499 | 1 | -1 |
101 | 0 | 8 | "test" | "intact" | "liked" | "tone" | 1 | 1.609 | 1 | -1 |
101 | 0 | 9 | "test" | "intact" | "notice" | "explain" | 1 | 1.352 | 1 | -1 |
We can put together multiple operations by chaining methods. Here, we’ll select some columns and filter the rows in one go.
df_test = (
df.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_test.head(3)
subj | trial | phase | type | response | RT | correct | lag |
---|---|---|---|---|---|---|---|
i64 | i64 | str | str | i64 | f64 | i64 | i64 |
101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.
Exercise: using filter#
Use the people
DataFrame. If you don’t have the variable defined currently, you can read it in from people.csv
. Use filter
to get people whose height is less than 1.7 meters and whose weight is greater than 55 kilograms.
# answer here
Calculating summary statistics#
When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.
First, let’s use method chaining to set up a DataFrame with data from the test phase. This example reads the data from the file, filters it to get the test phase, and selects some columns to focus on.
df_summary = (
pl.read_csv("exp1.csv")
.filter(pl.col("phase") == "test")
.select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_summary.head()
subj | trial | phase | type | response | RT | correct | lag |
---|---|---|---|---|---|---|---|
i64 | i64 | str | str | i64 | f64 | i64 | i64 |
101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 |
101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 |
101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 |
101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 |
101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 |
We can get some common summary statistics for all the columns using describe
.
df_summary.describe()
statistic | subj | trial | phase | type | response | RT | correct | lag |
---|---|---|---|---|---|---|---|---|
str | f64 | f64 | str | str | f64 | f64 | f64 | f64 |
"count" | 53700.0 | 53700.0 | "53700" | "53700" | 53700.0 | 53700.0 | 53700.0 | 53700.0 |
"null_count" | 0.0 | 0.0 | "0" | "0" | 0.0 | 0.0 | 0.0 | 0.0 |
"mean" | 165.655866 | 28.5 | null | null | 0.372607 | 1.331173 | 0.656648 | 1.0 |
"std" | 99.008229 | 17.318264 | null | null | 0.48554 | 0.76009 | 0.474832 | 2.236089 |
"min" | 101.0 | -1.0 | "test" | "intact" | -1.0 | -1.0 | 0.0 | -1.0 |
"25%" | 128.0 | 14.0 | null | null | 0.0 | 0.882 | 0.0 | -1.0 |
"50%" | 158.0 | 29.0 | null | null | 0.0 | 1.125 | 1.0 | 1.0 |
"75%" | 186.0 | 43.0 | null | null | 1.0 | 1.526 | 1.0 | 3.0 |
"max" | 1150.0 | 58.0 | "test" | "rearranged" | 1.0 | 7.925 | 1.0 | 5.0 |
The count
row gives the nuber of non-null samples in each column, while the null_count
row indicates how many samples are null.
The mean
and std
rows give the mean and standard deviation.
The min
and max
rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The 50%
percentile is also known as the median.
We can calculate more targeted statistics using expressions.
df_summary.select(pl.col("response", "RT").mean())
response | RT |
---|---|
f64 | f64 |
0.372607 | 1.331173 |
Note that here we use pl.col
to select multiple columns. The mean
function is then applied to both columns.
We can use .name.suffix
to distinguish these columns from existing ones, by adding a suffix.
result = df_summary.with_columns(pl.col("response", "RT").mean().name.suffix("_mean"))
result.head()
subj | trial | phase | type | response | RT | correct | lag | response_mean | RT_mean |
---|---|---|---|---|---|---|---|---|---|
i64 | i64 | str | str | i64 | f64 | i64 | i64 | f64 | f64 |
101 | -1 | "test" | "rearranged" | 0 | 2.312 | 1 | 2 | 0.372607 | 1.331173 |
101 | 0 | "test" | "rearranged" | 0 | 3.542 | 1 | 1 | 0.372607 | 1.331173 |
101 | 1 | "test" | "rearranged" | 0 | 2.084 | 1 | 3 | 0.372607 | 1.331173 |
101 | 2 | "test" | "rearranged" | 0 | 1.669 | 1 | 2 | 0.372607 | 1.331173 |
101 | 3 | "test" | "rearranged" | 0 | 2.326 | 1 | 5 | 0.372607 | 1.331173 |
See the Polars documentation for information about other descriptive statistics.
Exercise: summary statistics#
Use the people
DataFrame with describe
to calculate summary statistics for the weight and height columns. Do you understand what each of the statistics means?
# answer here
Summary#
Data in table format, such as data stored in spreadsheet files, can be represented in Python using DataFrames. DataFrames are similar to a dictionary that holds a series of NumPy arrays. Each column must contain data of the same datatype.
The Polars package has many tools for working with DataFrames. The select
and with_columns
methods can be used to select a subset of existing columns or to create new columns using expressions. The filter
method makes it possible to flexibly choose a subset of rows.
The data in a DataFrame can be summarized using the describe
method or various descriptive statistics functions such as mean
and std
.