Data Frames#

DataFrames are used to organize spreadsheets of data. There powerful methods in Python for organizing, filtering, and summarizing data in spreadsheets.

Importing data#

We can easily use the Polars package to import data from spreadsheets into a variable called a DataFrame. First, we will import the Polars module.

import polars as pl

We can use pl.read_csv to import a CSV file into a DataFrame. The head method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5). In Jupyter, the last command in a cell will be displayed below that cell. In this case, we have raw.head() at the end. This produces a DataFrame, which Jupyter then displays in a nice format.

raw = pl.read_csv("exp1.csv")
raw.head()
shape: (5, 16)
cycletrialphasetypeword1word2responseRTcorrectlagserPos1serPos2subjintactLagprevResponseprevRT
i64i64strstrstrstri64f64i64i64i64i64i64i64i64i64
0-1"study""intact""formal""positive"-1-1.0-1-100101000
00"study""intact""skin""careful"-1-1.0-1-111101000
01"study""intact""upon""miss"-1-1.0-1-122101000
02"study""intact""single""tradition"-1-1.0-1-133101000
03"study""intact""prove""airport"-1-1.0-1-144101000

We can also use print, but the formatting is less nice than what Jupyter uses.

print(raw.head())
shape: (5, 16)
┌───────┬───────┬───────┬────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type   ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---  ┆ ---       ┆ ---          ┆ ---    │
│ i64   ┆ i64   ┆ str   ┆ str    ┆   ┆ i64  ┆ i64       ┆ i64          ┆ i64    │
╞═══════╪═══════╪═══════╪════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0     ┆ -1    ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 0     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 1     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 2     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 3     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
└───────┴───────┴───────┴────────┴───┴──────┴───────────┴──────────────┴────────┘

There are a lot of columns, including some that we won’t use. Let’s use the select method to make the data a little simpler to look at. The select method lets us create a new DataFrame with a specified set of columns. The simplest way to use it is to get a subset of the columns in the DataFrame by passing a list of column names.

columns = ["subj", "cycle", "trial", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag"]
df = raw.select(columns)
df.head()
shape: (5, 11)
subjcycletrialphasetypeword1word2responseRTcorrectlag
i64i64i64strstrstrstri64f64i64i64
1010-1"study""intact""formal""positive"-1-1.0-1-1
10100"study""intact""skin""careful"-1-1.0-1-1
10101"study""intact""upon""miss"-1-1.0-1-1
10102"study""intact""single""tradition"-1-1.0-1-1
10103"study""intact""prove""airport"-1-1.0-1-1

A DataFrame has multiple columns, each of which can have a different type of data. For example, i64 means a 64-bit integer, and str indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.

The list of columns and their data types is known as a schema. We can access it for an existing DataFrame using the schema attribute.

df.schema
Schema([('subj', Int64),
        ('cycle', Int64),
        ('trial', Int64),
        ('phase', String),
        ('type', String),
        ('word1', String),
        ('word2', String),
        ('response', Int64),
        ('RT', Float64),
        ('correct', Int64),
        ('lag', Int64)])

When importing data, we can make changes to the schema to read things in differently. We can either pass in a full schema that we want to use, as a dictionary, or a schema_overrides input to indicate just some specific column formats. If we only specify some columns, the format of the other ones will be inferred automatically.

raw_override = pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64})
raw_override.head()
shape: (5, 16)
cycletrialphasetypeword1word2responseRTcorrectlagserPos1serPos2subjintactLagprevResponseprevRT
i64i64strstrstrstrf64f64i64i64i64i64i64i64i64i64
0-1"study""intact""formal""positive"-1.0-1.0-1-100101000
00"study""intact""skin""careful"-1.0-1.0-1-111101000
01"study""intact""upon""miss"-1.0-1.0-1-122101000
02"study""intact""single""tradition"-1.0-1.0-1-133101000
03"study""intact""prove""airport"-1.0-1.0-1-144101000

Note that the response column is now a 64-bit float (f64) instead of an integer.

We can also easily write out a DataFrame to a spreadsheet file, using the write_csv method. We must specify the path to where we want the file to be saved.

We’ll first use the head method to get the first five lines. Note that we can chain together calls to DataFrame methods. When we call head, it returns a new DataFrame. Then we immediately call write_csv, which works on the trimmed DataFrame.

df.head(5).write_csv("exp1_trimmed.csv")

Take a look at the exp1_trimmed.csv spreadsheet in VSCode.

Exercise: importing data using Polars#

Read the file in this directory called people.csv into a DataFrame called people. Use the columns input to select just these columns: name, weight, height. Use Jupyter to display the DataFrame.

Advanced: handle dates#

Set try_parse_dates to True to automatically identify dates and read them in as dates instead of strings. Dates can be used in Python to calculate things like differences in age.

# answer here

Creating a DataFrame directly in Python#

It’s also possible to create a DataFrame directly using a dictionary of lists. Each key in the dictionary becomes a column in the DataFrame. This method is mostly used for demonstrations; real data are usually in a file that has to be imported.

We can create a DataFrame directly by calling pl.DataFrame. The input gives all the data and the column names, as a Python dictionary. We use the special date datatype to represent dates. The dates are given in the international standard format, which follows this ordering: year, month, day.

from datetime import date
df_manual = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

Let’s take a look at the DataFrame we created. Note that each key in the dictionary is now a column. The lists we passed in become the data for each column.

df_manual
shape: (4, 4)
namebirthdateweightheight
strdatef64f64
"Alice Archer"1997-01-1057.91.56
"Ben Brown"1985-02-1572.51.77
"Chloe Cooper"1983-03-2253.61.65
"Daniel Donovan"1981-04-3083.11.75

Exercise: creating a DataFrame#

Create a DataFrame called data with the same data and column names as this table. The participant_id column should have strings; the age column should have integers; and the accuracy column should have floats.

participant_id

age

accuracy

001

24

0.85

002

21

0.72

003

33

0.92

# answer here

Accessing data in a DataFrame#

DataFrames organize data within a more complex data structure than we’ve seen so far. We can access parts of a DataFrame to get some of the same structures that we have worked with before.

We can access columns of a DataFrame using df["column"], like we’ve seen before with dictionaries. This gives us a special data type called a Series. It’s used in Polars to represent single columns of data separately from a DataFrame. They aren’t used very often by themselves.

df["type"]
shape: (107_443,)
type
str
"intact"
"intact"
"intact"
"intact"
"intact"
"intact"
"rearranged"
"rearranged"
"rearranged"
"intact"

We can convert Series into NumPy arrays using the to_numpy method.

x = df["response"].to_numpy()
x
array([-1, -1, -1, ...,  1,  0,  1], shape=(107443,))

We can then run any NumPy methods we want, like calculating the mean. We’ll use nanmean to exclude missing samples.

import numpy as np
np.nanmean(x)
np.float64(-0.31397112887763745)

Usually converting data to NumPy arrays isn’t the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.

df["response"].mean()
-0.31397112887763745

We can also use mean on the DataFrame to calculate it for all columns. Note that this doesn’t really make sense for all the columns, though. To get a meaningful result, we need to do some data cleaning first to do things like remove the -1 values that are used to indicate missing data.

df.mean()
shape: (1, 11)
subjcycletrialphasetypeword1word2responseRTcorrectlag
f64f64f64strstrstrstrf64f64f64f64
165.6447983.49749228.496598nullnullnullnull-0.3139710.16512-0.172007-0.0004

Exercise: accessing data#

Read the people.csv file into a DataFrame. Convert the height variable into a NumPy array and use NumPy to calculate the mean. Then use Polars to calculate the mean of all columns.

# answer here

Polars contexts#

Polars has different contexts that can be used to accomplish different kinds of operations.

select is used to get a subset of columns or to create new columns from data in the DataFrame.

with_columns works the same as select, but adds columns instead of creating a whole new DataFrame.

filter is used to select a subset of the rows in the DataFrame.

group_by is used to analyze groups of observations and calculate summary statistics.

Using select and with_columns#

We can use select to create a new DataFrame using expressions that access data from an existing DataFrame.

We can select multiple columns, and optionally change their order, using the select method.

df.select(["subj", "phase", "response", "RT"]).head()
shape: (5, 4)
subjphaseresponseRT
i64stri64f64
101"study"-1-1.0
101"study"-1-1.0
101"study"-1-1.0
101"study"-1-1.0
101"study"-1-1.0

Select is very powerful, and can be used for much more than just getting a subset of columns. We can also use it to transform columns in many different ways. Take the example of the people data, which includes weight and height.

people = pl.read_csv("people.csv")
people
shape: (4, 4)
namebirthdateweightheight
strstrf64f64
"Alice Archer""1997-01-10"57.91.56
"Ben Brown""1985-02-15"72.51.77
"Chloe Cooper""1983-03-22"53.61.65
"Daniel Donovan""1981-04-30"83.11.75

We can use these data to calculate a new column with body mass index (BMI). First, we’ll define an expression for the value we want to calculate. Expressions specify something we want to do without actually doing it yet. This allows Polars to optimize how it runs different operations. The pl.col function can be used to specify one or more columns in a DataFrame, as a placeholder. Notice that we define the expression without referring to any specific DataFrame.

bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]

We can use this expression with our DataFrame to calculate BMI. We can specify what to name the column using keyword arguments. Here, we set the input variable bmi to the expression that we defined before.

people.select(bmi=bmi_expr)
shape: (4, 1)
bmi
f64
23.791913
23.141498
19.687787
27.134694

Before, we used a variable to store the BMI expression, but usually we’ll write everything all together like this.

people.select(bmi=pl.col("weight") / (pl.col("height") ** 2))
shape: (4, 1)
bmi
f64
23.791913
23.141498
19.687787
27.134694

The select method only returns the columns that were defined in the call. If we want to keep existing columns, we can instead using with_columns.

people.with_columns(bmi=pl.col("weight") / (pl.col("height") ** 2))
shape: (4, 5)
namebirthdateweightheightbmi
strstrf64f64f64
"Alice Archer""1997-01-10"57.91.5623.791913
"Ben Brown""1985-02-15"72.51.7723.141498
"Chloe Cooper""1983-03-22"53.61.6519.687787
"Daniel Donovan""1981-04-30"83.11.7527.134694

We can pass in multiple inputs to specify multiple columns.

people.with_columns(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    reference_bmi=25,
)
shape: (4, 7)
namebirthdateweightheightbmiavg_bmireference_bmi
strstrf64f64f64f64i32
"Alice Archer""1997-01-10"57.91.5623.79191323.43897325
"Ben Brown""1985-02-15"72.51.7723.14149823.43897325
"Chloe Cooper""1983-03-22"53.61.6519.68778723.43897325
"Daniel Donovan""1981-04-30"83.11.7527.13469423.43897325

Here, we create two columns that are the same for all rows. The first one is the mean BMI across people. Note that we can chain together parts of an expression; this example calculates BMI for each person and then calculates the mean across people. The second is a reference value for BMI, which is assigned to 25.

Exercise: using select and with_columns#

Create the DataFrame scores as defined below.

participant_id

score1

score2

001

9

5

002

7

3

003

10

6

Create a new column called total that is the sum of the score1 and score2 columns, and a new column average that is the mean of the score1 and score2 columns. Compare what happens when you use select and with_columns.

# answer here

Using filter#

We can use filter to create a new DataFrame with a subset of rows.

We can filter the rows included in the table using the filter method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the pl.col function.

df.filter(pl.col("phase") == "test").head()
shape: (5, 11)
subjcycletrialphasetypeword1word2responseRTcorrectlag
i64i64i64strstrstrstri64f64i64i64
1010-1"test""rearranged""waste""degree"02.31212
10100"test""rearranged""needed""able"03.54211
10101"test""rearranged""single""clean"02.08413
10102"test""rearranged""train""useful"01.66912
10103"test""rearranged""knees""various"02.32615

Here, we indicate that we only want rows where the phase is test, to get only the test trials and exclude all the study trials.

We can make more complicated expressions using operators for and (&), or (|), and not (~). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.

targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()
shape: (5, 11)
subjcycletrialphasetypeword1word2responseRTcorrectlag
i64i64i64strstrstrstri64f64i64i64
10104"test""intact""skin""careful"11.4071-1
10105"test""intact""doctor""contrast"04.0560-1
10107"test""intact""homes""fuel"12.4991-1
10108"test""intact""liked""tone"11.6091-1
10109"test""intact""notice""explain"11.3521-1

We can put together multiple operations by chaining methods. Here, we’ll select some columns and filter the rows in one go.

df_test = (
    df.filter(pl.col("phase") == "test")
    .select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_test.head(3)
shape: (3, 8)
subjtrialphasetyperesponseRTcorrectlag
i64i64strstri64f64i64i64
101-1"test""rearranged"02.31212
1010"test""rearranged"03.54211
1011"test""rearranged"02.08413

Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.

Exercise: using filter#

Use the people DataFrame. If you don’t have the variable defined currently, you can read it in from people.csv. Use filter to get people whose height is less than 1.7 meters and whose weight is greater than 55 kilograms.

# answer here

Calculating summary statistics#

When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.

First, let’s use method chaining to set up a DataFrame with data from the test phase. This example reads the data from the file, filters it to get the test phase, and selects some columns to focus on.

df_summary = (
    pl.read_csv("exp1.csv")
    .filter(pl.col("phase") == "test")
    .select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_summary.head()
shape: (5, 8)
subjtrialphasetyperesponseRTcorrectlag
i64i64strstri64f64i64i64
101-1"test""rearranged"02.31212
1010"test""rearranged"03.54211
1011"test""rearranged"02.08413
1012"test""rearranged"01.66912
1013"test""rearranged"02.32615

We can get some common summary statistics for all the columns using describe.

df_summary.describe()
shape: (9, 9)
statisticsubjtrialphasetyperesponseRTcorrectlag
strf64f64strstrf64f64f64f64
"count"53700.053700.0"53700""53700"53700.053700.053700.053700.0
"null_count"0.00.0"0""0"0.00.00.00.0
"mean"165.65586628.5nullnull0.3726071.3311730.6566481.0
"std"99.00822917.318264nullnull0.485540.760090.4748322.236089
"min"101.0-1.0"test""intact"-1.0-1.00.0-1.0
"25%"128.014.0nullnull0.00.8820.0-1.0
"50%"158.029.0nullnull0.01.1251.01.0
"75%"186.043.0nullnull1.01.5261.03.0
"max"1150.058.0"test""rearranged"1.07.9251.05.0

The count row gives the nuber of non-null samples in each column, while the null_count row indicates how many samples are null.

The mean and std rows give the mean and standard deviation.

The min and max rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The 50% percentile is also known as the median.

We can calculate more targeted statistics using expressions.

df_summary.select(pl.col("response", "RT").mean())
shape: (1, 2)
responseRT
f64f64
0.3726071.331173

Note that here we use pl.col to select multiple columns. The mean function is then applied to both columns.

We can use .name.suffix to distinguish these columns from existing ones, by adding a suffix.

result = df_summary.with_columns(pl.col("response", "RT").mean().name.suffix("_mean"))
result.head()
shape: (5, 10)
subjtrialphasetyperesponseRTcorrectlagresponse_meanRT_mean
i64i64strstri64f64i64i64f64f64
101-1"test""rearranged"02.312120.3726071.331173
1010"test""rearranged"03.542110.3726071.331173
1011"test""rearranged"02.084130.3726071.331173
1012"test""rearranged"01.669120.3726071.331173
1013"test""rearranged"02.326150.3726071.331173

See the Polars documentation for information about other descriptive statistics.

Exercise: summary statistics#

Use the people DataFrame with describe to calculate summary statistics for the weight and height columns. Do you understand what each of the statistics means?

# answer here

Summary#

Data in table format, such as data stored in spreadsheet files, can be represented in Python using DataFrames. DataFrames are similar to a dictionary that holds a series of NumPy arrays. Each column must contain data of the same datatype.

The Polars package has many tools for working with DataFrames. The select and with_columns methods can be used to select a subset of existing columns or to create new columns using expressions. The filter method makes it possible to flexibly choose a subset of rows.

The data in a DataFrame can be summarized using the describe method or various descriptive statistics functions such as mean and std.