Python Pandas Data Frames Cheat Sheet#

Note: the course focuses on using Polars, a package that was recently developed to take advantage of progress in technology and interface design, and which is more similar to other data science approaches such as SQL and R’s Tidyverse. This guide demonstrates how to do the same operations discussed in the chapter using the popular Pandas package instead of Polars. Note that it is generally simple to convert between Pandas and Polars data frames, so both packages can be used in a given workflow.

Importing data#

Use Pandas functions to read data into a Pandas DataFrame. Pandas DataFrames use NumPy arrays to store data within each column. Unlike Polars, Pandas assigns an index to each row. Here, the index is similar to the indices of a Python list, including numbers from 0 to 3 for rows 1-4.

import pandas as pd               # import the pandas package
data = pd.read_csv("people.csv")  # use read_csv to read CSV files
data                              # display the data frame
name birthdate weight height
0 Alice Archer 1997-01-10 57.9 1.56
1 Ben Brown 1985-02-15 72.5 1.77
2 Chloe Cooper 1983-03-22 53.6 1.65
3 Daniel Donovan 1981-04-30 83.1 1.75

Data frame attributes#

Use DataFrame attributes to get information about how the data are organized. This is similar to Polars, but with different ways to get the numbers of rows and columns.

print(data.columns)       # column names
print(data.dtypes)        # data type of each column
print(data.shape)         # tuple with the number of rows and columns
print(len(data))          # number of rows
print(len(data.columns))  # number of columns
Index(['name', 'birthdate', 'weight', 'height'], dtype='object')
name          object
birthdate     object
weight       float64
height       float64
dtype: object
(4, 4)
4
4

Creating a data frame directly#

To make a DataFrame in your code, rather than inputting it from a file, use pd.DataFrame. This works the same as in Polars, but note that Pandas data frames are different from Polars data frames, though many functions can use them interchangeably.

To use pd.DataFrame, make a dictionary (use curly braces, {}) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.

data = pd.DataFrame(
    {
        "participant_id": ["001", "002", "003", "004"],
        "age": [25, 32, 65, 42],
        "condition": ["restudy", "test", "restudy", "test"],
        "score1": [3, 6, 2, 6],
        "score2": [8, 2, 4, 5],
    }
)
data
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
2 003 65 restudy 2 4
3 004 42 test 6 5

Accessing data in a data frame#

As in Polars, you can use indexing ([]) to access individual columns.

print(type(data["score1"]))  # each column is a Series
data["score1"]               # indexing a column results in a Series
<class 'pandas.core.series.Series'>
0    3
1    6
2    2
3    6
Name: score1, dtype: int64

As in Polars, columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it’s more efficient to use DataFrame functions for analysis.

score1 = data["score1"].to_numpy()  # access a column and convert to an array
score2 = data["score2"].to_numpy()
diff = score1 - score2              # now can use NumPy operations and functions
print(score1)
print(score2)
print(diff)
[3 6 2 6]
[8 2 4 5]
[-5  4 -2  1]

Expressions#

Pandas currently does not support Polars-like expressions. However, starting in 3.0, Pandas will have similar syntax that was inspired by newer packages like Polars.

Using select and with_columns#

Pandas does not have an equivalent of the select function used in Polars. Instead, a subset of columns can be selected and reordered using indexing.

data
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
2 003 65 restudy 2 4
3 004 42 test 6 5

Use square brackets ([]) with a list of columns to reorder them and/or get a subset of columns.

data[["score1", "score2", "participant_id"]]
score1 score2 participant_id
0 3 8 001
1 6 2 002
2 2 4 003
3 6 5 004

New columns can be added either using assignment or using the assign function.

subset = data[["score1", "score2"]].copy()
subset["score_total"] = subset["score1"] + subset["score2"]
subset
score1 score2 score_total
0 3 8 11
1 6 2 8
2 2 4 6
3 6 5 11
subset.assign(score_total = data["score1"] + data["score2"])
score1 score2 score_total
0 3 8 11
1 6 2 8
2 2 4 6
3 6 5 11

Using filter#

Pandas does not have a function that works the same as Polar’s filter. Rows may be filtered using logical indexing (similar to filtering NumPy arrays) or using a query string with query.

data
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
2 003 65 restudy 2 4
3 004 42 test 6 5
data[data["score1"] > 2]  # rows where score1 is greater than 2
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
3 004 42 test 6 5
data.query("score1 > 2")  # same thing, using query instead
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
3 004 42 test 6 5
data[(data["score1"] > 2) & (data["score2"] > 2)]  # both scores > 2
participant_id age condition score1 score2
0 001 25 restudy 3 8
3 004 42 test 6 5
data.query("score1 > 2 and score2 > 2")  # same thing, using query instead
participant_id age condition score1 score2
0 001 25 restudy 3 8
3 004 42 test 6 5
data[data["participant_id"] == "001"]  # filter based on strings
participant_id age condition score1 score2
0 001 25 restudy 3 8
data.query("participant_id == '001'")  # using query (must use nested strings)
participant_id age condition score1 score2
0 001 25 restudy 3 8
data[data["participant_id"] != "003"]  # use != for not equal
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
3 004 42 test 6 5
data.query("participant_id != '003'")  # get rows with non-matching strings
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
3 004 42 test 6 5

Comparison expressions#

Pandas currently does not support Polars-like expressions, but query can be used to filter rows in a similar manner using query strings such as the ones shown below.

"age == 25"   # rows where age equals 25
"age != 25"   # rows where age does not equal 25
"age < 25"    # rows where age is less than 25
"age <= 25"   # rows where age is less than or equal to 25
"age > 25"    # rows where age is greater than 25
"age >= 25";  # rows where age is greater than or equal to 25
"participant_id == '001'"   # check string equality
"participant_id != '001'";  # check string inequality
"score1 > 2 and score2 > 2"           # both greater than 2
"score1 == 1 or score2 == 1"          # either is 1
"not (score1 == 1 and score2 == 1)";  # not both 1

Sorting data#

Use sort_values to rearrange the order of the rows in a data frame. It works similarly to Polar’s sort, but changing the order of sorting involves setting the ascending input instead of Polar’s descending input.

data
participant_id age condition score1 score2
0 001 25 restudy 3 8
1 002 32 test 6 2
2 003 65 restudy 2 4
3 004 42 test 6 5
data.sort_values("score1")  # sort based on score1, in ascending order
participant_id age condition score1 score2
2 003 65 restudy 2 4
0 001 25 restudy 3 8
1 002 32 test 6 2
3 004 42 test 6 5
data.sort_values("score1", ascending=False)  # sort in descending order
participant_id age condition score1 score2
1 002 32 test 6 2
3 004 42 test 6 5
0 001 25 restudy 3 8
2 003 65 restudy 2 4
data.sort_values(["condition", "participant_id"])  # sort by multiple columns
participant_id age condition score1 score2
0 001 25 restudy 3 8
2 003 65 restudy 2 4
1 002 32 test 6 2
3 004 42 test 6 5
data.sort_values(
    ["condition", "score2"], ascending=[True, False]
)  # descending order
participant_id age condition score1 score2
0 001 25 restudy 3 8
2 003 65 restudy 2 4
3 004 42 test 6 5
1 002 32 test 6 2