Python Pandas Data Frames Cheat Sheet

Python Pandas Data Frames Cheat Sheet#

Note: the course focuses on using Polars, a package that was recently developed to take advantage of progress in technology and interface design, and which is more similar to other data science approaches such as SQL and R’s Tidyverse. This guide demonstrates how to do the same operations discussed in the chapter using the popular Pandas package instead of Polars. Note that it is generally simple to convert between Pandas and Polars data frames, so both packages can be used in a given workflow.

Importing data#

Use Pandas functions to read data into a Pandas DataFrame. Pandas DataFrames use NumPy arrays to store data within each column. Unlike Polars, Pandas assigns an index to each row. Here, the index is similar to the indices of a Python list, including numbers from 0 to 3 for rows 1-4.

import pandas as pd               # import the pandas package
data = pd.read_csv("people.csv")  # use read_csv to read CSV files
data                              # display the data frame

	name	birthdate	weight	height
0	Alice Archer	1997-01-10	57.9	1.56
1	Ben Brown	1985-02-15	72.5	1.77
2	Chloe Cooper	1983-03-22	53.6	1.65
3	Daniel Donovan	1981-04-30	83.1	1.75

Data frame attributes#

Use DataFrame attributes to get information about how the data are organized. This is similar to Polars, but with different ways to get the numbers of rows and columns.

print(data.columns)       # column names
print(data.dtypes)        # data type of each column
print(data.shape)         # tuple with the number of rows and columns
print(len(data))          # number of rows
print(len(data.columns))  # number of columns

Index(['name', 'birthdate', 'weight', 'height'], dtype='object')
name          object
birthdate     object
weight       float64
height       float64
dtype: object
(4, 4)
4
4

Creating a data frame directly#

To make a DataFrame in your code, rather than inputting it from a file, use pd.DataFrame. This works the same as in Polars, but note that Pandas data frames are different from Polars data frames, though many functions can use them interchangeably.

To use pd.DataFrame, make a dictionary (use curly braces, {}) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.

data = pd.DataFrame(
    {
        "participant_id": ["001", "002", "003", "004"],
        "age": [25, 32, 65, 42],
        "condition": ["restudy", "test", "restudy", "test"],
        "score1": [3, 6, 2, 6],
        "score2": [8, 2, 4, 5],
    }
)
data

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
2	003	65	restudy	2	4
3	004	42	test	6	5

Accessing data in a data frame#

As in Polars, you can use indexing ([]) to access individual columns.

print(type(data["score1"]))  # each column is a Series
data["score1"]               # indexing a column results in a Series

<class 'pandas.core.series.Series'>

  3
  6
  2
  6
Name: score1, dtype: int64

As in Polars, columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it’s more efficient to use DataFrame functions for analysis.

score1 = data["score1"].to_numpy()  # access a column and convert to an array
score2 = data["score2"].to_numpy()
diff = score1 - score2              # now can use NumPy operations and functions
print(score1)
print(score2)
print(diff)

[3 6 2 6]
[8 2 4 5]
[-5  4 -2  1]

Expressions#

Pandas currently does not support Polars-like expressions. However, starting in 3.0, Pandas will have similar syntax that was inspired by newer packages like Polars.

Using select and with_columns#

Pandas does not have an equivalent of the select function used in Polars. Instead, a subset of columns can be selected and reordered using indexing.

data

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
2	003	65	restudy	2	4
3	004	42	test	6	5

Use square brackets ([]) with a list of columns to reorder them and/or get a subset of columns.

data[["score1", "score2", "participant_id"]]

	score1	score2	participant_id
0	3	8	001
1	6	2	002
2	2	4	003
3	6	5	004

New columns can be added either using assignment or using the assign function.

subset = data[["score1", "score2"]].copy()
subset["score_total"] = subset["score1"] + subset["score2"]
subset

	score1	score2	score_total
0	3	8	11
1	6	2	8
2	2	4	6
3	6	5	11

subset.assign(score_total = data["score1"] + data["score2"])

	score1	score2	score_total
0	3	8	11
1	6	2	8
2	2	4	6
3	6	5	11

Using filter#

Pandas does not have a function that works the same as Polar’s filter. Rows may be filtered using logical indexing (similar to filtering NumPy arrays) or using a query string with query.

data

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
2	003	65	restudy	2	4
3	004	42	test	6	5

data[data["score1"] > 2]  # rows where score1 is greater than 2

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
3	004	42	test	6	5

data.query("score1 > 2")  # same thing, using query instead

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
3	004	42	test	6	5

data[(data["score1"] > 2) & (data["score2"] > 2)]  # both scores > 2

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
3	004	42	test	6	5

data.query("score1 > 2 and score2 > 2")  # same thing, using query instead

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
3	004	42	test	6	5

data[data["participant_id"] == "001"]  # filter based on strings

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8

data.query("participant_id == '001'")  # using query (must use nested strings)

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8

data[data["participant_id"] != "003"]  # use != for not equal

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
3	004	42	test	6	5

data.query("participant_id != '003'")  # get rows with non-matching strings

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
3	004	42	test	6	5

Comparison expressions#

Pandas currently does not support Polars-like expressions, but query can be used to filter rows in a similar manner using query strings such as the ones shown below.

"age == 25"   # rows where age equals 25
"age != 25"   # rows where age does not equal 25
"age < 25"    # rows where age is less than 25
"age <= 25"   # rows where age is less than or equal to 25
"age > 25"    # rows where age is greater than 25
"age >= 25";  # rows where age is greater than or equal to 25

"participant_id == '001'"   # check string equality
"participant_id != '001'";  # check string inequality

"score1 > 2 and score2 > 2"           # both greater than 2
"score1 == 1 or score2 == 1"          # either is 1
"not (score1 == 1 and score2 == 1)";  # not both 1

Sorting data#

Use sort_values to rearrange the order of the rows in a data frame. It works similarly to Polar’s sort, but changing the order of sorting involves setting the ascending input instead of Polar’s descending input.

data

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
1	002	32	test	6	2
2	003	65	restudy	2	4
3	004	42	test	6	5

data.sort_values("score1")  # sort based on score1, in ascending order

	participant_id	age	condition	score1	score2
2	003	65	restudy	2	4
0	001	25	restudy	3	8
1	002	32	test	6	2
3	004	42	test	6	5

data.sort_values("score1", ascending=False)  # sort in descending order

	participant_id	age	condition	score1	score2
1	002	32	test	6	2
3	004	42	test	6	5
0	001	25	restudy	3	8
2	003	65	restudy	2	4

data.sort_values(["condition", "participant_id"])  # sort by multiple columns

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
2	003	65	restudy	2	4
1	002	32	test	6	2
3	004	42	test	6	5

data.sort_values(
    ["condition", "score2"], ascending=[True, False]
)  # descending order

	participant_id	age	condition	score1	score2
0	001	25	restudy	3	8
2	003	65	restudy	2	4
3	004	42	test	6	5
1	002	32	test	6	2