Python Data Frames Cheat Sheet

Python Data Frames Cheat Sheet#

Importing data#

Use Polars functions to read data into a DataFrame or write a DataFrame to a file.

import polars as pl               # import the polars package
data = pl.read_csv("people.csv")  # use read_csv to read CSV files
data                              # display the data frame

shape: (4, 4)

name	birthdate	weight	height
str	str	f64	f64
"Alice Archer"	"1997-01-10"	57.9	1.56
"Ben Brown"	"1985-02-15"	72.5	1.77
"Chloe Cooper"	"1983-03-22"	53.6	1.65
"Daniel Donovan"	"1981-04-30"	83.1	1.75

Data frame attributes#

Use DataFrame attributes to get information about how the data are organized.

print(data.columns)  # column names in a list of strings
print(data.dtypes)   # data type of each column
print(data.shape)    # tuple with the number of rows and columns
print(data.height)   # number of rows
print(data.width)    # number of columns

['name', 'birthdate', 'weight', 'height']
[String, String, Float64, Float64]
(4, 4)
4
4

print(data.schema)  # dictionary of column names and their data types

Schema({'name': String, 'birthdate': String, 'weight': Float64, 'height': Float64})

Creating a data frame directly#

To make a DataFrame in your code, rather than inputting it from a file, use pl.DataFrame.

To use pl.DataFrame, make a dictionary (use curly braces, {}) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.

data = pl.DataFrame(
    {
        "participant_id": ["001", "002", "003", "004"],
        "age": [25, 32, 65, 42],
        "condition": ["restudy", "test", "restudy", "test"],
        "score1": [3, 6, 2, 6],
        "score2": [8, 2, 4, 5],
    }
)
data

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"003"	65	"restudy"	2	4
"004"	42	"test"	6	5

Accessing data in a data frame#

Use indexing ([]) to access individual columns.

print(type(data["score1"]))  # each column is a Series
data["score1"]               # indexing a column results in a series

<class 'polars.series.series.Series'>

shape: (4,)

score1
i64
3
6
2
6

Columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it’s more efficient to use DataFrame functions for analysis.

score1 = data["score1"].to_numpy()  # access a column and convert to an array
score2 = data["score2"].to_numpy()
diff = score1 - score2              # now can use NumPy operations and functions
print(score1)
print(score2)
print(diff)

[3 6 2 6]
[8 2 4 5]
[-5  4 -2  1]

Expressions#

In Polars, we can use expressions to represent operations on columns in a DataFrame. These expressions are used with the select, with_columns, filter, and group_by methods to clean, reorganize, and analyze data.

Expressions let us describe mathmatical operations on data columns, using standard math operators. Note that we can define an expression without actually evaluating it on any data.

pl.col("score1")                      # refer to a "score1" column
pl.col("score1") / 10                 # divide column by 10
pl.col("score1") ** 2                 # square column
pl.col("score1").sqrt()               # square root of column
pl.col("score1") + pl.col("score2")   # add two columns
pl.col("score1") * pl.col("score2");  # multiply two columns

Standard statistics are also available, similar to NumPy. By default, missing data will be ignored, like with NumPy’s nanmean, nanstd, etc.

pl.col("score1").sum()            # sum over all rows
pl.col("score1").mean()           # mean
pl.col("score1").std()            # standard deviation
pl.col("score1").min()            # minimum
pl.col("score1").max()            # maximum
pl.col("score1").median()         # median
pl.col("score1").quantile(0.25);  # 25th percentile

Using select and with_columns#

Use select to get a subset of columns from a DataFrame, change their order, and transform them. Use with_columns to add columns without removing any.

data

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"003"	65	"restudy"	2	4
"004"	42	"test"	6	5

Pass a list of columns to reorder them and/or get a subset of columns.

data.select(["score1", "score2", "participant_id"])

shape: (4, 3)

score1	score2	participant_id
i64	i64	str
3	8	"001"
6	2	"002"
2	4	"003"
6	5	"004"

Use an expression to make a new column based on existing columns.

data.select(
    "score1",
    "score2",
    score_total=pl.col("score1") + pl.col("score2"),  # add score 1 and score 2
)

shape: (4, 3)

score1	score2	score_total
i64	i64	i64
3	8	11
6	2	8
2	4	6
6	5	11

Use with_columns to add a column to the existing ones. Otherwise, it works the same as select.

data.with_columns(
    score_total=pl.col("score1") + pl.col("score2")
)

shape: (4, 6)

participant_id	age	condition	score1	score2	score_total
str	i64	str	i64	i64	i64
"001"	25	"restudy"	3	8	11
"002"	32	"test"	6	2	8
"003"	65	"restudy"	2	4	6
"004"	42	"test"	6	5	11

Using filter#

Use filter to get subsets of rows.

data

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"003"	65	"restudy"	2	4
"004"	42	"test"	6	5

data.filter(pl.col("score1") > 2)  # rows where score1 is greater than 2

shape: (3, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"004"	42	"test"	6	5

data.filter((pl.col("score1") > 2) & (pl.col("score2") > 2))  # both scores > 2

shape: (2, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"004"	42	"test"	6	5

data.filter(pl.col("participant_id") == "001")  # can test strings also

shape: (1, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8

data.filter(pl.col("participant_id") != "003")  # use != for not equal

shape: (3, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"004"	42	"test"	6	5

Comparison expressions#

Similar comparison expressions as in NumPy are available to use with filter.

pl.col("age") == 25   # rows where age equals 25
pl.col("age") != 25   # rows where age does not equal 25
pl.col("age") < 25    # rows where age is less than 25
pl.col("age") <= 25   # rows where age is less than or equal to 25
pl.col("age") > 25    # rows where age is greater than 25
pl.col("age") >= 25;  # rows where age is greater than or equal to 25

pl.col("participant_id") == "001"                # check string equality
pl.col("participant_id") != "001"                # check string inequality
pl.col("participant_id").is_in(["001", "002"]);  # check if in list

(pl.col("score1") > 2) & (pl.col("score2") > 2)        # both greater than 2
(pl.col("score1") == 1) | (pl.col("score2") == 1)      # either is 1
~((pl.col("score1") == 1) & (pl.col("score2") == 1));  # not both 1

Sorting data#

Use sort to rearrange the order of the rows in a data frame.

data

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"003"	65	"restudy"	2	4
"004"	42	"test"	6	5

data.sort("score1")  # sort based on score1, in ascending order

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"003"	65	"restudy"	2	4
"001"	25	"restudy"	3	8
"002"	32	"test"	6	2
"004"	42	"test"	6	5

data.sort("score1", descending=True)  # sort in descending order

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"002"	32	"test"	6	2
"004"	42	"test"	6	5
"001"	25	"restudy"	3	8
"003"	65	"restudy"	2	4

data.sort("condition", "participant_id")  # sort by multiple columns

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"003"	65	"restudy"	2	4
"002"	32	"test"	6	2
"004"	42	"test"	6	5

data.sort("condition", "score2", descending=[False, True])  # descending order

shape: (4, 5)

participant_id	age	condition	score1	score2
str	i64	str	i64	i64
"001"	25	"restudy"	3	8
"003"	65	"restudy"	2	4
"004"	42	"test"	6	5
"002"	32	"test"	6	2

Python Data Frames Cheat Sheet

Contents

Python Data Frames Cheat Sheet#

Importing data#

Data frame attributes#

Creating a data frame directly#

Accessing data in a data frame#

Expressions#

Using select and with_columns#

Using filter#

Comparison expressions#

Sorting data#