Python Data Frames Cheat Sheet#

Importing data#

Use Polars functions to read data into a DataFrame or write a DataFrame to a file.

import polars as pl               # import the polars package
data = pl.read_csv("people.csv")  # use read_csv to read CSV files
data                              # display the data frame
shape: (4, 4)
namebirthdateweightheight
strstrf64f64
"Alice Archer""1997-01-10"57.91.56
"Ben Brown""1985-02-15"72.51.77
"Chloe Cooper""1983-03-22"53.61.65
"Daniel Donovan""1981-04-30"83.11.75

Data frame attributes#

Use DataFrame attributes to get information about how the data are organized.

print(data.columns)  # column names in a list of strings
print(data.dtypes)   # data type of each column
print(data.shape)    # tuple with the number of rows and columns
print(data.height)   # number of rows
print(data.width)    # number of columns
['name', 'birthdate', 'weight', 'height']
[String, String, Float64, Float64]
(4, 4)
4
4
print(data.schema)  # dictionary of column names and their data types
Schema({'name': String, 'birthdate': String, 'weight': Float64, 'height': Float64})

Creating a data frame directly#

To make a DataFrame in your code, rather than inputting it from a file, use pl.DataFrame.

To use pl.DataFrame, make a dictionary (use curly braces, {}) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.

data = pl.DataFrame(
    {
        "participant_id": ["001", "002", "003", "004"],
        "age": [25, 32, 65, 42],
        "condition": ["restudy", "test", "restudy", "test"],
        "score1": [3, 6, 2, 6],
        "score2": [8, 2, 4, 5],
    }
)
data
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"003"65"restudy"24
"004"42"test"65

Accessing data in a data frame#

Use indexing ([]) to access individual columns.

print(type(data["score1"]))  # each column is a Series
data["score1"]               # indexing a column results in a series
<class 'polars.series.series.Series'>
shape: (4,)
score1
i64
3
6
2
6

Columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it’s more efficient to use DataFrame functions for analysis.

score1 = data["score1"].to_numpy()  # access a column and convert to an array
score2 = data["score2"].to_numpy()
diff = score1 - score2              # now can use NumPy operations and functions
print(score1)
print(score2)
print(diff)
[3 6 2 6]
[8 2 4 5]
[-5  4 -2  1]

Expressions#

In Polars, we can use expressions to represent operations on columns in a DataFrame. These expressions are used with the select, with_columns, filter, and group_by methods to clean, reorganize, and analyze data.

Expressions let us describe mathmatical operations on data columns, using standard math operators. Note that we can define an expression without actually evaluating it on any data.

pl.col("score1")                      # refer to a "score1" column
pl.col("score1") / 10                 # divide column by 10
pl.col("score1") ** 2                 # square column
pl.col("score1").sqrt()               # square root of column
pl.col("score1") + pl.col("score2")   # add two columns
pl.col("score1") * pl.col("score2");  # multiply two columns

Standard statistics are also available, similar to NumPy. By default, missing data will be ignored, like with NumPy’s nanmean, nanstd, etc.

pl.col("score1").sum()            # sum over all rows
pl.col("score1").mean()           # mean
pl.col("score1").std()            # standard deviation
pl.col("score1").min()            # minimum
pl.col("score1").max()            # maximum
pl.col("score1").median()         # median
pl.col("score1").quantile(0.25);  # 25th percentile

Using select and with_columns#

Use select to get a subset of columns from a DataFrame, change their order, and transform them. Use with_columns to add columns without removing any.

data
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"003"65"restudy"24
"004"42"test"65

Pass a list of columns to reorder them and/or get a subset of columns.

data.select(["score1", "score2", "participant_id"])
shape: (4, 3)
score1score2participant_id
i64i64str
38"001"
62"002"
24"003"
65"004"

Use an expression to make a new column based on existing columns.

data.select(
    "score1",
    "score2",
    score_total=pl.col("score1") + pl.col("score2"),  # add score 1 and score 2
)
shape: (4, 3)
score1score2score_total
i64i64i64
3811
628
246
6511

Use with_columns to add a column to the existing ones. Otherwise, it works the same as select.

data.with_columns(
    score_total=pl.col("score1") + pl.col("score2")
)
shape: (4, 6)
participant_idageconditionscore1score2score_total
stri64stri64i64i64
"001"25"restudy"3811
"002"32"test"628
"003"65"restudy"246
"004"42"test"6511

Using filter#

Use filter to get subsets of rows.

data
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"003"65"restudy"24
"004"42"test"65
data.filter(pl.col("score1") > 2)  # rows where score1 is greater than 2
shape: (3, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"004"42"test"65
data.filter((pl.col("score1") > 2) & (pl.col("score2") > 2))  # both scores > 2
shape: (2, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"004"42"test"65
data.filter(pl.col("participant_id") == "001")  # can test strings also
shape: (1, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
data.filter(pl.col("participant_id") != "003")  # use != for not equal
shape: (3, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"004"42"test"65

Comparison expressions#

Similar comparison expressions as in NumPy are available to use with filter.

pl.col("age") == 25   # rows where age equals 25
pl.col("age") != 25   # rows where age does not equal 25
pl.col("age") < 25    # rows where age is less than 25
pl.col("age") <= 25   # rows where age is less than or equal to 25
pl.col("age") > 25    # rows where age is greater than 25
pl.col("age") >= 25;  # rows where age is greater than or equal to 25
pl.col("participant_id") == "001"                # check string equality
pl.col("participant_id") != "001"                # check string inequality
pl.col("participant_id").is_in(["001", "002"]);  # check if in list
(pl.col("score1") > 2) & (pl.col("score2") > 2)        # both greater than 2
(pl.col("score1") == 1) | (pl.col("score2") == 1)      # either is 1
~((pl.col("score1") == 1) & (pl.col("score2") == 1));  # not both 1

Sorting data#

Use sort to rearrange the order of the rows in a data frame.

data
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"002"32"test"62
"003"65"restudy"24
"004"42"test"65
data.sort("score1")  # sort based on score1, in ascending order
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"003"65"restudy"24
"001"25"restudy"38
"002"32"test"62
"004"42"test"65
data.sort("score1", descending=True)  # sort in descending order
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"002"32"test"62
"004"42"test"65
"001"25"restudy"38
"003"65"restudy"24
data.sort("condition", "participant_id")  # sort by multiple columns
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"003"65"restudy"24
"002"32"test"62
"004"42"test"65
data.sort("condition", "score2", descending=[False, True])  # descending order
shape: (4, 5)
participant_idageconditionscore1score2
stri64stri64i64
"001"25"restudy"38
"003"65"restudy"24
"004"42"test"65
"002"32"test"62