Python Data Frames Cheat Sheet
Importing data
Use Polars functions to read data into a DataFrame or write a DataFrame to a file.
shape: (4, 4)| name | birthdate | weight | height |
|---|
| str | str | f64 | f64 |
| "Alice Archer" | "1997-01-10" | 57.9 | 1.56 |
| "Ben Brown" | "1985-02-15" | 72.5 | 1.77 |
| "Chloe Cooper" | "1983-03-22" | 53.6 | 1.65 |
| "Daniel Donovan" | "1981-04-30" | 83.1 | 1.75 |
Data frame attributes
Use DataFrame attributes to get information about how the data are organized.
['name', 'birthdate', 'weight', 'height']
[String, String, Float64, Float64]
(4, 4)
4
4
Schema({'name': String, 'birthdate': String, 'weight': Float64, 'height': Float64})
Creating a data frame directly
To make a DataFrame in your code, rather than inputting it from a file, use pl.DataFrame.
To use pl.DataFrame, make a dictionary (use curly braces, {}) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "003" | 65 | "restudy" | 2 | 4 |
| "004" | 42 | "test" | 6 | 5 |
Accessing data in a data frame
Use indexing ([]) to access individual columns.
<class 'polars.series.series.Series'>
Columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it’s more efficient to use DataFrame functions for analysis.
[3 6 2 6]
[8 2 4 5]
[-5 4 -2 1]
Expressions
In Polars, we can use expressions to represent operations on columns in a DataFrame. These expressions are used with the select, with_columns, filter, and group_by methods to clean, reorganize, and analyze data.
Expressions let us describe mathmatical operations on data columns, using standard math operators. Note that we can define an expression without actually evaluating it on any data.
Standard statistics are also available, similar to NumPy. By default, missing data will be ignored, like with NumPy’s nanmean, nanstd, etc.
Using select and with_columns
Use select to get a subset of columns from a DataFrame, change their order, and transform them. Use with_columns to add columns without removing any.
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "003" | 65 | "restudy" | 2 | 4 |
| "004" | 42 | "test" | 6 | 5 |
Pass a list of columns to reorder them and/or get a subset of columns.
shape: (4, 3)| score1 | score2 | participant_id |
|---|
| i64 | i64 | str |
| 3 | 8 | "001" |
| 6 | 2 | "002" |
| 2 | 4 | "003" |
| 6 | 5 | "004" |
Use an expression to make a new column based on existing columns.
shape: (4, 3)| score1 | score2 | score_total |
|---|
| i64 | i64 | i64 |
| 3 | 8 | 11 |
| 6 | 2 | 8 |
| 2 | 4 | 6 |
| 6 | 5 | 11 |
Use with_columns to add a column to the existing ones. Otherwise, it works the same as select.
shape: (4, 6)| participant_id | age | condition | score1 | score2 | score_total |
|---|
| str | i64 | str | i64 | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 | 11 |
| "002" | 32 | "test" | 6 | 2 | 8 |
| "003" | 65 | "restudy" | 2 | 4 | 6 |
| "004" | 42 | "test" | 6 | 5 | 11 |
Using filter
Use filter to get subsets of rows.
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "003" | 65 | "restudy" | 2 | 4 |
| "004" | 42 | "test" | 6 | 5 |
shape: (3, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "004" | 42 | "test" | 6 | 5 |
shape: (2, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "004" | 42 | "test" | 6 | 5 |
shape: (1, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
shape: (3, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "004" | 42 | "test" | 6 | 5 |
Comparison expressions
Similar comparison expressions as in NumPy are available to use with filter.
Sorting data
Use sort to rearrange the order of the rows in a data frame.
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "003" | 65 | "restudy" | 2 | 4 |
| "004" | 42 | "test" | 6 | 5 |
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "003" | 65 | "restudy" | 2 | 4 |
| "001" | 25 | "restudy" | 3 | 8 |
| "002" | 32 | "test" | 6 | 2 |
| "004" | 42 | "test" | 6 | 5 |
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "002" | 32 | "test" | 6 | 2 |
| "004" | 42 | "test" | 6 | 5 |
| "001" | 25 | "restudy" | 3 | 8 |
| "003" | 65 | "restudy" | 2 | 4 |
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "003" | 65 | "restudy" | 2 | 4 |
| "002" | 32 | "test" | 6 | 2 |
| "004" | 42 | "test" | 6 | 5 |
shape: (4, 5)| participant_id | age | condition | score1 | score2 |
|---|
| str | i64 | str | i64 | i64 |
| "001" | 25 | "restudy" | 3 | 8 |
| "003" | 65 | "restudy" | 2 | 4 |
| "004" | 42 | "test" | 6 | 5 |
| "002" | 32 | "test" | 6 | 2 |