Data Frames

7. Data Frames#

DataFrames are used to organize spreadsheets of data. There powerful methods in Python for organizing, filtering, and summarizing data in spreadsheets.

7.1. Importing data#

We can easily use the Polars package to import data from spreadsheets into a variable called a DataFrame. First, we will import the Polars module.

import polars as pl

We can use pl.read_csv to import a CSV file into a DataFrame (in this case, data from Osth & Fox 2019). The head method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5). In Jupyter, the last command in a cell will be displayed below that cell. In this case, we have raw.head() at the end. This produces a DataFrame, which Jupyter then displays in a nice format.

raw = pl.read_csv("exp1.csv")
raw.head()

shape: (5, 16)

cycle	trial	phase	type	word1	word2	response	RT	correct	lag	serPos1	serPos2	subj	intactLag	prevResponse	prevRT
i64	i64	str	str	str	str	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64
0	-1	"study"	"intact"	"formal"	"positive"	-1	-1.0	-1	-1	0	0	101	0	0	0
0	0	"study"	"intact"	"skin"	"careful"	-1	-1.0	-1	-1	1	1	101	0	0	0
0	1	"study"	"intact"	"upon"	"miss"	-1	-1.0	-1	-1	2	2	101	0	0	0
0	2	"study"	"intact"	"single"	"tradition"	-1	-1.0	-1	-1	3	3	101	0	0	0
0	3	"study"	"intact"	"prove"	"airport"	-1	-1.0	-1	-1	4	4	101	0	0	0

We can also use print, but the formatting is less nice than what Jupyter uses.

print(raw.head())

shape: (5, 16)
┌───────┬───────┬───────┬────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type   ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---  ┆ ---       ┆ ---          ┆ ---    │
│ i64   ┆ i64   ┆ str   ┆ str    ┆   ┆ i64  ┆ i64       ┆ i64          ┆ i64    │
╞═══════╪═══════╪═══════╪════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0     ┆ -1    ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 0     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 1     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 2     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 3     ┆ study ┆ intact ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
└───────┴───────┴───────┴────────┴───┴──────┴───────────┴──────────────┴────────┘

There are a lot of columns, including some that we won’t use. Let’s use the select method to make the data a little simpler to look at. The select method lets us create a new DataFrame with a specified set of columns. The simplest way to use it is to get a subset of the columns in the DataFrame by passing a list of column names.

columns = ["subj", "cycle", "trial", "phase", "type", "word1", "word2", "response", "RT", "correct", "lag"]
df = raw.select(columns)
df.head()

shape: (5, 11)

subj	cycle	trial	phase	type	word1	word2	response	RT	correct	lag
i64	i64	i64	str	str	str	str	i64	f64	i64	i64
101	0	-1	"study"	"intact"	"formal"	"positive"	-1	-1.0	-1	-1
101	0	0	"study"	"intact"	"skin"	"careful"	-1	-1.0	-1	-1
101	0	1	"study"	"intact"	"upon"	"miss"	-1	-1.0	-1	-1
101	0	2	"study"	"intact"	"single"	"tradition"	-1	-1.0	-1	-1
101	0	3	"study"	"intact"	"prove"	"airport"	-1	-1.0	-1	-1

A DataFrame has multiple columns, each of which can have a different type of data. For example, i64 means a 64-bit integer, and str indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.

The list of columns and their data types is known as a schema. We can access it for an existing DataFrame using the schema attribute.

df.schema

Schema([('subj', Int64),
        ('cycle', Int64),
        ('trial', Int64),
        ('phase', String),
        ('type', String),
        ('word1', String),
        ('word2', String),
        ('response', Int64),
        ('RT', Float64),
        ('correct', Int64),
        ('lag', Int64)])

When importing data, we can make changes to the schema to read things in differently. We can either pass in a full schema that we want to use, as a dictionary, or a schema_overrides input to indicate just some specific column formats. If we only specify some columns, the format of the other ones will be inferred automatically.

raw_override = pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64})
raw_override.head()

shape: (5, 16)

cycle	trial	phase	type	word1	word2	response	RT	correct	lag	serPos1	serPos2	subj	intactLag	prevResponse	prevRT
i64	i64	str	str	str	str	f64	f64	i64	i64	i64	i64	i64	i64	i64	i64
0	-1	"study"	"intact"	"formal"	"positive"	-1.0	-1.0	-1	-1	0	0	101	0	0	0
0	0	"study"	"intact"	"skin"	"careful"	-1.0	-1.0	-1	-1	1	1	101	0	0	0
0	1	"study"	"intact"	"upon"	"miss"	-1.0	-1.0	-1	-1	2	2	101	0	0	0
0	2	"study"	"intact"	"single"	"tradition"	-1.0	-1.0	-1	-1	3	3	101	0	0	0
0	3	"study"	"intact"	"prove"	"airport"	-1.0	-1.0	-1	-1	4	4	101	0	0	0

Note that the response column is now a 64-bit float (f64) instead of an integer.

We can also easily write out a DataFrame to a spreadsheet file, using the write_csv method. We must specify the path to where we want the file to be saved.

We’ll first use the head method to get the first five lines. Note that we can chain together calls to DataFrame methods. When we call head, it returns a new DataFrame. Then we immediately call write_csv, which works on the trimmed DataFrame.

df.head(5).write_csv("exp1_trimmed.csv")

Take a look at the exp1_trimmed.csv spreadsheet in VSCode.

Dates are handled with special variables that have datetime type. This data type allows for calculating the difference in time between different time stamps. By default, Polars will not attempt to convert date-like text.

sessions = pl.read_csv("sessions.csv")
sessions

shape: (3, 5)

participant_id	session1	score1	session2	score2
i64	str	i64	str	i64
1	"2024-03-01"	15	"2024-03-04"	18
2	"2024-03-15"	12	"2024-03-20"	19
3	"2024-03-21"	15	"2024-03-25"	20

Note that this file has dates in it, in the international standard YYYY-MM-DD format, but Polars reads them as strings to be safe.

To enable representing dates, we can set try_parse_dates to True.

sessions = pl.read_csv("sessions.csv", try_parse_dates=True)
sessions

shape: (3, 5)

participant_id	session1	score1	session2	score2
i64	date	i64	date	i64
1	2024-03-01	15	2024-03-04	18
2	2024-03-15	12	2024-03-20	19
3	2024-03-21	15	2024-03-25	20

Datetime variables can be used to calculate differences in day and time.

sessions["session1"][1] - sessions["session1"][0]

datetime.timedelta(days=14)

When we calculate a difference, we get a timedelta variable. Here, this shows us that there were 14 days between the first sessions of participants 1 and 2.

Exercise: importing data using Polars#

Read the file in this directory called participants.csv into a DataFrame called participants. Use the columns input to select just these columns: participant_id, session_date, score_task1. Use Jupyter to display the DataFrame.

Set try_parse_dates to True to automatically identify dates and read them in as dates instead of strings.

# answer here

7.2. Creating a DataFrame directly in Python#

It’s also possible to create a DataFrame directly using a dictionary of lists. Each key in the dictionary becomes a column in the DataFrame. This method is mostly used for demonstrations; real data are usually in a file that has to be imported.

We can create a DataFrame directly by calling pl.DataFrame. The input gives all the data and the column names, as a Python dictionary. We use the special date datatype to represent dates. The dates are given in the international standard format, which follows this ordering: year, month, day.

from datetime import date
df_manual = pl.DataFrame(
    {
        "participant_id": ["sub-001", "sub-002", "sub-003", "sub-004"],
        "session_date": [
            date(2025, 1, 8),
            date(2025, 1, 18),
            date(2025, 1, 25),
            date(2025, 2, 2),
        ],
        "score_task1": [8, 4, 6, 5],
        "score_task2": [6, 5, 4, 5],
    }
)

Let’s take a look at the DataFrame we created. Note that each key in the dictionary is now a column. The lists we passed in become the data for each column.

df_manual

shape: (4, 4)

participant_id	session_date	score_task1	score_task2
str	date	i64	i64
"sub-001"	2025-01-08	8	6
"sub-002"	2025-01-18	4	5
"sub-003"	2025-01-25	6	4
"sub-004"	2025-02-02	5	5

Exercise: creating a DataFrame#

Create a DataFrame called data with the same data and column names as this table. The participant_id column should have strings; the age column should have integers; and the accuracy column should have floats.

participant_id	age	accuracy
001	24	0.85
002	21	0.72
003	33	0.92

# answer here

7.3. Accessing data in a DataFrame#

DataFrames organize data within a more complex data structure than we’ve seen so far. We can access parts of a DataFrame to get some of the same structures that we have worked with before.

We can access columns of a DataFrame using df["column"], like we’ve seen before with dictionaries. This gives us a special data type called a Series. It’s used in Polars to represent single columns of data separately from a DataFrame. They aren’t used very often by themselves.

df["type"]

shape: (107_443,)

type
str
"intact"
"intact"
"intact"
"intact"
"intact"
…
"intact"
"rearranged"
"rearranged"
"rearranged"
"intact"

We can convert Series into NumPy arrays using the to_numpy method.

x = df["response"].to_numpy()
x

array([-1, -1, -1, ...,  1,  0,  1], shape=(107443,))

We can then run any NumPy methods we want, like calculating the mean. We’ll use nanmean to exclude missing samples.

import numpy as np
np.nanmean(x)

np.float64(-0.31397112887763745)

Usually converting data to NumPy arrays isn’t the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.

df["response"].mean()

-0.31397112887763745

We can also use mean on the DataFrame to calculate it for all columns. Note that this doesn’t really make sense for all the columns, though. To get a meaningful result, we need to do some data cleaning first to do things like remove the -1 values that are used to indicate missing data.

df.mean()

shape: (1, 11)

subj	cycle	trial	phase	type	word1	word2	response	RT	correct	lag
f64	f64	f64	str	str	str	str	f64	f64	f64	f64
165.644798	3.497492	28.496598	null	null	null	null	-0.313971	0.16512	-0.172007	-0.0004

Exercise: accessing data#

Read the participants.csv file into a DataFrame. Convert the score_task1 variable into a NumPy array and use NumPy to calculate the mean. Then use Polars to calculate the mean of all columns.

# answer here

7.4. Polars contexts#

Polars has different contexts that can be used to accomplish different kinds of operations.

select is used to get a subset of columns or to create new columns from data in the DataFrame.

with_columns works the same as select, but adds columns instead of creating a whole new DataFrame.

filter is used to select a subset of the rows in the DataFrame.

group_by is used to analyze groups of observations and calculate summary statistics.

7.5. Using select and with_columns#

We can use select to create a new DataFrame using expressions that access data from an existing DataFrame.

We can select multiple columns, and optionally change their order, using the select method.

df.select("subj", "phase", "response", "RT").head()

shape: (5, 4)

subj	phase	response	RT
i64	str	i64	f64
101	"study"	-1	-1.0
101	"study"	-1	-1.0
101	"study"	-1	-1.0
101	"study"	-1	-1.0
101	"study"	-1	-1.0

Select is very powerful, and can be used for much more than just getting a subset of columns. We can also use it to transform columns in many different ways. Take the example of the participants data, which includes scores on two tasks.

participants = pl.read_csv("participants.csv", try_parse_dates=True)
participants

shape: (4, 4)

participant_id	session_date	score_task1	score_task2
str	date	i64	i64
"sub-001"	2025-01-08	8	6
"sub-002"	2025-01-18	4	5
"sub-003"	2025-01-25	6	4
"sub-004"	2025-02-02	5	5

Given the columns with scores on two tasks, we can use these data to calculate a new column with an average score across the two tasks. First, we’ll define an expression for the value we want to calculate. Expressions specify something we want to do without actually doing it yet. This allows Polars to optimize how it runs different operations. The pl.col function can be used to specify one or more columns in a DataFrame, as a placeholder. Notice that we define the expression without referring to any specific DataFrame.

avg_expr = (pl.col("score_task1") + pl.col("score_task2")) / 2
print(avg_expr)

[([(col("score_task1")) + (col("score_task2"))]) / (dyn int: 2)]

We can use this expression with our DataFrame to calculate the average score. We can specify what to name the column using keyword arguments. Here, we set the input variable avg_score to the expression that we defined before.

participants.select(avg_score=avg_expr)

shape: (4, 1)

avg_score
f64
7.0
4.5
5.0
5.0

Before, we used a variable to store the average score expression, but usually we’ll write everything all together like this. We can also list other columns that we want to keep.

participants.select(
    "score_task1", 
    "score_task2", 
    avg_score=(pl.col("score_task1") + pl.col("score_task2")) / 2,
)

shape: (4, 3)

score_task1	score_task2	avg_score
i64	i64	f64
8	6	7.0
4	5	4.5
6	4	5.0
5	5	5.0

The select method only returns the columns that were defined in the call. If we want to keep existing columns, we can instead using with_columns.

participants.with_columns(avg_score=(pl.col("score_task1") + pl.col("score_task2")) / 2)

shape: (4, 5)

participant_id	session_date	score_task1	score_task2	avg_score
str	date	i64	i64	f64
"sub-001"	2025-01-08	8	6	7.0
"sub-002"	2025-01-18	4	5	4.5
"sub-003"	2025-01-25	6	4	5.0
"sub-004"	2025-02-02	5	5	5.0

We can pass in multiple inputs to specify multiple columns. In this example, we add additional columns with reference information, including the mean across subjects and the maximum possible score.

participants.with_columns(
    avg_score=avg_expr,              # expression to calculate average
    mean_avg_score=avg_expr.mean(),  # mean across participants
    max_score=10,                    # max possible score
)

shape: (4, 7)

participant_id	session_date	score_task1	score_task2	avg_score	mean_avg_score	max_score
str	date	i64	i64	f64	f64	i32
"sub-001"	2025-01-08	8	6	7.0	5.375	10
"sub-002"	2025-01-18	4	5	4.5	5.375	10
"sub-003"	2025-01-25	6	4	5.0	5.375	10
"sub-004"	2025-02-02	5	5	5.0	5.375	10

Here, we create two columns that are the same for all rows. The first one is the mean score across subjects. Note that we can chain together parts of an expression; this example calculates the average score for each participant and then calculates the mean across participants. The second is a reference value showing the maximum score, which is 10.

Exercise: using select and with_columns#

Create the DataFrame scores as defined below.

participant_id	score1	score2
001	9	5
002	7	3
003	10	6

Create a new column called total that is the sum of the score1 and score2 columns, and a new column average that is the mean of the score1 and score2 columns. Compare what happens when you use select and with_columns.

# answer here

7.6. Using filter#

We can use filter to create a new DataFrame with a subset of rows.

We can filter the rows included in the table using the filter method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the pl.col function.

df.filter(pl.col("phase") == "test").head()

shape: (5, 11)

subj	cycle	trial	phase	type	word1	word2	response	RT	correct	lag
i64	i64	i64	str	str	str	str	i64	f64	i64	i64
101	0	-1	"test"	"rearranged"	"waste"	"degree"	0	2.312	1	2
101	0	0	"test"	"rearranged"	"needed"	"able"	0	3.542	1	1
101	0	1	"test"	"rearranged"	"single"	"clean"	0	2.084	1	3
101	0	2	"test"	"rearranged"	"train"	"useful"	0	1.669	1	2
101	0	3	"test"	"rearranged"	"knees"	"various"	0	2.326	1	5

Here, we indicate that we only want rows where the phase is test, to get only the test trials and exclude all the study trials.

We can make more complicated expressions using operators for and (&), or (|), and not (~). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.

targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()

shape: (5, 11)

subj	cycle	trial	phase	type	word1	word2	response	RT	correct	lag
i64	i64	i64	str	str	str	str	i64	f64	i64	i64
101	0	4	"test"	"intact"	"skin"	"careful"	1	1.407	1	-1
101	0	5	"test"	"intact"	"doctor"	"contrast"	0	4.056	0	-1
101	0	7	"test"	"intact"	"homes"	"fuel"	1	2.499	1	-1
101	0	8	"test"	"intact"	"liked"	"tone"	1	1.609	1	-1
101	0	9	"test"	"intact"	"notice"	"explain"	1	1.352	1	-1

We can put together multiple operations by chaining methods. Here, we’ll select some columns and filter the rows in one step.

df_test = (
    df.filter(pl.col("phase") == "test")
    .select("subj", "trial", "phase", "type", "response", "RT", "correct", "lag")
)
df_test.head(3)

shape: (3, 8)

subj	trial	phase	type	response	RT	correct	lag
i64	i64	str	str	i64	f64	i64	i64
101	-1	"test"	"rearranged"	0	2.312	1	2
101	0	"test"	"rearranged"	0	3.542	1	1
101	1	"test"	"rearranged"	0	2.084	1	3

Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.

Exercise: using filter#

Use the participants DataFrame. If you don’t have the variable defined currently, you can read it in from participants.csv. Use filter to get participants who scored at least 5 on both tasks. Then use filter to get participants who scored at least 6 on either task.

# answer here

7.7. Using sort#

Data frames can be sorted based on multiple columns.

To change the sorting of the rows in a data frame, use sort.

participants.sort("score_task1")

shape: (4, 4)

participant_id	session_date	score_task1	score_task2
str	date	i64	i64
"sub-002"	2025-01-18	4	5
"sub-004"	2025-02-02	5	5
"sub-003"	2025-01-25	6	4
"sub-001"	2025-01-08	8	6

To sort in descending order instead of ascending, set descending=True.

participants.sort("session_date", descending=True)

shape: (4, 4)

participant_id	session_date	score_task1	score_task2
str	date	i64	i64
"sub-004"	2025-02-02	5	5
"sub-003"	2025-01-25	6	4
"sub-002"	2025-01-18	4	5
"sub-001"	2025-01-08	8	6

We can also sort by multiple columns simulateously. Columns are sorted in the order in which they are passed.

study = pl.read_csv("study.csv")
study.sort("participant_id", "item_type", "trial_type")

shape: (8, 5)

participant_id	trial_type	item_type	response	response_time
i64	str	i64	i64	f64
1	"lure"	1	1	1.8
1	"target"	1	1	1.5
1	"lure"	2	0	2.3
1	"target"	2	0	2.2
2	"lure"	1	1	2.1
2	"target"	1	1	1.2
2	"lure"	2	0	2.2
2	"target"	2	1	1.6

When there are multiple sorting columns, we can change whether each one is ascending or descending by passing a list to the descending argument.

study.sort(
    "participant_id", "item_type", "trial_type", 
    descending=[False, False, True],
)

shape: (8, 5)

participant_id	trial_type	item_type	response	response_time
i64	str	i64	i64	f64
1	"target"	1	1	1.5
1	"lure"	1	1	1.8
1	"target"	2	0	2.2
1	"lure"	2	0	2.3
2	"target"	1	1	1.2
2	"lure"	1	1	2.1
2	"target"	2	1	1.6
2	"lure"	2	0	2.2

Exercise: using sort#

Read study.csv into a data frame. Then use sort to sort by item_type in ascending order, followed by trial_type in descending order.

# answer here

7.8. Calculating summary statistics#

When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.

First, let’s use method chaining to set up a DataFrame with data from the test phase. This example reads the data from the file, filters it to get the test phase, and selects some columns to focus on.

df_summary = (
    pl.read_csv("exp1.csv")
    .filter(pl.col("phase") == "test")
    .select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_summary.head()

shape: (5, 8)

subj	trial	phase	type	response	RT	correct	lag
i64	i64	str	str	i64	f64	i64	i64
101	-1	"test"	"rearranged"	0	2.312	1	2
101	0	"test"	"rearranged"	0	3.542	1	1
101	1	"test"	"rearranged"	0	2.084	1	3
101	2	"test"	"rearranged"	0	1.669	1	2
101	3	"test"	"rearranged"	0	2.326	1	5

We can get some common summary statistics for all the columns using describe.

df_summary.describe()

shape: (9, 9)

statistic	subj	trial	phase	type	response	RT	correct	lag
str	f64	f64	str	str	f64	f64	f64	f64
"count"	53700.0	53700.0	"53700"	"53700"	53700.0	53700.0	53700.0	53700.0
"null_count"	0.0	0.0	"0"	"0"	0.0	0.0	0.0	0.0
"mean"	165.655866	28.5	null	null	0.372607	1.331173	0.656648	1.0
"std"	99.008229	17.318264	null	null	0.48554	0.76009	0.474832	2.236089
"min"	101.0	-1.0	"test"	"intact"	-1.0	-1.0	0.0	-1.0
"25%"	128.0	14.0	null	null	0.0	0.882	0.0	-1.0
"50%"	158.0	29.0	null	null	0.0	1.125	1.0	1.0
"75%"	186.0	43.0	null	null	1.0	1.526	1.0	3.0
"max"	1150.0	58.0	"test"	"rearranged"	1.0	7.925	1.0	5.0

The count row gives the nuber of non-null samples in each column, while the null_count row indicates how many samples are null.

The mean and std rows give the mean and standard deviation.

The min and max rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The 50% percentile is also known as the median.

We can calculate more targeted statistics using expressions.

df_summary.select(pl.col("response", "RT").mean())

shape: (1, 2)

response	RT
f64	f64
0.372607	1.331173

Note that here we use pl.col to select multiple columns. The mean function is then applied to both columns.

We can use .name.suffix to distinguish these columns from existing ones, by adding a suffix.

result = df_summary.with_columns(pl.col("response", "RT").mean().name.suffix("_mean"))
result.head()

shape: (5, 10)

subj	trial	phase	type	response	RT	correct	lag	response_mean	RT_mean
i64	i64	str	str	i64	f64	i64	i64	f64	f64
101	-1	"test"	"rearranged"	0	2.312	1	2	0.372607	1.331173
101	0	"test"	"rearranged"	0	3.542	1	1	0.372607	1.331173
101	1	"test"	"rearranged"	0	2.084	1	3	0.372607	1.331173
101	2	"test"	"rearranged"	0	1.669	1	2	0.372607	1.331173
101	3	"test"	"rearranged"	0	2.326	1	5	0.372607	1.331173

See the Polars documentation for information about other descriptive statistics.

Exercise: summary statistics#

Use the participants DataFrame with describe to calculate summary statistics for the score_task1 and score_task2 columns. Do you understand what each of the statistics means?

# answer here

7.9. Summary#

Data in table format, such as data stored in spreadsheet files, can be represented in Python using DataFrames. DataFrames are similar to a dictionary that holds a series of NumPy arrays. Each column must contain data of the same datatype.

The Polars package has many tools for working with DataFrames. The select and with_columns methods can be used to select a subset of existing columns or to create new columns using expressions. The filter method makes it possible to flexibly choose a subset of rows.

The data in a DataFrame can be summarized using the describe method or various descriptive statistics functions such as mean and std.

Data Frames

Contents

7. Data Frames#

7.1. Importing data#

Exercise: importing data using Polars#

7.2. Creating a DataFrame directly in Python#

Exercise: creating a DataFrame#

7.3. Accessing data in a DataFrame#

Exercise: accessing data#

7.4. Polars contexts#

7.5. Using select and with_columns#

Exercise: using select and with_columns#

7.6. Using filter#

Exercise: using filter#

7.7. Using sort#

Exercise: using sort#

7.8. Calculating summary statistics#

Exercise: summary statistics#

7.9. Summary#