R Data Frames Cheat Sheet

7.12. R Data Frames Cheat Sheet#

Note: the course focuses on using Python. This sheet on R is included as additional reading, to show how similar principles of data analysis apply in the R language.

library(tidyverse)  # the tidyverse packages provide data science tools

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data#

Use read_csv to read data into a Tibble data frame. A tibble is a modified version of R’s standard data.frame type, but designed to be more consistent and reliable.

data <- read_csv("participants.csv")
data

Rows: 4 Columns: 4

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): participant_id
dbl  (2): score_task1, score_task2
date (1): session_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A spec_tbl_df: 4 × 4
participant_id	session_date	score_task1	score_task2
<chr>	<date>	<dbl>	<dbl>
sub-001	2025-01-08	8	6
sub-002	2025-01-18	4	5
sub-003	2025-01-25	6	4
sub-004	2025-02-02	5	5

Data frame attributes#

Various functions are used to access information about data frames.

colnames(data)        # column names
sapply(data, typeof)  # data type of each column
dim(data)             # number of rows and columns
length(data)          # number of rows
ncol(data)            # number of columns

'participant_id'
'session_date'
'score_task1'
'score_task2'

participant_id: 'character'
session_date: 'double'
score_task1: 'double'
score_task2: 'double'

4
4

4

Creating a data frame directly#

To make a data frame in your code, rather than inputting it from a file, use tribble.

data <- tribble(
    ~participant_id, ~age, ~condition, ~score1, ~score2,
    "001", 25, "restudy", 3, 8,
    "002", 32, "test",    6, 2,
    "003", 65, "restudy", 2, 4,
    "004", 42, "test",    6, 5,
)
data

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
003	65	restudy	2	4
004	42	test	6	5

Accessing data in a data frame#

Use [] to get a subset of columns in a tibble, or [[]] to access a column individually.

data[c("score1", "score2")]
data[["score1"]]

A tibble: 4 × 2
score1	score2
<dbl>	<dbl>
3	8
6	2
2	4
6	5

3
6
2
6

Vectors may be accessed from a data frame and used for calculations.

score1 = data[["score1"]]
score2 = data[["score2"]]
diff = score1 - score2
score1
score2
diff

3
6
2
6

8
2
4
5

-5
4
-2
1

Selecting and mutating columns#

Use select to get a subset of columns and change their order. Use mutate to create new columns from existing ones.

data

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
003	65	restudy	2	4
004	42	test	6	5

data |>
  select(score1, score2, participant_id)  # column selection and ordering

A tibble: 4 × 3
score1	score2	participant_id
<dbl>	<dbl>	<chr>
3	8	001
6	2	002
2	4	003
6	5	004

data |>
  select(score1, score2) |>              # select score columns
  mutate(score_total = score1 + score2)  # add a total column

A tibble: 4 × 3
score1	score2	score_total
<dbl>	<dbl>	<dbl>
3	8	11
6	2	8
2	4	6
6	5	11

data |>
  mutate(score_total = score1 + score2)  # just add a total column

A tibble: 4 × 6
participant_id	age	condition	score1	score2	score_total
<chr>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
001	25	restudy	3	8	11
002	32	test	6	2	8
003	65	restudy	2	4	6
004	42	test	6	5	11

Using filter#

Use filter to get subsets of rows.

data

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
003	65	restudy	2	4
004	42	test	6	5

data |>
  filter(score1 > 2)  # rows where score 1 is greater than 2

A tibble: 3 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
004	42	test	6	5

data |>
  filter(score1 > 2 & score2 > 2)  # both scores > 2

A tibble: 2 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
004	42	test	6	5

data |>
  filter(participant_id == "001")  # can test strings also

A tibble: 1 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8

data |>
  filter(participant_id != "003")  # use != for not equal

A tibble: 3 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
004	42	test	6	5

Sorting data#

Use arrange to sort the oder of the rows in a data frame.

data

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
002	32	test	6	2
003	65	restudy	2	4
004	42	test	6	5

data |>
  arrange(score1)  # sort based on score1, in ascending order

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
003	65	restudy	2	4
001	25	restudy	3	8
002	32	test	6	2
004	42	test	6	5

data |>
  arrange(desc(score1))  # sort in descending order

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
002	32	test	6	2
004	42	test	6	5
001	25	restudy	3	8
003	65	restudy	2	4

data |>
  arrange(condition, participant_id)  # sort by multiple columns

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
003	65	restudy	2	4
002	32	test	6	2
004	42	test	6	5

data |>
  arrange(condition, desc(score2))  # sort score2 in descending order

A tibble: 4 × 5
participant_id	age	condition	score1	score2
<chr>	<dbl>	<chr>	<dbl>	<dbl>
001	25	restudy	3	8
003	65	restudy	2	4
004	42	test	6	5
002	32	test	6	2

Summary statistics#

Use summary to display commonly used statistics for each column, and summarise to calculate summary statistics such as the mean or standard deviation.

summary(data)

 participant_id          age         condition             score1    
 Length:4           Min.   :25.00   Length:4           Min.   :2.00  
 Class :character   1st Qu.:30.25   Class :character   1st Qu.:2.75  
 Mode  :character   Median :37.00   Mode  :character   Median :4.50  
                    Mean   :41.00                      Mean   :4.25  
                    3rd Qu.:47.75                      3rd Qu.:6.00  
                    Max.   :65.00                      Max.   :6.00  
     score2    
 Min.   :2.00  
 1st Qu.:3.50  
 Median :4.50  
 Mean   :4.75  
 3rd Qu.:5.75  
 Max.   :8.00  

data |>
  summarise(mean = mean(score1), sd = sd(score1))  # two statistics

A tibble: 1 × 2
mean	sd
<dbl>	<dbl>
4.25	2.061553

data |>
  summarise(across(c(score1, score2), mean))  # mean for two columns

A tibble: 1 × 2
score1	score2
<dbl>	<dbl>
4.25	4.75

data |>
  summarise(across(c(score1, score2), .fns = list(mean = mean)))  # add a suffix

A tibble: 1 × 2
score1_mean	score2_mean
<dbl>	<dbl>
4.25	4.75