7.12. R Data Frames Cheat Sheet#

Note: the course focuses on using Python. This sheet on R is included as additional reading, to show how similar principles of data analysis apply in the R language.

library(tidyverse)  # the tidyverse packages provide data science tools
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.2.0      readr     2.2.0
 forcats   1.0.1      stringr   1.6.0
 ggplot2   4.0.2      tibble    3.3.1
 lubridate 1.9.5      tidyr     1.3.2
 purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data#

Use read_csv to read data into a Tibble data frame. A tibble is a modified version of R’s standard data.frame type, but designed to be more consistent and reliable. By default, text of dates in YYYY-MM-DD format will be automatically interpreted as dates.

data <- read_csv("participants.csv")
data
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): participant_id
dbl  (2): score_task1, score_task2
date (1): session_date
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
A spec_tbl_df: 4 × 4
participant_idsession_datescore_task1score_task2
<chr><date><dbl><dbl>
sub-0012025-01-0886
sub-0022025-01-1845
sub-0032025-01-2564
sub-0042025-02-0255

Date variables can be used to perform operations to test things like the length of time from one date to another. Individual date variables can be represented using as.Date.

data |>
  mutate(delay = session_date - as.Date("2025-01-01"))
A tibble: 4 × 5
participant_idsession_datescore_task1score_task2delay
<chr><date><dbl><dbl><drtn>
sub-0012025-01-0886 7 days
sub-0022025-01-184517 days
sub-0032025-01-256424 days
sub-0042025-02-025532 days

Data frame attributes#

Various functions are used to access information about data frames.

colnames(data)        # column names
sapply(data, typeof)  # data type of each column
dim(data)             # number of rows and columns
length(data)          # number of rows
ncol(data)            # number of columns
  1. 'participant_id'
  2. 'session_date'
  3. 'score_task1'
  4. 'score_task2'
participant_id
'character'
session_date
'double'
score_task1
'double'
score_task2
'double'
  1. 4
  2. 4
4
4

Creating a data frame directly#

To make a data frame in your code, rather than inputting it from a file, use tribble.

data <- tribble(
    ~participant_id, ~age, ~condition, ~score1, ~score2,
    "001", 25, "restudy", 3, 8,
    "002", 32, "test",    6, 2,
    "003", 65, "restudy", 2, 4,
    "004", 42, "test",    6, 5,
)
data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65

Accessing data in a data frame#

Use [] to get a subset of columns in a tibble, or [[]] to access a column individually.

data[c("score1", "score2")]
data[["score1"]]
A tibble: 4 × 2
score1score2
<dbl><dbl>
38
62
24
65
  1. 3
  2. 6
  3. 2
  4. 6

Vectors may be accessed from a data frame and used for calculations.

score1 = data[["score1"]]
score2 = data[["score2"]]
diff = score1 - score2
score1
score2
diff
  1. 3
  2. 6
  3. 2
  4. 6
  1. 8
  2. 2
  3. 4
  4. 5
  1. -5
  2. 4
  3. -2
  4. 1

Selecting and mutating columns#

Use select to get a subset of columns and change their order. Use mutate to create new columns from existing ones.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  select(score1, score2, participant_id)  # column selection and ordering
A tibble: 4 × 3
score1score2participant_id
<dbl><dbl><chr>
38001
62002
24003
65004
data |>
  select(score1, score2) |>              # select score columns
  mutate(score_total = score1 + score2)  # add a total column
A tibble: 4 × 3
score1score2score_total
<dbl><dbl><dbl>
3811
62 8
24 6
6511
data |>
  mutate(score_total = score1 + score2)  # just add a total column
A tibble: 4 × 6
participant_idageconditionscore1score2score_total
<chr><dbl><chr><dbl><dbl><dbl>
00125restudy3811
00232test 62 8
00365restudy24 6
00442test 6511

Using filter#

Use filter to get subsets of rows.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  filter(score1 > 2)  # rows where score 1 is greater than 2
A tibble: 3 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00442test 65
data |>
  filter(score1 > 2 & score2 > 2)  # both scores > 2
A tibble: 2 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00442test 65
data |>
  filter(participant_id == "001")  # can test strings also
A tibble: 1 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
data |>
  filter(participant_id != "003")  # use != for not equal
A tibble: 3 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00442test 65

Sorting data#

Use arrange to sort the oder of the rows in a data frame.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  arrange(score1)  # sort based on score1, in ascending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00365restudy24
00125restudy38
00232test 62
00442test 65
data |>
  arrange(desc(score1))  # sort in descending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00232test 62
00442test 65
00125restudy38
00365restudy24
data |>
  arrange(condition, participant_id)  # sort by multiple columns
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00365restudy24
00232test 62
00442test 65
data |>
  arrange(condition, desc(score2))  # sort score2 in descending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00365restudy24
00442test 65
00232test 62

Summary statistics#

Use summary to display commonly used statistics for each column, and summarise to calculate summary statistics such as the mean or standard deviation.

summary(data)
 participant_id          age         condition             score1    
 Length:4           Min.   :25.00   Length:4           Min.   :2.00  
 Class :character   1st Qu.:30.25   Class :character   1st Qu.:2.75  
 Mode  :character   Median :37.00   Mode  :character   Median :4.50  
                    Mean   :41.00                      Mean   :4.25  
                    3rd Qu.:47.75                      3rd Qu.:6.00  
                    Max.   :65.00                      Max.   :6.00  
     score2    
 Min.   :2.00  
 1st Qu.:3.50  
 Median :4.50  
 Mean   :4.75  
 3rd Qu.:5.75  
 Max.   :8.00  
data |>
  summarise(mean = mean(score1), sd = sd(score1))  # two statistics
A tibble: 1 × 2
meansd
<dbl><dbl>
4.252.061553
data |>
  summarise(across(c(score1, score2), mean))  # mean for two columns
A tibble: 1 × 2
score1score2
<dbl><dbl>
4.254.75
data |>
  summarise(across(c(score1, score2), .fns = list(mean = mean)))  # add a suffix
A tibble: 1 × 2
score1_meanscore2_mean
<dbl><dbl>
4.254.75