R Data Frames Cheat Sheet#

Note: the course focuses on using Python. This sheet on R is included as additional reading, to show how similar principles of data analysis apply in the R language.

library(tidyverse)  # the tidyverse packages provide data science tools
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.6
 forcats   1.0.1      stringr   1.6.0
 ggplot2   4.0.1      tibble    3.3.0
 lubridate 1.9.4      tidyr     1.3.2
 purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data#

Use read_csv to read data into a Tibble data frame. A tibble is a modified version of R’s standard data.frame type, but designed to be more consistent and reliable.

data <- read_csv("people.csv")
data
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): name
dbl  (2): weight, height
date (1): birthdate
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
A spec_tbl_df: 4 × 4
namebirthdateweightheight
<chr><date><dbl><dbl>
Alice Archer 1997-01-1057.91.56
Ben Brown 1985-02-1572.51.77
Chloe Cooper 1983-03-2253.61.65
Daniel Donovan1981-04-3083.11.75

Data frame attributes#

Various functions are used to access information about data frames.

colnames(data)        # column names
sapply(data, typeof)  # data type of each column
dim(data)             # number of rows and columns
length(data)          # number of rows
ncol(data)            # number of columns
  1. 'name'
  2. 'birthdate'
  3. 'weight'
  4. 'height'
name
'character'
birthdate
'double'
weight
'double'
height
'double'
  1. 4
  2. 4
4
4

Creating a data frame directly#

To make a data frame in your code, rather than inputting it from a file, use tribble.

data <- tribble(
    ~participant_id, ~age, ~condition, ~score1, ~score2,
    "001", 25, "restudy", 3, 8,
    "002", 32, "test",    6, 2,
    "003", 65, "restudy", 2, 4,
    "004", 42, "test",    6, 5,
)
data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65

Accessing data in a data frame#

Use [] to get a subset of columns in a tibble, or [[]] to access a column individually.

data[c("score1", "score2")]
data[["score1"]]
A tibble: 4 × 2
score1score2
<dbl><dbl>
38
62
24
65
  1. 3
  2. 6
  3. 2
  4. 6

Vectors may be accessed from a data frame and used for calculations.

score1 = data[["score1"]]
score2 = data[["score2"]]
diff = score1 - score2
score1
score2
diff
  1. 3
  2. 6
  3. 2
  4. 6
  1. 8
  2. 2
  3. 4
  4. 5
  1. -5
  2. 4
  3. -2
  4. 1

Selecting and mutating columns#

Use select to get a subset of columns and change their order. Use mutate to create new columns from existing ones.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  select(score1, score2, participant_id)  # column selection and ordering
A tibble: 4 × 3
score1score2participant_id
<dbl><dbl><chr>
38001
62002
24003
65004
data |>
  select(score1, score2) |>              # select score columns
  mutate(score_total = score1 + score2)  # add a total column
A tibble: 4 × 3
score1score2score_total
<dbl><dbl><dbl>
3811
62 8
24 6
6511
data |>
  mutate(score_total = score1 + score2)  # just add a total column
A tibble: 4 × 6
participant_idageconditionscore1score2score_total
<chr><dbl><chr><dbl><dbl><dbl>
00125restudy3811
00232test 62 8
00365restudy24 6
00442test 6511

Using filter#

Use filter to get subsets of rows.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  filter(score1 > 2)  # rows where score 1 is greater than 2
A tibble: 3 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00442test 65
data |>
  filter(score1 > 2 & score2 > 2)  # both scores > 2
A tibble: 2 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00442test 65
data |>
  filter(participant_id == "001")  # can test strings also
A tibble: 1 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
data |>
  filter(participant_id != "003")  # use != for not equal
A tibble: 3 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00442test 65

Sorting data#

Use arrange to sort the oder of the rows in a data frame.

data
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00232test 62
00365restudy24
00442test 65
data |>
  arrange(score1)  # sort based on score1, in ascending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00365restudy24
00125restudy38
00232test 62
00442test 65
data |>
  arrange(desc(score1))  # sort in descending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00232test 62
00442test 65
00125restudy38
00365restudy24
data |>
  arrange(condition, participant_id)  # sort by multiple columns
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00365restudy24
00232test 62
00442test 65
data |>
  arrange(condition, desc(score2))  # sort score2 in descending order
A tibble: 4 × 5
participant_idageconditionscore1score2
<chr><dbl><chr><dbl><dbl>
00125restudy38
00365restudy24
00442test 65
00232test 62