# R Data Frames Cheat Sheet

Note: the course focuses on using Python. This sheet on R is included as additional reading, to show how similar principles of data analysis apply in the R language.

In [1]:
library(tidyverse)  # the tidyverse packages provide data science tools

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Importing data

Use `read_csv` to read data into a Tibble data frame. A tibble is a modified version of R's standard `data.frame` type, but designed to be more consistent and reliable.

In [2]:
data <- read_csv("people.csv")
data

[1mRows: [22m[34m4[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): name
[32mdbl[39m  (2): weight, height
[34mdate[39m (1): birthdate

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


name,birthdate,weight,height
<chr>,<date>,<dbl>,<dbl>
Alice Archer,1997-01-10,57.9,1.56
Ben Brown,1985-02-15,72.5,1.77
Chloe Cooper,1983-03-22,53.6,1.65
Daniel Donovan,1981-04-30,83.1,1.75


## Data frame attributes

Various functions are used to access information about data frames.

In [3]:
colnames(data)        # column names
sapply(data, typeof)  # data type of each column
dim(data)             # number of rows and columns
length(data)          # number of rows
ncol(data)            # number of columns

## Creating a data frame directly

To make a data frame in your code, rather than inputting it from a file, use `tribble`.

In [4]:
data <- tribble(
    ~participant_id, ~age, ~condition, ~score1, ~score2,
    "001", 25, "restudy", 3, 8,
    "002", 32, "test",    6, 2,
    "003", 65, "restudy", 2, 4,
    "004", 42, "test",    6, 5,
)
data

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
3,65,restudy,2,4
4,42,test,6,5


## Accessing data in a data frame

Use `[]` to get a subset of columns in a tibble, or `[[]]` to access a column individually.

In [5]:
data[c("score1", "score2")]
data[["score1"]]

score1,score2
<dbl>,<dbl>
3,8
6,2
2,4
6,5


Vectors may be accessed from a data frame and used for calculations.

In [6]:
score1 = data[["score1"]]
score2 = data[["score2"]]
diff = score1 - score2
score1
score2
diff

## Selecting and mutating columns

Use `select` to get a subset of columns and change their order. Use `mutate` to create new columns from existing ones.

In [7]:
data

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
3,65,restudy,2,4
4,42,test,6,5


In [8]:
data |>
  select(score1, score2, participant_id)  # column selection and ordering

score1,score2,participant_id
<dbl>,<dbl>,<chr>
3,8,1
6,2,2
2,4,3
6,5,4


In [9]:
data |>
  select(score1, score2) |>              # select score columns
  mutate(score_total = score1 + score2)  # add a total column

score1,score2,score_total
<dbl>,<dbl>,<dbl>
3,8,11
6,2,8
2,4,6
6,5,11


In [10]:
data |>
  mutate(score_total = score1 + score2)  # just add a total column

participant_id,age,condition,score1,score2,score_total
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>
1,25,restudy,3,8,11
2,32,test,6,2,8
3,65,restudy,2,4,6
4,42,test,6,5,11


## Using filter

Use `filter` to get subsets of rows.

In [11]:
data

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
3,65,restudy,2,4
4,42,test,6,5


In [12]:
data |>
  filter(score1 > 2)  # rows where score 1 is greater than 2

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
4,42,test,6,5


In [13]:
data |>
  filter(score1 > 2 & score2 > 2)  # both scores > 2

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
4,42,test,6,5


In [14]:
data |>
  filter(participant_id == "001")  # can test strings also

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8


In [15]:
data |>
  filter(participant_id != "003")  # use != for not equal

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
4,42,test,6,5


## Sorting data

Use `arrange` to sort the oder of the rows in a data frame.

In [16]:
data

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
2,32,test,6,2
3,65,restudy,2,4
4,42,test,6,5


In [17]:
data |>
  arrange(score1)  # sort based on score1, in ascending order

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
3,65,restudy,2,4
1,25,restudy,3,8
2,32,test,6,2
4,42,test,6,5


In [18]:
data |>
  arrange(desc(score1))  # sort in descending order

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
2,32,test,6,2
4,42,test,6,5
1,25,restudy,3,8
3,65,restudy,2,4


In [19]:
data |>
  arrange(condition, participant_id)  # sort by multiple columns

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
3,65,restudy,2,4
2,32,test,6,2
4,42,test,6,5


In [20]:
data |>
  arrange(condition, desc(score2))  # sort score2 in descending order

participant_id,age,condition,score1,score2
<chr>,<dbl>,<chr>,<dbl>,<dbl>
1,25,restudy,3,8
3,65,restudy,2,4
4,42,test,6,5
2,32,test,6,2
