R Data Frames Cheat Sheet#
Note: the course focuses on using Python. This sheet on R is included as additional reading, to show how similar principles of data analysis apply in the R language.
library(tidyverse) # the tidyverse packages provide data science tools
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing data#
Use read_csv to read data into a Tibble data frame. A tibble is a modified version of R’s standard data.frame type, but designed to be more consistent and reliable.
data <- read_csv("people.csv")
data
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (2): weight, height
date (1): birthdate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| name | birthdate | weight | height |
|---|---|---|---|
| <chr> | <date> | <dbl> | <dbl> |
| Alice Archer | 1997-01-10 | 57.9 | 1.56 |
| Ben Brown | 1985-02-15 | 72.5 | 1.77 |
| Chloe Cooper | 1983-03-22 | 53.6 | 1.65 |
| Daniel Donovan | 1981-04-30 | 83.1 | 1.75 |
Data frame attributes#
Various functions are used to access information about data frames.
colnames(data) # column names
sapply(data, typeof) # data type of each column
dim(data) # number of rows and columns
length(data) # number of rows
ncol(data) # number of columns
- 'name'
- 'birthdate'
- 'weight'
- 'height'
- name
- 'character'
- birthdate
- 'double'
- weight
- 'double'
- height
- 'double'
- 4
- 4
Creating a data frame directly#
To make a data frame in your code, rather than inputting it from a file, use tribble.
data <- tribble(
~participant_id, ~age, ~condition, ~score1, ~score2,
"001", 25, "restudy", 3, 8,
"002", 32, "test", 6, 2,
"003", 65, "restudy", 2, 4,
"004", 42, "test", 6, 5,
)
data
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 003 | 65 | restudy | 2 | 4 |
| 004 | 42 | test | 6 | 5 |
Accessing data in a data frame#
Use [] to get a subset of columns in a tibble, or [[]] to access a column individually.
data[c("score1", "score2")]
data[["score1"]]
| score1 | score2 |
|---|---|
| <dbl> | <dbl> |
| 3 | 8 |
| 6 | 2 |
| 2 | 4 |
| 6 | 5 |
- 3
- 6
- 2
- 6
Vectors may be accessed from a data frame and used for calculations.
score1 = data[["score1"]]
score2 = data[["score2"]]
diff = score1 - score2
score1
score2
diff
- 3
- 6
- 2
- 6
- 8
- 2
- 4
- 5
- -5
- 4
- -2
- 1
Selecting and mutating columns#
Use select to get a subset of columns and change their order. Use mutate to create new columns from existing ones.
data
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 003 | 65 | restudy | 2 | 4 |
| 004 | 42 | test | 6 | 5 |
data |>
select(score1, score2, participant_id) # column selection and ordering
| score1 | score2 | participant_id |
|---|---|---|
| <dbl> | <dbl> | <chr> |
| 3 | 8 | 001 |
| 6 | 2 | 002 |
| 2 | 4 | 003 |
| 6 | 5 | 004 |
data |>
select(score1, score2) |> # select score columns
mutate(score_total = score1 + score2) # add a total column
| score1 | score2 | score_total |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 3 | 8 | 11 |
| 6 | 2 | 8 |
| 2 | 4 | 6 |
| 6 | 5 | 11 |
data |>
mutate(score_total = score1 + score2) # just add a total column
| participant_id | age | condition | score1 | score2 | score_total |
|---|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 | 11 |
| 002 | 32 | test | 6 | 2 | 8 |
| 003 | 65 | restudy | 2 | 4 | 6 |
| 004 | 42 | test | 6 | 5 | 11 |
Using filter#
Use filter to get subsets of rows.
data
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 003 | 65 | restudy | 2 | 4 |
| 004 | 42 | test | 6 | 5 |
data |>
filter(score1 > 2) # rows where score 1 is greater than 2
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 004 | 42 | test | 6 | 5 |
data |>
filter(score1 > 2 & score2 > 2) # both scores > 2
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 004 | 42 | test | 6 | 5 |
data |>
filter(participant_id == "001") # can test strings also
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
data |>
filter(participant_id != "003") # use != for not equal
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 004 | 42 | test | 6 | 5 |
Sorting data#
Use arrange to sort the oder of the rows in a data frame.
data
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 003 | 65 | restudy | 2 | 4 |
| 004 | 42 | test | 6 | 5 |
data |>
arrange(score1) # sort based on score1, in ascending order
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 003 | 65 | restudy | 2 | 4 |
| 001 | 25 | restudy | 3 | 8 |
| 002 | 32 | test | 6 | 2 |
| 004 | 42 | test | 6 | 5 |
data |>
arrange(desc(score1)) # sort in descending order
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 002 | 32 | test | 6 | 2 |
| 004 | 42 | test | 6 | 5 |
| 001 | 25 | restudy | 3 | 8 |
| 003 | 65 | restudy | 2 | 4 |
data |>
arrange(condition, participant_id) # sort by multiple columns
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 003 | 65 | restudy | 2 | 4 |
| 002 | 32 | test | 6 | 2 |
| 004 | 42 | test | 6 | 5 |
data |>
arrange(condition, desc(score2)) # sort score2 in descending order
| participant_id | age | condition | score1 | score2 |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> | <dbl> |
| 001 | 25 | restudy | 3 | 8 |
| 003 | 65 | restudy | 2 | 4 |
| 004 | 42 | test | 6 | 5 |
| 002 | 32 | test | 6 | 2 |