R Data Cleaning Cheat Sheet#
require(tidyverse)
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("scores.csv")
Rows: 4 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): participant_id, group
dbl (4): condition, test1, test2, test3
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Handling missing data sentinels#
Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | -999 |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
To replace missing data sentinels in the test3 column, using na_if.
data |>
mutate(test3 = na_if(test3, -999))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | NA |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
To replace missing data sentinels in multiple columns, use across to apply the na_if function to each column. The ~ indicates that the code after (here, na_if(.x, -999))) should be treated as a function, which will be applied to each of the indicated columns. The . is a placeholder for the column being evaluated.
data |>
mutate(across(c(test1, test2, test3), ~ na_if(., -999)))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | NA |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
Alternatively, we can set the na input when calling read_csv to set those special values to null immediately, in all columns. Here, setting show_col_types suppresses the messages that are usually shown when using read_csv.
data <- read_csv("scores.csv", na = "-999", show_col_types = FALSE)
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | NA |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
Recoding variables#
Use recode or case_match to change the values of variables and as.numeric or as.character to change the data type of variables.
data |>
mutate(group = recode(group, "exp" = "Experimental", "con" = "Control"))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | Experimental | 1 | 6 | 9 | NA |
| 01 | Experimental | 2 | 4 | 8 | 9 |
| 02 | Control | 1 | 9 | 10 | 8 |
| 02 | Control | 2 | 7 | 9 | 7 |
data |>
mutate(group = case_match(group, "exp" ~ "Experimental", "con" ~ "Control"))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | Experimental | 1 | 6 | 9 | NA |
| 01 | Experimental | 2 | 4 | 8 | 9 |
| 02 | Control | 1 | 9 | 10 | 8 |
| 02 | Control | 2 | 7 | 9 | 7 |
To change a variable into a different data type, use a type conversion function such as as.numeric or as.character.
data |>
mutate(condition = as.character(condition))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | NA |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
To change a numeric value to a string (known as a character vector in R), use case_match.
data |>
mutate(condition = case_match(condition, 1 ~ "target", 2 ~ "lure"))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| 01 | exp | target | 6 | 9 | NA |
| 01 | exp | lure | 4 | 8 | 9 |
| 02 | con | target | 9 | 10 | 8 |
| 02 | con | lure | 7 | 9 | 7 |
Working with missing data#
Missing data should be marked as NA in DataFrames. They may then easily be excluded from calculations.
To count the number of NA values in each column, use summarise_all (which applies a summarizing function to all columns) with ~ sum(is.na(.)), which sums the count of NA values for each column.
summarise_all(data, ~ sum(is.na(.)))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | <int> |
| 0 | 0 | 0 | 0 | 0 | 1 |
Calculations will not ignore NA values by default.
data |>
summarise(across(c(test1, test2, test3), mean))
| test1 | test2 | test3 |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 6.5 | 9 | NA |
Calculations can be set to ignore NA values by setting na.rm to TRUE.
data |>
summarise(across(c(test1, test2, test3), ~ mean(., na.rm = TRUE)))
| test1 | test2 | test3 |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 6.5 | 9 | 8 |
Alternatively, can replace NA values with some other value using replace_na.
data |>
mutate(across(c(test1, test2, test3), ~ replace_na(., 0)))
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | 0 |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
Use drop_na to completely exclude rows with NA.
data |>
drop_na()
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
Grouping and aggregation#
Use group_by and summarize to split data into groups and calculate statistics for each group.
This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.
data |>
group_by(group) |>
summarize(test1 = mean(test1)) # mean for each condition
| group | test1 |
|---|---|
| <chr> | <dbl> |
| con | 8 |
| exp | 5 |
data |>
group_by(group) |>
summarise(test1 = sd(test1)) # standard deviation for each condition
| group | test1 |
|---|---|
| <chr> | <dbl> |
| con | 1.414214 |
| exp | 1.414214 |
We can also calculate multiple statistics at once.
data |>
group_by(group) |>
summarise(mean = mean(test1), sd = sd(test1))
| group | mean | sd |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| con | 8 | 1.414214 |
| exp | 5 | 1.414214 |
Grouping by multiple variables#
To calculate statistics for each combination of multiple variables, use group_by with multiple columns. Here, we set .groups to "drop", so that the data frame will no longer be grouped after the summarization.
data |>
group_by(group, condition) |>
summarise(test1 = mean(test1), .groups = "drop")
| group | condition | test1 |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| con | 1 | 9 |
| con | 2 | 7 |
| exp | 1 | 6 |
| exp | 2 | 4 |
Use n() to get the number of samples in each condition.
data |>
group_by(group, condition) |>
summarize(test1 = mean(test1), n = n(), .groups = "drop")
| group | condition | test1 | n |
|---|---|---|---|
| <chr> | <dbl> | <dbl> | <int> |
| con | 1 | 9 | 1 |
| con | 2 | 7 | 1 |
| exp | 1 | 6 | 1 |
| exp | 2 | 4 | 1 |
Reshaping data to long format#
In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.
This example data frame is wide format, because it has multiple test scores in each row.
data
| participant_id | group | condition | test1 | test2 | test3 |
|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | exp | 1 | 6 | 9 | NA |
| 01 | exp | 2 | 4 | 8 | 9 |
| 02 | con | 1 | 9 | 10 | 8 |
| 02 | con | 2 | 7 | 9 | 7 |
Convert from wide format to long format using pivot_longer. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.
data |>
pivot_longer(cols = starts_with("test"))
| participant_id | group | condition | name | value |
|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <chr> | <dbl> |
| 01 | exp | 1 | test1 | 6 |
| 01 | exp | 1 | test2 | 9 |
| 01 | exp | 1 | test3 | NA |
| 01 | exp | 2 | test1 | 4 |
| 01 | exp | 2 | test2 | 8 |
| 01 | exp | 2 | test3 | 9 |
| 02 | con | 1 | test1 | 9 |
| 02 | con | 1 | test2 | 10 |
| 02 | con | 1 | test3 | 8 |
| 02 | con | 2 | test1 | 7 |
| 02 | con | 2 | test2 | 9 |
| 02 | con | 2 | test3 | 7 |
To change the names of the new columns, use names_to to specify the name of the column that keeps the test label, and values_to to specify the name of the column with the score value.
long <- data |>
pivot_longer(
cols = starts_with("test"),
names_to = "test_type",
values_to = "score",
)
long
| participant_id | group | condition | test_type | score |
|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <chr> | <dbl> |
| 01 | exp | 1 | test1 | 6 |
| 01 | exp | 1 | test2 | 9 |
| 01 | exp | 1 | test3 | NA |
| 01 | exp | 2 | test1 | 4 |
| 01 | exp | 2 | test2 | 8 |
| 01 | exp | 2 | test3 | 9 |
| 02 | con | 1 | test1 | 9 |
| 02 | con | 1 | test2 | 10 |
| 02 | con | 1 | test3 | 8 |
| 02 | con | 2 | test1 | 7 |
| 02 | con | 2 | test2 | 9 |
| 02 | con | 2 | test3 | 7 |
Reshaping data to wide format#
Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.
In this example, we first get the mean score for each test type, for each participant.
means <- long |>
group_by(participant_id, test_type) |>
summarise(score = mean(score, na.rm = TRUE), .groups = "drop")
means
| participant_id | test_type | score |
|---|---|---|
| <chr> | <chr> | <dbl> |
| 01 | test1 | 5.0 |
| 01 | test2 | 8.5 |
| 01 | test3 | 9.0 |
| 02 | test1 | 8.0 |
| 02 | test2 | 9.5 |
| 02 | test3 | 7.5 |
Now we can use pivot_wider to make a table that shows the results for each participant, with a column for each test.
wide <- means |>
pivot_wider(names_from = test_type, values_from = score)
wide
| participant_id | test1 | test2 | test3 |
|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> |
| 01 | 5 | 8.5 | 9.0 |
| 02 | 8 | 9.5 | 7.5 |
With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.
wide |>
mutate(total = test1 + test2 + test3)
| participant_id | test1 | test2 | test3 | total |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| 01 | 5 | 8.5 | 9.0 | 22.5 |
| 02 | 8 | 9.5 | 7.5 | 25.0 |