R Data Cleaning Cheat Sheet

8.11. R Data Cleaning Cheat Sheet#

require(tidyverse)

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read_csv("scores.csv")

Rows: 4 Columns: 6

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): participant_id, group
dbl (4): condition, test1, test2, test3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Handling missing data sentinels#

Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.

data

A spec_tbl_df: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	-999
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

To replace missing data sentinels in the test3 column, using na_if.

data |>
  mutate(test3 = na_if(test3, -999))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	NA
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

To replace missing data sentinels in multiple columns, use across to apply the na_if function to each column. The ~ indicates that the code after (here, na_if(.x, -999))) should be treated as a function, which will be applied to each of the indicated columns. The . is a placeholder for the column being evaluated.

data |>
  mutate(across(c(test1, test2, test3), ~ na_if(., -999)))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	NA
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

Alternatively, we can set the na input when calling read_csv to set those special values to null immediately, in all columns. Here, setting show_col_types suppresses the messages that are usually shown when using read_csv.

data <- read_csv("scores.csv", na = "-999", show_col_types = FALSE)
data

A spec_tbl_df: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	NA
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

Recoding variables#

Use recode or case_match to change the values of variables and as.numeric or as.character to change the data type of variables.

data |>
  mutate(group = recode(group, "exp" = "Experimental", "con" = "Control"))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	Experimental	1	6	9	NA
01	Experimental	2	4	8	9
02	Control	1	9	10	8
02	Control	2	7	9	7

data |>
  mutate(group = case_match(group, "exp" ~ "Experimental", "con" ~ "Control"))

Warning message:
“There was 1 warning in `mutate()`.
ℹ In argument: `group = case_match(group, "exp" ~ "Experimental", "con" ~
  "Control")`.
Caused by warning:
! `case_match()` was deprecated in dplyr 1.2.0.
ℹ Please use `recode_values()` instead.”

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	Experimental	1	6	9	NA
01	Experimental	2	4	8	9
02	Control	1	9	10	8
02	Control	2	7	9	7

To change a variable into a different data type, use a type conversion function such as as.numeric or as.character.

data |>
  mutate(condition = as.character(condition))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	NA
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

To change a numeric value to a string (known as a character vector in R), use case_match.

data |>
  mutate(condition = case_match(condition, 1 ~ "target", 2 ~ "lure"))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
01	exp	target	6	9	NA
01	exp	lure	4	8	9
02	con	target	9	10	8
02	con	lure	7	9	7

Working with missing data#

Missing data should be marked as NA in DataFrames. They may then easily be excluded from calculations.

To count the number of NA values in each column, use summarise_all (which applies a summarizing function to all columns) with ~ sum(is.na(.)), which sums the count of NA values for each column.

summarise_all(data, ~ sum(is.na(.)))

A tibble: 1 × 6
participant_id	group	condition	test1	test2	test3
<int>	<int>	<int>	<int>	<int>	<int>
0	0	0	0	0	1

Calculations will not ignore NA values by default.

data |> 
  summarise(across(c(test1, test2, test3), mean))

A tibble: 1 × 3
test1	test2	test3
<dbl>	<dbl>	<dbl>
6.5	9	NA

Calculations can be set to ignore NA values by setting na.rm to TRUE.

data |>
  summarise(across(c(test1, test2, test3), ~ mean(., na.rm = TRUE)))

A tibble: 1 × 3
test1	test2	test3
<dbl>	<dbl>	<dbl>
6.5	9	8

Alternatively, can replace NA values with some other value using replace_na.

data |>
  mutate(across(c(test1, test2, test3), ~ replace_na(., 0)))

A tibble: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	0
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

Use drop_na to completely exclude rows with NA.

data |>
  drop_na()

A tibble: 3 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

Grouping and aggregation#

Use group_by and summarize to split data into groups and calculate statistics for each group.

This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.

data |>
  group_by(group) |>
  summarize(test1 = mean(test1))  # mean for each condition

A tibble: 2 × 2
group	test1
<chr>	<dbl>
con	8
exp	5

data |>
  group_by(group) |>
  summarise(test1 = sd(test1))  # standard deviation for each condition

A tibble: 2 × 2
group	test1
<chr>	<dbl>
con	1.414214
exp	1.414214

We can also calculate multiple statistics at once.

data |>
  group_by(group) |>
  summarise(mean = mean(test1), sd = sd(test1))

A tibble: 2 × 3
group	mean	sd
<chr>	<dbl>	<dbl>
con	8	1.414214
exp	5	1.414214

Grouping by multiple variables#

To calculate statistics for each combination of multiple variables, use group_by with multiple columns. Here, we set .groups to "drop", so that the data frame will no longer be grouped after the summarization.

data |>
  group_by(group, condition) |>
  summarise(test1 = mean(test1), .groups = "drop")

A tibble: 4 × 3
group	condition	test1
<chr>	<dbl>	<dbl>
con	1	9
con	2	7
exp	1	6
exp	2	4

Use n() to get the number of samples in each condition.

data |>
  group_by(group, condition) |>
  summarize(test1 = mean(test1), n = n(), .groups = "drop")

A tibble: 4 × 4
group	condition	test1	n
<chr>	<dbl>	<dbl>	<int>
con	1	9	1
con	2	7	1
exp	1	6	1
exp	2	4	1

Reshaping data to long format#

In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.

This example data frame is wide format, because it has multiple test scores in each row.

data

A spec_tbl_df: 4 × 6
participant_id	group	condition	test1	test2	test3
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	exp	1	6	9	NA
01	exp	2	4	8	9
02	con	1	9	10	8
02	con	2	7	9	7

Convert from wide format to long format using pivot_longer. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.

data |>
  pivot_longer(cols = starts_with("test"))

A tibble: 12 × 5
participant_id	group	condition	name	value
<chr>	<chr>	<dbl>	<chr>	<dbl>
01	exp	1	test1	6
01	exp	1	test2	9
01	exp	1	test3	NA
01	exp	2	test1	4
01	exp	2	test2	8
01	exp	2	test3	9
02	con	1	test1	9
02	con	1	test2	10
02	con	1	test3	8
02	con	2	test1	7
02	con	2	test2	9
02	con	2	test3	7

To change the names of the new columns, use names_to to specify the name of the column that keeps the test label, and values_to to specify the name of the column with the score value.

long <- data |>
  pivot_longer(
    cols = starts_with("test"),
    names_to = "test_type",
    values_to = "score",
  )
long

A tibble: 12 × 5
participant_id	group	condition	test_type	score
<chr>	<chr>	<dbl>	<chr>	<dbl>
01	exp	1	test1	6
01	exp	1	test2	9
01	exp	1	test3	NA
01	exp	2	test1	4
01	exp	2	test2	8
01	exp	2	test3	9
02	con	1	test1	9
02	con	1	test2	10
02	con	1	test3	8
02	con	2	test1	7
02	con	2	test2	9
02	con	2	test3	7

Reshaping data to wide format#

Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.

In this example, we first get the mean score for each test type, for each participant.

means <- long |>
  group_by(participant_id, test_type) |>
  summarise(score = mean(score,  na.rm = TRUE), .groups = "drop")
means

A tibble: 6 × 3
participant_id	test_type	score
<chr>	<chr>	<dbl>
01	test1	5.0
01	test2	8.5
01	test3	9.0
02	test1	8.0
02	test2	9.5
02	test3	7.5

Now we can use pivot_wider to make a table that shows the results for each participant, with a column for each test.

wide <- means |>
  pivot_wider(names_from = test_type, values_from = score)
wide

A tibble: 2 × 4
participant_id	test1	test2	test3
<chr>	<dbl>	<dbl>	<dbl>
01	5	8.5	9.0
02	8	9.5	7.5

With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.

wide |>
  mutate(total = test1 + test2 + test3)

A tibble: 2 × 5
participant_id	test1	test2	test3	total
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
01	5	8.5	9.0	22.5
02	8	9.5	7.5	25.0