R Data Cleaning Cheat Sheet#

require(tidyverse)
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.6
 forcats   1.0.1      stringr   1.6.0
 ggplot2   4.0.1      tibble    3.3.0
 lubridate 1.9.4      tidyr     1.3.2
 purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("scores.csv")
Rows: 4 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): participant_id, group
dbl (4): condition, test1, test2, test3
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.

Handling missing data sentinels#

Sometimes, special values (like -999) called sentinels are used to indicate missing data. The replace method can be used to replace these values with None to indicate that they are missing.

data
A spec_tbl_df: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 9-999
01exp24 8 9
02con1910 8
02con27 9 7

To replace missing data sentinels in the test3 column, using na_if.

data |>
  mutate(test3 = na_if(test3, -999))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 9NA
01exp24 8 9
02con1910 8
02con27 9 7

To replace missing data sentinels in multiple columns, use across to apply the na_if function to each column. The ~ indicates that the code after (here, na_if(.x, -999))) should be treated as a function, which will be applied to each of the indicated columns. The . is a placeholder for the column being evaluated.

data |>
  mutate(across(c(test1, test2, test3), ~ na_if(., -999)))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 9NA
01exp24 8 9
02con1910 8
02con27 9 7

Alternatively, we can set the na input when calling read_csv to set those special values to null immediately, in all columns. Here, setting show_col_types suppresses the messages that are usually shown when using read_csv.

data <- read_csv("scores.csv", na = "-999", show_col_types = FALSE)
data
A spec_tbl_df: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 9NA
01exp24 8 9
02con1910 8
02con27 9 7

Recoding variables#

Use recode or case_match to change the values of variables and as.numeric or as.character to change the data type of variables.

data |>
  mutate(group = recode(group, "exp" = "Experimental", "con" = "Control"))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01Experimental16 9NA
01Experimental24 8 9
02Control 1910 8
02Control 27 9 7
data |>
  mutate(group = case_match(group, "exp" ~ "Experimental", "con" ~ "Control"))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01Experimental16 9NA
01Experimental24 8 9
02Control 1910 8
02Control 27 9 7

To change a variable into a different data type, use a type conversion function such as as.numeric or as.character.

data |>
  mutate(condition = as.character(condition))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><chr><dbl><dbl><dbl>
01exp16 9NA
01exp24 8 9
02con1910 8
02con27 9 7

To change a numeric value to a string (known as a character vector in R), use case_match.

data |>
  mutate(condition = case_match(condition, 1 ~ "target", 2 ~ "lure"))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><chr><dbl><dbl><dbl>
01exptarget6 9NA
01explure 4 8 9
02contarget910 8
02conlure 7 9 7

Working with missing data#

Missing data should be marked as NA in DataFrames. They may then easily be excluded from calculations.

To count the number of NA values in each column, use summarise_all (which applies a summarizing function to all columns) with ~ sum(is.na(.)), which sums the count of NA values for each column.

summarise_all(data, ~ sum(is.na(.)))
A tibble: 1 × 6
participant_idgroupconditiontest1test2test3
<int><int><int><int><int><int>
000001

Calculations will not ignore NA values by default.

data |> 
  summarise(across(c(test1, test2, test3), mean))
A tibble: 1 × 3
test1test2test3
<dbl><dbl><dbl>
6.59NA

Calculations can be set to ignore NA values by setting na.rm to TRUE.

data |>
  summarise(across(c(test1, test2, test3), ~ mean(., na.rm = TRUE)))
A tibble: 1 × 3
test1test2test3
<dbl><dbl><dbl>
6.598

Alternatively, can replace NA values with some other value using replace_na.

data |>
  mutate(across(c(test1, test2, test3), ~ replace_na(., 0)))
A tibble: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 90
01exp24 89
02con19108
02con27 97

Use drop_na to completely exclude rows with NA.

data |>
  drop_na()
A tibble: 3 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp24 89
02con19108
02con27 97

Grouping and aggregation#

Use group_by and summarize to split data into groups and calculate statistics for each group.

This groups rows by the condition label, then calculates statistics for the test 1 scores within each group.

data |>
  group_by(group) |>
  summarize(test1 = mean(test1))  # mean for each condition
A tibble: 2 × 2
grouptest1
<chr><dbl>
con8
exp5
data |>
  group_by(group) |>
  summarise(test1 = sd(test1))  # standard deviation for each condition
A tibble: 2 × 2
grouptest1
<chr><dbl>
con1.414214
exp1.414214

We can also calculate multiple statistics at once.

data |>
  group_by(group) |>
  summarise(mean = mean(test1), sd = sd(test1))
A tibble: 2 × 3
groupmeansd
<chr><dbl><dbl>
con81.414214
exp51.414214

Grouping by multiple variables#

To calculate statistics for each combination of multiple variables, use group_by with multiple columns. Here, we set .groups to "drop", so that the data frame will no longer be grouped after the summarization.

data |>
  group_by(group, condition) |>
  summarise(test1 = mean(test1), .groups = "drop")
A tibble: 4 × 3
groupconditiontest1
<chr><dbl><dbl>
con19
con27
exp16
exp24

Use n() to get the number of samples in each condition.

data |>
  group_by(group, condition) |>
  summarize(test1 = mean(test1), n = n(), .groups = "drop")
A tibble: 4 × 4
groupconditiontest1n
<chr><dbl><dbl><int>
con191
con271
exp161
exp241

Reshaping data to long format#

In wide format data, there are multiple observations in each row. In long format data, there is one row for each observation.

This example data frame is wide format, because it has multiple test scores in each row.

data
A spec_tbl_df: 4 × 6
participant_idgroupconditiontest1test2test3
<chr><chr><dbl><dbl><dbl><dbl>
01exp16 9NA
01exp24 8 9
02con1910 8
02con27 9 7

Convert from wide format to long format using pivot_longer. Here, we want to unpivot the test score columns (test1, test2, and test3), so that there is one observation per row.

data |>
  pivot_longer(cols = starts_with("test"))
A tibble: 12 × 5
participant_idgroupconditionnamevalue
<chr><chr><dbl><chr><dbl>
01exp1test1 6
01exp1test2 9
01exp1test3NA
01exp2test1 4
01exp2test2 8
01exp2test3 9
02con1test1 9
02con1test210
02con1test3 8
02con2test1 7
02con2test2 9
02con2test3 7

To change the names of the new columns, use names_to to specify the name of the column that keeps the test label, and values_to to specify the name of the column with the score value.

long <- data |>
  pivot_longer(
    cols = starts_with("test"),
    names_to = "test_type",
    values_to = "score",
  )
long
A tibble: 12 × 5
participant_idgroupconditiontest_typescore
<chr><chr><dbl><chr><dbl>
01exp1test1 6
01exp1test2 9
01exp1test3NA
01exp2test1 4
01exp2test2 8
01exp2test3 9
02con1test1 9
02con1test210
02con1test3 8
02con2test1 7
02con2test2 9
02con2test3 7

Reshaping data to wide format#

Wide format data frames make it easier to compare different variables measured within the same condition. Use pivot to take long data and reorganize it to wide format.

In this example, we first get the mean score for each test type, for each participant.

means <- long |>
  group_by(participant_id, test_type) |>
  summarise(score = mean(score,  na.rm = TRUE), .groups = "drop")
means
A tibble: 6 × 3
participant_idtest_typescore
<chr><chr><dbl>
01test15.0
01test28.5
01test39.0
02test18.0
02test29.5
02test37.5

Now we can use pivot_wider to make a table that shows the results for each participant, with a column for each test.

wide <- means |>
  pivot_wider(names_from = test_type, values_from = score)
wide
A tibble: 2 × 4
participant_idtest1test2test3
<chr><dbl><dbl><dbl>
0158.59.0
0289.57.5

With the wide-format table, we can perform calculations that involve multiple variables. For example, we can calculate a total score for each participant.

wide |>
  mutate(total = test1 + test2 + test3)
A tibble: 2 × 5
participant_idtest1test2test3total
<chr><dbl><dbl><dbl><dbl>
0158.59.022.5
0289.57.525.0