
Joining and Reshaping Data
Grayson White
Math 241
Week 4 | Spring 2026
Mon Lecture
Different types of data in R
Talk more about logical statements
More practice with dplyr verbs and data wrangling
Wed Lecture
“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” – The Tidyverse Style Guide
Part of writing easily reproducible code.
Lots of reasonable coding styles.
Important piece is consistency across all coders on a project.
Some suggestions:
R help you with the proper indenting.Check out the Tidyverse Style Guide.
tibble()s vs. data.frame()stidyversetibble()s vs. data.frame()stibble()s vs. data.frame()stibble()s have subsetting behavior that is consistent with list()sdata.frame()s are lists, they do not have this consistent behavior
Typically use a SQL server for storing and managing that data.
We are going to learn to how to join relational data in R using dplyr.
dplyr.RToday, we’ll look at a few ways to join the following tables x and y:

The dplyr package, which is part of the tidyverse, includes functions for two general types of joins:
x and y, andx and y.Think of how mutate() adds columns to a data frame, while filter() removes rows.
For the following examples of data joins, we will use the data frames from the first slide. We can load this data into R:
dplyr contains four mutating joins:
left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.right_join(x, y) is equivalent to left_join(y, x), except for column order.inner_join(x, y) keeps only the rows matched between x and y.full_join(x, y) keeps all rows of both x and y.Recall our example data
left_join()left_join()Joining with `by = join_by(spp_code)`
spp_code common_name latin_name
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
But what’s that message about?
"Joining with `by = join_by(spp_code)`"
A key is can just be thought of the name(s) of the column(s) you’re joining by. In the left_join() from the last slide, R assumed we were joining by the column spp_code since both x and y have a column with that name.
left_join(), with a 🔑 spp_code common_name latin_name
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
by argument. This is the same for all joins in dplyr.right_join()right_join() spp_code latin_name common_name
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 PIST <NA> white pine
Notice that this is the same as our previous left_join().
What happens if we try switching the order of x and y?
right_join() spp_code latin_name common_name
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 PIST <NA> white pine
Notice that this is the same as our previous left_join().
What happens if we try switching the order of x and y?
inner_join()How many rows will the output have?
inner_join()How many rows will the output have?
spp_code common_name latin_name
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
Why is this the result?
full_join()How many rows will the output have?
full_join()How many rows will the output have?
spp_code common_name latin_name
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
4 LALA <NA> Larix laricina
Why is this the result?
dplyr contains two filtering joins:
semi_join(x, y) keeps all the rows in x that have a match in y.anti_join(x, y) removes all the rows in x that have a match in y.Note: Unlike mutating joins, filtering joins do not add any columns to the data.
semi_join()How many rows will this semi_join return? How many columns?
semi_join()How many rows will this semi_join return? How many columns?
semi_join()What about this semi_join? Will it be the same as semi_join(x, y)?
semi_join()What about this semi_join? Will it be the same as semi_join(x, y)?
anti_join()Let’s see what anti_join does:
anti_join()Let’s see what anti_join does:
Why do we get this output?
anti_join()What happens if we switch the order of x and y?
anti_join()What happens if we switch the order of x and y?
So far, we have joined x and y by the spp_code column.
But what if y had the same column named differently:
x and y?Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Looks like we need to specify by (our 🔑)
x and y?Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Looks like we need to specify by (our 🔑)
Error in `left_join()`:
! Join columns in `y` must be present in the data.
✖ Problem with `spp_code`.
Still not working!
x and y?Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Looks like we need to specify by (our 🔑)
Error in `left_join()`:
! Join columns in `y` must be present in the data.
✖ Problem with `spp_code`.
Still not working!
tidyverseQuestion: Why do we need TWO different data wrangling packages in the tidyverse?
Question: What does dplyr do?
Question: So what is left for tidyr to do?
“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

Each column is a single variable.
Each row is a unique observation.
Each value must have its own cell.
See this blog post for a cute introduction to “tidy data”.
The face dataset from IFDAR:
Rows: 1,991
Columns: 9
$ Rep <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Treat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Clone <dbl> 8, 216, 8, 216, 216, 271, 271, 8, 259, 271, 271, 271, 27…
$ ID <dbl> 45, 44, 43, 42, 54, 55, 56, 57, 58, 59, 60, 73, 72, 71, …
$ `2001_Height` <dbl> NA, 547, 273, 526, 328, 543, 450, 217, 158, 230, NA, 516…
$ `2002_Height` <dbl> NA, 622, 275, 619, 341, 590, 502, 227, 155, 241, NA, 619…
$ `2003_Height` <dbl> NA, 715, 305, 720, 364, 634, 587, 256, NA, 260, NA, 742,…
$ `2004_Height` <dbl> NA, 716, 324, 706, 361, 691, 639, 266, NA, 263, NA, 703,…
$ `2005_Height` <dbl> NA, 817, 323, 738, 366, 669, 647, 269, NA, 270, NA, 862,…
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
What does a row represent?
What does a column represent?
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
Each row should represent a measurement for a given rep, treatment, clone, and ID. But currently, we have multiple measurements on the same row.
Further, we have multiple columns for the “height” variable.
Not tidy!
Enter, pivot_longer().
pivot_longer(), a function from tidyr takes four key arguments:
.data: the data you’d like to pivot,cols: the columns you’d like to pivot,names_to: the new column that will be created which takes the column names from cols as values, andvalues_to: the new column that will be created which takes the column values from cols as values.face?face?face?dplyr’s contains() function.face?# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
face_long dataset back to its original form.tidyr has an aptly named function, pivot_wider().pivot_wider():
.data: the data frame to widen,names_from: the column that contains values which will be assigned as the new column names,values_from: the column that contains[1] TRUE
tidyr functionstidyr has some functions for dealing with that, too:unite() for pasteing column values together with specified separators,separate_wider_*() family: for splitting columns into multiple new columns:
separate_wider_delim(): separate by delimiterseparate_wider_position(): separate by positionseparate_wider_regex(): separate by regular expressionunite(): examples# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
unite(): examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1_1_8 45 2001_Height NA
2 1_1_8 45 2002_Height NA
3 1_1_8 45 2003_Height NA
4 1_1_8 45 2004_Height NA
5 1_1_8 45 2005_Height NA
6 1_1_216 44 2001_Height 547
7 1_1_216 44 2002_Height 622
8 1_1_216 44 2003_Height 715
9 1_1_216 44 2004_Height 716
10 1_1_216 44 2005_Height 817
# ℹ 9,945 more rows
unite(): examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1.1.8 45 2001_Height NA
2 1.1.8 45 2002_Height NA
3 1.1.8 45 2003_Height NA
4 1.1.8 45 2004_Height NA
5 1.1.8 45 2005_Height NA
6 1.1.216 44 2001_Height 547
7 1.1.216 44 2002_Height 622
8 1.1.216 44 2003_Height 715
9 1.1.216 44 2004_Height 716
10 1.1.216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim() example# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
# A tibble: 6 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
# A tibble: 9,955 × 6
Rep Treat Clone ID Year Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001 NA
2 1 1 8 45 2002 NA
3 1 1 8 45 2003 NA
4 1 1 8 45 2004 NA
5 1 1 8 45 2005 NA
6 1 1 216 44 2001 547
7 1 1 216 44 2002 622
8 1 1 216 44 2003 715
9 1 1 216 44 2004 716
10 1 1 216 44 2005 817
# ℹ 9,945 more rows
R