9  Joining data

9.2 Binding rows and columns to a table

Occasionally, a simpler problem presents itself: there is a single dataset, but its rows are contained across separate tables. For example, a table containing student names and subject areas might be spread across two tables, like this:

studies1 <- tibble(
  name    = c("Sacha", "Gabe", "Alex"),
  subject = c("Physics", "Chemistry", "Biology")
)

print(studies1)
# A tibble: 3 × 2
  name  subject  
  <chr> <chr>    
1 Sacha Physics  
2 Gabe  Chemistry
3 Alex  Biology  
studies2 <- tibble(
  name    = c("Jamie", "Ashley", "Dallas", "Jordan"),
  subject = c("Geology", "Mathematics", "Philosophy", "Physics")
)

print(studies2)
# A tibble: 4 × 2
  name   subject    
  <chr>  <chr>      
1 Jamie  Geology    
2 Ashley Mathematics
3 Dallas Philosophy 
4 Jordan Physics    

The tables have the exact same structure, in that the column names and types are identical. It’s just that the rows are, for some reason, disparate. To combine them together, we could recourse to full-joining the tables by both their columns:

full_join(studies1, studies2, by = join_by(name, subject))
# A tibble: 7 × 2
  name   subject    
  <chr>  <chr>      
1 Sacha  Physics    
2 Gabe   Chemistry  
3 Alex   Biology    
4 Jamie  Geology    
5 Ashley Mathematics
6 Dallas Philosophy 
7 Jordan Physics    

This, however, is not necessary. Whenever all we need to do is take two tables and stick their rows together, there is the simpler bind_rows:

bind_rows(studies1, studies2)
# A tibble: 7 × 2
  name   subject    
  <chr>  <chr>      
1 Sacha  Physics    
2 Gabe   Chemistry  
3 Alex   Biology    
4 Jamie  Geology    
5 Ashley Mathematics
6 Dallas Philosophy 
7 Jordan Physics    

Similarly, in case two tables have the same number of rows but different columns, one can stick their columns together using bind_cols. For example, suppose we have

studies <- tibble(
  name    = c("Sacha", "Gabe", "Alex"),
  subject = c("Physics", "Chemistry", "Biology")
)

print(studies)
# A tibble: 3 × 2
  name  subject  
  <chr> <chr>    
1 Sacha Physics  
2 Gabe  Chemistry
3 Alex  Biology  

as well as a table with year of study and result of last exam only:

yearExam <- tibble(
  year     = c(3, 1, 2),
  examPass = c(FALSE, TRUE, TRUE)
)

print(yearExam)
# A tibble: 3 × 2
   year examPass
  <dbl> <lgl>   
1     3 FALSE   
2     1 TRUE    
3     2 TRUE    

We can now join these using bind_cols:

bind_cols(studies, yearExam)
# A tibble: 3 × 4
  name  subject    year examPass
  <chr> <chr>     <dbl> <lgl>   
1 Sacha Physics       3 FALSE   
2 Gabe  Chemistry     1 TRUE    
3 Alex  Biology       2 TRUE    

9.3 Exercises

We have used the data of Fauchald et al. (2017) before in other exercises, in Section 7.4 and Section 8.6. As a reminder, they tracked the population size of various herds of caribou in North America over time, and correlated population cycling with the amount of vegetation and sea-ice cover. Two files from their data are pop_size.tsv (herd population sizes) and sea_ice.tsv (sea ice cover per year and month).

  1. Load these two datasets into two variables. They could be called pop and ice, for instance. Look at the data to familiarize yourself with them. How many rows and columns are in each?

  2. Before doing anything else: how many rows will there be in the table that is the left join of pop and ice, based on the two columns Herd and Year? Perform the left join to see if you were correct. Where do you see NAs in the table, and why?

  3. Now do the same with right-joining, inner-joining, and full-joining pop and ice.


  1. It can also do a lot more. You can check its help page (?join_by) after you finish reading this chapter—by that time, it will make much more sense.↩︎