5 Basic data manipulation

5.1 Important functions for transforming data

Let us start by loading the tidyverse:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As seen from the message output above, the dplyr package is part of the tidyverse and gets loaded by default. It allows one to arrange and manipulate data efficiently. The basic functions one should know are select, filter, slice, rename, arrange, and mutate. Additionally, a useful (though not as commonly used) function is distinct, also explained below. We will use the island-FL.csv data file we worked with in Chapter 4:

snailDat <- read_csv("island-FL.csv")
print(snailDat)

# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

Now we will give examples of each of the functions select, filter, slice, rename, arrange, mutate, and distinct. They all work similarly in that the first argument they take is the data, in the form of a tibble. Their other arguments, and what they each do, are explained below.

5.1.1 `select`

This function chooses columns of the data. The second and subsequent arguments of the function are the columns which should be retained. For example, select(snailDat, species) will keep only the species column of snailDat:

select(snailDat, species)

# A tibble: 223 × 1
   species    
   <chr>      
 1 ustulatus  
 2 ustulatus  
 3 ustulatus  
 4 calvus     
 5 nux        
 6 ustulatus  
 7 ustulatus  
 8 ustulatus  
 9 galapaganus
10 nux        
# ℹ 213 more rows

It is also possible to deselect columns, by prepending a minus sign (-) in front of the column names. To drop the species column, we can type:

select(snailDat, -species)

# A tibble: 223 × 3
   habitat  size  shape
   <chr>   <dbl>  <dbl>
 1 humid    17.1 -0.029
 2 humid    20.1 -0.001
 3 humid    16.3  0.014
 4 arid     13.7 -0.043
 5 humid    21.9 -0.042
 6 humid    16.8 -0.023
 7 humid    19.2  0.014
 8 humid    16.0  0.042
 9 arid     18.9  0.011
10 humid    26.6  0    
# ℹ 213 more rows

Now we are left with only the columns habitat, size, and shape. One can also select and deselect multiple columns in a similar vein. For example, select(snailDat, -species, -habitat) removes both columns and leaves only size, and shape (try it).

There are several other options within select, which mostly help with selecting several columns at a time fulfilling certain criteria. The collection of these options and methods is called tidy selection. First of all, tidy selection allows one to specify, as character strings, what the column names should start or end with, using starts_with and ends_with:

select(snailDat, starts_with("s"))

# A tibble: 223 × 3
   species      size  shape
   <chr>       <dbl>  <dbl>
 1 ustulatus    17.1 -0.029
 2 ustulatus    20.1 -0.001
 3 ustulatus    16.3  0.014
 4 calvus       13.7 -0.043
 5 nux          21.9 -0.042
 6 ustulatus    16.8 -0.023
 7 ustulatus    19.2  0.014
 8 ustulatus    16.0  0.042
 9 galapaganus  18.9  0.011
10 nux          26.6  0    
# ℹ 213 more rows

select(snailDat, starts_with("sh"))

# A tibble: 223 × 1
    shape
    <dbl>
 1 -0.029
 2 -0.001
 3  0.014
 4 -0.043
 5 -0.042
 6 -0.023
 7  0.014
 8  0.042
 9  0.011
10  0    
# ℹ 213 more rows

select(snailDat, ends_with("e"))

# A tibble: 223 × 2
    size  shape
   <dbl>  <dbl>
 1  17.1 -0.029
 2  20.1 -0.001
 3  16.3  0.014
 4  13.7 -0.043
 5  21.9 -0.042
 6  16.8 -0.023
 7  19.2  0.014
 8  16.0  0.042
 9  18.9  0.011
10  26.6  0    
# ℹ 213 more rows

select(snailDat, ends_with("pe"))

# A tibble: 223 × 1
    shape
    <dbl>
 1 -0.029
 2 -0.001
 3  0.014
 4 -0.043
 5 -0.042
 6 -0.023
 7  0.014
 8  0.042
 9  0.011
10  0    
# ℹ 213 more rows

Similarly, the function contains can select columns which contain some string anywhere in their names. For example, select(snailDat, ends_with("pe")) above only selected the shape column, but select(snailDat, contains("pe")) additionally selects species:

select(snailDat, contains("pe"))

# A tibble: 223 × 2
   species      shape
   <chr>        <dbl>
 1 ustulatus   -0.029
 2 ustulatus   -0.001
 3 ustulatus    0.014
 4 calvus      -0.043
 5 nux         -0.042
 6 ustulatus   -0.023
 7 ustulatus    0.014
 8 ustulatus    0.042
 9 galapaganus  0.011
10 nux          0    
# ℹ 213 more rows

One can combine these selection methods using the & (“and”), | (“or”), and ! (“not”) logical operators. To select all columns which start with "s" but do not contain the letter "z" in their names:

select(snailDat, starts_with("s") & !contains("z"))

# A tibble: 223 × 2
   species      shape
   <chr>        <dbl>
 1 ustulatus   -0.029
 2 ustulatus   -0.001
 3 ustulatus    0.014
 4 calvus      -0.043
 5 nux         -0.042
 6 ustulatus   -0.023
 7 ustulatus    0.014
 8 ustulatus    0.042
 9 galapaganus  0.011
10 nux          0    
# ℹ 213 more rows

The following selects columns that either contain "ha" in their names, or end with the letter "s":

select(snailDat, contains("ha") | ends_with("s"))

# A tibble: 223 × 3
   habitat  shape species    
   <chr>    <dbl> <chr>      
 1 humid   -0.029 ustulatus  
 2 humid   -0.001 ustulatus  
 3 humid    0.014 ustulatus  
 4 arid    -0.043 calvus     
 5 humid   -0.042 nux        
 6 humid   -0.023 ustulatus  
 7 humid    0.014 ustulatus  
 8 humid    0.042 ustulatus  
 9 arid     0.011 galapaganus
10 humid    0     nux        
# ℹ 213 more rows

Finally, it is possible to select a range of columns, using the colon (:) operator. To select the columns species and shape, along with any columns in between them, we write:

select(snailDat, species:shape)

# A tibble: 223 × 3
   species      size  shape
   <chr>       <dbl>  <dbl>
 1 ustulatus    17.1 -0.029
 2 ustulatus    20.1 -0.001
 3 ustulatus    16.3  0.014
 4 calvus       13.7 -0.043
 5 nux          21.9 -0.042
 6 ustulatus    16.8 -0.023
 7 ustulatus    19.2  0.014
 8 ustulatus    16.0  0.042
 9 galapaganus  18.9  0.011
10 nux          26.6  0    
# ℹ 213 more rows

Such range selection can also be combined with the logical operations above. For instance, to select the range from habitat to species, as well as any columns whose name contains the letter "z":

select(snailDat, habitat:species | contains("z"))

# A tibble: 223 × 3
   habitat species      size
   <chr>   <chr>       <dbl>
 1 humid   ustulatus    17.1
 2 humid   ustulatus    20.1
 3 humid   ustulatus    16.3
 4 arid    calvus       13.7
 5 humid   nux          21.9
 6 humid   ustulatus    16.8
 7 humid   ustulatus    19.2
 8 humid   ustulatus    16.0
 9 arid    galapaganus  18.9
10 humid   nux          26.6
# ℹ 213 more rows

5.1.2 `filter`

While select chooses columns, filter chooses rows from the data. As with all these functions, the first argument of filter is the tibble to be filtered. The second argument is a logical condition on the columns. Those rows which satisfy the condition are retained; the rest are dropped. Thus, filter keeps only those rows of the data which fulfill some condition.

For example, to retain only those individuals from snailDat whose shell size is at least 29:

filter(snailDat, size >= 29)

# A tibble: 6 × 4
  habitat species       size  shape
  <chr>   <chr>        <dbl>  <dbl>
1 humid   unifasciatus  33.8 -0.07 
2 humid   unifasciatus  31.7 -0.115
3 humid   unifasciatus  30.9 -0.074
4 humid   unifasciatus  31.9 -0.071
5 humid   nux           29.2 -0.01 
6 humid   unifasciatus  32.0 -0.087

The filtered data have only 6 rows instead of the original 223—this is the number of snail individuals with a very large shell size. As seen, five of these belong in the species Naesiotus unifasciatus, and only one in the species Naesiotus nux.

5.1.3 `slice`

With slice, one can choose rows of the data, just like with filter. Unlike with filter however, slice receives a vector of row indices to retain instead of a condition to be tested on each row. So, for example, if one wanted to keep only the first, second, and fifth rows, then one can do so with slice:

slice(snailDat, c(1, 2, 5))

# A tibble: 3 × 4
  habitat species    size  shape
  <chr>   <chr>     <dbl>  <dbl>
1 humid   ustulatus  17.1 -0.029
2 humid   ustulatus  20.1 -0.001
3 humid   nux        21.9 -0.042

(Note: the numbers in front of the rows in the output generated by tibbles always pertain to the row numbers of the current table, not the one from which they were created. So the row labels 1, 2, 3 above simply enumerate the rows of the sliced data. The actual rows still correspond to rows 1, 2, and 5 in the original snailDat.)

5.1.4 `rename`

The rename function simply gives new names to existing columns. The first argument, as always, is the tibble in which the column(s) should be renamed. The subsequent arguments follow the pattern new_name = old_name in replacing column names. For example, in the land snail data, the arid and humid habitats are often referred to as arid or humid zones. To rename habitat to zone, we simply write:

rename(snailDat, zone = habitat)

# A tibble: 223 × 4
   zone  species      size  shape
   <chr> <chr>       <dbl>  <dbl>
 1 humid ustulatus    17.1 -0.029
 2 humid ustulatus    20.1 -0.001
 3 humid ustulatus    16.3  0.014
 4 arid  calvus       13.7 -0.043
 5 humid nux          21.9 -0.042
 6 humid ustulatus    16.8 -0.023
 7 humid ustulatus    19.2  0.014
 8 humid ustulatus    16.0  0.042
 9 arid  galapaganus  18.9  0.011
10 humid nux          26.6  0    
# ℹ 213 more rows

Multiple columns can also be renamed. To change all column names to start with capital letters:

rename(snailDat,
       Habitat = habitat, Species = species, Size = size, Shape = shape)

# A tibble: 223 × 4
   Habitat Species      Size  Shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

5.1.5 `arrange`

This function rearranges the rows of the data, in increasing order of the column given as the second argument. For example, to arrange in increasing order of size, we write:

arrange(snailDat, size)

# A tibble: 223 × 4
   habitat species  size  shape
   <chr>   <chr>   <dbl>  <dbl>
 1 arid    calvus   12.3 -0.019
 2 arid    calvus   12.9 -0.039
 3 arid    calvus   13.5 -0.012
 4 arid    calvus   13.7 -0.018
 5 arid    calvus   13.7 -0.043
 6 arid    calvus   13.9  0.01 
 7 arid    calvus   14.1 -0.027
 8 arid    calvus   14.1 -0.016
 9 arid    calvus   14.3 -0.011
10 arid    calvus   14.4  0.002
# ℹ 213 more rows

To arrange in decreasing order, there is a small helper function called desc. Arranging by desc(size) instead of size will arrange the rows in decreasing order of size:

arrange(snailDat, desc(size))

# A tibble: 223 × 4
   habitat species       size  shape
   <chr>   <chr>        <dbl>  <dbl>
 1 humid   unifasciatus  33.8 -0.07 
 2 humid   unifasciatus  32.0 -0.087
 3 humid   unifasciatus  31.9 -0.071
 4 humid   unifasciatus  31.7 -0.115
 5 humid   unifasciatus  30.9 -0.074
 6 humid   nux           29.2 -0.01 
 7 humid   unifasciatus  28.8 -0.075
 8 humid   unifasciatus  28.5 -0.088
 9 humid   nux           27.7 -0.05 
10 humid   unifasciatus  27.7 -0.047
# ℹ 213 more rows

It is also perfectly possible to arrange by a column whose type is character string. In that case, the system will sort the rows in alphabetical order—or reverse alphabetical order in case desc is applied. For example, to sort in alphabetical order of species names:

arrange(snailDat, species)

# A tibble: 223 × 4
   habitat species  size  shape
   <chr>   <chr>   <dbl>  <dbl>
 1 arid    calvus   13.7 -0.043
 2 arid    calvus   17.9 -0.024
 3 arid    calvus   13.9  0.01 
 4 arid    calvus   16.5 -0.004
 5 arid    calvus   16.6 -0.006
 6 arid    calvus   16.1  0.01 
 7 arid    calvus   18.2 -0.003
 8 arid    calvus   12.9 -0.039
 9 arid    calvus   17.3  0.002
10 arid    calvus   14.1 -0.027
# ℹ 213 more rows

And to sort in reverse alphabetical order:

arrange(snailDat, desc(species))

# A tibble: 223 × 4
   habitat species    size  shape
   <chr>   <chr>     <dbl>  <dbl>
 1 humid   ustulatus  17.1 -0.029
 2 humid   ustulatus  20.1 -0.001
 3 humid   ustulatus  16.3  0.014
 4 humid   ustulatus  16.8 -0.023
 5 humid   ustulatus  19.2  0.014
 6 humid   ustulatus  16.0  0.042
 7 humid   ustulatus  15.4 -0.016
 8 humid   ustulatus  16.3 -0.017
 9 humid   ustulatus  16.7 -0.034
10 humid   ustulatus  17.0 -0.018
# ℹ 213 more rows

Notice that when we sort the rows by species, there are many ties—rows with the same value of species. In those cases, arrange will not be able to decide which rows should come earlier, and so any ordering that was present before invoking arrange will be retained. In case we would like to break the ties, we can give further sorting variables, as the third, fourth, etc. arguments to arrange. To sort the data by species, and to resolve ties in order of increasing size, we write:

arrange(snailDat, species, size)

# A tibble: 223 × 4
   habitat species  size  shape
   <chr>   <chr>   <dbl>  <dbl>
 1 arid    calvus   12.3 -0.019
 2 arid    calvus   12.9 -0.039
 3 arid    calvus   13.5 -0.012
 4 arid    calvus   13.7 -0.018
 5 arid    calvus   13.7 -0.043
 6 arid    calvus   13.9  0.01 
 7 arid    calvus   14.1 -0.027
 8 arid    calvus   14.1 -0.016
 9 arid    calvus   14.3 -0.011
10 arid    calvus   14.4  0.002
# ℹ 213 more rows

This causes the table to be sorted primarily by species, but in case there are ties (equal species between multiple rows), they will be resolved in priority of size—first the smallest and then increasingly larger individuals.

5.1.6 `mutate`

The mutate function allows us to create new columns from existing ones. We may apply any function or operator we learned about to existing columns, and the result of the computation will go into the new column. We do this in the second argument of mutate (the first, as always, is the data tibble) by first giving a name to the column, then writing =, and then the desired computation. For example, we could create a new column indicating whether a snail is “large” (has a shell size above some threshold—say, 25) or “small”. We can do this using the ifelse function within mutate:

mutate(snailDat, shellSize = ifelse(size > 25, "large", "small"))

# A tibble: 223 × 5
   habitat species      size  shape shellSize
   <chr>   <chr>       <dbl>  <dbl> <chr>    
 1 humid   ustulatus    17.1 -0.029 small    
 2 humid   ustulatus    20.1 -0.001 small    
 3 humid   ustulatus    16.3  0.014 small    
 4 arid    calvus       13.7 -0.043 small    
 5 humid   nux          21.9 -0.042 small    
 6 humid   ustulatus    16.8 -0.023 small    
 7 humid   ustulatus    19.2  0.014 small    
 8 humid   ustulatus    16.0  0.042 small    
 9 arid    galapaganus  18.9  0.011 small    
10 humid   nux          26.6  0     large    
# ℹ 213 more rows

The original columns of the data are retained, but we now also have the additional shellSize column.

One very common transformation on quantities such as size and shape is to standardize them: subtract the overall mean from each entry and then divide the result by the standard deviation. This makes the quantities unitless with mean 0 and standard deviation 1. This is how one can perform this standardization with mutate:

mutate(snailDat,
       stdSize = (size - mean(size)) / sd(size),
       stdShape = (shape - mean(shape)) / sd(shape))

# A tibble: 223 × 6
   habitat species      size  shape stdSize stdShape
   <chr>   <chr>       <dbl>  <dbl>   <dbl>    <dbl>
 1 humid   ustulatus    17.1 -0.029  -0.660 -0.575  
 2 humid   ustulatus    20.1 -0.001   0.124  0.00343
 3 humid   ustulatus    16.3  0.014  -0.858  0.313  
 4 arid    calvus       13.7 -0.043  -1.55  -0.864  
 5 humid   nux          21.9 -0.042   0.609 -0.844  
 6 humid   ustulatus    16.8 -0.023  -0.723 -0.451  
 7 humid   ustulatus    19.2  0.014  -0.113  0.313  
 8 humid   ustulatus    16.0  0.042  -0.943  0.892  
 9 arid    galapaganus  18.9  0.011  -0.183  0.251  
10 humid   nux          26.6  0       1.85   0.0241 
# ℹ 213 more rows

5.1.7 `distinct`

While not as important as the previous six functions, distinct can also be useful. It takes as its input a tibble, and removes all rows that contain exact copies of any other row. For example, we might wonder how many different species there are in snailDat. One way to answer this is to select the species column only, and then apply distinct to remove duplicated entries:

distinct(select(snailDat, species))

# A tibble: 7 × 1
  species     
  <chr>       
1 ustulatus   
2 calvus      
3 nux         
4 galapaganus 
5 unifasciatus
6 invalidus   
7 rugulosus

So each individual in the data comes from one of the above seven species.

5.2 Using pipes to our advantage

Let us take a slightly more complicated (and quite typical) data analysis task. We want to answer the question: how many species are there with at least some individuals whose standardized shell size is larger than a threshold—say, 2? This corresponds to individuals whose size is two standard deviations above the community average.

To obtain an answer, we could first mutate a new column that contains the standardized shell size for each individual. We can then filter for those rows only for which this is greater than 2. Afterwards, we can use select to choose just the species column. This will likely have repeated entries (because multiple individuals from the same species could have a standardized size greater than 2), so as a final step, we should remove duplicated rows with distinct.

Since mutate, filter, select, distinct, etc. are just ordinary functions, they do not “modify” data. They merely take a tibble as input (plus other arguments) and return another tibble. They do not do anything to the original input data. In order for R not to forget their result immediately after they are computed, they have to be stored in variables. So one way of implementing the solution might rely on repeated assignments, as below:

mutatedDat <- mutate(snailDat, stdSize = (size - mean(size)) / sd(size))
filteredDat <- filter(mutatedDat, stdSize > 2)
onlySpeciesDat <- select(filteredDat, species)
speciesListDat <- distinct(onlySpeciesDat)
print(speciesListDat)

# A tibble: 2 × 1
  species     
  <chr>       
1 unifasciatus
2 nux

It turns out that only two out of the seven species have individuals with shells that large.

While this solution works, it requires inventing arbitrary variable names at every step, or else overwriting variables. For such a short example, this is not problematic, but doing the same for a long pipeline of dozens of steps could get confusing, as well as dangerous due to the repeatedly modified variables.

Another possible solution is to rely on function composition (Section 3.2). Applying repeated composition is straightforward—in principle. In practice, when composing many functions together, things can get unwieldy quite quickly. Let us see what such a solution looks like:

distinct(
  select(
    filter(
      mutate(snailDat, stdSize = (size - mean(size)) / sd(size)),
      stdSize > 2
    ),
    species
  )
)

# A tibble: 2 × 1
  species     
  <chr>       
1 unifasciatus
2 nux

The expression is highly unpleasant: to a human reader, it is not at all obvious what is happening above. It would be nice to clarify this workflow if possible.

It turns out that one can do this by making use of the pipe operator |> from Section 3.3. As a reminder: for any function f and function argument x, f(x, y, ...) is the same as x |> f(y, ...), where the ... denote potential further arguments to f. That is, the first argument of the function can be moved from the argument list to in front of the function, before the pipe symbol. The tidyverse functions take the data as their first argument, which means that the use of pipes allow us to very conveniently chain together multiple steps of data analysis. In our case, we can rewrite the above (quite confusing) code block in a much more transparent way:

snailDat |>
  mutate(stdSize = (size - mean(size)) / sd(size)) |>
  filter(stdSize > 2) |>
  select(species) |>
  distinct()

Again, the pipe |> should be pronounced then. We take the data, then we mutate it, then we filter for large-shelled individuals, then we select one of the columns, and then we remove all duplicated entries in that column. In performing these steps, each function both receives and returns data. Thus, by starting out with the original snailDat, we no longer need to write out the data argument of the functions explicitly. Instead, the pipe takes care of that automatically for us, making the functions receive as their first input the piped-in data, and in turn producing transformed data as their output—which becomes the input for the next function in line.

In fact, there is no need to even assign snailDat. The pipe can just as well start with read_csv to load the dataset:

read_csv("island-FL.csv") |>
  mutate(stdSize = (size - mean(size)) / sd(size)) |>
  filter(stdSize > 2) |>
  select(species) |>
  distinct()

# A tibble: 2 × 1
  species     
  <chr>       
1 unifasciatus
2 nux

5.3 Exercises

The Smith2003_data.txt dataset we worked with in Section 4.8 occasionally has the entry -999 in its last three columns. This stands for unavailable data. As discussed in Section 3.3, in R there is a built-in way of referring to such information: by setting a variable to NA. Modify these columns using mutate so that the entries which are equal to -999 are replaced with NA.
After replacing -999 values with NA, remove all rows from the data which contain one or more NA values (hint: look up the function drop_na). How many rows are retained? And what was the original number of rows?

The iris dataset is a built-in table in R. It contains measurements of petal and sepal characteristics from three flower species belonging to the genus Iris (I. setosa, I. versicolor, and I. virginica). If you type iris in the console, you will see the dataset displayed. In solving the problems below, feel free to use the all-important dplyr cheat sheet.

The format of the data is not a tibble, but a data.frame. As mentioned in Chapter 4, the two are basically the same for practical purposes, though internally tibbles do offer some advantages. Convert the iris data frame into a tibble. (Hint: look up the as_tibble function.)
Verify that there are indeed three distinct species in the data (hint: combine select and distinct in an appropriate way).
Select the columns containing petal and sepal length, and species identity.
Get those rows of the data with petal length less than 4 cm, but sepal length greater than 4 cm.
Sort the data by increasing petal length, breaking ties by decreasing order of sepal length.
Create a new column called MeanLength. It should contain the average of the petal and sepal length (i.e., petal length plus sepal length, divided by 2) of each individual flower.
Perform the operations from exercises 5-8 sequentially, in a single long function call, using function composition via pipes.

5.1 Important functions for transforming data

5.1.1 select

5.1.2 filter

5.1.3 slice

5.1.4 rename

5.1.5 arrange

5.1.6 mutate

5.1.7 distinct