── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
As seen from the message output above, the dplyr package is part of the tidyverse and gets loaded by default. It allows one to arrange and manipulate data efficiently. The basic functions one should know are select, filter, slice, rename, arrange, and mutate. Additionally, a useful (though not as commonly used) function is distinct, also explained below. We will use the island-FL.csv data file we worked with in Chapter 4:
Now we will give examples of each of the functions select, filter, slice, rename, arrange, mutate, and distinct. They all work similarly in that the first argument they take is the data, in the form of a tibble. Their other arguments, and what they each do, are explained below.
5.1.1select
This function chooses columns of the data. The second and subsequent arguments of the function are the columns which should be retained. For example, select(snailDat, species) will keep only the species column of snailDat:
Now we are left with only the columns habitat, size, and shape. One can also select and deselect multiple columns in a similar vein. For example, select(snailDat, -species, -habitat) removes both columns and leaves only size, and shape (try it).
There are several other options within select, which mostly help with selecting several columns at a time fulfilling certain criteria. The collection of these options and methods is called tidy selection. First of all, tidy selection allows one to specify, as character strings, what the column names should start or end with, using starts_with and ends_with:
Similarly, the function contains can select columns which contain some string anywhere in their names. For example, select(snailDat, ends_with("pe")) above only selected the shape column, but select(snailDat, contains("pe")) additionally selects species:
One can combine these selection methods using the & (“and”), | (“or”), and ! (“not”) logical operators. To select all columns which start with "s" but do not contain the letter "z" in their names:
Finally, it is possible to select a range of columns, using the colon (:) operator. To select the columns species and shape, along with any columns in between them, we write:
Such range selection can also be combined with the logical operations above. For instance, to select the range from habitat to species, as well as any columns whose name contains the letter "z":
While select chooses columns, filter chooses rows from the data. As with all these functions, the first argument of filter is the tibble to be filtered. The second argument is a logical condition on the columns. Those rows which satisfy the condition are retained; the rest are dropped. Thus, filter keeps only those rows of the data which fulfill some condition.
For example, to retain only those individuals from snailDat whose shell size is at least 29:
The filtered data have only 6 rows instead of the original 223—this is the number of snail individuals with a very large shell size. As seen, five of these belong in the species Naesiotus unifasciatus, and only one in the species Naesiotus nux.
5.1.3slice
With slice, one can choose rows of the data, just like with filter. Unlike with filter however, slice receives a vector of row indices to retain instead of a condition to be tested on each row. So, for example, if one wanted to keep only the first, second, and fifth rows, then one can do so with slice:
(Note: the numbers in front of the rows in the output generated by tibbles always pertain to the row numbers of the current table, not the one from which they were created. So the row labels 1, 2, 3 above simply enumerate the rows of the sliced data. The actual rows still correspond to rows 1, 2, and 5 in the original snailDat.)
5.1.4rename
The rename function simply gives new names to existing columns. The first argument, as always, is the tibble in which the column(s) should be renamed. The subsequent arguments follow the pattern new_name = old_name in replacing column names. For example, in the land snail data, the arid and humid habitats are often referred to as arid or humid zones. To rename habitat to zone, we simply write:
This function rearranges the rows of the data, in increasing order of the column given as the second argument. For example, to arrange in increasing order of size, we write:
To arrange in decreasing order, there is a small helper function called desc. Arranging by desc(size) instead of size will arrange the rows in decreasing order of size:
It is also perfectly possible to arrange by a column whose type is character string. In that case, the system will sort the rows in alphabetical order—or reverse alphabetical order in case desc is applied. For example, to sort in alphabetical order of species names:
Notice that when we sort the rows by species, there are many ties—rows with the same value of species. In those cases, arrange will not be able to decide which rows should come earlier, and so any ordering that was present before invoking arrange will be retained. In case we would like to break the ties, we can give further sorting variables, as the third, fourth, etc. arguments to arrange. To sort the data by species, and to resolve ties in order of increasing size, we write:
This causes the table to be sorted primarily by species, but in case there are ties (equal species between multiple rows), they will be resolved in priority of size—first the smallest and then increasingly larger individuals.
5.1.6mutate
The mutate function allows us to create new columns from existing ones. We may apply any function or operator we learned about to existing columns, and the result of the computation will go into the new column. We do this in the second argument of mutate (the first, as always, is the data tibble) by first giving a name to the column, then writing =, and then the desired computation. For example, we could create a new column indicating whether a snail is “large” (has a shell size above some threshold—say, 25) or “small”. We can do this using the ifelse function within mutate:
# A tibble: 223 × 5
habitat species size shape shellSize
<chr> <chr> <dbl> <dbl> <chr>
1 humid ustulatus 17.1 -0.029 small
2 humid ustulatus 20.1 -0.001 small
3 humid ustulatus 16.3 0.014 small
4 arid calvus 13.7 -0.043 small
5 humid nux 21.9 -0.042 small
6 humid ustulatus 16.8 -0.023 small
7 humid ustulatus 19.2 0.014 small
8 humid ustulatus 16.0 0.042 small
9 arid galapaganus 18.9 0.011 small
10 humid nux 26.6 0 large
# ℹ 213 more rows
The original columns of the data are retained, but we now also have the additional shellSize column.
One very common transformation on quantities such as size and shape is to standardize them: subtract the overall mean from each entry and then divide the result by the standard deviation. This makes the quantities unitless with mean 0 and standard deviation 1. This is how one can perform this standardization with mutate:
While not as important as the previous six functions, distinct can also be useful. It takes as its input a tibble, and removes all rows that contain exact copies of any other row. For example, we might wonder how many different species there are in snailDat. One way to answer this is to select the species column only, and then apply distinct to remove duplicated entries:
So each individual in the data comes from one of the above seven species.
5.2 Using pipes to our advantage
Let us take a slightly more complicated (and quite typical) data analysis task. We want to answer the question: how many species are there with at least some individuals whose standardized shell size is larger than a threshold—say, 2? This corresponds to individuals whose size is two standard deviations above the community average.
To obtain an answer, we could first mutate a new column that contains the standardized shell size for each individual. We can then filter for those rows only for which this is greater than 2. Afterwards, we can use select to choose just the species column. This will likely have repeated entries (because multiple individuals from the same species could have a standardized size greater than 2), so as a final step, we should remove duplicated rows with distinct.
Since mutate, filter, select, distinct, etc. are just ordinary functions, they do not “modify” data. They merely take a tibble as input (plus other arguments) and return another tibble. They do not do anything to the original input data. In order for R not to forget their result immediately after they are computed, they have to be stored in variables. So one way of implementing the solution might rely on repeated assignments, as below:
# A tibble: 2 × 1
species
<chr>
1 unifasciatus
2 nux
It turns out that only two out of the seven species have individuals with shells that large.
While this solution works, it requires inventing arbitrary variable names at every step, or else overwriting variables. For such a short example, this is not problematic, but doing the same for a long pipeline of dozens of steps could get confusing, as well as dangerous due to the repeatedly modified variables.
Another possible solution is to rely on function composition (Section 3.2). Applying repeated composition is straightforward—in principle. In practice, when composing many functions together, things can get unwieldy quite quickly. Let us see what such a solution looks like:
# A tibble: 2 × 1
species
<chr>
1 unifasciatus
2 nux
The expression is highly unpleasant: to a human reader, it is not at all obvious what is happening above. It would be nice to clarify this workflow if possible.
It turns out that one can do this by making use of the pipe operator |> from Section 3.3. As a reminder: for any function f and function argument x, f(x, y, ...) is the same as x |> f(y, ...), where the ... denote potential further arguments to f. That is, the first argument of the function can be moved from the argument list to in front of the function, before the pipe symbol. The tidyverse functions take the data as their first argument, which means that the use of pipes allow us to very conveniently chain together multiple steps of data analysis. In our case, we can rewrite the above (quite confusing) code block in a much more transparent way:
Again, the pipe |> should be pronounced then. We take the data, then we mutate it, then we filter for large-shelled individuals, then we select one of the columns, and then we remove all duplicated entries in that column. In performing these steps, each function both receives and returns data. Thus, by starting out with the original snailDat, we no longer need to write out the data argument of the functions explicitly. Instead, the pipe takes care of that automatically for us, making the functions receive as their first input the piped-in data, and in turn producing transformed data as their output—which becomes the input for the next function in line.
In fact, there is no need to even assign snailDat. The pipe can just as well start with read_csv to load the dataset:
# A tibble: 2 × 1
species
<chr>
1 unifasciatus
2 nux
5.3 Exercises
The Smith2003_data.txt dataset we worked with in Section 4.8 occasionally has the entry -999 in its last three columns. This stands for unavailable data. As discussed in Section 3.3, in R there is a built-in way of referring to such information: by setting a variable to NA. Modify these columns using mutate so that the entries which are equal to -999 are replaced with NA.
After replacing -999 values with NA, remove all rows from the data which contain one or more NA values (hint: look up the function drop_na). How many rows are retained? And what was the original number of rows?
The iris dataset is a built-in table in R. It contains measurements of petal and sepal characteristics from three flower species belonging to the genus Iris (I. setosa, I. versicolor, and I. virginica). If you type iris in the console, you will see the dataset displayed. In solving the problems below, feel free to use the all-important dplyr cheat sheet.
The format of the data is not a tibble, but a data.frame. As mentioned in Chapter 4, the two are basically the same for practical purposes, though internally tibbles do offer some advantages. Convert the iris data frame into a tibble. (Hint: look up the as_tibble function.)
Verify that there are indeed three distinct species in the data (hint: combine select and distinct in an appropriate way).
Select the columns containing petal and sepal length, and species identity.
Get those rows of the data with petal length less than 4 cm, but sepal length greater than 4 cm.
Sort the data by increasing petal length, breaking ties by decreasing order of sepal length.
Create a new column called MeanLength. It should contain the average of the petal and sepal length (i.e., petal length plus sepal length, divided by 2) of each individual flower.
Perform the operations from exercises 5-8 sequentially, in a single long function call, using function composition via pipes.