library(tidyverse)
6 Summary statistics and tidy data
Chapter 5 introduced the basics of manipulating data. Here we continue by learning how one can efficiently compute summary statistics over a dataset. Additionally, we will take a look at the concept of tidy data. The two are not independent: as we will see, tidy data admit to calculating summaries much more efficiently than non-tidy data.
We will be relying on the file pop_data.csv
. Download the file and set your working directory to the folder where you saved it. This is a comma-separated file (CSV), so you can load it using read_csv
. As usual, we first load the tidyverse
package:
And now we may use the tidyverse
functionalities, such as read_csv
:
<- read_csv("pop_data.csv") pop
The data we just loaded contain population densities of three species at two spatial patches (A and B) at various points in time, ranging from 1 to 50 in steps of 1:
pop
# A tibble: 100 × 5
time patch species1 species2 species3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1
2 1 B 10.1 3.28 6.27
3 2 A 7.76 6.93 10.3
4 2 B 10.1 3.04 6.07
5 3 A 7.09 7.24 10.5
6 3 B 10.1 2.8 5.82
7 4 A 6.49 7.54 10.6
8 4 B 10.1 2.56 5.57
9 5 A 5.99 7.83 10.7
10 5 B 10.1 2.33 5.32
# ℹ 90 more rows
One can now perform various manipulations on these data, by using the functions we have learned about in Chapter 5. For instance, we could create a new column called total
which contains the total community density (sum of the three species’ population densities) at each point in time and each location:
mutate(pop, total = species1 + species2 + species3)
# A tibble: 100 × 6
time patch species1 species2 species3 total
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1 25.1
2 1 B 10.1 3.28 6.27 19.7
3 2 A 7.76 6.93 10.3 25
4 2 B 10.1 3.04 6.07 19.2
5 3 A 7.09 7.24 10.5 24.8
6 3 B 10.1 2.8 5.82 18.7
7 4 A 6.49 7.54 10.6 24.7
8 4 B 10.1 2.56 5.57 18.2
9 5 A 5.99 7.83 10.7 24.5
10 5 B 10.1 2.33 5.32 17.8
# ℹ 90 more rows
As a reminder, this can be written equivalently using pipes as
|> mutate(total = species1 + species2 + species3) pop
# A tibble: 100 × 6
time patch species1 species2 species3 total
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1 25.1
2 1 B 10.1 3.28 6.27 19.7
3 2 A 7.76 6.93 10.3 25
4 2 B 10.1 3.04 6.07 19.2
5 3 A 7.09 7.24 10.5 24.8
6 3 B 10.1 2.8 5.82 18.7
7 4 A 6.49 7.54 10.6 24.7
8 4 B 10.1 2.56 5.57 18.2
9 5 A 5.99 7.83 10.7 24.5
10 5 B 10.1 2.33 5.32 17.8
# ℹ 90 more rows
6.1 Creating summary data
One can create summaries of data using the summarise
function. This will simply apply some function to a column. For example, to calculate the average population density of species 1 in pop
, across both time and patches, one can write
|> summarise(meanDensity1 = mean(species1)) pop
# A tibble: 1 × 1
meanDensity1
<dbl>
1 5.30
Here meanDensity1
is the name of the new column to be created, and the mean
function is our summary function, collapsing the data into a single number.
So far, this is not particularly interesting; in fact, the exact same effect would have been achieved by typing the shorter mean(pop$species1)
instead. The real power of summarise
comes through when combined with group_by
. This groups the data based on the given grouping variables. Let us see how this works in practice:
|> group_by(patch) pop
# A tibble: 100 × 5
# Groups: patch [2]
time patch species1 species2 species3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1
2 1 B 10.1 3.28 6.27
3 2 A 7.76 6.93 10.3
4 2 B 10.1 3.04 6.07
5 3 A 7.09 7.24 10.5
6 3 B 10.1 2.8 5.82
7 4 A 6.49 7.54 10.6
8 4 B 10.1 2.56 5.57
9 5 A 5.99 7.83 10.7
10 5 B 10.1 2.33 5.32
# ℹ 90 more rows
Seemingly nothing has happened; the only difference is the extra line of comment above, before the printed table, saying Groups: patch [2]
. What this means is that the rows of the data were internally split into two groups. The first have "A"
as their patch, and the second have "B"
. Whenever one groups data using group_by
, rows which share the same unique combination of the grouping variables now belong together, and subsequent operations will act separately on each group instead of acting on the table as a whole (which is what we have been doing so far). That is, group_by
does not actually alter the data; it only alters the behavior of the functions applied to the grouped data.
If we group not just by patch
but also by time
, the comment above the table will read Groups: patch, time [100]
:
|> group_by(patch, time) pop
# A tibble: 100 × 5
# Groups: patch, time [100]
time patch species1 species2 species3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1
2 1 B 10.1 3.28 6.27
3 2 A 7.76 6.93 10.3
4 2 B 10.1 3.04 6.07
5 3 A 7.09 7.24 10.5
6 3 B 10.1 2.8 5.82
7 4 A 6.49 7.54 10.6
8 4 B 10.1 2.56 5.57
9 5 A 5.99 7.83 10.7
10 5 B 10.1 2.33 5.32
# ℹ 90 more rows
This is because there are 100 unique combinations of patch and time: two different patch
values ("A"
and "B"
), and fifty points in time (1, 2, …, 50). So we have “patch A, time 1” as group 1, “patch B, time 1” as group 2, “patch A, time 2” as group 3, and so on until “patch B, time 50” as our group 100.
As mentioned, functions that are applied to grouped data will act on the groups separately. To return to the example of calculating the mean population density of species 1 in the two patches, we can write:
|>
pop group_by(patch) |>
summarise(meanDensity1 = mean(species1))
# A tibble: 2 × 2
patch meanDensity1
<chr> <dbl>
1 A 5.29
2 B 5.32
One may obtain multiple summary statistics within the same summarise
function. Below we compute both the mean and the standard deviation of the densities per patch:
|>
pop group_by(patch) |>
summarise(meanDensity1 = mean(species1), sdDensity1 = sd(species1))
# A tibble: 2 × 3
patch meanDensity1 sdDensity1
<chr> <dbl> <dbl>
1 A 5.29 0.833
2 B 5.32 3.81
Let us see what happens if we calculate the mean density of species 1—but grouping by time
instead of patch
:
|>
pop group_by(time) |>
summarise(meanDensity1 = mean(species1))
# A tibble: 50 × 2
time meanDensity1
<dbl> <dbl>
1 1 9.28
2 2 8.94
3 3 8.60
4 4 8.3
5 5 8.04
6 6 7.84
7 7 7.68
8 8 7.54
9 9 7.44
10 10 7.35
# ℹ 40 more rows
The resulting table has 50 rows—half the number of rows in the original data, but many more than the two rows we get after grouping by patch
. The reason is that there are 50 unique time points, and so the average is now computed over those rows which share time
. But there are only two rows per moment of time: the rows corresponding to patch A and patch B. When we call summarise
after having grouped by time
, the averages are computed over the densities in these two rows only, per group. That is why here we end up with a table which has a single row per point in time.
One common mistake when first encountering grouping and summaries is to assume that if we call group_by(patch)
, then the subsequent summaries will be taken over patches. This is not the case, and it is important to take a moment to understand why. When we apply group_by(patch)
, we are telling R to treat different patch values as group indicators. Therefore, when creating a summary, only the patch identities are retained from the original data, to which the newly calculated summary statistics are added. This means that the subsequent summaries are taken over everything except the patches. This should be clear after comparing the outputs of
|> group_by(patch) |> summarise(meanDensity1 = mean(species1)) pop
and
|> group_by(time) |> summarise(meanDensity1 = mean(species1)) pop
The first distinguishes the rows of the data only by patch
, and therefore the average is taken over time. The second distinguishes the rows by time
, so the average is taken over the patches. Run the two expressions again to see the difference between them!
We can use functions such as mutate
or filter
on grouped data. For example, we might want to know the difference of species 1’s density from its average in each patch. Doing the following does not quite do what we want:
|> mutate(species1Diff = species1 - mean(species1)) pop
# A tibble: 100 × 6
time patch species1 species2 species3 species1Diff
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1 3.13
2 1 B 10.1 3.28 6.27 4.83
3 2 A 7.76 6.93 10.3 2.46
4 2 B 10.1 3.04 6.07 4.82
5 3 A 7.09 7.24 10.5 1.79
6 3 B 10.1 2.8 5.82 4.82
7 4 A 6.49 7.54 10.6 1.19
8 4 B 10.1 2.56 5.57 4.81
9 5 A 5.99 7.83 10.7 0.685
10 5 B 10.1 2.33 5.32 4.80
# ℹ 90 more rows
This will put the difference of species 1’s density from its mean density across both time and patches into the new column species1Diff
. That is not the same as calculating the difference from the mean in a given patch—patch A for rows corresponding to patch A, and patch B for the others. To achieve this, all one needs to do is to group the data by patch
before invoking mutate
:
|>
pop group_by(patch) |>
mutate(species1Diff = species1 - mean(species1))
# A tibble: 100 × 6
# Groups: patch [2]
time patch species1 species2 species3 species1Diff
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1 3.14
2 1 B 10.1 3.28 6.27 4.81
3 2 A 7.76 6.93 10.3 2.47
4 2 B 10.1 3.04 6.07 4.80
5 3 A 7.09 7.24 10.5 1.80
6 3 B 10.1 2.8 5.82 4.80
7 4 A 6.49 7.54 10.6 1.20
8 4 B 10.1 2.56 5.57 4.79
9 5 A 5.99 7.83 10.7 0.702
10 5 B 10.1 2.33 5.32 4.78
# ℹ 90 more rows
Comparing this with the previous table, we see that the values in the species1Diff
column are now different, because this time the differences are taken with respect to the average densities per each patch.
Finally, since group_by
changes subsequent behaviour, we eventually want to get rid of the grouping in our data. This can be done with the function ungroup
. For example:
|>
pop group_by(patch) |>
mutate(species1Diff = species1 - mean(species1)) |>
ungroup()
# A tibble: 100 × 6
time patch species1 species2 species3 species1Diff
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1 3.14
2 1 B 10.1 3.28 6.27 4.81
3 2 A 7.76 6.93 10.3 2.47
4 2 B 10.1 3.04 6.07 4.80
5 3 A 7.09 7.24 10.5 1.80
6 3 B 10.1 2.8 5.82 4.80
7 4 A 6.49 7.54 10.6 1.20
8 4 B 10.1 2.56 5.57 4.79
9 5 A 5.99 7.83 10.7 0.702
10 5 B 10.1 2.33 5.32 4.78
# ℹ 90 more rows
It is good practice to always ungroup
the data after we have calculated what we wanted using the group structure. Otherwise, subsequent calculations could be influenced by the grouping in unexpected ways.
6.2 Tidy data
In science, we often strive to work with tidy data (also known as long-format data or normal-form data). A dataset is tidy if:
- Each variable is in its own column;
- Each observation is in its own row.
Tidy data are suitable for performing operations, statistics, and plotting on. Furthermore, tidy data have a certain well-groomed feel to them, in the sense that their organization always follows the same general pattern regardless of the type of dataset one studies. Paraphrasing Tolstoy: tidy data are all alike; by contrast, every non-tidy dataset tends to be messy in its own unique way.
The tidyverse
offers a simple and convenient way of putting data in tidy format. The pop
table from the previous section is not tidy, because although each variable is in its own column, it is not true that each observation is in its own row. Instead, each row contains three observations: the densities of species 1, 2, and 3 at a given time and place. To tidy up these data, we create key-value pairs. We merge the columns for species densities into just two new ones. The first of these (the key) indicates whether it is species 1, or 2, or 3 which the given row refers to. The second column (the value) contains the population density of the given species. Such key-value pairs are created by the function pivot_longer
:
|>
pop pivot_longer(cols = c(species1, species2, species3),
names_to = "species",
values_to = "density")
# A tibble: 300 × 4
time patch species density
<dbl> <chr> <chr> <dbl>
1 1 A species1 8.43
2 1 A species2 6.62
3 1 A species3 10.1
4 1 B species1 10.1
5 1 B species2 3.28
6 1 B species3 6.27
7 2 A species1 7.76
8 2 A species2 6.93
9 2 A species3 10.3
10 2 B species1 10.1
# ℹ 290 more rows
The function pivot_longer
takes three arguments in addition to the first (data) argument that we may pipe in, like above. First, cols
is the list of columns to be converted into key-value pairs. It uses the same tidy selection mechanisms as the function select
; see Section 5.1.1. (This means that cols = starts_with("species")
could also have been used.) Second, the argument names_to
is the name of the new key column, specified as a character string. And third, values_to
is the name of the new value column, also as a character string.
The above table is now in tidy form: each column records a single variable, and each row contains a single observation. Notice that, unlike the original pop
which had 100 rows and 5 columns, the tidy version has 300 rows and 4 columns. This is natural: since the number of columns was reduced, there must be some extra rows to prevent the loss of information. And one should notice another benefit to casting the data in tidy format: it forces one to explicitly specify what was measured. By having named the value column density
, we now know that the numbers 8.43, 6.62, etc. are density measurements. By contrast, it is not immediately obvious what these same numbers mean under the columns species1
, species2
, … in the original data.
It is possible to undo the effect pivot_longer
. To do so, use pivot_wider
:
|>
pop pivot_longer(cols = starts_with("species"),
names_to = "species",
values_to = "density") |>
pivot_wider(names_from = "species", values_from = "density")
# A tibble: 100 × 5
time patch species1 species2 species3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 8.43 6.62 10.1
2 1 B 10.1 3.28 6.27
3 2 A 7.76 6.93 10.3
4 2 B 10.1 3.04 6.07
5 3 A 7.09 7.24 10.5
6 3 B 10.1 2.8 5.82
7 4 A 6.49 7.54 10.6
8 4 B 10.1 2.56 5.57
9 5 A 5.99 7.83 10.7
10 5 B 10.1 2.33 5.32
# ℹ 90 more rows
The two named arguments of pivot_wider
above are names_from
(which specifies the column from which the names for the new columns will be taken), and values_from
(the column whose values will be used to fill in the rows under those new columns).
As a remark, one could make the data even “wider”, by not only making columns out of the population densities, but the densities at a given patch. Doing so is simple: one just needs to specify both the species
and patch
columns from which the new column names will be compiled.
|>
pop pivot_longer(cols = starts_with("species"),
names_to = "species",
values_to = "density") |>
pivot_wider(names_from = c("species", "patch"), values_from = "density")
# A tibble: 50 × 7
time species1_A species2_A species3_A species1_B species2_B species3_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 8.43 6.62 10.1 10.1 3.28 6.27
2 2 7.76 6.93 10.3 10.1 3.04 6.07
3 3 7.09 7.24 10.5 10.1 2.8 5.82
4 4 6.49 7.54 10.6 10.1 2.56 5.57
5 5 5.99 7.83 10.7 10.1 2.33 5.32
6 6 5.58 8.1 10.7 10.1 2.12 5.08
7 7 5.27 8.34 10.6 10.1 1.92 4.86
8 8 5.02 8.54 10.4 10.1 1.74 4.64
9 9 4.82 8.7 10.0 10.0 1.58 4.43
10 10 4.66 8.82 9.66 10.0 1.43 4.23
# ℹ 40 more rows
If tidy data are what we strive for, what is the practical use of pivot_wider
? There are two answers to this question. First, while non-tidy data are indeed less efficient from a computational and data analysis standpoint, they are often more human-readable. For example, the pop
table is easy to read despite the lack of tidiness, because each row corresponds to a given time and place. By tidying the data, information referring to any given time and place will be spread out over multiple (in our case, three) rows—one for each species. While this is preferable from a data analysis point of view, it can be more difficult to digest visually. Second, wide data lend themselves very well to a class of statistical techniques called multivariate analysis. In case one wants to perform multivariate analysis, wide-format data are often better than tidy data. We will see an example application of this in Section 14.1.
Finally, it is worth noting the power of tidy data in, e.g., generating summary statistics. To obtain the mean and the standard deviation of the population densities for each species in each patch, all one has to do is this:
|>
pop # Tidy up the data:
pivot_longer(cols = starts_with("species"),
names_to = "species",
values_to = "density") |>
# Group data by both species and patch:
group_by(patch, species) |>
# Obtain statistics:
summarise(meanDensity = mean(density), sdDensity = sd(density)) |>
# Don't forget to ungroup the data at the end:
ungroup()
# A tibble: 6 × 4
patch species meanDensity sdDensity
<chr> <chr> <dbl> <dbl>
1 A species1 5.29 0.833
2 A species2 8.05 0.559
3 A species3 7.51 1.56
4 B species1 5.32 3.81
5 B species2 1.07 0.737
6 B species3 6.57 2.48
6.3 Exercises
The first set of problems relies on the iris
dataset—the same that we used in the previous chapter’s exercises (Section 5.3). Convert the iris
data to a tibble with the as_tibble
function, and assign it to a variable.
Create a new column in the
iris
dataset which contains the difference of petal lengths from the average of the whole dataset.Create a new column in the
iris
dataset which contains the difference of petal lengths from the average of each species. (Hint:group_by
the species and thenmutate
!)Create a table where the rows are the three species, and the columns are: average petal length, total spread of petal length (difference between the largest and smallest values), average sepal length, and total spread of sepal length. Each of these should be calculated across flowers of the corresponding species.
Create key-value pairs in the
iris
dataset for the petal characteristics. In other words, have a column calledPetal.Trait
(whose values are eitherPetal.Length
orPetal.Width
), and another column calledPetal.Value
(with the length/width values).Repeat the same exercise, but now for sepal traits.
Finally, do it for both petal and sepal traits simultaneously, to obtain a fully tidy form of the
iris
data. That is, the key column (call itFlower.Trait
) will have the valuesPetal.Length
,Petal.Width
,Sepal.Length
, andSepal.Width
. And the value column (which you can callTrait.Value
) will have the corresponding measurements.
The subsequent exercises use the Galápagos land snail data from the previous two chapters (see Section 4.2.2 to review the description of the data).
What is the average shell size of the whole community? What is its total spread (the difference between the largest and smallest values)? How about the average and total spread of shell shape?
What is the average shell size / shell shape of the community in each habitat type (humid/arid)?
What is the average shell size / shell shape in each unique combination of species and habitat type?
Based on your answer to the previous question: how many species are there which live in both arid and humid environments?
Organize the
size
andshape
columns in key-value pairs: instead of the originalsize
andshape
, have a column calledtrait
(which will either be"size"
or"shape"
) and another column calledvalue
which holds the corresponding measurement.