8 Some further plotting options; introducing factors

8.1 Smoothing and regression lines

In Chapter 7 we learned about aesthetic mappings and various geom_ options such as geom_point, geom_histogram, and geom_boxplot. Let us explore another type of geom_, which approximates the trend of a set of data points with a line and an error bar that shows the confidence interval of the estimate at each point:

library(tidyverse)

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point() +
  geom_smooth()

While such fits are occasionally useful, we often want a linear least-squares regression on our data. To get such a linear fit, add the argument method = lm to geom_smooth() (lm stands for “linear model”):

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point() +
  geom_smooth(method = lm)

Linear regression lines are usually shown without the confidence intervals (the gray band around the regression line). To drop this, set se = FALSE:

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

What happens if we color the data points by species? Let us add colour = Species to the list of aesthetic mappings:

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

Now the regression line is automatically fitted to the data within each of the groups separately—a highly useful behavior.

Notice also that a color legend was automatically created and appended to the right of the graph. This legend positioning is the default in ggplot2. You can move the legend to another position by specifying the legend.position option within the theme function that can be added onto the plot:

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  theme(legend.position = "left") # Or: "top", "bottom", "right", "none"

Specifying legend.position = "none" omits the legend altogether.

A word of caution: in case the legend positioning is matched with a generic theme such as theme_bw(), one should put the legend position after the main theme definition. The reason is that pre-defined themes like theme_bw() override any specific theme options you might specify. The rule of thumb is: any theme() component to your plot should be added only after the generic theme definition. Otherwise the theme() component will be overridden and will not take effect. For example, this does not work as intended:

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  theme(legend.position = "left") + # Position legend at the left
  theme_bw() # Define general theme - and thus override the line above...

But this one does:

iris |>
  ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  theme_bw() + # This defines the general theme
  theme(legend.position = "left") # Now override default legend positioning

8.2 Scales

The aesthetic mappings of a graph (x-axis, y-axis, color, fill, size, shape, alpha, …) are automatically rendered into the displayed plot, based on certain default settings within ggplot2. These defaults can be altered, however. Consider the following bare-bones plot:

iris |>
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot()

We can now change, for example, how the y-axis is displayed. The component to be added to the plot is scale_y_continuous(). Here scale means we are going to change the scaling of some aesthetic mapping, y refers to the y-axis (as expected, it can be replaced with x, colour, fill, etc.), and continuous means that the scaling of the axis is not via discrete values (e.g., either 1 or 2 or 3 but nothing in between), but continuous (every real number is permissible along the y-axis). The plot component scale_y_continuous() takes several arguments; take a look at its help page to see all possible options. Here we mention a few of them. First, there is the name option, which is used to relabel the axis. The limits argument accepts a vector of two values, containing the lower and upper limits of the plot. If any of them is set to NA, the corresponding limit will be determined automatically. Next, the breaks argument controls where the tick marks along the axis go. It is given as a vector, with its entries corresponding to the y-coordinates of the tick marks. Finally, labels determines what actually gets written on the axis at the tick mark points—it is therefore also a vector, its length matching that of breaks.

As an example, let us scale the y-axis of the previous graph in the following way. The axis label should read “Petal length [cm]”, instead of the current “Petal.Length”. It should go from 0 to 7, with a break at those two values and also halfway in between at 3.5. Here is how to do this:

iris |>
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot() +
  scale_y_continuous(name = "Petal length [cm]",
                     limits = c(0, 7),
                     breaks = c(0, 3.5, 7))

What should we do if, for some reason, we would additionally like the “3.5” in the middle to be displayed as “7/2” instead (an exact value)? In that case, we can add an appropriate labels option as an argument to scale_y_continuous:

iris |>
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot() +
  scale_y_continuous(name = "Petal length [cm]",
                     limits = c(0, 7),
                     breaks = c(0, 3.5, 7),
                     labels = c("0", "7/2", "7"))

The x-axis can be scaled similarly. One important difference though is that here the x-axis has a discrete scale. Since we are displaying the species along it, any value must be either setosa or versicolor or virginica; it makes no sense to talk about what is “halfway in between setosa and versicolor”. Therefore, one should use scale_x_discrete(). Its options are similar to those of scale_x_continuous(). For instance, let us override the axis label, spelling out that the three species belong to the genus Iris:

iris |>
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot() +
  scale_y_continuous(name = "Petal length [cm]",
                     limits = c(0, 7),
                     breaks = c(0, 3.5, 7),
                     labels = c("0", "7/2", "7")) +
  scale_x_discrete(name = "Species (genus: Iris)")

Alternatively, one could also redefine the labels and get an equally good graph:

iris |>
  ggplot(aes(x = Species, y = Petal.Length)) +
  geom_boxplot() +
  scale_y_continuous(name = "Petal length [cm]",
                     limits = c(0, 7),
                     breaks = c(0, 3.5, 7),
                     labels = c("0", "7/2", "7")) +
  scale_x_discrete(labels = c("Iris setosa", "Iris versicolor",
                              "Iris virginica"))

Note

In case you would like to display the species names in italics, as is standard requirement when writing binomial nomenclature, feel free to add theme(axis.text.x = element_text(face = "italic")) to the end of the plot. We will not be going into more detail on tweaking themes, but feel free to explore the possibilities by looking at the help pages or Googling them.

Other aesthetic mappings can also be adjusted, such as colour, fill, size, or alpha. One useful way to do it is through scale_colour_manual(), scale_fill_manual(), and so on. These are like scale_colour_discrete(), scale_fill_discrete() etc., except that they allow one to specify a discrete set of values by hand. Let us do this for color and fill:

iris |>
  ggplot(aes(x = Species, y = Petal.Length,
             colour = Species, fill = Species)) +
  geom_boxplot(alpha = 0.2) +
  scale_y_continuous(name = "Petal length [cm]",
                     limits = c(0, 7),
                     breaks = c(0, 3.5, 7),
                     labels = c("0", "7/2", "7")) +
  scale_x_discrete(labels = c("Iris setosa", "Iris versicolor",
                              "Iris virginica")) +
  scale_colour_manual(values = c("steelblue", "goldenrod", "forestgreen")) +
  scale_fill_manual(values = c("steelblue", "goldenrod", "forestgreen"))

We used the built-in color names "steelblue", "goldenrod", and "forestgreen" above. A useful R color cheat sheet can be found here, for more options and built-in color names.

8.3 Reordering labels using factors

In the previous examples, the order of text labels was always automatically determined. The basic rule in R is that the ordering follows the alphabet: the default is for setosa to precede versicolor, which will be followed by virginica. This default ordering can be inconvenient, however. Consider the data file temp-Lin.csv of average temperatures per month, measured in the town of Linköping, Sweden:

read_csv("temp-Lin.csv")

# A tibble: 12 × 2
   month temp_C
   <chr>  <dbl>
 1 Jan     -1.9
 2 Feb     -1.7
 3 Mar      0.9
 4 Apr      6.3
 5 May     11.5
 6 Jun     15.4
 7 Jul     17.9
 8 Aug     16.6
 9 Sep     12.8
10 Oct      7.5
11 Nov      3.4
12 Dec      0.1

The table has two columns: month and temp_C, giving the mean temperature in each month across years 1991-2021.¹ So far so good. However, if we plot this with months along the x-axis and temperature along the y-axis, we run into trouble because R displays items by alphabetical instead of chronological order:

read_csv("temp-Lin.csv") |>
  ggplot(aes(x = month, y = temp_C)) +
  geom_point(colour = "steelblue") +
  scale_y_continuous(name = "average temperature (Celsius)") +
  theme_bw()

To fix this, one must convert the type of month from a simple vector of character strings to a vector of factors. Factors are categorical variables (i.e., take on well-defined distinct values instead of varying on a continuous scale like double-precision numbers), but with an extra attribute which determines the order of those values. This ordering is often referred to as the levels of the factor. The first of the values has level 1, the next one level 2, and so on.

One very convenient way of assigning factor levels is through the tidyverse function as_factor.² This function takes a vector of values and, if the values are numeric, assigns them levels based on those numerical values. However, if the values are character strings, then the levels are assigned in order of appearance within the vector. This is perfect for us, because the months are in proper order already within the tibble:

read_csv("temp-Lin.csv") |>
  mutate(month = as_factor(month)) |>
  ggplot(aes(x = month, y = temp_C)) +
  geom_point(colour = "steelblue") +
  scale_y_continuous(name = "average temperature (Celsius)") +
  theme_bw()

It is also possible to take a factor and reassign its levels manually. This can be done with the fct_relevel function:

read_csv("temp-Lin.csv") |>
  mutate(month = fct_relevel(month, "Jan", "Feb", "Mar", "Apr", "May", "Jun",
                             "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) |>
  ggplot(aes(x = month, y = temp_C)) +
  geom_point(colour = "steelblue") +
  scale_y_continuous(name = "average temperature (Celsius)") +
  theme_bw()

Also, the use of fct_relevel need not be this laborious. If all we want to do is place a few factor levels to be first ones without changing any of the others, it is possible to enter just their names. Often, for example, a factor column holds various experimental treatments, one of which is called "control". In that case, all we might want to do is to make the control be the first factor level, without altering any of the others. If treatment is the name of the vector (or column in a data frame) that holds the different experimental treatment names, then this can be done with fct_relevel(treatment, "control").

8.4 Facets

Plots can be faceted (subplots created and arranged in a grid layout) based on some variable or variables. For instance, let us create histograms of petal lengths in the iris dataset, like we did last time:

iris |>
  ggplot(aes(x = Petal.Length)) +
  geom_histogram()

This way, one cannot see which part of the histogram belongs to which species. One fix to this is to color the histogram by species—this is what we have done before. Another is to separate the plot into three facets, each displaying data for one of the species only:

iris |>
  ggplot(aes(x = Petal.Length)) +
  geom_histogram() +
  facet_grid(. ~ Species)

The component facet_grid(x ~ y) means that the data will be grouped based on columns x and y, with the distinct values of column x making up the rows and those of column y the columns of the grid of plots. If one of them is replaced with a dot (as above), then that variable is ignored, and only the other variable is used in creating a row (or column) of subplots. So, to display the same data but with the facets arranged in one column instead of one row, we simply replace facet_grid(. ~ Species) with facet_grid(Species ~ .):

iris |>
  ggplot(aes(x = Petal.Length)) +
  geom_histogram() +
  facet_grid(Species ~ .)

In this particular case, the above graph is preferable to the previous one, because the three subplots now share the same x-axis. This makes it easier to compare the distribution of petal lengths across the species.

To illustrate how to make a two-dimensional grid of facets, let us put the iris dataset in tidy form using pivot_longer():

as_tibble(iris) |>
  pivot_longer(cols = !Species,
               names_to = "Trait",
               values_to = "Measurement")

# A tibble: 600 × 3
   Species Trait        Measurement
   <fct>   <chr>              <dbl>
 1 setosa  Sepal.Length         5.1
 2 setosa  Sepal.Width          3.5
 3 setosa  Petal.Length         1.4
 4 setosa  Petal.Width          0.2
 5 setosa  Sepal.Length         4.9
 6 setosa  Sepal.Width          3  
 7 setosa  Petal.Length         1.4
 8 setosa  Petal.Width          0.2
 9 setosa  Sepal.Length         4.7
10 setosa  Sepal.Width          3.2
# ℹ 590 more rows

(We specified the columns for tidying via !Species, meaning all columns except Species are selected.) As seen, now the Measurement in every row is characterized by two other variables: Species and Trait (i.e., whether the given value refers to the sepal length, petal width etc. of the given species). We can create a histogram of each measured trait for each species now, in a remarkably simple way:

as_tibble(iris) |>
  pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width),
               names_to = "Trait",
               values_to = "Measurement") |>
  ggplot(aes(x = Measurement)) +
  geom_histogram() +
  facet_grid(Species ~ Trait)

8.5 Saving plots

To save the most recently created ggplot figure, simply type

ggsave(filename = "graph.pdf", width = 4, height = 3)

Here filename is the name (with path and extension) of the file you want to save the figure into. The extension is important: by having specified .pdf, the system automatically saves the figure in PDF format. To use, say, PNG instead:

ggsave(filename = "graph.png", width = 4, height = 3)

PDF is a vectorized file format: the file contains the instructions for generating the plot elements instead of a pixel representation of the image. Consequently, PDF figures are arbitrarily scalable, and are therefore the preferred way of saving and handling scientific graphs.

The width and height parameters specify, in inches, the dimensions of the saved plot. Note that this also scales some other plot elements, such as the size of the axis labels and plot legends. This means you can play with the width and height parameters to save the figure at a size where the labels are clearly visible without being too large.

In case you would like to save a figure that is not the last one that was generated, you can specify the plot argument to ggsave(). to do so, first you should assign a plot to a variable. For example:

p <- iris |> # Assign the ggplot object to the variable p
  ggplot(aes(x = Petal.Length)) +
  geom_histogram()

and then

ggsave(filename = "graph.pdf", plot = p, width = 4, height = 3)

8.6 Exercises

Let us revisit the data of Fauchald et al. (2017) which we used in Section 7.4, tracking the population size of various herds of caribou in North America over time and correlating population cycling with the amount of vegetation and sea-ice cover. Using the file sea_ice.tsv (sea ice cover per year and month for each caribou herd), do the following:

One exercise from Section 7.4 was to plot Year along the x-axis and Month along the y-axis, with color tile shading indicating the level of ice cover for the herd labeled WAH in each month-year pair (using geom_tile). The resulting graph had an obvious weakness: the months along the y-axis were not in proper chronological order. Fix this problem by converting the Month column from character strings to factors whose levels go from January to December.
Create a similar plot, but do not filter for one single herd. Instead, have each herd occupy a different facet (sub-plot). Do this with the different herds making up a single row of facets, and then a single column of them.
Since there are 11 different herds altogether, neither the row-wise nor the column-wise arrangement of the facets above is very satisfying: there are too many of them next to one another. Try re-plotting the results, but instead of facet_grid, use facet_wrap(~ Herd) which allows one to lay out the facets in several rows. (As usual, look up the help pages of facet_wrap if needed.)

The remaining exercises use the Galápagos land snail data (Section 4.2.2).

Create a plot with standardized size along the x-axis, standardized shape along the y-axis, each individual represented by a point colored by species, and with two facets corresponding to humid and arid habitat types. The facets should be side by side. How does this figure influence the interpretation you had in Section 7.4, exercise 15? That is: does the splitting of communities based on habitat type increase or decrease the overlap between different species?
Re-create the previous plot with the two side-by-side facets, but in reverse order: the humid facet should be on the left and arid on the right. (Hint: convert habitat to factors!)

Data from https://climate-data.org.↩︎
There also exists a similarly-named function called as.factor, in addition to as_factor. It is the base R version of the same functionality. As usual, the tidyverse version offers improvements over the original, so it is recommended not to use as.factor at all, relying on just as_factor instead.↩︎