library(tidyverse)
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
geom_point() +
geom_smooth()
8 Some further plotting options; introducing factors
8.1 Smoothing and regression lines
In Chapter 7 we learned about aesthetic mappings and various geom_
options such as geom_point
, geom_histogram
, and geom_boxplot
. Let us explore another type of geom_
, which approximates the trend of a set of data points with a line and an error bar that shows the confidence interval of the estimate at each point:
While such fits are occasionally useful, we often want a linear least-squares regression on our data. To get such a linear fit, add the argument method = lm
to geom_smooth()
(lm
stands for “linear model”):
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
geom_point() +
geom_smooth(method = lm)
Linear regression lines are usually shown without the confidence intervals (the gray band around the regression line). To drop this, set se = FALSE
:
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
What happens if we color the data points by species? Let us add colour = Species
to the list of aesthetic mappings:
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
Now the regression line is automatically fitted to the data within each of the groups separately—a highly useful behavior.
Notice also that a color legend was automatically created and appended to the right of the graph. This legend positioning is the default in ggplot2
. You can move the legend to another position by specifying the legend.position
option within the theme
function that can be added onto the plot:
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position = "left") # Or: "top", "bottom", "right", "none"
Specifying legend.position = "none"
omits the legend altogether.
A word of caution: in case the legend positioning is matched with a generic theme such as theme_bw()
, one should put the legend position after the main theme definition. The reason is that pre-defined themes like theme_bw()
override any specific theme options you might specify. The rule of thumb is: any theme()
component to your plot should be added only after the generic theme definition. Otherwise the theme()
component will be overridden and will not take effect. For example, this does not work as intended:
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position = "left") + # Position legend at the left
theme_bw() # Define general theme - and thus override the line above...
But this one does:
|>
iris ggplot(aes(x = Sepal.Length, y = Petal.Length, colour = Species)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_bw() + # This defines the general theme
theme(legend.position = "left") # Now override default legend positioning
8.2 Scales
The aesthetic mappings of a graph (x-axis, y-axis, color, fill, size, shape, alpha, …) are automatically rendered into the displayed plot, based on certain default settings within ggplot2
. These defaults can be altered, however. Consider the following bare-bones plot:
|>
iris ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot()
We can now change, for example, how the y-axis is displayed. The component to be added to the plot is scale_y_continuous()
. Here scale
means we are going to change the scaling of some aesthetic mapping, y
refers to the y-axis (as expected, it can be replaced with x
, colour
, fill
, etc.), and continuous
means that the scaling of the axis is not via discrete values (e.g., either 1 or 2 or 3 but nothing in between), but continuous (every real number is permissible along the y-axis). The plot component scale_y_continuous()
takes several arguments; take a look at its help page to see all possible options. Here we mention a few of them. First, there is the name
option, which is used to relabel the axis. The limits
argument accepts a vector of two values, containing the lower and upper limits of the plot. If any of them is set to NA
, the corresponding limit will be determined automatically. Next, the breaks
argument controls where the tick marks along the axis go. It is given as a vector, with its entries corresponding to the y-coordinates of the tick marks. Finally, labels
determines what actually gets written on the axis at the tick mark points—it is therefore also a vector, its length matching that of breaks
.
As an example, let us scale the y-axis of the previous graph in the following way. The axis label should read “Petal length [cm]”, instead of the current “Petal.Length”. It should go from 0 to 7, with a break at those two values and also halfway in between at 3.5. Here is how to do this:
|>
iris ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot() +
scale_y_continuous(name = "Petal length [cm]",
limits = c(0, 7),
breaks = c(0, 3.5, 7))
What should we do if, for some reason, we would additionally like the “3.5” in the middle to be displayed as “7/2” instead (an exact value)? In that case, we can add an appropriate labels
option as an argument to scale_y_continuous
:
|>
iris ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot() +
scale_y_continuous(name = "Petal length [cm]",
limits = c(0, 7),
breaks = c(0, 3.5, 7),
labels = c("0", "7/2", "7"))
The x-axis can be scaled similarly. One important difference though is that here the x-axis has a discrete scale. Since we are displaying the species along it, any value must be either setosa or versicolor or virginica; it makes no sense to talk about what is “halfway in between setosa and versicolor”. Therefore, one should use scale_x_discrete()
. Its options are similar to those of scale_x_continuous()
. For instance, let us override the axis label, spelling out that the three species belong to the genus Iris:
|>
iris ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot() +
scale_y_continuous(name = "Petal length [cm]",
limits = c(0, 7),
breaks = c(0, 3.5, 7),
labels = c("0", "7/2", "7")) +
scale_x_discrete(name = "Species (genus: Iris)")
Alternatively, one could also redefine the labels and get an equally good graph:
|>
iris ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot() +
scale_y_continuous(name = "Petal length [cm]",
limits = c(0, 7),
breaks = c(0, 3.5, 7),
labels = c("0", "7/2", "7")) +
scale_x_discrete(labels = c("Iris setosa", "Iris versicolor",
"Iris virginica"))
In case you would like to display the species names in italics, as is standard requirement when writing binomial nomenclature, feel free to add theme(axis.text.x = element_text(face = "italic"))
to the end of the plot. We will not be going into more detail on tweaking themes, but feel free to explore the possibilities by looking at the help
pages or Googling them.
Other aesthetic mappings can also be adjusted, such as colour
, fill
, size
, or alpha
. One useful way to do it is through scale_colour_manual()
, scale_fill_manual()
, and so on. These are like scale_colour_discrete()
, scale_fill_discrete()
etc., except that they allow one to specify a discrete set of values by hand. Let us do this for color and fill:
|>
iris ggplot(aes(x = Species, y = Petal.Length,
colour = Species, fill = Species)) +
geom_boxplot(alpha = 0.2) +
scale_y_continuous(name = "Petal length [cm]",
limits = c(0, 7),
breaks = c(0, 3.5, 7),
labels = c("0", "7/2", "7")) +
scale_x_discrete(labels = c("Iris setosa", "Iris versicolor",
"Iris virginica")) +
scale_colour_manual(values = c("steelblue", "goldenrod", "forestgreen")) +
scale_fill_manual(values = c("steelblue", "goldenrod", "forestgreen"))
We used the built-in color names "steelblue"
, "goldenrod"
, and "forestgreen"
above. A useful R color cheat sheet can be found here, for more options and built-in color names.
8.3 Reordering labels using factors
In the previous examples, the order of text labels was always automatically determined. The basic rule in R is that the ordering follows the alphabet: the default is for setosa
to precede versicolor
, which will be followed by virginica
. This default ordering can be inconvenient, however. Consider the data file temp-Lin.csv
of average temperatures per month, measured in the town of Linköping, Sweden:
read_csv("temp-Lin.csv")
# A tibble: 12 × 2
month temp_C
<chr> <dbl>
1 Jan -1.9
2 Feb -1.7
3 Mar 0.9
4 Apr 6.3
5 May 11.5
6 Jun 15.4
7 Jul 17.9
8 Aug 16.6
9 Sep 12.8
10 Oct 7.5
11 Nov 3.4
12 Dec 0.1
The table has two columns: month
and temp_C
, giving the mean temperature in each month across years 1991-2021.1 So far so good. However, if we plot this with months along the x-axis and temperature along the y-axis, we run into trouble because R displays items by alphabetical instead of chronological order:
read_csv("temp-Lin.csv") |>
ggplot(aes(x = month, y = temp_C)) +
geom_point(colour = "steelblue") +
scale_y_continuous(name = "average temperature (Celsius)") +
theme_bw()
To fix this, one must convert the type of month
from a simple vector of character strings to a vector of factors. Factors are categorical variables (i.e., take on well-defined distinct values instead of varying on a continuous scale like double-precision numbers), but with an extra attribute which determines the order of those values. This ordering is often referred to as the levels of the factor. The first of the values has level 1, the next one level 2, and so on.
One very convenient way of assigning factor levels is through the tidyverse
function as_factor
.2 This function takes a vector of values and, if the values are numeric, assigns them levels based on those numerical values. However, if the values are character strings, then the levels are assigned in order of appearance within the vector. This is perfect for us, because the months are in proper order already within the tibble:
read_csv("temp-Lin.csv") |>
mutate(month = as_factor(month)) |>
ggplot(aes(x = month, y = temp_C)) +
geom_point(colour = "steelblue") +
scale_y_continuous(name = "average temperature (Celsius)") +
theme_bw()
It is also possible to take a factor and reassign its levels manually. This can be done with the fct_relevel
function:
read_csv("temp-Lin.csv") |>
mutate(month = fct_relevel(month, "Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) |>
ggplot(aes(x = month, y = temp_C)) +
geom_point(colour = "steelblue") +
scale_y_continuous(name = "average temperature (Celsius)") +
theme_bw()
Also, the use of fct_relevel
need not be this laborious. If all we want to do is place a few factor levels to be first ones without changing any of the others, it is possible to enter just their names. Often, for example, a factor column holds various experimental treatments, one of which is called "control"
. In that case, all we might want to do is to make the control be the first factor level, without altering any of the others. If treatment
is the name of the vector (or column in a data frame) that holds the different experimental treatment names, then this can be done with fct_relevel(treatment, "control")
.
8.4 Facets
Plots can be faceted (subplots created and arranged in a grid layout) based on some variable or variables. For instance, let us create histograms of petal lengths in the iris
dataset, like we did last time:
|>
iris ggplot(aes(x = Petal.Length)) +
geom_histogram()
This way, one cannot see which part of the histogram belongs to which species. One fix to this is to color the histogram by species—this is what we have done before. Another is to separate the plot into three facets, each displaying data for one of the species only:
|>
iris ggplot(aes(x = Petal.Length)) +
geom_histogram() +
facet_grid(. ~ Species)
The component facet_grid(x ~ y)
means that the data will be grouped based on columns x
and y
, with the distinct values of column x
making up the rows and those of column y
the columns of the grid of plots. If one of them is replaced with a dot (as above), then that variable is ignored, and only the other variable is used in creating a row (or column) of subplots. So, to display the same data but with the facets arranged in one column instead of one row, we simply replace facet_grid(. ~ Species)
with facet_grid(Species ~ .)
:
|>
iris ggplot(aes(x = Petal.Length)) +
geom_histogram() +
facet_grid(Species ~ .)
In this particular case, the above graph is preferable to the previous one, because the three subplots now share the same x-axis. This makes it easier to compare the distribution of petal lengths across the species.
To illustrate how to make a two-dimensional grid of facets, let us put the iris
dataset in tidy form using pivot_longer()
:
as_tibble(iris) |>
pivot_longer(cols = !Species,
names_to = "Trait",
values_to = "Measurement")
# A tibble: 600 × 3
Species Trait Measurement
<fct> <chr> <dbl>
1 setosa Sepal.Length 5.1
2 setosa Sepal.Width 3.5
3 setosa Petal.Length 1.4
4 setosa Petal.Width 0.2
5 setosa Sepal.Length 4.9
6 setosa Sepal.Width 3
7 setosa Petal.Length 1.4
8 setosa Petal.Width 0.2
9 setosa Sepal.Length 4.7
10 setosa Sepal.Width 3.2
# ℹ 590 more rows
(We specified the columns for tidying via !Species
, meaning all columns except Species
are selected.) As seen, now the Measurement
in every row is characterized by two other variables: Species
and Trait
(i.e., whether the given value refers to the sepal length, petal width etc. of the given species). We can create a histogram of each measured trait for each species now, in a remarkably simple way:
as_tibble(iris) |>
pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width),
names_to = "Trait",
values_to = "Measurement") |>
ggplot(aes(x = Measurement)) +
geom_histogram() +
facet_grid(Species ~ Trait)
8.5 Saving plots
To save the most recently created ggplot figure, simply type
ggsave(filename = "graph.pdf", width = 4, height = 3)
Here filename
is the name (with path and extension) of the file you want to save the figure into. The extension is important: by having specified .pdf
, the system automatically saves the figure in PDF format. To use, say, PNG instead:
ggsave(filename = "graph.png", width = 4, height = 3)
PDF is a vectorized file format: the file contains the instructions for generating the plot elements instead of a pixel representation of the image. Consequently, PDF figures are arbitrarily scalable, and are therefore the preferred way of saving and handling scientific graphs.
The width
and height
parameters specify, in inches, the dimensions of the saved plot. Note that this also scales some other plot elements, such as the size of the axis labels and plot legends. This means you can play with the width
and height
parameters to save the figure at a size where the labels are clearly visible without being too large.
In case you would like to save a figure that is not the last one that was generated, you can specify the plot
argument to ggsave()
. to do so, first you should assign a plot to a variable. For example:
<- iris |> # Assign the ggplot object to the variable p
p ggplot(aes(x = Petal.Length)) +
geom_histogram()
and then
ggsave(filename = "graph.pdf", plot = p, width = 4, height = 3)
8.6 Exercises
Let us revisit the data of Fauchald et al. (2017) which we used in Section 7.4, tracking the population size of various herds of caribou in North America over time and correlating population cycling with the amount of vegetation and sea-ice cover. Using the file sea_ice.tsv
(sea ice cover per year and month for each caribou herd), do the following:
One exercise from Section 7.4 was to plot
Year
along the x-axis andMonth
along the y-axis, with color tile shading indicating the level of ice cover for the herd labeledWAH
in each month-year pair (usinggeom_tile
). The resulting graph had an obvious weakness: the months along the y-axis were not in proper chronological order. Fix this problem by converting theMonth
column from character strings to factors whose levels go from January to December.Create a similar plot, but do not filter for one single herd. Instead, have each herd occupy a different facet (sub-plot). Do this with the different herds making up a single row of facets, and then a single column of them.
Since there are 11 different herds altogether, neither the row-wise nor the column-wise arrangement of the facets above is very satisfying: there are too many of them next to one another. Try re-plotting the results, but instead of
facet_grid
, usefacet_wrap(~ Herd)
which allows one to lay out the facets in several rows. (As usual, look up the help pages offacet_wrap
if needed.)
The remaining exercises use the Galápagos land snail data (Section 4.2.2).
Create a plot with standardized
size
along the x-axis, standardizedshape
along the y-axis, each individual represented by a point colored byspecies
, and with two facets corresponding to humid and aridhabitat
types. The facets should be side by side. How does this figure influence the interpretation you had in Section 7.4, exercise 15? That is: does the splitting of communities based on habitat type increase or decrease the overlap between different species?Re-create the previous plot with the two side-by-side facets, but in reverse order: the
humid
facet should be on the left andarid
on the right. (Hint: converthabitat
to factors!)
Data from https://climate-data.org.↩︎
There also exists a similarly-named function called
as.factor
, in addition toas_factor
. It is the base R version of the same functionality. As usual, thetidyverse
version offers improvements over the original, so it is recommended not to useas.factor
at all, relying on justas_factor
instead.↩︎