library(tidyverse) # Loading the tidyverse, before doing anything else
15 Higher-order functions and mapping
15.1 Introduction
In Chapter 3 we learned how to create user-defined functions. An example was provided in Section 12.2, where we made our life easier by eliminating the need to always call aov
before performing a Tukey test with TukeyHSD
. Without the function, we must write:
lm(weight ~ group, data = PlantGrowth) |> aov() |> TukeyHSD()
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lm(weight ~ group, data = PlantGrowth))
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
Instead, we can write a simple function that takes a linear model fit object as its input and produces the Tukey test as its output:
<- function(modelFit) modelFit |> aov() |> TukeyHSD() tukeyTest
Make sure to review Chapter 3, especially Section 3.1.1, if you need a refresher on how to define functions in R.
Using this function, we can now simply write:
lm(weight ~ group, data = PlantGrowth) |> tukeyTest()
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = modelFit)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
Another useful function we can write helps use statistical procedures within a pipeline. As we have seen, most statistical functions take a formula as their first argument and the data as their second (and then they may take further, method-specific arguments as well, like conf.int
in the function wilcox.test
). Since the pipe operator |>
is often used in conjunction with functions like select
, mutate
, pivot_longer
, summarise
, etc. which all return a data frame, it would be convenient to reverse the order of arguments in all statistical functions, with the data coming first and the formula coming second. In fact, such a function is easy to write. We could call it tidystat
:
<- function(data, formula, method) method(formula, data) tidystat
Here method
is the statistical function we wish to use. For example:
|>
PlantGrowth filter(group != "ctrl") |>
tidystat(weight ~ group, wilcox.test)
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
This is now fully equivalent to:
|>
PlantGrowth filter(group != "ctrl") |>
wilcox.test(weight ~ group, data = _)
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
We can add a further improvement to the tidystat
function. As it is, it can only take the formula and the data as inputs, but not any other, function-specific arguments. In R, there is a way of passing arbitrary extra arguments, using the ellipsis (...
). We can redefine tidystat
this way:
<- function(data, formula, method, ...) {
tidystat method(formula, data, ...) # The ... means "possibly more arguments"
}
And now, we can pass extra arguments that we would not have been able to do before. For instance, we can request confidence intervals from wilcox.test
:
|>
PlantGrowth filter(group != "ctrl") |>
tidystat(weight ~ group, wilcox.test, conf.int = TRUE, conf.level = 0.99)
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
99 percent confidence interval:
-1.70 -0.03
sample estimates:
difference in location
-0.945
From now on, we can use tidystat
for performing various statistical procedures:
library(FSA) # For the Dunn test
|> tidystat(weight ~ group, lm) |> anova() PlantGrowth
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
|> tidystat(weight ~ group, lm) |> tukeyTest() PlantGrowth
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = modelFit)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
|> tidystat(weight ~ group, kruskal.test) PlantGrowth
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
|> tidystat(weight ~ group, dunnTest) PlantGrowth
Comparison Z P.unadj P.adj
1 ctrl - trt1 1.117725 0.26368427 0.26368427
2 ctrl - trt2 -1.689290 0.09116394 0.18232789
3 trt1 - trt2 -2.807015 0.00500029 0.01500087
15.2 Higher-order functions
If we think about the tidystat
function above, something strange has happened. We are used to the idea of functions taking numbers, character strings, logical values, or even data frames as their arguments. What we have not paid much attention to before is what happens if the input to a function is itself another function. Yet this is precisely what tidystat
does: its method
argument is a function object.
In R, this is perfectly legal, and its use and interpretation is every bit as natural as it was in tidystat
. Functions which take other functions as arguments are often called higher-order functions.1 To emphasize again: there really is nothing special about such functions, and they can be used in much the same way as “ordinary” functions.
One very natural example for a higher-order function is integration. An integral (at least in simple cases) takes three inputs: a function to integrate, a lower limit of integration, and an upper limit of integration. The output is the (sign-weighted) area under the function’s curve, evaluated between the lower and upper limits. This is stressed even in the usual mathematical notation for integrals: when we write \[ \int_0^1 x^2 \,\text{d} x = \frac{1}{3}\] (a true statement), we show the lower and upper limits of 0 and 1 at the bottom and top of the integral sign, and the function to be integrated (in this case, \(f(x) = x^2\)) in between the integral sign and \(\text{d} x\).
If you do not know how integrals do their magic, there is no need to worry, because R has a built-in function called integrate
to do the calculations for you. integrate
takes the three arguments described above: the function to integrate, and the lower and upper limits of integration. To perform the above integral, we can write:
# The squaring function: sqr(2) returns 4, sqr(4) returns 16, etc.
<- function(x) x^2
sqr # Perform the integral between 0 and 1:
integrate(sqr, 0, 1)
0.3333333 with absolute error < 3.7e-15
The answer is indeed one-third.2 But there was no obligation to use the square function above. We could have used any other one. For instance, to compute the integral of the cosine function \(\cos(x)\) between \(0\) and \(2\pi\), we can type:
integrate(cos, 0, 2*pi)
4.359836e-16 with absolute error < 4.5e-14
We get the expected result of zero, within numerical error.
One thing to know about function objects like sqr
is that they do not need a name to be used. In the definition sqr <- function(x) x^2
, we assigned the function object function(x) x^2
to the symbol sqr
, so we wouldn’t have to write it out all the time. But since sqr
is just a name that stands for function(x) x^2
, calling (function(x) x^2)(4)
is the same as calling sqr(4)
, both returning 16. If a function is used only once within another (higher-order) function, then we might not wish to bother with naming the function separately. Thus, the following is exactly equivalent to integrating the sqr
function:
integrate(function(x) x^2, 0, 1)
0.3333333 with absolute error < 3.7e-15
Functions without names are often called anonymous functions. They are commonly used within other, higher-order functions. Their use is not mandatory: it is always possible to first define the function with a name, and then use that name instead (e.g., using sqr
instead of function(x) x^2
, after defining sqr <- function(x) x^2
). However, they can be convenient, and it is also important to recognize them in R code written by others.
15.3 The map
family of functions
The purrr
package is a standard, automatically-loaded part of the tidyverse
. It contains a large family of mapping functions. These allow one to perform repetitive tasks by applying the same function to all elements of a vector, list, or column in a data frame.
To illustrate their use, how would we obtain the squares of all integers from 1 to 10? Using our earlier sqr
function, we could painstakingly type out sqr(1)
, then sqr(2)
, and so on, up until sqr(10)
(we ought to be grateful that the task was to obtain the squares of the first ten integers, instead of the first ten thousand). But there is no need to do this, as this is exactly what map
can do. map
takes two arguments: some data (e.g., a vector of values), and a function. It then applies that function to all data entries. So a much quicker way of obtaining the squares of all integers from 1 to 10 is this:
map(1:10, sqr)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
[[6]]
[1] 36
[[7]]
[1] 49
[[8]]
[1] 64
[[9]]
[1] 81
[[10]]
[1] 100
Or, in case we prefer anonymous functions and do not want to bother with defining our own sqr
routine:
map(1:10, function(x) x^2)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
[[6]]
[1] 36
[[7]]
[1] 49
[[8]]
[1] 64
[[9]]
[1] 81
[[10]]
[1] 100
The results are all there, although they are prefaced by double-bracketed indices [[1]]
, [[2]]
, and so on. You may recall from Section 14.2.1 that this is the notation used to reference the entries of lists. That is correct: map
returns a list of values, not a vector. We will see momentarily that this can be very useful behavior, but here it can feel overkill. Fortunately, it is easy to get back a vector instead of a list. Since the entries of vectors must have a well-defined, uniform type (numeric, character string, logical, etc.), we have to tell R what kind of result we want. In our case, we want numeric results. The function to do this is called map_dbl
(“map into double-precision numerical values”). It can be used just like map
; the only difference between the two is that the output type changes from list to numeric vector:
map_dbl(1:10, function(x) x^2)
[1] 1 4 9 16 25 36 49 64 81 100
Similarly, there are functions map_chr
, map_lgl
, and some others, which create vectors of the appropriate type. For example, to append the agglutination “-ing” to various verbs, we can do:
c("attend", "visit", "support", "help", "savour") |>
map_chr(function(text) paste0(text, "ing"))
[1] "attending" "visiting" "supporting" "helping" "savouring"
Interestingly, we could also try
map_chr(1:10, function(x) x^2)
Warning: Automatic coercion from double to character was deprecated in purrr 1.0.0.
ℹ Please use an explicit call to `as.character()` within `map_chr()` instead.
[1] "1.000000" "4.000000" "9.000000" "16.000000" "25.000000"
[6] "36.000000" "49.000000" "64.000000" "81.000000" "100.000000"
and see that, although the computations were performed correctly, the output was converted from numbers to character strings encoding those numbers.
15.4 Making use of map
when vectorization fails
One perhaps obvious criticism of map
as we have used it is that its use was not really needed. Early on, we learned that when simple functions are applied to a vector of values, they get applied element-wise. This is called vectorization, and it is a very useful property of R. So (1:10)^2
, in our case, achieves the same thing as map_dbl(1:10, function(x) x^2)
. Similarly, if we simply write cos(1:100)
, we get the cosine of all integers between 1 and 100 without having to type out map_dbl(1:100, cos)
. So why bother with map
then?
The answer is that map
can handle cases where vectorization is not available. Most simple functions in R are vectorized, but there are plenty of non-vectorizable operations. To give an example, let us start from a simple dataset: the PlantGrowth
table we looked at before, but without the control ctrl
group. This leaves just the two treatment groups trt1
and trt2
:
<- filter(PlantGrowth, group != "ctrl")
plantTrt print(plantTrt)
weight group
1 4.81 trt1
2 4.17 trt1
3 4.41 trt1
4 3.59 trt1
5 5.87 trt1
6 3.83 trt1
7 6.03 trt1
8 4.89 trt1
9 4.32 trt1
10 4.69 trt1
11 6.31 trt2
12 5.12 trt2
13 5.54 trt2
14 5.50 trt2
15 5.37 trt2
16 5.29 trt2
17 4.92 trt2
18 6.15 trt2
19 5.80 trt2
20 5.26 trt2
We might want to perform a Wilcoxon rank sum test on these data, but with a number of different confidence levels. A naive approach would be to supply the required confidence levels as a vector:
wilcox.test(weight ~ group, data = plantTrt,
conf.int = TRUE, conf.level = c(0.8, 0.9, 0.95, 0.99))
Error in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): 'conf.level' must be a single number between 0 and 1
This generates an error: because of the way wilcox.test
is internally implemented in R, it does not allow or recognize a vector input in place of conf.level
. It must be a single number instead. This, however, can be overcome if we just use map
:
c(0.8, 0.9, 0.95, 0.99) |> # The vector of confidence levels
map(function(confLevel) wilcox.test(weight ~ group, data = plantTrt,
conf.int = TRUE, conf.level = confLevel))
[[1]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
80 percent confidence interval:
-1.33 -0.56
sample estimates:
difference in location
-0.945
[[2]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
-1.43 -0.44
sample estimates:
difference in location
-0.945
[[3]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.50 -0.31
sample estimates:
difference in location
-0.945
[[4]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
99 percent confidence interval:
-1.70 -0.03
sample estimates:
difference in location
-0.945
Notice that we used map
and not map_dbl
or map_chr
, because wilcox.test
returns a complex model object which cannot be coerced into a vector. This is precisely when map
, which generates a list whose entries can be arbitrary, is especially useful. (Try the above with map_dbl
; it will throw an error.) As a final comment, it would of course have been possible to define a function separately, instead of using the anonymous function above:
<- function(confLevel) {
wilcoxConf wilcox.test(weight ~ group, data = plantTrt,
conf.int = TRUE, conf.level = confLevel)
}
c(0.8, 0.9, 0.95, 0.99) |> map(wilcoxConf)
[[1]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
80 percent confidence interval:
-1.33 -0.56
sample estimates:
difference in location
-0.945
[[2]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
-1.43 -0.44
sample estimates:
difference in location
-0.945
[[3]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.50 -0.31
sample estimates:
difference in location
-0.945
[[4]]
Wilcoxon rank sum exact test
data: weight by group
W = 16, p-value = 0.008931
alternative hypothesis: true location shift is not equal to 0
99 percent confidence interval:
-1.70 -0.03
sample estimates:
difference in location
-0.945
15.5 Exercises
- Start from the following sequence of values:
values <- seq(-5*pi, 5*pi, by = 0.01)
(a vector with values going from \(-5\pi\) to \(5\pi\), in steps of 0.01). Now do the following:- Define a new vector,
x
, which contains the cosine (cos
) of each value invalues
. Usemap_dbl
. - Now define another vector,
y
, and again usingmap_dbl
, compute the functionsin(t) + cos(14 * t / 5) / 5
for every value t invalues
. You can either define this function separately to use insidemap_dbl
, or create it anonymously. - Finally, create a tibble whose two columns are
x
andy
, and plot them against each other usinggeom_path
. See what you get!
- Define a new vector,
- How does the integral of \(\cos(x)\) change if the lower limit of integration is fixed at zero, but the upper limit gradually increases from \(0\) to \(2\pi\)? Define a sequence of upper limits
upper <- seq(0, 2*pi, by = 0.1)
. Then, usingmap_dbl
, create a vectorintegrals
whose entries are the integral of \(\cos(x)\) from zero to each upper limit. Finally, plotintegrals
againstupper
, usinggeom_point
orgeom_line
. What is the function you see? (Note: theintegrate
function returns a complicated list object instead of just a single number. To access just the value of the integral, you can useintegrate(...)$value
, where...
means all the arguments to the function you are supposed to write when solving the problem.)
Functions that return a function as their output are also called higher-order functions. For an example which both takes functions as arguments and produces a function as its output, check out the
compose
function from thepurrr
package (part of thetidyverse
).↩︎The error is included in the output because R’s integration routine is purely numeric, so it algorithmically approximates the integral, and such procedures always have finite precision.↩︎