4  Reading tabular data from disk

4.1 The tidyverse package suite

A suite of R packages, sharing the same design philosophy, are collected under the name tidyverse. The tidyverse describes itself as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” The packages in this ecosystem work well together, making it easy to combine their elements to perform a wide range of data analysis tasks. In case it is not yet installed on your computer, type

install.packages("tidyverse")

at the R console and press Enter. After the package is installed, we can load it via the function call

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Looking at the message generated by executing the above line, we see that nine packages are now loaded.1 They are called ggplot2, tibble, and so on. We will get to know these in more detail throughout the book.

There are even more packages that are part of the tidyverse. Typing and executing tidyverse_packages() will show all such packages. Of all these, only eight are loaded by default when invoking library(tidyverse). The others must be loaded separately. For example, readxl is a tidyverse package for loading Excel files in R. To use it, run library(readxl).

Tip

In general, it is a good idea to load all necessary packages at the top of your R script, instead of loading them wherever the need to use them first arises. There are two reasons for this. First, if you close and then later reopen RStudio, the packages do not get automatically reloaded—one must execute the calls to library all over again. Second, often other users will run the scripts you write on their own computers, and they will not be able to do so unless the proper packages are loaded first. It is then helpful to others if the necessary packages are all listed right at the top, showing what is needed to run your program.

4.2 Data file formats

One of the packages loaded by default with the tidyverse is called readr. This package contains tools for loading data files and writing them to disk. We will explore how it works using an example dataset. Before turning to the data however, it is worth mentioning a few things about data file formats.

Tabular data (i.e., data which can be organized into a table with rows and named columns) may be stored in many different forms. By far the most popular choice is to use Excel spreadsheets. While these may have many good properties, they are decidedly not recommended for scientific use. Instead, in science we strive to rely on plain text files to store our data. Plain text files have several advantages. First, they can be read by any machine without the need for special or proprietary software (such as Excel). Second, they will never go obsolete: one will always be able to open plain text files on any machine. By contrast, there is no guarantee that an Excel file saved today will still open ten years from now in future versions of Excel. And third, file formats such as Excel’s .xls and .xlsx contain a lot of metadata about formatting, font types, locale, etc. that are obscured from the user, but which have the potential to be in the way when performing scientific analyses. By contrast, plain text files have nothing up their sleeve: what you see they contain is exactly what you get; no more, no less. For these reasons, this book will emphasize workflows that rely on data stored in plain text files. However, due to the popularity of Excel, we will also learn how to read data from Excel files into R (Section 4.6), even though their use is otherwise not recommended.

4.2.1 The CSV file format

How can one store data in plain text files? There are various solutions, but the one we will be relying on is called a delimited file. These files contain information as in a spreadsheet, with contents from every row in a different line. The column entries are separated by some delimiter—a character that represents the end of the information in one column and the start of the next. As an example, let us consider some actual data on land snail species from the Galápagos Islands.2 The data file is available by clicking on this link: islands-FL.csv.3 Set the working directory in RStudio to the folder where you have saved it. As a reminder, one can do this by executing setwd("/path/to/files"), where one should substitute in one’s own path in place of /path/to/files. Then one can open islands-FL.csv in RStudio by clicking on it in the Files panel in the lower right part of the RStudio window, and then choosing the option “View file” (ignore the other option called “Import dataset…”). Having done this, a new tab opens in the editor panel (upper left region) where something like the following should appear:

habitat,species,size,shape
humid,ustulatus,17.088,-0.029
humid,ustulatus,20.06,-0.001
humid,ustulatus,16.34,0.014
arid,calvus,13.734,-0.043
humid,nux,21.898,-0.042
humid,ustulatus,16.848,-0.023
humid,ustulatus,19.162,0.014
humid,ustulatus,16.017,0.042
arid,galapaganus,18.894,0.011
humid,nux,26.59,0

And so on. Here the first row does not contain data, but instead contains the names of the different columns. Both the names in the first row and the data in subsequent ones are separated by a comma, which is our delimiter that separates information belonging in different columns. This also explains the strange-looking extension .csv to the file: this stands for comma-separated values. Comma-separated value files are some of the most often used ones in science, and we will be relying on them frequently throughout the book.

4.2.2 The Galápagos land snail data

It is worth explaining what the dataset in the file island-FL.csv represents, as we will be using it repeatedly in subsequent examples. Each row contains information on one land snail individual sampled from the Galápagos Islands. To simplify the data, the individuals have been restricted to just those that are from Floreana Island in the Galápagos. The columns in the data are:

  • habitat: This can be either “arid” or “humid”.
  • species: Which species the individual belongs to. The Genus is always Naesiotus, from the family Bulimulidae. The seven species in the data are Naesiotus calvus, N. galapaganus, N. invalidus, N. nux, N. rugulosus, N. unifasciatus, and N. ustulatus.
  • size: A measurement of the absolute volume of the snail’s shell. It is given in units we are not concerned with here, but larger values correspond to larger shells. See Parent and Crespi (2009), Kraemer et al. (2018), and Kraemer et al. (2022) for more information.
  • shape: A measurement for the shell’s shape. Again, the units are not important for us here, but small values represent bulky and round shells, while large values represent long and slender shells.

Below are some pictures of these snails (courtesy of Dr. Christine E. Parent):

Naesiotus ustulatus – a land snail from Floreana Island, Galápagos. Photo credits: Dr. Christine E. Parent.

N. ochsneri – a land snail from Santa Cruz Island, Galápagos. Photo credits: Dr. Christine E. Parent.

N. chemnitzioides – a land snail from San Cristóbal Island, Galápagos. Photo credits: Dr. Christine E. Parent.

Shell morphology is an important indicator of microhabitat specialization in these snails: species with long slender shells are adaptations to arid environments due to their better surface-to-volume ratios (see the picture of N. chemnitzioides above for an example), whereas species with round and bulky shells are better adapted to humid environments.

4.2.3 Tabulator-separated and other delimited files

Above we have discussed comma-separated files. Another type of file uses tabulators instead of commas as the column separator. These are called tab-separated value (TSV) files. Just like CSV files, TSV files are also very widely used. An example is provided by the file islands-FL.tsv. The data in this file are exactly identical to the one in islands-FL.csv. The only difference is that the commas are replaced by tabulators. Opening the file in RStudio, the first few lines look like this:

habitat species size    shape
humid   ustulatus   17.088  -0.029
humid   ustulatus   20.06   -0.001
humid   ustulatus   16.34   0.014
arid    calvus  13.734  -0.043
humid   nux 21.898  -0.042
humid   ustulatus   16.848  -0.023
humid   ustulatus   19.162  0.014
humid   ustulatus   16.017  0.042
arid    galapaganus 18.894  0.011
humid   nux 26.59   0

Commas and tabulators are not the only possible delimiters however, and in principle any character could play that role. The only thing to be wary of is that choosing some “regular” character as the delimiter might create unintended side effects. For instance, if the character 4 is the delimiter, then a number such as 0.143 will no longer be interpreted as a single number, but as two numbers in separate columns (0.1 and 3, respectively). For this reason, only those characters should be used as delimiters which we can be certain will never be confused with the actual data.

To give an example: it is perfectly possible and legal (even if not common) to use a “percent-separated value” representation where the symbol % is the delimiter. Here is what that would look like with the same land snail data (one can download it from islands-FL.psv):

habitat%species%size%shape
humid%ustulatus%17.088%-0.029
humid%ustulatus%20.06%-0.001
humid%ustulatus%16.34%0.014
arid%calvus%13.734%-0.043
humid%nux%21.898%-0.042
humid%ustulatus%16.848%-0.023
humid%ustulatus%19.162%0.014
humid%ustulatus%16.017%0.042
arid%galapaganus%18.894%0.011
humid%nux%26.59%0
Warning

The common file extensions .csv and .tsv are useful and suggestive: they can indicate that the data are comma- or tab-separated. However, one must bear in mind that these are mere naming conventions which are not forced in any way. That is, while not necessarily a good idea, it is perfectly possible to save comma-separated data in a file called mydata.tsv, or tab-separated data in mydata.csv. Furthermore, plain text files often have extensions such as .txt (“text”) or .dat (“data”), neither of which reveal what form the data were stored in. To be absolutely sure, one must open the file in a text editor such as RStudio’s top left editor panel, and see what was used as a delimiter. If you open a file like this, make absolutely sure not to modify and save it accidentally. Doing so compromises your hard-won data, which then no longer accurately capture your true observations.

4.3 The tibble data structure

The above raw formats are not yet amenable to processing within R. To make it so, we first need to import them. For delimited files there is a convenient function, read_delim, which makes this especially simple:4

read_delim("island-FL.csv", delim = ",")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

The function read_delim takes two inputs: the name of the file (with a path if needed), and delim, which is the delimiter used to separate the columns in the file. We will discuss how it works in greater detail in Section 4.4.

Look at the output that was produced by read_delim("island-FL.csv", delim = ",") above, starting with # A tibble: 223 x 4. A tibble (or data frame5) is the R-equivalent of an Excel-style spreadsheet. In this case, it has 223 rows and 4 columns (hence the 223 x 4). The way to conceive of a tibble is as a collection of vectors, each arranged in a column, glued together side-by-side to form a table of data. Importantly, although each vector must consist of entries of the same type as usual (e.g., they can be vectors of numbers, vectors of strings, or vectors of logical values), the different columns need not share types. For example, in the above table, the first and second columns consist of character strings but the third and fourth ones consist of numerical values. This can be seen right below the header information. Below habitat and species you can see <chr>, which stands for “character string”. Below size and shape we have <dbl> which, confusing as it may look at first, refers simply to ordinary numbers.6 In turn, columns comprising of logical values would have the tag <lgl> underneath them (in this case though, we don’t have such a column). The point is that by looking at the type information below the header, you can see how R has interpreted each of the columns at a glance.

The fact that the individual columns are simply vectors can be made explicit, by relying on the $-notation. To access a given column of the table as a vector, we write the name of the table, followed by the $ symbol, followed by the name of the column in question. To make the illustration easier, let us first assign the tibble to a variable called snailDat:

snailDat <- read_delim("island-FL.csv", delim = ",")

And now one can access e.g. the size column from the snailDat table as a vector of numbers like this:

snailDat$size
  [1] 17.088 20.060 16.340 13.734 21.898 16.848 19.162 16.017 18.894 26.590
 [11] 17.865 13.857 21.706 15.354 21.369 22.714 16.530 16.604 16.292 16.075
 [21] 18.154 21.025 12.895 25.046 23.227 17.284 16.701 23.515 21.903 17.314
 [31] 23.891 21.349 24.946 14.122 28.467 17.010 16.831 16.641 20.244 22.042
 [41] 18.665 18.680 18.949 16.933 20.399 22.907 18.265 24.269 20.704 16.553
 [51] 19.246 24.894 23.610 14.131 15.617 18.605 18.343 19.340 17.017 16.868
 [61] 17.814 16.311 17.066 15.386 23.167 21.272 16.799 17.383 15.178 17.195
 [71] 19.027 15.011 23.723 17.569 20.091 18.773 22.885 13.661 33.792 19.390
 [81] 19.513 18.246 17.637 19.865 18.812 17.903 16.614 28.810 27.675 31.672
 [91] 17.209 21.618 26.420 19.135 20.493 15.142 17.195 17.382 15.502 14.404
[101] 22.678 16.210 30.946 14.347 14.541 20.546 20.994 20.103 31.923 24.465
[111] 24.451 19.188 20.610 16.193 15.989 25.314 16.610 13.507 16.744 21.330
[121] 20.480 19.246 19.125 18.642 18.250 17.152 18.814 21.081 15.852 16.743
[131] 16.602 14.879 12.299 21.646 20.539 21.798 19.211 18.168 24.985 19.938
[141] 20.176 14.846 23.488 24.120 17.320 16.118 18.924 15.928 15.478 21.282
[151] 15.838 19.506 24.212 18.066 17.427 15.213 20.748 20.504 20.311 17.444
[161] 20.846 19.351 18.679 24.111 17.906 16.322 21.967 15.992 22.788 17.920
[171] 18.804 24.760 25.766 23.452 16.145 16.040 18.868 17.528 16.556 22.199
[181] 24.424 18.800 21.335 20.191 19.225 23.424 15.601 27.679 29.221 20.373
[191] 16.037 17.577 21.852 18.969 16.933 19.290 24.102 21.398 16.369 19.649
[201] 25.137 18.394 23.102 32.001 16.946 19.054 18.527 18.646 21.361 14.362
[211] 15.858 17.201 19.886 19.264 21.503 21.728 23.312 20.423 21.912 18.158
[221] 16.859 20.488 22.369

Here snailDat$size is really just a vector, and can be treated as such. For example, to get the 9th entry of this vector, we can use the usual bracket notation:

snailDat$size[9]
[1] 18.894

The result is an ordinary numerical value.

Finally, let us take one more look at the output again:

print(snailDat)
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

When displaying large tibbles, R will not dump all the data at you. Instead, it will display the first 10 rows, with a message indicating how many more rows remain (in our case, we have ...with 213 more rows written at the end of the printout). The system is still aware of the other rows; it just does not show them. To get a full view of a tibble in a more digestible, spreadsheet-like style, one can use the view function. Try running view(snailDat) and see what happens!

4.4 Reading delimited files

We have seen that the function read_delim works by taking two inputs: the name of the file to read, and the delimiter character. So to read a comma-separated file, we can write

read_delim("island-FL.csv", delim = ",")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

Similarly, in case we want to read a percent-separated file such as islands-FL.psv, all we need to do is change the delimiter:

read_delim("island-FL.psv", delim = "%")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

The output is exactly as before, since the file contained the same information (just represented differently due to the different delimiter).

To read a tab-separated file, remember not to simply press the tabulator key between the quotes of the delim argument. Instead, R uses the character string "\t" to represent a single press of the tabulator:

read_delim("island-FL.tsv", delim = "\t")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

What happens if the wrong delimiter is specified—for instance, if we type read_delim("island-FL.csv", delim = "%")? In that case R will look for the % character to separate the columns, but since these do not occur anywhere, each row will be interpreted as a single column entry in its entirety:

read_delim("island-FL.csv", delim = "%")
# A tibble: 223 × 1
   `habitat,species,size,shape` 
   <chr>                        
 1 humid,ustulatus,17.088,-0.029
 2 humid,ustulatus,20.06,-0.001 
 3 humid,ustulatus,16.34,0.014  
 4 arid,calvus,13.734,-0.043    
 5 humid,nux,21.898,-0.042      
 6 humid,ustulatus,16.848,-0.023
 7 humid,ustulatus,19.162,0.014 
 8 humid,ustulatus,16.017,0.042 
 9 arid,galapaganus,18.894,0.011
10 humid,nux,26.59,0            
# ℹ 213 more rows

As seen, we have a tibble with a single column called `habitat,species,size,shape1,shape2`, which has the type <chr> (character string). One must beware of such mistakes, because the computer will not signal any errors: for all it cares, we did intend % to be the delimiter. However, one will not be able to properly work with the data subsequently if it gets loaded in this mistaken form.

The CSV and TSV file formats are so common that there are shorthands available for loading them. Instead of writing read_delim("island-FL.csv", delim = ","), one can equivalently type read_csv("island-FL.csv") for the same effect:

read_csv("island-FL.csv")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

Similarly, read_tsv("island-FL.tsv") is an equivalent shorthand to read_delim("island-FL.tsv", delim = "\t"):

read_tsv("island-FL.tsv")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

4.5 Naming (and renaming) columns

All of the files we have looked at so far contained a header as their first row: instead of containing data, they contained the names of the data columns. Having such a header is good practice. Sometimes however, the header may be missing. For example, the file islands-FL-nohead.csv contains the same land snail data as the other files above, but it lacks a header. The first few lines of the file look like this:

humid,ustulatus,17.088,-0.029
humid,ustulatus,20.06,-0.001
humid,ustulatus,16.34,0.014
arid,calvus,13.734,-0.043
humid,nux,21.898,-0.042
humid,ustulatus,16.848,-0.023
humid,ustulatus,19.162,0.014
humid,ustulatus,16.017,0.042
arid,galapaganus,18.894,0.011
humid,nux,26.59,0

If one tries to read this file either with read_delim("island-FL-nohead.csv", delim = ",") or the shorthand read_csv("island-FL-nohead.csv"), we get the following:

read_delim("island-FL-nohead.csv", delim = ",")
# A tibble: 222 × 4
   humid ustulatus   `17.088` `-0.029`
   <chr> <chr>          <dbl>    <dbl>
 1 humid ustulatus       20.1   -0.001
 2 humid ustulatus       16.3    0.014
 3 arid  calvus          13.7   -0.043
 4 humid nux             21.9   -0.042
 5 humid ustulatus       16.8   -0.023
 6 humid ustulatus       19.2    0.014
 7 humid ustulatus       16.0    0.042
 8 arid  galapaganus     18.9    0.011
 9 humid nux             26.6    0    
10 arid  calvus          17.9   -0.024
# ℹ 212 more rows

The fact that a header is missing is obvious to a human reading the file—but not to the computer, which simply took the first row to be the header names anyway and interpreted the data in there as if they were column names. To avoid doing this, one can pass col_names = FALSE as an argument to either read_delim, read_csv, or read_tsv. The col_names argument is set by default to TRUE; to override it, we must explicitly change it like this:

read_delim("island-FL-nohead.csv", delim = ",", col_names = FALSE)
# A tibble: 223 × 4
   X1    X2             X3     X4
   <chr> <chr>       <dbl>  <dbl>
 1 humid ustulatus    17.1 -0.029
 2 humid ustulatus    20.1 -0.001
 3 humid ustulatus    16.3  0.014
 4 arid  calvus       13.7 -0.043
 5 humid nux          21.9 -0.042
 6 humid ustulatus    16.8 -0.023
 7 humid ustulatus    19.2  0.014
 8 humid ustulatus    16.0  0.042
 9 arid  galapaganus  18.9  0.011
10 humid nux          26.6  0    
# ℹ 213 more rows

While this works, the column names now default to the moderately informative labels X1, X2, and so on. In fact, one can use the col_names argument to explicitly specify a vector of character strings, which are then interpreted as the names to be given to the columns in case no header information exists within the file itself. For example:

read_delim("island-FL-nohead.csv", delim = ",",
           col_names = c("habitat", "species", "size", "shape"))
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

And now these data are identical to our earlier snailDat tibble.

4.6 Excel tables

Although their use is discouraged in science, one should know how to read data from an Excel spreadsheet. To do this, one needs to load the readxl package. This package is part of the tidyverse, but does not get automatically attached when running library(tidyverse). Therefore, we first load the package:

library(readxl)

We can now load Excel files with the function read_excel(). At the start, we downloaded an Excel version of the land snail data, called island-FL.xlsx. It holds the exact same data as the original CSV file, just saved in Excel format for instructive purposes. Let us load this file:

read_excel("island-FL.xlsx")
# A tibble: 223 × 4
   habitat species      size  shape
   <chr>   <chr>       <dbl>  <dbl>
 1 humid   ustulatus    17.1 -0.029
 2 humid   ustulatus    20.1 -0.001
 3 humid   ustulatus    16.3  0.014
 4 arid    calvus       13.7 -0.043
 5 humid   nux          21.9 -0.042
 6 humid   ustulatus    16.8 -0.023
 7 humid   ustulatus    19.2  0.014
 8 humid   ustulatus    16.0  0.042
 9 arid    galapaganus  18.9  0.011
10 humid   nux          26.6  0    
# ℹ 213 more rows

The function read_excel has several further options. For example, given an Excel table with multiple sheets, one can specify which one to import, using the sheet argument. Check the help page of read_excel, and experiment with its options.

4.7 Writing data to files

Finally, data can not only be read from a file, but also written out to one. Then, instead of read_delim, read_csv, read_tsv and the like, one uses write_delim, write_csv, write_tsv, and so on. For instance, to save some tibble called dat in CSV form, one can do either

write_delim(dat, file = "/path/to/file.csv", delim = ",")

or the equivalent but shorter

write_csv(dat, file = "/path/to/file.csv")

where /path/to/file.csv should be replaced by the path and file name with which the data ought to be saved.

4.8 Exercises

  1. Goldberg et al. (2010) collected data on self-incompatibility in the family Solanaceae (nightshades). It contains a list of 356 species, along with a column determining self-incompatibility status (0: self-incompatible; 1: self-compatible; 2-5: more complicated selfing scenarios). The data are in the file Goldberg2010_data.csv. An equivalent version is available in Goldberg2010_data.xlsx, with the only difference that it is saved in Excel format.
    • Read the file Goldberg2010_data.csv using read_delim.
    • Read the file Goldberg2010_data.csv using read_csv.
    • Read the Excel file Goldberg2010_data.xlsx using read_excel from the readxl package.
    • Read the file Goldberg2010_data.csv using read_csv, and assign it to a variable called goldbergDat.
    • Extract the species names (first column, called Species) from goldbergDat as a vector of character strings, using the $ notation. What are the 42nd and 137th entries of this vector?
  2. The data file ladybugs.csv contains information on whether ladybugs sampled from rural versus industrial areas tend to be red or black morphs. Black morphs are thought to be an adaptation to more polluted industrial environments where the dark coloration provides better camouflage. The file is comma-separated and lacks headers. The meanings of the columns are, in order: habitat type, site code where the sampling was done, morph color, and the number of individuals sampled.
    • Load this file into a tibble in R, and give the following names to its columns: habitat, site, morph, and number.
    • Extract the last column (number) as a vector and calculate its mean and variance.
  3. Smith et al. (2003) compiled a database of the body masses of mammals of the late Quaternary period. The data are in the file Smith2003_data.txt. The column names are not specified in the file, but they are, in order: Continent (AF=Africa, etc.), Status (extinct, historical, introduction, or extant), Order, Family, Genus, Species, Base-10 Log Mass, Combined Mass (grams), and Reference (numbers, referring to a numerically ordered list of published works—no need to worry about the details).
    • What is the delimiter in this data file?
    • Load the data and give appropriate names to its columns, based on the information above.

  1. If you are using a tidyverse version earlier than 2.0.0, only eight packages are loaded—lubridate, a package for working with dates, is missing from the list.↩︎

  2. The data have been adapted from Barabás et al. (2022).↩︎

  3. This and other data files used throughout the book will be accessible in a similar way. They are compressed in a zip file; extract them to obtain the data.↩︎

  4. Warning: there exists a similarly-named function called read.delim which is part of base R. It does much the same thing as read_delim; however, its use is clunkier and less flexible. You can think of read_delim as the tidyverse upgrade to the original read.delim. My recommendation is to stick with using just read_delim—it is simpler and at the same time more powerful than its predecessor.↩︎

  5. Data frames (a feature of base R) and tibbles (a tidyverse construct) are equivalent for most practical purposes. Tibbles offer some features that are absent from data frames and omit quirks of data frames which tend to get in the way. Like with read_delim and read.delim, tibbles can be thought of as a slightly upgraded and more user-friendly version of data frames. You do not need to be overly concerned with the precise differences between the two. In this book we will be using tibbles almost exclusively.↩︎

  6. The abbreviation <dbl> happens to stand for double-precision numerical value, a standard way of representing numbers on computers.↩︎