install.packages("tidyverse")
4 Reading tabular data from disk
4.1 The tidyverse
package suite
A suite of R packages, sharing the same design philosophy, are collected under the name tidyverse
. The tidyverse
describes itself as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” The packages in this ecosystem work well together, making it easy to combine their elements to perform a wide range of data analysis tasks. In case it is not yet installed on your computer, type
at the R console and press Enter. After the package is installed, we can load it via the function call
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Looking at the message generated by executing the above line, we see that nine packages are now loaded.1 They are called ggplot2
, tibble
, and so on. We will get to know these in more detail throughout the book.
There are even more packages that are part of the tidyverse. Typing and executing tidyverse_packages()
will show all such packages. Of all these, only eight are loaded by default when invoking library(tidyverse)
. The others must be loaded separately. For example, readxl
is a tidyverse package for loading Excel files in R. To use it, run library(readxl)
.
In general, it is a good idea to load all necessary packages at the top of your R script, instead of loading them wherever the need to use them first arises. There are two reasons for this. First, if you close and then later reopen RStudio, the packages do not get automatically reloaded—one must execute the calls to library
all over again. Second, often other users will run the scripts you write on their own computers, and they will not be able to do so unless the proper packages are loaded first. It is then helpful to others if the necessary packages are all listed right at the top, showing what is needed to run your program.
4.2 Data file formats
One of the packages loaded by default with the tidyverse
is called readr
. This package contains tools for loading data files and writing them to disk. We will explore how it works using an example dataset. Before turning to the data however, it is worth mentioning a few things about data file formats.
Tabular data (i.e., data which can be organized into a table with rows and named columns) may be stored in many different forms. By far the most popular choice is to use Excel spreadsheets. While these may have many good properties, they are decidedly not recommended for scientific use. Instead, in science we strive to rely on plain text files to store our data. Plain text files have several advantages. First, they can be read by any machine without the need for special or proprietary software (such as Excel). Second, they will never go obsolete: one will always be able to open plain text files on any machine. By contrast, there is no guarantee that an Excel file saved today will still open ten years from now in future versions of Excel. And third, file formats such as Excel’s .xls
and .xlsx
contain a lot of metadata about formatting, font types, locale, etc. that are obscured from the user, but which have the potential to be in the way when performing scientific analyses. By contrast, plain text files have nothing up their sleeve: what you see they contain is exactly what you get; no more, no less. For these reasons, this book will emphasize workflows that rely on data stored in plain text files. However, due to the popularity of Excel, we will also learn how to read data from Excel files into R (Section 4.6), even though their use is otherwise not recommended.
4.2.1 The CSV file format
How can one store data in plain text files? There are various solutions, but the one we will be relying on is called a delimited file. These files contain information as in a spreadsheet, with contents from every row in a different line. The column entries are separated by some delimiter—a character that represents the end of the information in one column and the start of the next. As an example, let us consider some actual data on land snail species from the Galápagos Islands.2 The data file is available by clicking on this link: islands-FL.csv
.3 Set the working directory in RStudio to the folder where you have saved it. As a reminder, one can do this by executing setwd("/path/to/files")
, where one should substitute in one’s own path in place of /path/to/files
. Then one can open islands-FL.csv
in RStudio by clicking on it in the Files panel in the lower right part of the RStudio window, and then choosing the option “View file” (ignore the other option called “Import dataset…”). Having done this, a new tab opens in the editor panel (upper left region) where something like the following should appear:
habitat,species,size,shape
humid,ustulatus,17.088,-0.029
humid,ustulatus,20.06,-0.001
humid,ustulatus,16.34,0.014
arid,calvus,13.734,-0.043
humid,nux,21.898,-0.042
humid,ustulatus,16.848,-0.023
humid,ustulatus,19.162,0.014
humid,ustulatus,16.017,0.042
arid,galapaganus,18.894,0.011
humid,nux,26.59,0
And so on. Here the first row does not contain data, but instead contains the names of the different columns. Both the names in the first row and the data in subsequent ones are separated by a comma, which is our delimiter that separates information belonging in different columns. This also explains the strange-looking extension .csv
to the file: this stands for comma-separated values. Comma-separated value files are some of the most often used ones in science, and we will be relying on them frequently throughout the book.
4.2.2 The Galápagos land snail data
It is worth explaining what the dataset in the file island-FL.csv
represents, as we will be using it repeatedly in subsequent examples. Each row contains information on one land snail individual sampled from the Galápagos Islands. To simplify the data, the individuals have been restricted to just those that are from Floreana Island in the Galápagos. The columns in the data are:
habitat
: This can be either “arid” or “humid”.species
: Which species the individual belongs to. The Genus is always Naesiotus, from the family Bulimulidae. The seven species in the data are Naesiotus calvus, N. galapaganus, N. invalidus, N. nux, N. rugulosus, N. unifasciatus, and N. ustulatus.size
: A measurement of the absolute volume of the snail’s shell. It is given in units we are not concerned with here, but larger values correspond to larger shells. See Parent and Crespi (2009), Kraemer et al. (2018), and Kraemer et al. (2022) for more information.shape
: A measurement for the shell’s shape. Again, the units are not important for us here, but small values represent bulky and round shells, while large values represent long and slender shells.
Below are some pictures of these snails (courtesy of Dr. Christine E. Parent):
Shell morphology is an important indicator of microhabitat specialization in these snails: species with long slender shells are adaptations to arid environments due to their better surface-to-volume ratios (see the picture of N. chemnitzioides above for an example), whereas species with round and bulky shells are better adapted to humid environments.
4.2.3 Tabulator-separated and other delimited files
Above we have discussed comma-separated files. Another type of file uses tabulators instead of commas as the column separator. These are called tab-separated value (TSV) files. Just like CSV files, TSV files are also very widely used. An example is provided by the file islands-FL.tsv
. The data in this file are exactly identical to the one in islands-FL.csv
. The only difference is that the commas are replaced by tabulators. Opening the file in RStudio, the first few lines look like this:
habitat species size shape
humid ustulatus 17.088 -0.029
humid ustulatus 20.06 -0.001
humid ustulatus 16.34 0.014
arid calvus 13.734 -0.043
humid nux 21.898 -0.042
humid ustulatus 16.848 -0.023
humid ustulatus 19.162 0.014
humid ustulatus 16.017 0.042
arid galapaganus 18.894 0.011
humid nux 26.59 0
Commas and tabulators are not the only possible delimiters however, and in principle any character could play that role. The only thing to be wary of is that choosing some “regular” character as the delimiter might create unintended side effects. For instance, if the character 4
is the delimiter, then a number such as 0.143
will no longer be interpreted as a single number, but as two numbers in separate columns (0.1
and 3
, respectively). For this reason, only those characters should be used as delimiters which we can be certain will never be confused with the actual data.
To give an example: it is perfectly possible and legal (even if not common) to use a “percent-separated value” representation where the symbol %
is the delimiter. Here is what that would look like with the same land snail data (one can download it from islands-FL.psv
):
habitat%species%size%shape
humid%ustulatus%17.088%-0.029
humid%ustulatus%20.06%-0.001
humid%ustulatus%16.34%0.014
arid%calvus%13.734%-0.043
humid%nux%21.898%-0.042
humid%ustulatus%16.848%-0.023
humid%ustulatus%19.162%0.014
humid%ustulatus%16.017%0.042
arid%galapaganus%18.894%0.011
humid%nux%26.59%0
The common file extensions .csv
and .tsv
are useful and suggestive: they can indicate that the data are comma- or tab-separated. However, one must bear in mind that these are mere naming conventions which are not forced in any way. That is, while not necessarily a good idea, it is perfectly possible to save comma-separated data in a file called mydata.tsv
, or tab-separated data in mydata.csv
. Furthermore, plain text files often have extensions such as .txt
(“text”) or .dat
(“data”), neither of which reveal what form the data were stored in. To be absolutely sure, one must open the file in a text editor such as RStudio’s top left editor panel, and see what was used as a delimiter. If you open a file like this, make absolutely sure not to modify and save it accidentally. Doing so compromises your hard-won data, which then no longer accurately capture your true observations.
4.3 The tibble
data structure
The above raw formats are not yet amenable to processing within R. To make it so, we first need to import them. For delimited files there is a convenient function, read_delim
, which makes this especially simple:4
read_delim("island-FL.csv", delim = ",")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
The function read_delim
takes two inputs: the name of the file (with a path if needed), and delim
, which is the delimiter used to separate the columns in the file. We will discuss how it works in greater detail in Section 4.4.
Look at the output that was produced by read_delim("island-FL.csv", delim = ",")
above, starting with # A tibble: 223 x 4
. A tibble (or data frame5) is the R-equivalent of an Excel-style spreadsheet. In this case, it has 223 rows and 4 columns (hence the 223 x 4
). The way to conceive of a tibble is as a collection of vectors, each arranged in a column, glued together side-by-side to form a table of data. Importantly, although each vector must consist of entries of the same type as usual (e.g., they can be vectors of numbers, vectors of strings, or vectors of logical values), the different columns need not share types. For example, in the above table, the first and second columns consist of character strings but the third and fourth ones consist of numerical values. This can be seen right below the header information. Below habitat
and species
you can see <chr>
, which stands for “character string”. Below size
and shape
we have <dbl>
which, confusing as it may look at first, refers simply to ordinary numbers.6 In turn, columns comprising of logical values would have the tag <lgl>
underneath them (in this case though, we don’t have such a column). The point is that by looking at the type information below the header, you can see how R has interpreted each of the columns at a glance.
The fact that the individual columns are simply vectors can be made explicit, by relying on the $
-notation. To access a given column of the table as a vector, we write the name of the table, followed by the $
symbol, followed by the name of the column in question. To make the illustration easier, let us first assign the tibble to a variable called snailDat
:
<- read_delim("island-FL.csv", delim = ",") snailDat
And now one can access e.g. the size
column from the snailDat
table as a vector of numbers like this:
$size snailDat
[1] 17.088 20.060 16.340 13.734 21.898 16.848 19.162 16.017 18.894 26.590
[11] 17.865 13.857 21.706 15.354 21.369 22.714 16.530 16.604 16.292 16.075
[21] 18.154 21.025 12.895 25.046 23.227 17.284 16.701 23.515 21.903 17.314
[31] 23.891 21.349 24.946 14.122 28.467 17.010 16.831 16.641 20.244 22.042
[41] 18.665 18.680 18.949 16.933 20.399 22.907 18.265 24.269 20.704 16.553
[51] 19.246 24.894 23.610 14.131 15.617 18.605 18.343 19.340 17.017 16.868
[61] 17.814 16.311 17.066 15.386 23.167 21.272 16.799 17.383 15.178 17.195
[71] 19.027 15.011 23.723 17.569 20.091 18.773 22.885 13.661 33.792 19.390
[81] 19.513 18.246 17.637 19.865 18.812 17.903 16.614 28.810 27.675 31.672
[91] 17.209 21.618 26.420 19.135 20.493 15.142 17.195 17.382 15.502 14.404
[101] 22.678 16.210 30.946 14.347 14.541 20.546 20.994 20.103 31.923 24.465
[111] 24.451 19.188 20.610 16.193 15.989 25.314 16.610 13.507 16.744 21.330
[121] 20.480 19.246 19.125 18.642 18.250 17.152 18.814 21.081 15.852 16.743
[131] 16.602 14.879 12.299 21.646 20.539 21.798 19.211 18.168 24.985 19.938
[141] 20.176 14.846 23.488 24.120 17.320 16.118 18.924 15.928 15.478 21.282
[151] 15.838 19.506 24.212 18.066 17.427 15.213 20.748 20.504 20.311 17.444
[161] 20.846 19.351 18.679 24.111 17.906 16.322 21.967 15.992 22.788 17.920
[171] 18.804 24.760 25.766 23.452 16.145 16.040 18.868 17.528 16.556 22.199
[181] 24.424 18.800 21.335 20.191 19.225 23.424 15.601 27.679 29.221 20.373
[191] 16.037 17.577 21.852 18.969 16.933 19.290 24.102 21.398 16.369 19.649
[201] 25.137 18.394 23.102 32.001 16.946 19.054 18.527 18.646 21.361 14.362
[211] 15.858 17.201 19.886 19.264 21.503 21.728 23.312 20.423 21.912 18.158
[221] 16.859 20.488 22.369
Here snailDat$size
is really just a vector, and can be treated as such. For example, to get the 9th entry of this vector, we can use the usual bracket notation:
$size[9] snailDat
[1] 18.894
The result is an ordinary numerical value.
Finally, let us take one more look at the output again:
print(snailDat)
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
When displaying large tibbles, R will not dump all the data at you. Instead, it will display the first 10 rows, with a message indicating how many more rows remain (in our case, we have ...with 213 more rows
written at the end of the printout). The system is still aware of the other rows; it just does not show them. To get a full view of a tibble in a more digestible, spreadsheet-like style, one can use the view
function. Try running view(snailDat)
and see what happens!
4.4 Reading delimited files
We have seen that the function read_delim
works by taking two inputs: the name of the file to read, and the delimiter character. So to read a comma-separated file, we can write
read_delim("island-FL.csv", delim = ",")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
Similarly, in case we want to read a percent-separated file such as islands-FL.psv
, all we need to do is change the delimiter:
read_delim("island-FL.psv", delim = "%")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
The output is exactly as before, since the file contained the same information (just represented differently due to the different delimiter).
To read a tab-separated file, remember not to simply press the tabulator key between the quotes of the delim
argument. Instead, R uses the character string "\t"
to represent a single press of the tabulator:
read_delim("island-FL.tsv", delim = "\t")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
What happens if the wrong delimiter is specified—for instance, if we type read_delim("island-FL.csv", delim = "%")
? In that case R will look for the %
character to separate the columns, but since these do not occur anywhere, each row will be interpreted as a single column entry in its entirety:
read_delim("island-FL.csv", delim = "%")
# A tibble: 223 × 1
`habitat,species,size,shape`
<chr>
1 humid,ustulatus,17.088,-0.029
2 humid,ustulatus,20.06,-0.001
3 humid,ustulatus,16.34,0.014
4 arid,calvus,13.734,-0.043
5 humid,nux,21.898,-0.042
6 humid,ustulatus,16.848,-0.023
7 humid,ustulatus,19.162,0.014
8 humid,ustulatus,16.017,0.042
9 arid,galapaganus,18.894,0.011
10 humid,nux,26.59,0
# ℹ 213 more rows
As seen, we have a tibble with a single column called `habitat,species,size,shape1,shape2`
, which has the type <chr>
(character string). One must beware of such mistakes, because the computer will not signal any errors: for all it cares, we did intend %
to be the delimiter. However, one will not be able to properly work with the data subsequently if it gets loaded in this mistaken form.
The CSV and TSV file formats are so common that there are shorthands available for loading them. Instead of writing read_delim("island-FL.csv", delim = ",")
, one can equivalently type read_csv("island-FL.csv")
for the same effect:
read_csv("island-FL.csv")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
Similarly, read_tsv("island-FL.tsv")
is an equivalent shorthand to read_delim("island-FL.tsv", delim = "\t")
:
read_tsv("island-FL.tsv")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
4.5 Naming (and renaming) columns
All of the files we have looked at so far contained a header as their first row: instead of containing data, they contained the names of the data columns. Having such a header is good practice. Sometimes however, the header may be missing. For example, the file islands-FL-nohead.csv
contains the same land snail data as the other files above, but it lacks a header. The first few lines of the file look like this:
humid,ustulatus,17.088,-0.029
humid,ustulatus,20.06,-0.001
humid,ustulatus,16.34,0.014
arid,calvus,13.734,-0.043
humid,nux,21.898,-0.042
humid,ustulatus,16.848,-0.023
humid,ustulatus,19.162,0.014
humid,ustulatus,16.017,0.042
arid,galapaganus,18.894,0.011
humid,nux,26.59,0
If one tries to read this file either with read_delim("island-FL-nohead.csv", delim = ",")
or the shorthand read_csv("island-FL-nohead.csv")
, we get the following:
read_delim("island-FL-nohead.csv", delim = ",")
# A tibble: 222 × 4
humid ustulatus `17.088` `-0.029`
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 20.1 -0.001
2 humid ustulatus 16.3 0.014
3 arid calvus 13.7 -0.043
4 humid nux 21.9 -0.042
5 humid ustulatus 16.8 -0.023
6 humid ustulatus 19.2 0.014
7 humid ustulatus 16.0 0.042
8 arid galapaganus 18.9 0.011
9 humid nux 26.6 0
10 arid calvus 17.9 -0.024
# ℹ 212 more rows
The fact that a header is missing is obvious to a human reading the file—but not to the computer, which simply took the first row to be the header names anyway and interpreted the data in there as if they were column names. To avoid doing this, one can pass col_names = FALSE
as an argument to either read_delim
, read_csv
, or read_tsv
. The col_names
argument is set by default to TRUE
; to override it, we must explicitly change it like this:
read_delim("island-FL-nohead.csv", delim = ",", col_names = FALSE)
# A tibble: 223 × 4
X1 X2 X3 X4
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
While this works, the column names now default to the moderately informative labels X1
, X2
, and so on. In fact, one can use the col_names
argument to explicitly specify a vector of character strings, which are then interpreted as the names to be given to the columns in case no header information exists within the file itself. For example:
read_delim("island-FL-nohead.csv", delim = ",",
col_names = c("habitat", "species", "size", "shape"))
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
And now these data are identical to our earlier snailDat
tibble.
4.6 Excel tables
Although their use is discouraged in science, one should know how to read data from an Excel spreadsheet. To do this, one needs to load the readxl
package. This package is part of the tidyverse
, but does not get automatically attached when running library(tidyverse)
. Therefore, we first load the package:
library(readxl)
We can now load Excel files with the function read_excel()
. At the start, we downloaded an Excel version of the land snail data, called island-FL.xlsx
. It holds the exact same data as the original CSV file, just saved in Excel format for instructive purposes. Let us load this file:
read_excel("island-FL.xlsx")
# A tibble: 223 × 4
habitat species size shape
<chr> <chr> <dbl> <dbl>
1 humid ustulatus 17.1 -0.029
2 humid ustulatus 20.1 -0.001
3 humid ustulatus 16.3 0.014
4 arid calvus 13.7 -0.043
5 humid nux 21.9 -0.042
6 humid ustulatus 16.8 -0.023
7 humid ustulatus 19.2 0.014
8 humid ustulatus 16.0 0.042
9 arid galapaganus 18.9 0.011
10 humid nux 26.6 0
# ℹ 213 more rows
The function read_excel
has several further options. For example, given an Excel table with multiple sheets, one can specify which one to import, using the sheet
argument. Check the help page of read_excel
, and experiment with its options.
4.7 Writing data to files
Finally, data can not only be read from a file, but also written out to one. Then, instead of read_delim
, read_csv
, read_tsv
and the like, one uses write_delim
, write_csv
, write_tsv
, and so on. For instance, to save some tibble called dat
in CSV form, one can do either
write_delim(dat, file = "/path/to/file.csv", delim = ",")
or the equivalent but shorter
write_csv(dat, file = "/path/to/file.csv")
where /path/to/file.csv
should be replaced by the path and file name with which the data ought to be saved.
4.8 Exercises
- Goldberg et al. (2010) collected data on self-incompatibility in the family Solanaceae (nightshades). It contains a list of 356 species, along with a column determining self-incompatibility status (0: self-incompatible; 1: self-compatible; 2-5: more complicated selfing scenarios). The data are in the file
Goldberg2010_data.csv
. An equivalent version is available inGoldberg2010_data.xlsx
, with the only difference that it is saved in Excel format.- Read the file
Goldberg2010_data.csv
usingread_delim
. - Read the file
Goldberg2010_data.csv
usingread_csv
. - Read the Excel file
Goldberg2010_data.xlsx
usingread_excel
from thereadxl
package. - Read the file
Goldberg2010_data.csv
usingread_csv
, and assign it to a variable calledgoldbergDat
. - Extract the species names (first column, called
Species
) fromgoldbergDat
as a vector of character strings, using the$
notation. What are the 42nd and 137th entries of this vector?
- Read the file
- The data file
ladybugs.csv
contains information on whether ladybugs sampled from rural versus industrial areas tend to be red or black morphs. Black morphs are thought to be an adaptation to more polluted industrial environments where the dark coloration provides better camouflage. The file is comma-separated and lacks headers. The meanings of the columns are, in order: habitat type, site code where the sampling was done, morph color, and the number of individuals sampled.- Load this file into a tibble in R, and give the following names to its columns:
habitat
,site
,morph
, andnumber
. - Extract the last column (
number
) as a vector and calculate its mean and variance.
- Load this file into a tibble in R, and give the following names to its columns:
- Smith et al. (2003) compiled a database of the body masses of mammals of the late Quaternary period. The data are in the file
Smith2003_data.txt
. The column names are not specified in the file, but they are, in order: Continent (AF=Africa, etc.), Status (extinct, historical, introduction, or extant), Order, Family, Genus, Species, Base-10 Log Mass, Combined Mass (grams), and Reference (numbers, referring to a numerically ordered list of published works—no need to worry about the details).- What is the delimiter in this data file?
- Load the data and give appropriate names to its columns, based on the information above.
If you are using a
tidyverse
version earlier than 2.0.0, only eight packages are loaded—lubridate
, a package for working with dates, is missing from the list.↩︎This and other data files used throughout the book will be accessible in a similar way. They are compressed in a zip file; extract them to obtain the data.↩︎
Warning: there exists a similarly-named function called
read.delim
which is part of base R. It does much the same thing asread_delim
; however, its use is clunkier and less flexible. You can think ofread_delim
as thetidyverse
upgrade to the originalread.delim
. My recommendation is to stick with using justread_delim
—it is simpler and at the same time more powerful than its predecessor.↩︎Data frames (a feature of base R) and tibbles (a
tidyverse
construct) are equivalent for most practical purposes. Tibbles offer some features that are absent from data frames and omit quirks of data frames which tend to get in the way. Like withread_delim
andread.delim
, tibbles can be thought of as a slightly upgraded and more user-friendly version of data frames. You do not need to be overly concerned with the precise differences between the two. In this book we will be using tibbles almost exclusively.↩︎The abbreviation
<dbl>
happens to stand for double-precision numerical value, a standard way of representing numbers on computers.↩︎