Unit 1 Module 2

GEOG246-346

1 Introduction

In this module, we will begin to learn about the R language, starting with the different types of R objects, and how and where (in environments) they interact with one another. As I was trained as an ecologist, I find it helpful to think of how the language functions in ecological terms. First, we can think of R objects (data and functions that are held in memory) as being like species (plant or animal) with their own classification system, or taxonomy, and these objects are found in and interact within different environments (note: this conceptual framework is not the same as the “R ecosystem” terminology you might see online (e.g. this), which refers to the array of user-contributed packages and R-related tools, e.g. RStudio).

So let’s first look at R’s species.

2 A taxonomy of R

The Linnean system of biological classification groups species hierarchically, from Kingdom all the way down to species (and even sub-species), according to the figure below (source).

We can adapt the lower end of the hierarchy to classify R’s objects (and probably most any other programming language), borrowing the organization from family, genus, and species. In fact, we could use the higher level organization if we wanted to classify R itself within the context of other analytical tools and programming methods:

Domain	Analog; digital
Kingdom	Mainframe; Desktop; Laptop; Phone; Cloud/HPC
Phylum	Windows; Mac; Linux
Class	Interpreted; Compiled
Order	Python; Ruby; Perl; R
Family	S3; S4; RC
Genus	vector; matrix; data.frame; array; list; function
Species	logical; integer; character; boolean; closure

Admittedly the classifications I give above this might not be that sound, but our focus here is on Family, Genus, Species, which are internal to R. Here we liken taxonomic family to the set of classes that define the different types of objects in R. “But wait”, you say, “that’s confusing! Why don’t you map taxonomic class to R classes?” I know, I know, but I wanted to use the whole hierarchy, and it felt better to use Class to distinguish programming languages (into interpreted versus compiled). Plus I didn’t want to have to jump over Order, which I would have struggled to fill with this analogy.

Moving on, Family maps onto R structures, and Species onto types, primarily to data types. My organization of these topics is pieced together from several sources of information that are online (classes 1, 2; structures 3, 4; types: 5), and it is based on the level of complexity inherent in the object.

2.1 Species (data types)

Let’s start with the simplest level first, the species in our taxonomic analogy. Here we refer to the types of data that we work with (in R or any language). Types are actually assigned to any R object, even ones that we are more complex, as the typeof function (run ?typeof to see) will show you, but here we are thinking only of the types of data, which are logical, integer, double, character, NULL, and the less used (at least for this class) complex and raw.

typeof(FALSE)
#> [1] "logical"
typeof(1L)
#> [1] "integer"
typeof(1)
#> [1] "double"
typeof("a")
#> [1] "character"
typeof(NULL)
#> [1] "NULL"
as.raw(1)
#> [1] 01
typeof(as.raw(1))
#> [1] "raw"

What is raw? According to ?raw:

The raw type is intended to hold raw bytes

2.2 Genus (data structures and functions)

One level up from data types are structures and functions. I liken these to the genus level because both are designed to do something with data–either hold the data or do something to, with, or on the data, and each of these can have many forms. For example, a vector is a type of data structure, which can be either an atomic vector or a list, and any of these can hold multiple data types. So let’s look at structures first.

2.2.1 Data structures

2.2.1.1 One dimensional

I have already mentioned the most basic structure, which is a vector. An atomic vector is a one-dimensional object that contains a single data type:

a <- c("a", "b", "c", "d")
a
#> [1] "a" "b" "c" "d"
b <- 1:10
b
#>  [1]  1  2  3  4  5  6  7  8  9 10
d <- TRUE
d
#> [1] TRUE

The object a is a character vector with four elements, b is an integer vector with 10 elements, and d is logical vector with one element. To strain the taxonomic example here, you can think of each of these vectors as a genus that contains just one species. A list, on the other hand, can be thought of as a genus containing multiple species, as it can contain many different data types within a single object.

l <- list("a", 1, 0.5, TRUE)
l
#> [[1]]
#> [1] "a"
#> 
#> [[2]]
#> [1] 1
#> 
#> [[3]]
#> [1] 0.5
#> 
#> [[4]]
#> [1] TRUE
str(l)
#> List of 4
#>  $ : chr "a"
#>  $ : num 1
#>  $ : num 0.5
#>  $ : logi TRUE

2.2.1.2 Two or more dimensions

Notice that each of the data types is maintained in the list (which we put together using the list function), and we can verify the type of data in the list using the str function. If we try to put together this same mix of types into an atomic vector using the c (concatenate) function, we don’t get the same results.

l <- c("a", 1, 0.5, TRUE)
l
#> [1] "a"    "1"    "0.5"  "TRUE"
str(l)
#>  chr [1:4] "a" "1" "0.5" "TRUE"

It coerces everything to a character data type.

There are several structures that have two or more dimensions. There are the matrix, the data.frame, and the array. The first two are two-dimensional, in that they consist of rows and columns, and the third can have an arbitrary number of dimensions.

m <- cbind(v1 = 1:4, v2 = 1:4)
m
#>      v1 v2
#> [1,]  1  1
#> [2,]  2  2
#> [3,]  3  3
#> [4,]  4  4
str(m)
#>  int [1:4, 1:2] 1 2 3 4 1 2 3 4
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "v1" "v2"
m2 <- cbind(v1 = c("a", "b"), c("c", "d"))
m2
#>      v1     
#> [1,] "a" "c"
#> [2,] "b" "d"
str(m2)
#>  chr [1:2, 1:2] "a" "b" "c" "d"
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "v1" ""
DF <- data.frame(v1 = 1:4, v2 = as.numeric(1:4), v3 = c("a", "b", "c", "d"))
DF
#>   v1 v2 v3
#> 1  1  1  a
#> 2  2  2  b
#> 3  3  3  c
#> 4  4  4  d
str(DF)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ v1: int  1 2 3 4
#>  $ v2: num  1 2 3 4
#>  $ v3: chr  "a" "b" "c" "d"
arr <- array(c(1:4, 1:4), dim = c(2, 2, 2))
arr
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
str(arr)
#>  int [1:2, 1:2, 1:2] 1 2 3 4 1 2 3 4

A matrix can only hold a single data type (like an atomic vector, if you try to mix types it will coerce them all to one kind–so a matrix is a genus that can only hold one species). A data.frame, which is actually a special kind of list that binds vectors containing the same number of elements into columns (so that they can have the same number of rows), can mix data types (a genus with multiple species). An array, on the other hand, can only have one data type despite being able to have more than one dimension.

Let’s turn back to the list now, since we just mentioned it in the context of the data.frame. A list is very versatile, and can contain any kind of Robject.

l2 <- list(m, m2, DF, arr, c, list)
l2
#> [[1]]
#>      v1 v2
#> [1,]  1  1
#> [2,]  2  2
#> [3,]  3  3
#> [4,]  4  4
#> 
#> [[2]]
#>      v1     
#> [1,] "a" "c"
#> [2,] "b" "d"
#> 
#> [[3]]
#>   v1 v2 v3
#> 1  1  1  a
#> 2  2  2  b
#> 3  3  3  c
#> 4  4  4  d
#> 
#> [[4]]
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> 
#> 
#> [[5]]
#> function (...)  .Primitive("c")
#> 
#> [[6]]
#> function (...)  .Primitive("list")
str(l2)
#> List of 6
#>  $ : int [1:4, 1:2] 1 2 3 4 1 2 3 4
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2] "v1" "v2"
#>  $ : chr [1:2, 1:2] "a" "b" "c" "d"
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2] "v1" ""
#>  $ :'data.frame':    4 obs. of  3 variables:
#>   ..$ v1: int [1:4] 1 2 3 4
#>   ..$ v2: num [1:4] 1 2 3 4
#>   ..$ v3: chr [1:4] "a" "b" "c" "d"
#>  $ : int [1:2, 1:2, 1:2] 1 2 3 4 1 2 3 4
#>  $ :function (...)  
#>  $ :function (...)

We can put all the matrices, data.frames, and array we just made into a list, as well as some of the functions we were using to make those objects (c, list).

2.2.2 Functions

That brings us now to functions. I know it is perhaps strained to think of a function as a genus, but functions are a kind of structure and functions can be organized into different groups, so it is not entirely crazy to think of functions as being analogous(ish) to Genus. So what are the functional genera?

2.2.2.1 Primitives

The first genus consists of primitive functions, of which c and list are two examples, but also ones like sum. Primitive functions are actually C functions that are called directly by R that contain no R code:

c
#> function (...)  .Primitive("c")
sum
#> function (..., na.rm = FALSE)  .Primitive("sum")
list
#> function (...)  .Primitive("list")

By running the function without parentheses, you can see what type of function they are. You can also get a complete list of R’s primitive functions by running names(methods:::.BasicFunsList).

2.2.2.2 Operators

Operators are another kind of functional genus, such as the usual mathematical symbols +, -, /, *, and logical ones such as >, <, and |, plus a number of others, some of which are listed here. This list overlaps heavily, with the list of primitives, so might even be considered more properly a sub-genus of it, although there are non-primitive operators in existence, such as ?.

5 * 5
#> [1] 25
(10 + 2) / 5
#> [1] 2.4
(10 > 5) & (5 < 6)
#> [1] TRUE

2.2.2.3 Control structures

There are a number of functions that R shares with other languages, which are (to paraphrase from here) used to control the sequence in which statements (e.g. a <- 1 + 10) are evaluated. There are functions such as for, while, if, else, break, etc.

a <- c(1, 11)
for(i in a) {
  if(i < 10) {
    print(paste(i, "is less than 10"))
  } else {
    print(paste(i, "is bigger than 10"))
  }
}
#> [1] "1 is less than 10"
#> [1] "11 is bigger than 10"

i <- 0
while(i < 5) {
  print(i^10)
  i <- i + 1
}
#> [1] 0
#> [1] 1
#> [1] 1024
#> [1] 59049
#> [1] 1048576

The code above uses several common control structures. if and else are conditional operators, determining whether a statement gets evaluated or not depending on a defined condition. for and while are different kinds of loops. Of particular interest are another set of looping statements that are native to R, which are known as *apply functions. We will get into all these in later sections, but for now here is a taste of one of them (lapply).

lapply(c(1, 11), function(x) {
  if(x < 10) {
    paste(x, "is less than 10")
  } else {
    paste(x, "is bigger than 10")
  }
})
#> [[1]]
#> [1] "1 is less than 10"
#> 
#> [[2]]
#> [1] "11 is bigger than 10"

2.2.2.4 Base, package, and user-defined functions

Beyond the primitives, R ships with a number of already built functions, including various commonly used statistical functions.

mean
#> standardGeneric for "mean" defined from package "base"
#> 
#> function (x, ...) 
#> standardGeneric("mean")
#> <environment: 0x7feb16ee0208>
#> Methods may be defined for arguments: x
#> Use  showMethods(mean)  for currently available ones.
sample
#> function (x, size, replace = FALSE, prob = NULL) 
#> {
#>     if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 
#>         1) {
#>         if (missing(size)) 
#>             size <- x
#>         sample.int(x, size, replace, prob)
#>     }
#>     else {
#>         if (missing(size)) 
#>             size <- length(x)
#>         x[sample.int(length(x), size, replace, prob)]
#>     }
#> }
#> <bytecode: 0x7feb216d5418>
#> <environment: namespace:base>
sd
#> function (x, na.rm = FALSE) 
#> sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
#>     na.rm = na.rm))
#> <bytecode: 0x7feaf5abc908>
#> <environment: namespace:stats>

Three are provided above, two of which (mean and sample) are part of base R, i.e. they are built into the language itself, and one of which comes from the stats package, which is one of R’s core packages (basically it loads when you open R). You will note above that the packages are referred to next to the term “namespace”. We will hear more about that in the next sections.

In addition to these core packages, there are many, many (>10,000) user contributed packages, most of which can be installed from CRAN using the install.packages command, or RStudio’s Packages interface. One example we have already used a fair bit (because you are reading this) is the install_github function from the devtools package.

And then, of course, there are user-defined functions, a much, much larger universe, like the grains of sand on a beach (or the largest genus of them all). These are all the functions users make for themselves in their various scripts and never put into packages. For example:

my_random_function <- function(x) (x * 10) - 2 + 10^2
my_random_function(c(2, 4, 100))
#> [1]  118  138 1098

2.2.2.5 Generic functions

This is the last genus of functions we will describe, as it sets us up to think next about classes (the Family). Generics are functions that have a common name and generally do the same thing, but produce different outputs depending on what class (Family) of object they are applied to. Three widely used generics are print, plot, and summary. Let’s look at two examples of summary

a <- 1:10
b <- sample(1:100, 10)
summary(a)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.25    5.50    5.50    7.75   10.00
summary(lm(a ~ b))
#> 
#> Call:
#> lm(formula = a ~ b)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.0802 -1.5640  0.2557  2.1819  4.2148 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  4.01509    2.50527   1.603    0.148
#> b            0.03278    0.05080   0.645    0.537
#> 
#> Residual standard error: 3.131 on 8 degrees of freedom
#> Multiple R-squared:  0.04947,    Adjusted R-squared:  -0.06935 
#> F-statistic: 0.4163 on 1 and 8 DF,  p-value: 0.5368

Here we see that summary applied to a vector of integers provides mean and quantile values, while it provides a summary of regression fit when applied to a the output of a linear model (lm) fit to vector a and 10 randomly selected numbers between 1 and 100.

We can see which classes use the summary generic by running the method function:

methods(summary)
#>   [1] summary,ANY-method                  summary,DBIObject-method           
#>   [3] summary,diagonalMatrix-method       summary,GridTopology-method        
#>   [5] summary,mle-method                  summary,RasterLayer-method         
#>   [7] summary,RasterStackBrick-method     summary,sparseMatrix-method        
#>   [9] summary,Spatial-method              summary,SpatRaster-method          
#>  [11] summary,SpatVector-method           summary.aareg*                     
#>  [13] summary.aov                         summary.aovlist*                   
#>  [15] summary.aspell*                     summary.bag*                       
#>  [17] summary.bagEarth*                   summary.bagFDA*                    
#>  [19] summary.bit*                        summary.bitwhich*                  
#>  [21] summary.booltype*                   summary.cch*                       
#>  [23] summary.check_packages_in_dir*      summary.classbagg*                 
#>  [25] summary.col_spec*                   summary.connection                 
#>  [27] summary.corAR1*                     summary.corARMA*                   
#>  [29] summary.corCAR1*                    summary.corCompSymm*               
#>  [31] summary.corExp*                     summary.corGaus*                   
#>  [33] summary.corIdent*                   summary.corLin*                    
#>  [35] summary.corNatural*                 summary.corRatio*                  
#>  [37] summary.corSpher*                   summary.corStruct*                 
#>  [39] summary.corSymm*                    summary.coxph*                     
#>  [41] summary.coxph.penal*                summary.data.frame                 
#>  [43] summary.Date                        summary.default                    
#>  [45] summary.diff.resamples*             summary.Duration*                  
#>  [47] summary.ecdf*                       summary.effects*                   
#>  [49] summary.estimate*                   summary.factor                     
#>  [51] summary.ggplot*                     summary.glm                        
#>  [53] summary.gls*                        summary.haven_labelled*            
#>  [55] summary.hcl_palettes*               summary.Hist*                      
#>  [57] summary.ImageMetaData*              summary.inbagg*                    
#>  [59] summary.inclass*                    summary.infl*                      
#>  [61] summary.integer64*                  summary.Interval*                  
#>  [63] summary.lca*                        summary.lm                         
#>  [65] summary.lme*                        summary.lmList*                    
#>  [67] summary.loess*                      summary.loglm*                     
#>  [69] summary.lvm*                        summary.lvm.mixture*               
#>  [71] summary.lvmfit*                     summary.manova                     
#>  [73] summary.matrix                      summary.mlm*                       
#>  [75] summary.modelStruct*                summary.multigroup*                
#>  [77] summary.multigroupfit*              summary.multinom*                  
#>  [79] summary.negbin*                     summary.nls*                       
#>  [81] summary.nlsList*                    summary.nnet*                      
#>  [83] summary.ordreg*                     summary.packageStatus*             
#>  [85] summary.pdBlocked*                  summary.pdCompSymm*                
#>  [87] summary.pdDiag*                     summary.pdIdent*                   
#>  [89] summary.pdLogChol*                  summary.pdMat*                     
#>  [91] summary.pdNatural*                  summary.pdSymm*                    
#>  [93] summary.Period*                     summary.polr*                      
#>  [95] summary.POSIXct                     summary.POSIXlt                    
#>  [97] summary.ppr*                        summary.pr_DB*                     
#>  [99] summary.prcomp*                     summary.princomp*                  
#> [101] summary.proc_time                   summary.prodlim*                   
#> [103] summary.proxy_registry*             summary.pyears*                    
#> [105] summary.ratetable*                  summary.recipe*                    
#> [107] summary.resamples*                  summary.reStruct*                  
#> [109] summary.ri*                         summary.RichSOCKcluster*           
#> [111] summary.RichSOCKnode*               summary.rlang_error*               
#> [113] summary.rlang_message*              summary.rlang_trace*               
#> [115] summary.rlang_warning*              summary.rlang:::list_of_conditions*
#> [117] summary.rlm*                        summary.rpart*                     
#> [119] summary.sfc*                        summary.shingle*                   
#> [121] summary.sim*                        summary.srcfile                    
#> [123] summary.srcref                      summary.stepfun                    
#> [125] summary.stl*                        summary.survbagg*                  
#> [127] summary.survexp*                    summary.survfit*                   
#> [129] summary.survfitms*                  summary.survreg*                   
#> [131] summary.svm*                        summary.table                      
#> [133] summary.timeDate*                   summary.tmerge*                    
#> [135] summary.train*                      summary.trellis*                   
#> [137] summary.tukeysmooth*                summary.tune*                      
#> [139] summary.twostageCV*                 summary.units*                     
#> [141] summary.varComb*                    summary.varConstPower*             
#> [143] summary.varConstProp*               summary.varExp*                    
#> [145] summary.varFixed*                   summary.varFunc*                   
#> [147] summary.varIdent*                   summary.varPower*                  
#> [149] summary.vctrs_sclr*                 summary.vctrs_vctr*                
#> [151] summary.warnings                    summary.which*                     
#> [153] summary.XMLInternalDocument*        summary.zibreg*                    
#> see '?methods' for accessing help and source code

Quite a list (and much longer for print)! The notation above is <generic_function>.<class>. Generics can also be understood within the context of object-oriented programming, which is an important aspect of R and python. We get into this more below.

2.3 Family (classes)

Finally we arrive at classes, which in our (hopefully still useful) analogy is akin to a taxonomic family. To better understand classes (and why they are likened to a taxonomic family, a higher level of organization than genus and species), we need to learn about object-oriented programming (OOP), in which classes are a central concept.

2.3.1 OOP

The best short explanation I have seen for what OOP is comes from a python guide:

In all the programs we wrote till now, we have designed our program around functions i.e. blocks of statements which manipulate data. This is called the procedure-oriented way of programming. There is another way of organizing your program which is to combine data and functionality and wrap it inside something called an object. This is called the object oriented programming paradigm. Most of the time you can use procedural programming, but when writing large programs or have a problem that is better suited to this method, you can use object oriented programming techniques.

Classes and objects are the two main aspects of object oriented programming. A class creates a new type where objects are instances of the class. An analogy is that you can have variables of type int which translates to saying that variables that store integers are variables which are instances (objects) of the int class.

This explanation nicely explains how OOP differs from the alternative programming paradigm (procedural programming). Another useful bit of explanation on OOP is from Advanced R:

Central to any object-oriented system are the concepts of class and method. A class defines the behaviour of objects by describing their attributes and their relationship to other classes. The class is also used when selecting methods, functions that behave differently depending on the class of their input. Classes are usually organised in a hierarchy: if a method does not exist for a child, then the parent’s method is used instead; the child inherits behaviour from the parent.

The main takeaway here is that a class defines different types of objects and what methods are associated with them (so to me this feels like a higher level of organization, which makes it like a taxonomic family). Here is where R gets confusing, because it actually has several types of OO system: S3, S4, and RC, and (not quite OO) the “base types”, which are the primitives from C we described above. I’ll let you read the description of each of those and how they differ in the Advanced R link I just gave you (here it is again), and it is good to understand them. Here I will highlight a few things about them that I think are important to know, particularly with respect to understanding R’s geospatial capabilities.

R methods are idosyncratic relative to other OO languages. If you have ever worked with python, you likely have done something like this:

>>> import numpy as np
>>> v = np.array([0, 1, 2, 3])
>>> v.mean()
1.5

Where the method (the function mean) appears after the object (v, a 1-dimensional numpy array, equivalent to an R integer vector), because it belongs to the class. In R, methods are applied to the object, and the appropriate version of the generic function is then applied for the particular class of object:

v <- 0:3
mean(v)
#> [1] 1.5

If a class-specific variant of the generic hasn’t been defined, R applies the default version of the function. That’s the case here, where mean.default is used because this is just a simple integer vector.

class(v)
#> [1] "integer"

This is important to know because sometimes you might find that the generic function you need and expect isn’t there for you:

rst <- raster::raster(nrow = 10, ncol = 10, vals = 1:100)
plot(rst)

The example above jumps a bit ahead of where we are currently, but it shows what happens when a generic function is not available for a particular class. Here we created an object of class raster (we will be seeing much more of these in Unit 2, but specifically working with those generated by the terra package), and tried to plot it (i.e. map the raster). The method for plotting a raster is plot.raster, so when you call the generic function plot and apply it to rst (plot(rst)), it will map the raster rst. However, in this example the plot method for rasters is not available because the raster package was not loaded. R was instead trying to apply plot.default to object rst, which has a very different structure than a class that plot.default is able to handle, e.g:

mat <- cbind(x = 1:10, y = 11:20)
plot(mat)

So let’s look at the two structures of each object. Here’s the class and structure of the object mat:

class(mat)
#> [1] "matrix" "array"
str(mat)
#>  int [1:10, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "x" "y"

Pretty simple. Here is the same for object rst:

class(rst)
#> [1] "RasterLayer"
#> attr(,"package")
#> [1] "raster"
str(rst)
#> Formal class 'RasterLayer' [package "raster"] with 12 slots
#>   ..@ file    :Formal class '.RasterFile' [package "raster"] with 13 slots
#>   .. .. ..@ name        : chr ""
#>   .. .. ..@ datanotation: chr "FLT4S"
#>   .. .. ..@ byteorder   : chr "little"
#>   .. .. ..@ nodatavalue : num -Inf
#>   .. .. ..@ NAchanged   : logi FALSE
#>   .. .. ..@ nbands      : int 1
#>   .. .. ..@ bandorder   : chr "BIL"
#>   .. .. ..@ offset      : int 0
#>   .. .. ..@ toptobottom : logi TRUE
#>   .. .. ..@ blockrows   : int 0
#>   .. .. ..@ blockcols   : int 0
#>   .. .. ..@ driver      : chr ""
#>   .. .. ..@ open        : logi FALSE
#>   ..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
#>   .. .. ..@ values    : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..@ offset    : num 0
#>   .. .. ..@ gain      : num 1
#>   .. .. ..@ inmemory  : logi TRUE
#>   .. .. ..@ fromdisk  : logi FALSE
#>   .. .. ..@ isfactor  : logi FALSE
#>   .. .. ..@ attributes: list()
#>   .. .. ..@ haveminmax: logi TRUE
#>   .. .. ..@ min       : int 1
#>   .. .. ..@ max       : int 100
#>   .. .. ..@ band      : int 1
#>   .. .. ..@ unit      : chr ""
#>   .. .. ..@ names     : chr ""
#>   ..@ legend  :Formal class '.RasterLegend' [package "raster"] with 5 slots
#>   .. .. ..@ type      : chr(0) 
#>   .. .. ..@ values    : logi(0) 
#>   .. .. ..@ color     : logi(0) 
#>   .. .. ..@ names     : logi(0) 
#>   .. .. ..@ colortable: logi(0) 
#>   ..@ title   : chr(0) 
#>   ..@ extent  :Formal class 'Extent' [package "raster"] with 4 slots
#>   .. .. ..@ xmin: num -180
#>   .. .. ..@ xmax: num 180
#>   .. .. ..@ ymin: num -90
#>   .. .. ..@ ymax: num 90
#>   ..@ rotated : logi FALSE
#>   ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
#>   .. .. ..@ geotrans: num(0) 
#>   .. .. ..@ transfun:function ()  
#>   ..@ ncols   : int 10
#>   ..@ nrows   : int 10
#>   ..@ crs     :Formal class 'CRS' [package "sp"] with 1 slot
#>   .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs"
#>   .. .. ..$ comment: chr "GEOGCRS[\"unknown\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.25722"| __truncated__
#>   ..@ history : list()
#>   ..@ z       : list()

Much more complicated! This is an object of class raster, which uses the S4 OO system. It has a number of “slots”, which holds information about the raster object, in this case 12 upper-level slots, most of which contain several sub-slots. You can access the information in an S4 object’s slots in two ways, using either the @ operator or the slot function:

rst@extent
#> class      : Extent 
#> xmin       : -180 
#> xmax       : 180 
#> ymin       : -90 
#> ymax       : 90
slot(rst, "extent")
#> class      : Extent 
#> xmin       : -180 
#> xmax       : 180 
#> ymin       : -90 
#> ymax       : 90

Here we are pulling out the information on rst’s extent, which is itself an object with a class definition.

S3 and S4 classes are accessed in different ways. Although both make use of generics functions in the same way, their slots are accessed differently. In the previous example using lm in the Generic functions section, lm(a ~ b) is an S3 object:

lm_ab <- lm(a ~ b)  
str(lm_ab)
#> List of 12
#>  $ coefficients : Named num [1:2] 4.0151 0.0328
#>   ..- attr(*, "names")= chr [1:2] "(Intercept)" "b"
#>  $ residuals    : Named num [1:10] -5.08 -3.457 -1.671 -0.933 -1.244 ...
#>   ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
#>  $ effects      : Named num [1:10] -17.393 2.02 0.534 1.005 -0.638 ...
#>   ..- attr(*, "names")= chr [1:10] "(Intercept)" "b" "" "" ...
#>  $ rank         : int 2
#>  $ fitted.values: Named num [1:10] 6.08 5.46 4.67 4.93 6.24 ...
#>   ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
#>  $ assign       : int [1:2] 0 1
#>  $ qr           :List of 5
#>   ..$ qr   : num [1:10, 1:2] -3.162 0.316 0.316 0.316 0.316 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:10] "1" "2" "3" "4" ...
#>   .. .. ..$ : chr [1:2] "(Intercept)" "b"
#>   .. ..- attr(*, "assign")= int [1:2] 0 1
#>   ..$ qraux: num [1:2] 1.32 1.09
#>   ..$ pivot: int [1:2] 1 2
#>   ..$ tol  : num 1e-07
#>   ..$ rank : int 2
#>   ..- attr(*, "class")= chr "qr"
#>  $ df.residual  : int 8
#>  $ xlevels      : Named list()
#>  $ call         : language lm(formula = a ~ b)
#>  $ terms        :Classes 'terms', 'formula'  language a ~ b
#>   .. ..- attr(*, "variables")= language list(a, b)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "a" "b"
#>   .. .. .. ..$ : chr "b"
#>   .. ..- attr(*, "term.labels")= chr "b"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(a, b)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
#>  $ model        :'data.frame':   10 obs. of  2 variables:
#>   ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
#>   ..$ b: int [1:10] 63 44 20 28 68 9 47 50 70 54
#>   ..- attr(*, "terms")=Classes 'terms', 'formula'  language a ~ b
#>   .. .. ..- attr(*, "variables")= language list(a, b)
#>   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:2] "a" "b"
#>   .. .. .. .. ..$ : chr "b"
#>   .. .. ..- attr(*, "term.labels")= chr "b"
#>   .. .. ..- attr(*, "order")= int 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list(a, b)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
#>  - attr(*, "class")= chr "lm"

An S3 object is really just a list, and you access information in the list using the $ operator.

lm_ab$coefficients
#> (Intercept)           b 
#>  4.01508649  0.03277955

That gives us the linear model’s coefficients.

S3 and S4 are both used in defining R spatial objects. As we have seen in the example above, rasters are based on the S4 system, while the two main packages providing vector operations, sp and sf, respectively use the S4 and S3 system. Even though sf is newer and intended to replace sp, it makes use of R’s older S3 system. So, from our perspective, we want to have some understanding of both of these systems because we will eventually want to extract information from (and put it into) spatial objects, and the way in which we do that will differ according to the OO system the defines the class.

Okay, that’s enough on OO for now.

3 Environments

This brings us to the final chapter in our extended ecological metaphor. We have just heard about R‘s species and their taxonomy, so now we will talk about how they interact. In ecology, we talk about species’ environments. In R, objects have different environments. To learn about R environments in detail, there is a whole chapter on them in Advanced R. However, for now, to avoid too much confusion, and since (according to the chapter) “Understanding environments is not necessary for day-to-day use of R”, we will focus on just a few aspects of what environments are and how they affect objects in ways that you will almost certainly encounter.

First, to provide a very simplistic definition of what an environment is, it is the means by which R maps the name you assign to an object to where the object’s values live in memory:

The job of an environment is to associate, or bind, a set of names to a set of values (source)

Environments in R are actually lists, which are nested in ways such that objects found in one environment are isolated from other environments. There are three major environments you should know about:

The global environment
The package environment
The execution environment

3.1 The global environment

This is the top level environment in R, and is the place where any object that you create in the console or a script that you are using for interactive analysis lives.

a <- 1:4
f <- function(x) {
  x * 10
}
ls()
#>  [1] "a"                  "arr"                "b"                 
#>  [4] "biophysical_path"   "birthdays"          "clarity"           
#>  [7] "colMax"             "d"                  "date2"             
#> [10] "date2_char"         "date3"              "DF"                
#> [13] "DF_filt"            "DF_head"            "DF_tail"           
#> [16] "distinct_birthdays" "districts"          "e"                 
#> [19] "f"                  "farmers"            "gr"                
#> [22] "i"                  "ifile"              "items"             
#> [25] "k"                  "l"                  "l2"                
#> [28] "landsat_path"       "lm_ab"              "lt"                
#> [31] "m"                  "m2"                 "maize"             
#> [34] "maize2"             "mat"                "my_number_checker" 
#> [37] "my_random_function" "ohtml"              "progress"          
#> [40] "pts1"               "pts2"               "pts3"              
#> [43] "pts4"               "quality"            "r"                 
#> [46] "reproducibility"    "rmds"               "roads"             
#> [49] "rowMax"             "rst"                "sample_size"       
#> [52] "samples"            "site_summary"       "struct"            
#> [55] "taxonomic_path"     "tib"                "tib_sorted"        
#> [58] "ug"                 "v"                  "v1"                
#> [61] "v2"                 "w"                  "x"                 
#> [64] "y"
environment()
#> <environment: R_GlobalEnv>

Here we define two objects, the vector a and the function f, and we use the function ls to list the objects in the global environment (you can use the environ argument of ls (see ?ls to view the objects in other environments–more on that in a bit)), and environment() tells us what environment we are in. We can also use environment to tell us what environment any function belongs to:

environment(f)
#> <environment: R_GlobalEnv>
environment(mean)
#> <environment: 0x7feb16ee0208>
environment(lm)
#> <environment: namespace:stats>

We see that f is a function defined in the global environment, whereas mean and lm belong to “namespaces” called base and stats, respectively.

3.2 The package environment and namespaces

This last point on namespaces brings us to packages. Packages have their own environments, as well as namespace environment. Let’s let Hadley Wickham explain this:

Every function in a package is associated with a pair of environments: the package environment, which you learned about earlier, and the namespace environment.

The package environment is the external interface to the package. It’s how you, the R user, find a function in an attached package or with ::. Its parent is determined by search path, i.e. the order in which packages have been attached.

The namespace environment is the internal interface to the package. The package environment controls how we find the function; the namespace controls how the function finds its variables.

There are a few things to dive into in that explanation. First, you will note the mention of ::. That has already appeared several times in examples in these first two modules. When you create an R package, it makes a package environment that contains its functions. You access the functions in the package environment in one of two ways:

rst <- raster::raster(nrows = 10, ncols = 10, vals = 1:100)
raster::plot(rst)

This is the first way we tried to do it, with the exception that we now use the :: to get access to raster’s plot method (we didn’t do that before), so that rst can actually be plotted.

The second way uses the function library to load and attach an installed package, here raster, which makes the raster function available in the search path, so it can be called without using the packagename::function_name format.

library(raster)
rst <- raster(nrows = 10, ncols = 10, vals = 1:100)
plot(rst)

So what is the search path? That is answered by search, which tells you all the package environments that are attached.

search()
#>  [1] ".GlobalEnv"        "package:lubridate" "package:knitr"    
#>  [4] "package:RStoolbox" "package:raster"    "package:sp"       
#>  [7] "package:devtools"  "package:usethis"   "package:geospaar" 
#> [10] "package:forcats"   "package:stringr"   "package:dplyr"    
#> [13] "package:purrr"     "package:readr"     "package:tidyr"    
#> [16] "package:tibble"    "package:ggplot2"   "package:tidyverse"
#> [19] "package:sf"        "package:terra"     "tools:rstudio"    
#> [22] "package:stats"     "package:graphics"  "package:grDevices"
#> [25] "package:utils"     "package:datasets"  "package:methods"  
#> [28] "Autoloads"         "package:base"

These are ordered hierarchically, such that the immediate environment is the “.GlobalEnv”, which contains any globally defined functions, followed immediately by the last package you attached (using either library or require), the second-to-last you attached, etc, all the way to the base package.

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following objects are masked from 'package:raster':
#> 
#>     area, select
#> The following object is masked from 'package:dplyr':
#> 
#>     select
#> The following object is masked from 'package:terra':
#> 
#>     area
search()
#>  [1] ".GlobalEnv"        "package:MASS"      "package:lubridate"
#>  [4] "package:knitr"     "package:RStoolbox" "package:raster"   
#>  [7] "package:sp"        "package:devtools"  "package:usethis"  
#> [10] "package:geospaar"  "package:forcats"   "package:stringr"  
#> [13] "package:dplyr"     "package:purrr"     "package:readr"    
#> [16] "package:tidyr"     "package:tibble"    "package:ggplot2"  
#> [19] "package:tidyverse" "package:sf"        "package:terra"    
#> [22] "tools:rstudio"     "package:stats"     "package:graphics" 
#> [25] "package:grDevices" "package:utils"     "package:datasets" 
#> [28] "package:methods"   "Autoloads"         "package:base"

Notice how the previous call to search showed “package:raster” being right after “.GlobalEnv”. In this last call, we attach the MASS package, which is then interposed between “.GlobalEnv” and “package:raster”. Another way of expressing this is in terms of parentage, where each package is the parent of the last package you attached, and all packages are parents of the “.GlobalEnv”. This is laid-out nicely in section 7.4.1 of Advanced R.

This ordering or parentage matters to us for at least one important reason, and this relates to whether functions are exported from packages or not. If a function is exported (recall how we exported our first function in Module 1) it then becomes publicly usable in the package environment. The thing is, however, that the same function name might be used by more than one package (and not as generic functions). If you try to attach both packages, the function in the most recently attached of the two packages will mask the function from the other one. This is demonstrated by the following examples in which we attach dplyr (a package we will use more in the next modules) and raster in alternating sequence.

#> Warning: 'raster' namespace cannot be unloaded:
#>   namespace 'raster' is imported by 'RStoolbox', 'exactextractr' so cannot be unloaded

library(dplyr)
library(raster)
#> 
#> Attaching package: 'raster'
#> The following object is masked from 'package:MASS':
#> 
#>     select
#> The following object is masked from 'package:dplyr':
#> 
#>     select
detach("package:raster", unload = TRUE)
#> Warning: 'raster' namespace cannot be unloaded:
#>   namespace 'raster' is imported by 'RStoolbox', 'exactextractr' so cannot be unloaded
detach("package:dplyr", unload = TRUE)
#> Warning: 'dplyr' namespace cannot be unloaded:
#>   namespace 'dplyr' is imported by 'tidyr', 'RStoolbox', 'broom', 'dbplyr', 'recipes' so cannot be unloaded
library(raster)
#> 
#> Attaching package: 'raster'
#> The following object is masked from 'package:MASS':
#> 
#>     select
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:raster':
#> 
#>     intersect, select, union
#> The following object is masked from 'package:MASS':
#> 
#>     select
#> The following objects are masked from 'package:terra':
#> 
#>     intersect, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dplyr masks functions from a bunch of packages, including base and stats, but when it is attached before raster, dplyr’s select function is masked by raster’s function of the same name. Something different happens when we detach both packages, and then attach raster followed by dplyr, which masks three functions from raster: intersect, select, and union.

This matters because when because when such package conflicts arise, you have to call the function you want using the packagename::function_name format, e.g. raster::intersect in the last example, otherwise a call to just intersect will give you dplyr::intersect, which won’t be able to operate on a raster.

That is why package developers are encouraged to export functions sparingly:

Generally, you want to export a minimal set of functions; the fewer you export, the smaller the chance of a conflict. While conflicts aren’t the end of the world because you can always use :: to disambiguate, they’re best avoided where possible because it makes the lives of your users easier.

Package functions can always be left as functions internal to the package, that is, they exist in the package’s namespace environment. Such internal functions might be used by one of the exported functions. You can actually access such functions from outside the package by using the ::: operator, although this is not recommended.

3.3 The function environment

The last environment we will discuss is the function environment, which actually has three more environmental terms associated with it: the enclosing environment, the binding environment, and the execution environment, according to this section in one version of Advanced R. We won’t worry about the first two, except to note that the enclosing environment is the one in which the function was created–so if you define a function in a script but don’t add it to a package, it will be enclosed by the global environment.

The execution environment is important to know for everyday programming purposes (in my opinion, at any rate). It is a temporary environment that is created within functions when they are executed. Here’s a short demonstration of how the function environment differs from the global environment.

x <- 10
f <- function() {
  x <- 20
  return(x)
}

x
#> [1] 10
f()
#> [1] 20

Here x is an integer vector (value 10) and f is a function that specifies an integer vector x (value 20) inside the function body. It returns the value of this internal variable x out of the function body on execution, not the value of the globally defined x (10). This is because the execution environment is separate from the enclosing environment, and a new, clean environment is created each time the function is executed (called) and then discarded on completion.

You can modify what’s going on inside the execution environment using a global variable, although this is probably not a great idea.

x <- 10
f <- function() {
  x <- 20 * x
  return(x)
}
f()
#> [1] 200

x <- 10
f <- function(x) {
  x <- 20 * x
  return(x)
}
f()
#> Error in f(): argument "x" is missing, with no default
f(x)
#> [1] 200
f(10)
#> [1] 200
f(x = x)
#> [1] 200

The first example shows that if you specify x in the global environment and then assign the value 20 * x in the function body to create x, the answer returned is 200 (20 * 10). In the second example, we define the function f as having an argument x, and then try execute f() as we did before. That fails, because you have to assign a value x to the argument, so we have to specify a value in the function () to run, so f() fails to run. How about f(x). That works, because now you are telling the f that you want to input the value stored in the global variable x, which is the same as running f(10). f(x) is shorthand for the more correct f(x = x) (passing the value of global variable x to argument x).

The upshot of this all is that f on each execution is returning the same value, which is the result of a single execution where a new environment is created on each execution.

You can see that it is a different environment on each execution with the following modification:

x <- 10
f <- function(x) {
  x <- 20 * x
  environment()
}
f(x)
#> <environment: 0x7feaf263fcf0>
f(10)
#> <environment: 0x7feaf269d858>
f(x = x)
#> <environment: 0x7feaf2547900>

The function is modified to return the value from environment, which returns the name of the execution environment, which is simply a complicated hex-string. Note, however, that the string changes on each execution, indicating that the environment is not the same.

This is important to know because you cannot modify the value of a global object from within a function’s execution environment.

The only way to do that is using a control structure such as a for loop.

x <- 10
for(i in 1:3) {
  x <- 20 * x
  print(x)
  print(environment())
}
#> [1] 200
#> <environment: R_GlobalEnv>
#> [1] 4000
#> <environment: R_GlobalEnv>
#> [1] 80000
#> <environment: R_GlobalEnv>
x
#> [1] 80000

Note that the value of x gets updated it each iteration, and the environment inside the {} is still part of the global environment (and thus not the execution environment of a function).

Okay, so that’s it for environments, and actually for this whole module. There is no formal assignment for this module, just some questions to answer.

4 Question to answer

What are the data types in R?
What is the difference between a matrix and data.frame?
What do a data.frame and a list have in common?
What is a generic function, and how does it relate to object-oriented programming?
What is one difference between S3 and S4 object-oriented systems?
If you create the object a <- 1:10 in the R console, what environment would you find the object in?
How many times does the execution environment in a function get reused?

Back to home