Helpful functions

GEOG246-346

1 strings

1.1 paste, paste0

Concatenate strings. paste includes the option to add a specific separator, like a space or hyphen.

paste("a", "b", "c", sep = " ")
#> [1] "a b c"
paste("a", "b", "c", sep = "-")
#> [1] "a-b-c"

paste0 assumes there is no separator.

paste0("a", "b", "c")
#> [1] "abc"

1.2 str_replace

From stringr package. Use to replace strings

library(stringr)
v <- "king phillip came over for good soup"
print(v)
#> [1] "king phillip came over for good soup"
w <- stringr::str_replace(v, "soup", "spaghetti" )
print(w)
#> [1] "king phillip came over for good spaghetti"

1.3 str_replace_all

Similar to str_replace, but str_replace_all replaces

library(stringr)
v <- "it was the best of times it was the worst of times"
print(v)
#> [1] "it was the best of times it was the worst of times"
w <- stringr::str_replace(v, "times", "spaghetti" )
print(w) ## only first "times" replaced
#> [1] "it was the best of spaghetti it was the worst of times"
x <- stringr::str_replace_all(v, "times", "spaghetti" ) 
print(x) ## all "times" replaced
#> [1] "it was the best of spaghetti it was the worst of spaghetti"

1.4 str_sub

Create substrings based on character index

library(stringr)
v <- "it was the best of times it was the worst of times"
print(v)
#> [1] "it was the best of times it was the worst of times"
w <- stringr::str_sub(v, 1, 10)
print(w)
#> [1] "it was the"
x <- stringr::str_sub(v, 11, 20)
print(x)
#> [1] " best of t"

2 dates

2.1 as_date

From lubridate package. Converts from character to date.

Dates in “YYYY-MM-DD” format don’t need additional information.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:raster':
#> 
#>     intersect, union
#> The following objects are masked from 'package:terra':
#> 
#>     intersect, union
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
a <- as_date("2020-11-01")
print(a)
#> [1] "2020-11-01"

Dates in other formats may need the format parameter. See different format options here or run ?strptime.

date2 <- as_date("3/1/22", format = "%m/%d/%y" )
date2
#> [1] "2022-03-01"

Can also convert from date to character.

date2_char <- as.character(date2, format = "%A %B %d, %Y")
date2_char
#> [1] "Tuesday March 01, 2022"

2.2 as_datetime

Similar to as_date, except you can include a time.

date3 <- as_datetime("2000-05-09 10:00:00", tz = "EST")
date3
#> [1] "2000-05-09 10:00:00 EST"

3 dplyr

library(dplyr)

3.1 pipe operator ( %>% )

Use pipe operator to chain commands.

a <- c(9, 16, 25)
b <- a %>% sqrt()
b
#> [1] 3 4 5

Commonly used with tibbles. Note that in dplyr, you don’t need to use quotes for column names.

Example below groups by “site_id” and summarizes the mean NDVI.

library(geospaar)
f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
site_summary <- maize %>% 
  filter(Element == "Production") %>% 
  group_by(Area) %>% 
  summarise(Value)
#> `summarise()` has grouped output by 'Area'. You can override using the
#> `.groups` argument.
site_summary
#> # A tibble: 114 × 2
#> # Groups:   Area [2]
#>    Area           Value
#>    <chr>          <int>
#>  1 South Africa 5293000
#>  2 South Africa 6024000
#>  3 South Africa 6127000
#>  4 South Africa 4310000
#>  5 South Africa 4608000
#>  6 South Africa 5161000
#>  7 South Africa 9802000
#>  8 South Africa 5358000
#>  9 South Africa 5378000
#> 10 South Africa 6179000
#> # … with 104 more rows

You can use . to specify in which argument the %>% should go to. Let’s say you want to take a sample of size 50 from the numbers 1:100.

sample_size <- 50
samples <- sample_size %>% sample(1:100, ., replace = F)

3.2 mutate

Creates a new column based on calculations you define.

set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
tib
#> # A tibble: 10 × 3
#>        a     b product
#>    <int> <int>   <int>
#>  1     1    68      68
#>  2     2    39      78
#>  3     3     1       3
#>  4     4    34     136
#>  5     5    87     435
#>  6     6    43     258
#>  7     7    14      98
#>  8     8    82     656
#>  9     9    59     531
#> 10    10    51     510

3.3 dplyr::select

set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
tib
#> # A tibble: 10 × 3
#>        a     b product
#>    <int> <int>   <int>
#>  1     1    68      68
#>  2     2    39      78
#>  3     3     1       3
#>  4     4    34     136
#>  5     5    87     435
#>  6     6    43     258
#>  7     7    14      98
#>  8     8    82     656
#>  9     9    59     531
#> 10    10    51     510

3.4 arrange

Sorts by a column. Default is ascending order. You can also arrange multiple columns.

set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
## sort tib by product column
tib_sorted <- tib %>% arrange(product)
tib_sorted
#> # A tibble: 10 × 3
#>        a     b product
#>    <int> <int>   <int>
#>  1     3     1       3
#>  2     1    68      68
#>  3     2    39      78
#>  4     7    14      98
#>  5     4    34     136
#>  6     6    43     258
#>  7     5    87     435
#>  8    10    51     510
#>  9     9    59     531
#> 10     8    82     656

Use - for descending order

tib_sorted <- tib %>% arrange(-product)
tib_sorted
#> # A tibble: 10 × 3
#>        a     b product
#>    <int> <int>   <int>
#>  1     8    82     656
#>  2     9    59     531
#>  3    10    51     510
#>  4     5    87     435
#>  5     6    43     258
#>  6     4    34     136
#>  7     7    14      98
#>  8     2    39      78
#>  9     1    68      68
#> 10     3     1       3

4 control structures

4.1 for loops

In for loop, you perform the operations once for each item in the iterator. So if the loop starts for(k in items) then items is the iterator.

set.seed(2)
items <- sample (1:100, 5)
for(k in items){
  print(k)
  print(paste0("This value is ", k))
}
#> [1] 85
#> [1] "This value is 85"
#> [1] 79
#> [1] "This value is 79"
#> [1] 70
#> [1] "This value is 70"
#> [1] 6
#> [1] "This value is 6"
#> [1] 32
#> [1] "This value is 32"

4.2 if-else

items <- sample(LETTERS, 10)

for(k in items){
  print(k)
  if(k %in% c("A", "E", "I", "O", "U")){
    print("vowel")
  } else {
    print("consonant")
  }
}
#> [1] "H"
#> [1] "consonant"
#> [1] "Q"
#> [1] "consonant"
#> [1] "Y"
#> [1] "consonant"
#> [1] "L"
#> [1] "consonant"
#> [1] "I"
#> [1] "vowel"
#> [1] "R"
#> [1] "consonant"
#> [1] "K"
#> [1] "consonant"
#> [1] "A"
#> [1] "vowel"
#> [1] "C"
#> [1] "consonant"
#> [1] "P"
#> [1] "consonant"

4.3 lapply

lapply returns objects in a list.

v1 <- 1:5
v2 <- lapply(v1, function(x){
  y <- x^2  ## y will be returned
}) #
print(v2)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 4
#> 
#> [[3]]
#> [1] 9
#> 
#> [[4]]
#> [1] 16
#> 
#> [[5]]
#> [1] 25

4.4 sapply

sapply returns elements in a vector (when possible)

v1 <- 1:5
v2 <- sapply(v1, function(x){
  y <- x^2  ## y will be returned
}) #
print(v2)
#> [1]  1  4  9 16 25

4.5 apply

apply works well fro 2D objects.

set.seed(3)
v1 <- sample(1:100, 5)
v2 <- sample(1:100, 5)
DF <- data.frame(v1, v2) ## data frame columns will take names of vectors
DF
#>   v1 v2
#> 1  5 95
#> 2 58  8
#> 3 12 20
#> 4 36 74
#> 5 99 55

Use index 1 for rows.

## index 1 for rows
rowMax <- apply(DF, 1, FUN = max)
rowMax
#> [1] 95 58 20 74 99

Use index 2 for columns

## index 2 for columns
colMax <- apply(DF, 2, FUN = max)
colMax
#> v1 v2 
#> 99 95

5 sampling

5.1 sample

sample is used for picking samples from a discrete object, like a vector.

v1 <- sample(1:100, 5)
v2 <- sample(letters, 5)
print(v1)
#> [1] 40 48  8 37 66
print(v2)
#> [1] "l" "m" "e" "h" "x"

5.2 runif

runif samples from a uniform distribution (equal probability for all values in the defined interval)

The example below picks 5 values from a uniform distribution between 0 and 2.

set.seed(4)
v <- runif(5, min = 0, max = 2)
v
#> [1] 1.17160061 0.01789159 0.58747922 0.55474992 1.62714843

5.3 rnorm

rnorm uses a normal distribution. You can define the mean and standard deviation.

set.seed(4)
v <- rnorm(5, mean = 10, sd = 3)
v
#> [1] 10.650265  8.372522 12.673434 11.787942 14.906854

6 read/write

6.1 read/write csv’s

You can use Base R read.csv(), or readr read_csv()

f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
print(class(maize))
#> [1] "data.frame"
maize2 <- readr::read_csv(f)
#> Rows: 228 Columns: 14
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Descr...
#> dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(class(maize2))
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
maize2
#> # A tibble: 228 × 14
#>    Domain …¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
#>    <chr>     <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
#>  1 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1961  1961
#>  2 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1962  1962
#>  3 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1963  1963
#>  4 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1964  1964
#>  5 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1965  1965
#>  6 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1966  1966
#>  7 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1967  1967
#>  8 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1968  1968
#>  9 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1969  1969
#> 10 QC        Crops      202 Sout…    5312 Area h…      56 Maize    1970  1970
#> # … with 218 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#> #   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#> #   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`,
#> #   ⁵​`Year Code`

6.2 save/load

Saving and loading is used for RData objects. Use extension .rda. You can save any R object in this way (data frames, tibbles, lists, rasters etc)

f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
save(maize, file = "~/maize.rda") ## save to your user home

When you load data, it will retain the variable name it had.

maize <- NULL
load(file = "~/maize.rda") ## data will be loaded into "maize" variable

7 table indexes

7.1 Base R

Use [ , ] notation. Row conditions (filtering) are to the left of comma. Column conditions (dplyr::selecting columns) are to the right.

DF <- data.frame(v1 = 1:5, v2 = 6:10)
rownames(DF) <- LETTERS[1:5]
DF
#>   v1 v2
#> A  1  6
#> B  2  7
#> C  3  8
#> D  4  9
#> E  5 10
DF[,'v2'] ## column indexing
#> [1]  6  7  8  9 10
DF[c("A", "B", "D"), ] ## row indexing
#>   v1 v2
#> A  1  6
#> B  2  7
#> D  4  9

Subsetting data.

DF[ DF$v1 > 3   ,    ] ## get observations (rows) where first column is larger than 3
#>   v1 v2
#> D  4  9
#> E  5 10

7.2 dplyr

Use filter for row conditions and dplyr::select to dplyr::select columns.

DF <- tibble(v1 = 1:5, v2 = 6:10)
rownames(DF) <- LETTERS[1:5]
#> Warning: Setting row names on a tibble is deprecated.
DF
#> # A tibble: 5 × 2
#>      v1    v2
#> * <int> <int>
#> 1     1     6
#> 2     2     7
#> 3     3     8
#> 4     4     9
#> 5     5    10

Filter to rows where v1 is greater than 3.

DF_filt <- DF %>% filter(v1 > 3)
DF_filt
#> # A tibble: 2 × 2
#>      v1    v2
#> * <int> <int>
#> 1     4     9
#> 2     5    10

Same as above but only show column v2.

DF_filt <- DF %>% filter(v1 > 3) %>% dplyr::select(v2)
DF_filt
#> # A tibble: 2 × 1
#>      v2
#>   <int>
#> 1     9
#> 2    10

7.3 slice

slice is a dplyr function to dplyr::select rows by number.

dplyr::select second and third rows.

DF_filt <- DF %>% slice(2:3)
DF_filt
#> # A tibble: 2 × 2
#>      v1    v2
#> * <int> <int>
#> 1     2     7
#> 2     3     8

7.4 head, tail

head dplyr::selects the first n rows in a data frame or tibble. tail dplyr::selects the last n rows.

DF_head <- DF %>% head(2)
DF_head
#> # A tibble: 2 × 2
#>      v1    v2
#>   <int> <int>
#> 1     1     6
#> 2     2     7
DF_tail <- DF %>% tail(2)
DF_tail
#> # A tibble: 2 × 2
#>      v1    v2
#>   <int> <int>
#> 1     4     9
#> 2     5    10

8 table functions

8.1 cbind, rbind

8.2 joins

8.3 pivot_longer

8.4 pivot_wider

9 Other

9.1 which

Returns indices (position in vector) where a condition is true.

set.seed(1)
a <- sample(1:100, 20)
print(a)
#>  [1] 68 39  1 34 87 43 14 82 59 51 85 21 54 74  7 73 79 37 83 97
print(which(a > 80)) ## shows indices of elements greater than 80. 
#> [1]  5  8 11 19 20

9.2 which.min

Finds index of minimum value. Only returns first location of min, even if multiple values exist.

v <- c(5, 1, 10, 3, 10, 8, 1)
which.min(v) ## only returns index 2, even though 
#> [1] 2

9.3 which.max

Finds index of maximum value. Only returns first location of max, even if multiple values exist.

v <- c(5, 1, 10, 3, 10, 8, 1)
which.max(v) ## only returns index 3
#> [1] 3

9.4 unique

unique filters an object to unique values

set.seed(2)
birthdays <- sample(1:365, 50, replace = T) ## sample 100 birthdays
print(birthdays)
#>  [1] 341 198 262 273 349 204 297 178  75 131 306 311  63 136 231 289  54 361
#> [19] 112 171  38 361 110 144  45 238 208 134 339   9 350 130 244   3 129 304
#> [37] 297 301 289 274   8 164 350  37 226 149 205 327 242 358
distinct_birthdays <- (unique(birthdays))
print(distinct_birthdays)
#>  [1] 341 198 262 273 349 204 297 178  75 131 306 311  63 136 231 289  54 361
#> [19] 112 171  38 110 144  45 238 208 134 339   9 350 130 244   3 129 304 301
#> [37] 274   8 164  37 226 149 205 327 242 358
print(paste0(length(distinct_birthdays), " distinct birthdays"))
#> [1] "46 distinct birthdays"

10 getting help

10.1 ?

Use ? to load help for a function

#?dplyr::mutate
#?strptime

Back to home