Helpful functions
GEOG246-346
1 strings
1.1 paste, paste0
Concatenate strings. paste
includes the option to add a
specific separator, like a space or hyphen.
paste0
assumes there is no separator.
1.2 str_replace
From stringr
package. Use to replace strings
1.3 str_replace_all
Similar to str_replace
, but str_replace_all
replaces
library(stringr)
v <- "it was the best of times it was the worst of times"
print(v)
#> [1] "it was the best of times it was the worst of times"
w <- stringr::str_replace(v, "times", "spaghetti" )
print(w) ## only first "times" replaced
#> [1] "it was the best of spaghetti it was the worst of times"
x <- stringr::str_replace_all(v, "times", "spaghetti" )
print(x) ## all "times" replaced
#> [1] "it was the best of spaghetti it was the worst of spaghetti"
1.4 str_sub
Create substrings based on character index
2 dates
2.1 as_date
From lubridate
package. Converts from character to
date.
Dates in “YYYY-MM-DD” format don’t need additional information.
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:raster':
#>
#> intersect, union
#> The following objects are masked from 'package:terra':
#>
#> intersect, union
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
a <- as_date("2020-11-01")
print(a)
#> [1] "2020-11-01"
Dates in other formats may need the format
parameter.
See different format options here or
run ?strptime
.
Can also convert from date to character.
3 dplyr
3.1 pipe operator ( %>% )
Use pipe operator to chain commands.
Commonly used with tibbles. Note that in dplyr
, you
don’t need to use quotes for column names.
Example below groups by “site_id” and summarizes the mean NDVI.
library(geospaar)
f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
site_summary <- maize %>%
filter(Element == "Production") %>%
group_by(Area) %>%
summarise(Value)
#> `summarise()` has grouped output by 'Area'. You can override using the
#> `.groups` argument.
site_summary
#> # A tibble: 114 × 2
#> # Groups: Area [2]
#> Area Value
#> <chr> <int>
#> 1 South Africa 5293000
#> 2 South Africa 6024000
#> 3 South Africa 6127000
#> 4 South Africa 4310000
#> 5 South Africa 4608000
#> 6 South Africa 5161000
#> 7 South Africa 9802000
#> 8 South Africa 5358000
#> 9 South Africa 5378000
#> 10 South Africa 6179000
#> # … with 104 more rows
You can use .
to specify in which argument the
%>%
should go to. Let’s say you want to take a sample of
size 50 from the numbers 1:100.
3.2 mutate
Creates a new column based on calculations you define.
set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
tib
#> # A tibble: 10 × 3
#> a b product
#> <int> <int> <int>
#> 1 1 68 68
#> 2 2 39 78
#> 3 3 1 3
#> 4 4 34 136
#> 5 5 87 435
#> 6 6 43 258
#> 7 7 14 98
#> 8 8 82 656
#> 9 9 59 531
#> 10 10 51 510
3.3 dplyr::select
set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
tib
#> # A tibble: 10 × 3
#> a b product
#> <int> <int> <int>
#> 1 1 68 68
#> 2 2 39 78
#> 3 3 1 3
#> 4 4 34 136
#> 5 5 87 435
#> 6 6 43 258
#> 7 7 14 98
#> 8 8 82 656
#> 9 9 59 531
#> 10 10 51 510
3.4 arrange
Sorts by a column. Default is ascending order. You can also arrange multiple columns.
set.seed(1)
tib <- tibble(a = 1:10, b = sample(1:100, 10))
tib <- tib %>% mutate(product = a * b) ## new column is product of columns a, b
## sort tib by product column
tib_sorted <- tib %>% arrange(product)
tib_sorted
#> # A tibble: 10 × 3
#> a b product
#> <int> <int> <int>
#> 1 3 1 3
#> 2 1 68 68
#> 3 2 39 78
#> 4 7 14 98
#> 5 4 34 136
#> 6 6 43 258
#> 7 5 87 435
#> 8 10 51 510
#> 9 9 59 531
#> 10 8 82 656
Use -
for descending order
4 control structures
4.1 for loops
In for loop, you perform the operations once for each item in the
iterator. So if the loop starts for(k in items)
then
items
is the iterator.
4.2 if-else
items <- sample(LETTERS, 10)
for(k in items){
print(k)
if(k %in% c("A", "E", "I", "O", "U")){
print("vowel")
} else {
print("consonant")
}
}
#> [1] "H"
#> [1] "consonant"
#> [1] "Q"
#> [1] "consonant"
#> [1] "Y"
#> [1] "consonant"
#> [1] "L"
#> [1] "consonant"
#> [1] "I"
#> [1] "vowel"
#> [1] "R"
#> [1] "consonant"
#> [1] "K"
#> [1] "consonant"
#> [1] "A"
#> [1] "vowel"
#> [1] "C"
#> [1] "consonant"
#> [1] "P"
#> [1] "consonant"
4.3 lapply
lapply
returns objects in a list.
4.4 sapply
sapply
returns elements in a vector (when possible)
5 sampling
5.1 sample
sample
is used for picking samples from a discrete
object, like a vector.
5.2 runif
runif
samples from a uniform distribution (equal
probability for all values in the defined interval)
The example below picks 5 values from a uniform distribution between 0 and 2.
6 read/write
6.1 read/write csv’s
You can use Base R read.csv()
, or readr
read_csv()
f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
print(class(maize))
#> [1] "data.frame"
maize2 <- readr::read_csv(f)
#> Rows: 228 Columns: 14
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Descr...
#> dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(class(maize2))
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
maize2
#> # A tibble: 228 × 14
#> Domain …¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 QC Crops 202 Sout… 5312 Area h… 56 Maize 1961 1961
#> 2 QC Crops 202 Sout… 5312 Area h… 56 Maize 1962 1962
#> 3 QC Crops 202 Sout… 5312 Area h… 56 Maize 1963 1963
#> 4 QC Crops 202 Sout… 5312 Area h… 56 Maize 1964 1964
#> 5 QC Crops 202 Sout… 5312 Area h… 56 Maize 1965 1965
#> 6 QC Crops 202 Sout… 5312 Area h… 56 Maize 1966 1966
#> 7 QC Crops 202 Sout… 5312 Area h… 56 Maize 1967 1967
#> 8 QC Crops 202 Sout… 5312 Area h… 56 Maize 1968 1968
#> 9 QC Crops 202 Sout… 5312 Area h… 56 Maize 1969 1969
#> 10 QC Crops 202 Sout… 5312 Area h… 56 Maize 1970 1970
#> # … with 218 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#> # Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#> # ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`,
#> # ⁵`Year Code`
6.2 save/load
Saving and loading is used for RData
objects. Use
extension .rda
. You can save any R object in this way (data
frames, tibbles, lists, rasters etc)
f <- system.file("extdata/FAOSTAT_maize.csv", package = "geospaar")
maize <- read.csv(f)
save(maize, file = "~/maize.rda") ## save to your user home
When you load data, it will retain the variable name it had.
7 table indexes
7.1 Base R
Use [ , ]
notation. Row conditions (filtering) are to
the left of comma. Column conditions (dplyr::selecting columns) are to
the right.
DF <- data.frame(v1 = 1:5, v2 = 6:10)
rownames(DF) <- LETTERS[1:5]
DF
#> v1 v2
#> A 1 6
#> B 2 7
#> C 3 8
#> D 4 9
#> E 5 10
Subsetting data.
7.2 dplyr
Use filter
for row conditions and
dplyr::select
to dplyr::select columns.
DF <- tibble(v1 = 1:5, v2 = 6:10)
rownames(DF) <- LETTERS[1:5]
#> Warning: Setting row names on a tibble is deprecated.
DF
#> # A tibble: 5 × 2
#> v1 v2
#> * <int> <int>
#> 1 1 6
#> 2 2 7
#> 3 3 8
#> 4 4 9
#> 5 5 10
Filter to rows where v1 is greater than 3.
DF_filt <- DF %>% filter(v1 > 3)
DF_filt
#> # A tibble: 2 × 2
#> v1 v2
#> * <int> <int>
#> 1 4 9
#> 2 5 10
Same as above but only show column v2.
7.3 slice
slice
is a dplyr
function to dplyr::select
rows by number.
dplyr::select second and third rows.
7.4 head, tail
head
dplyr::selects the first n rows in a data frame or
tibble. tail
dplyr::selects the last n rows.
8 table functions
8.1 cbind, rbind
8.2 joins
8.3 pivot_longer
8.4 pivot_wider
9 Other
9.1 which
Returns indices (position in vector) where a condition is true.
9.2 which.min
Finds index of minimum value. Only returns first location of min, even if multiple values exist.
9.3 which.max
Finds index of maximum value. Only returns first location of max, even if multiple values exist.
9.4 unique
unique
filters an object to unique values
set.seed(2)
birthdays <- sample(1:365, 50, replace = T) ## sample 100 birthdays
print(birthdays)
#> [1] 341 198 262 273 349 204 297 178 75 131 306 311 63 136 231 289 54 361
#> [19] 112 171 38 361 110 144 45 238 208 134 339 9 350 130 244 3 129 304
#> [37] 297 301 289 274 8 164 350 37 226 149 205 327 242 358
distinct_birthdays <- (unique(birthdays))
print(distinct_birthdays)
#> [1] 341 198 262 273 349 204 297 178 75 131 306 311 63 136 231 289 54 361
#> [19] 112 171 38 110 144 45 238 208 134 339 9 350 130 244 3 129 304 301
#> [37] 274 8 164 37 226 149 205 327 242 358
print(paste0(length(distinct_birthdays), " distinct birthdays"))
#> [1] "46 distinct birthdays"