Today

  • Project intro
  • A few more advanced R applications
    • Parallelization
    • Random Forests example

Projects

  • 24 students = ~9-12 projects
  • Ideally mixed UG + Grad students in a team
  • Why team?
  • Potential projects:
    • Projects page
    • Something HERO-ic
    • Long-term vegetation trends
    • Head-start on a PhD project
    • Lots of country-scale cropland/field boundary data
      • Algorithm to break up narrowly connected fields
      • Compare field characteristics (e.g. area, shape metrics) between countries or between years
    • Sampling problems:
      • Come up hypothetical crop type distributions and sample scheme to collect from road network - evaluate sample-based estimates relative to true distribution

Modelling demo

Random Forests

# Train model
library(caTools)
library(randomForest)
library(geospaar)

# read in train set
train_ref <- system.file("extdata/train_reference.csv", 
                         package = "geospaar") %>% 
  read_csv()  %>% mutate(crop = factor(case_when(
  class == "Maize" ~ 1,
  class == "Rice" ~ 2,
  class == "Other" ~ 3,
  class == "noncrop" ~ 4
))) %>% dplyr::select(id, class, crop, !!names(.))

Model

# prepare dataset
dset <- train_ref %>% filter(type == "mean") %>%
  select(-id, -class, -type, -seed)
set.seed(10)
samp <- sample.split(dset$crop, SplitRatio = 0.8)
train <- subset(dset, samp == TRUE)
test <- subset(dset, samp == FALSE)
# train %>% tidyr::drop_na()

# fit model
set.seed(123)
mod <- randomForest(crop ~ ., data = train, ntree = 1000,
                    importance = TRUE)
sum(diag(mod$confusion[1:4, 1:4])) / sum(mod$confusion[1:4, 1:4])
summary(mod) 

# Evaluation
pred <- predict(mod, newdata = test %>% dplyr::select(-c(crop)))
cm <- table(test$crop, pred)
# sum(diag(cm)) / sum(cm)

rf_results <- list("train" = train, "test" = test, "mod" = mod, 
                   "pred" = pred, "error_mat" = cm)
imptab <- mod$importance
nms <- rownames(imptab)
imp_tbl <- as_tibble(imptab) %>% 
  mutate(Variable = nms) %>% 
  dplyr::select(Variable, MeanDecreaseAccuracy) %>% 
  arrange(MeanDecreaseAccuracy) %>% 
  mutate(order = 1:n()) %>% 
  mutate(Variable = gsub("\\.", "_", Variable))

p <- ggplot(imp_tbl) + 
  geom_point(aes(MeanDecreaseAccuracy, factor(order))) + 
  scale_y_discrete(labels = toupper(imp_tbl$Variable)) + 
  ylab("Variable") + xlab("Mean Decrease in Accuracy") + 
  theme_linedraw() + 
  theme(axis.text = element_text(size = 7), 
        axis.title = element_text(size = 7))
p

Map

Look at Rmd file to see code use to make prediction map for crop types.

Parallelization

Note the use of wrap and unwrap to serialize the spatRaster object so it can be passed between cores. See here for discussion.

library(parallel)

# terra
r <- rast(ext(30, 31, -14, -13), res = 0.01, 
          crs = "+proj=longlat +datum=WGS84")
values(r) <- 1:ncell(r)

rp <- terra::wrap(r)
system.time(b <- mclapply(1:2000, function(x, r = rp) {
  r <- unwrap(r)
  values(r) <- sample(1:10, size = ncell(r), replace = TRUE)
  set.seed(x)
  r3 <- r2 <- r
  values(r2) <- values(r) * runif(n = ncell(r), 0.9, 1.2)
  values(r3) <- values(r) * runif(n = ncell(r), 0.8, 1.3)
  wrap(c(r3, r2, r))
}) %>% lapply(., unwrap))

New formats