Skip to content

Ha raster tiling#56

Open
hfadams wants to merge 45 commits intomainfrom
ha-raster-tiling
Open

Ha raster tiling#56
hfadams wants to merge 45 commits intomainfrom
ha-raster-tiling

Conversation

@hfadams
Copy link
Copy Markdown
Collaborator

@hfadams hfadams commented Apr 2, 2026

  • Added raster tiling transformer for spatial multiprocessing
  • Added RGB band assignment and band renaming
  • Updated scenario set-up workflow to load only the packages needed for the specified model
  • Fixed postprocessing with multiclass data
  • Removed benign installation messages from runlog error
  • Fixed DB update files

Summary by CodeRabbit

  • New Features

    • Docker image and build context support; new spatial-tiling preparation and revamped prediction workflow.
  • Improvements

    • Improved memory and file handling, band-name override and RGB selection, ground-truth validation, and per-tile processing controls.
  • Bug Fixes

    • Safer raster read/write lifecycles, tiling-aware aggregation, and more resilient summary generation.
  • Updates

    • Conda environment version bump, package metadata and database schema updates.
  • Chores

    • Added/expanded .gitignore and .dockerignore rules.

hfadams added 30 commits March 20, 2026 11:04
@hfadams hfadams requested a review from afilazzola April 2, 2026 14:34
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Note

.coderabbit.yml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'code_guidelines', 'ai'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Adds spatial tiling support and a tile-aware prediction pipeline, refactors training and post-processing for tiling workflows, introduces conditional model-package loading, updates datasheets/package schema for tiling and summary charts, adds Docker/Conda/installer artifacts, and reorganizes prediction scripts.

Changes

Cohort / File(s) Summary
Docker & VCS ignores
\.dockerignore, Dockerfile, \.gitignore
Add .dockerignore; new Dockerfile building geospatial R + JVM + PyTorch + SyncroSim image; extend .gitignore patterns.
Conda & installer
src/ecoClassify-conda.yml, src/installDependencies.r
Bump conda env name and add r-gower; sanitize conda/lib paths and capture install logs to temp sink with improved error echoing.
New tiling transformer & schema
src/2a-prepare-spatial-tiling.R, src/updates/update-2.4.xml, src/package.xml
Add PrepareSpatialTiling transformer; generate tile_manifest and tile mask raster; add TilingOptions & SummaryOutputChart schemas and package.xml wiring/layout updates.
Prediction scripts
src/2-predict.R, src/2-predict.r
Add new tile-aware src/2-predict.R (tile manifest handling, per-tile preprocessing, tile-aware RGB and summary aggregation); remove legacy src/2-predict.r.
Training script
src/1-train-classifier.R
Load terminology band overrides, validate target-class presence, write covariate outputs to transferDir, adjust memory management and summary output saving.
Post-processing
src/3-postprocess.R
Refine raster path selection, change predictionType labels, recompute confusion metrics from post-processed rasters, consolidate summaries/charts, and improve cleanup.
Core setup & model package loading
src/functions/0.0-setup.r, src/functions/0.4-maxent.r
Remove eager model preloads; add loadModelPackages(modelType) to load model-specific deps and shift MAXENT availability gating to setup.
Preprocessing & covariates
src/functions/0.1-pre-processing.r
Add defaults in assignVariables(), force file-backed covariates, change addCovariates(..., outDir = NULL), and update overrideBandNames(..., bandLabelFile).
Training/prediction helpers
src/functions/0.2-training-and-prediction.r, src/functions/0.4-random-forest.r, src/functions/0.5-histogram.r
Add rgbBands param to saveFiles(), improve raster read/write lifecycle in predictRanger(), make histogram extraction robust to shapes, and warn on low sample counts for thresholding.
Post-processing helpers & labels
src/functions/0.6-post-processing.r
Rename two statistic labels to full names (Positive/Negative Predictive Value); keep computed metrics unchanged.
Updates & metadata
src/updates/update-2.1.xml, src/package.xml
Schema and package.xml edits: transformer env versions, datasheet/column/layout updates, and added/removed datasheets per 2.4 changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SyncroSim
    participant Prepare as "PrepareSpatialTiling\n(2a-prepare-spatial-tiling.R)"
    participant Predict as "Predict\n(2-predict.R)"
    participant Train as "TrainClassifier\n(1-train-classifier.R)"
    participant Post as "PostProcess\n(3-postprocess.R)"
    participant DB as "SyncroSim DB / Datasheets"

    User->>SyncroSim: Run scenario (tiling enabled)
    SyncroSim->>Prepare: Execute PrepareSpatialTiling
    Prepare->>Prepare: Compute tile grid, pre-crop rasters, write tile_manifest.json
    Prepare->>DB: Save tile manifest & grid metadata
    SyncroSim->>Train: Execute TrainClassifier
    Train->>DB: Read training inputs, load model packages
    Train->>DB: Save trained model & summary
    SyncroSim->>Predict: Launch tile jobs (1..N)
    loop per tile job
      Predict->>DB: Load tile-cropped rasters & model
      Predict->>Predict: Preprocess, predict per timestep
      Predict->>DB: Write per-tile prediction rasters and tile summary
    end
    Predict->>Predict: Job 1 aggregates tile summaries -> unified SummaryOutput
    Predict->>DB: Save aggregated SummaryOutput
    SyncroSim->>Post: Execute PostProcessing
    Post->>DB: Load predictions, reclassify/filter, compute metrics
    Post->>DB: Save ModelStatistics & SummaryOutputChart
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • afilazzola

"I nibbled code and hatched a plan,
Tiles in rows like carrot span,
Models hum and rasters sing,
Summaries bloom—oh what a spring!
Hooray for tidy, tiled machine! 🥕"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "Ha raster tiling" is vague and does not clearly convey the scope or primary objective of the changes. While raster tiling is mentioned, the title lacks specificity about what was implemented, why, or what else changed. Use a more descriptive title that clearly summarizes the primary changes, such as "Add spatial raster tiling support for multiprocessing" or "Implement spatial tiling transformer and related improvements".
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ha-raster-tiling

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/functions/0.2-training-and-prediction.r (1)

1281-1297: ⚠️ Potential issue | 🟡 Minor

Missing dev.off() call after PNG device opening.

The grDevices::png() device is opened but dev.off() is not called after plotRGB(). If plotRGB() succeeds, the device remains open. While the error handler catches failures, the device should be closed in all cases to prevent resource leaks and ensure the PNG file is properly finalized.

🛠️ Proposed fix
   grDevices::png(filename = png_file)
   tryCatch(
     {
       terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin")
     },
     error = function(e) {
       updateRunLog(
         sprintf(
           "t=%s: Failed to write RGB PNG: %s",
           timestep,
           conditionMessage(e)
         ),
         type = "warning"
       )
-    }
+    },
+    finally = {
+      grDevices::dev.off()
+    }
   )
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/functions/0.2-training-and-prediction.r` around lines 1281 - 1297, The
PNG device opened via grDevices::png(filename = png_file) is never closed; wrap
the plotting call in tryCatch (or use tryCatch(..., finally =
{grDevices::dev.off()})) so that grDevices::dev.off() is always called
regardless of success or error. Specifically, modify the block containing
terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin") and the error
handler that calls updateRunLog(...) to ensure a final grDevices::dev.off() runs
(or use on.exit(grDevices::dev.off()) immediately after opening the device) so
the PNG device is properly finalized.
src/3-postprocess.R (1)

145-153: ⚠️ Potential issue | 🟠 Major

Guard the training raster load with file.exists().

Line 151 only filters out NA. A stale path still makes terra::rast() abort the whole post-processing pass; the predicting branch already handles this case.

🛠️ Proposed fix
-  if (!is.na(predictedPresenceFilepath)) {
+  if (!is.na(predictedPresenceFilepath) && file.exists(predictedPresenceFilepath)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/3-postprocess.R` around lines 145 - 153, The code currently only checks
for NA on predictedPresenceFilepath before calling terra::rast, which still
fails if the path is stale; update the loop handling trainTimestepList so that
after computing predictedPresenceFilepath from trainingOutputDataframe (Timestep
== t) you also check file.exists(predictedPresenceFilepath) and skip (or log a
warning) when the file does not exist instead of calling rast; ensure this guard
is applied where predictedPresence is created so functions/variables like
predictedPresenceFilepath, predictedPresence, and the call to rast() are
protected.
🧹 Nitpick comments (6)
src/2a-prepare-spatial-tiling.R (2)

100-107: Auto-tiling may produce fewer tiles than available cores.

The calculation numTilesX = ceiling(sqrt(nTiles)) and numTilesY = floor(nTiles / numTilesX) can produce a grid with fewer tiles than the target. For example:

  • 7 cores: numTilesX = ceiling(sqrt(7)) = 3, numTilesY = floor(7/3) = 2 → 6 tiles
  • 11 cores: numTilesX = ceiling(sqrt(11)) = 4, numTilesY = floor(11/4) = 2 → 8 tiles

This leaves some cores unused. Consider adjusting numTilesY to ceiling(nTiles / numTilesX) to ensure at least nTiles tiles are created:

💡 Alternative calculation to maximize core utilization
   numTilesX <- ceiling(sqrt(nTiles))
-  numTilesY <- floor(nTiles / numTilesX)
+  numTilesY <- ceiling(nTiles / numTilesX)

This would create slightly more tiles than cores in some cases (e.g., 7 cores → 9 tiles), which is generally preferable to underutilization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/2a-prepare-spatial-tiling.R` around lines 100 - 107, The auto-tiling can
produce fewer tiles than requested because numTilesY is computed with floor;
change the calculation so numTilesY uses ceiling(nTiles / numTilesX) instead of
floor(...) to ensure at least nTiles tiles (reference symbols: nTiles,
numTilesX, numTilesY) and update the updateRunLog message to report the actual
number of tiles (numTilesX * numTilesY) so logs reflect the adjusted tiling.

442-449: Overwriting MaximumJobs may reduce user-configured parallelism.

Line 444 sets MaximumJobs = nValidTiles, which overwrites any user-configured value. If a user configured 16 jobs but only 4 valid tiles exist, this is correct behavior. However, if a user configured 4 jobs and 16 tiles exist, this would increase parallelism beyond what the user intended.

The earlier check at lines 124-130 logs an info message when tiles exceed configured jobs, but then line 444 overwrites that configuration anyway. Consider preserving the user's configured maximum:

🔧 Proposed fix to respect user-configured maximum
+# Respect user-configured MaximumJobs as an upper bound
+userMaxJobs <- if (
+  nrow(multiprocessingSheet) > 0 &&
+  !is.null(multiprocessingSheet$MaximumJobs) &&
+  !is.na(multiprocessingSheet$MaximumJobs[1])
+) as.integer(multiprocessingSheet$MaximumJobs[1]) else nValidTiles
+
+finalMaxJobs <- min(nValidTiles, userMaxJobs)
+
 saveDatasheet(
   myScenario,
-  data = data.frame(EnableMultiprocessing = TRUE, MaximumJobs = nValidTiles,
+  data = data.frame(EnableMultiprocessing = TRUE, MaximumJobs = finalMaxJobs,
                     stringsAsFactors = FALSE),
   name = "core_Multiprocessing"
 )
 
-updateRunLog(paste0("core_Multiprocessing set to ", nValidTiles, " job(s) to match tile count."))
+updateRunLog(paste0("core_Multiprocessing set to ", finalMaxJobs, " job(s) (", nValidTiles, " tiles)."))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/2a-prepare-spatial-tiling.R` around lines 442 - 449, The current
saveDatasheet call unconditionally sets MaximumJobs = nValidTiles and can
override a user-configured value; instead, read the existing configured maximum
(from the saved "core_Multiprocessing" datasheet or the variable that holds the
user's MaximumJobs), compute chosenJobs <- ifelse(is.na(userMax) || userMax <=
0, nValidTiles, min(userMax, nValidTiles)), and pass MaximumJobs = chosenJobs to
saveDatasheet (keep EnableMultiprocessing = TRUE); also update the updateRunLog
message to report the final chosenJobs value so it accurately reflects whether
the user's setting or the tile count limited parallelism.
src/functions/0.1-pre-processing.r (1)

723-756: Consider documenting the expected CSV format.

The overrideBandNames() function uses read.csv(..., header = TRUE), which assumes the CSV has a header row. The docstring mentions "one band name per row (in band order), with an optional header" but the code always reads with header = TRUE, which will consume the first data row as a header if no header exists.

Consider either:

  1. Using header = FALSE to treat all rows as data, or
  2. Clarifying in documentation that a header row is required
💡 Option 1: Read without header assumption
   bandLabels <- tryCatch(
-    read.csv(bandLabelFile, header = TRUE, stringsAsFactors = FALSE),
+    read.csv(bandLabelFile, header = FALSE, stringsAsFactors = FALSE),
     error = function(e) stop(paste0(
       "Could not read band label file '", bandLabelFile, "': ", conditionMessage(e)
     ))
   )
-  newNames <- bandLabels[[1]]
+  newNames <- as.character(bandLabels[[1]])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/functions/0.1-pre-processing.r` around lines 723 - 756, The function
overrideBandNames currently calls read.csv(..., header = TRUE) which will treat
a missing header row as column names and drop the first band; change the
read.csv call to use header = FALSE so every CSV row is treated as a band name
(keep bandLabels[[1]] as the source of names), and update the function
docs/docstring to state the CSV must be one name per row with no header (or
explicitly allow an optional header and describe the behavior). Ensure this
change is applied in overrideBandNames and any associated documentation strings
referencing bandLabelFile.
Dockerfile (2)

84-91: Verify SyncroSim download URL stability.

The SyncroSim zip is downloaded from a versioned URL (3-1-24). Consider whether this URL will remain available long-term, and whether the version should be parameterized as a build argument for easier updates.

💡 Optional: Parameterize SyncroSim version
+ARG SYNCROSIM_VERSION=3-1-24
+
 # --- SyncroSim ----------------------------------------------------------------
 RUN mkdir -p /tmp/syncrosim && \
     cd /tmp/syncrosim && \
-    wget -O syncrosim.zip https://downloads.syncrosim.com/3-1-24/syncrosim-linux-3-1-24.zip && \
+    wget -O syncrosim.zip https://downloads.syncrosim.com/${SYNCROSIM_VERSION}/syncrosim-linux-${SYNCROSIM_VERSION}.zip && \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 84 - 91, The Dockerfile currently hardcodes the
SyncroSim download URL with version "3-1-24" in the RUN that creates
/tmp/syncrosim and wget's syncrosim.zip; make the version configurable by adding
a build argument (e.g., SYNCROSIM_VERSION) and update the wget call to use that
variable so future updates only require changing the arg; optionally add a
checksum or S3/mirror fallback for stability and keep the rest of the steps
(unzip, copy to /opt/syncrosim, chmod, cleanup) unchanged but referenced via the
same RUN sequence.

16-127: Consider adding a non-root user for improved security posture.

Per static analysis (Trivy DS-0002), the container runs as root by default. While this may be necessary for SyncroSim operations or specific file permissions, running containers as a non-root user is a security best practice that reduces the impact of container escape vulnerabilities.

If SyncroSim doesn't require root privileges at runtime, consider adding a dedicated user:

🔒 Proposed addition at the end of the Dockerfile
 # Optional: pre-create a directory for maxent.jar
 RUN mkdir -p /opt/ecoClassify/java
+
+# Create non-root user for runtime
+RUN useradd -m -s /bin/bash ecoclassify && \
+    chown -R ecoclassify:ecoclassify /opt/ecoClassify
+USER ecoclassify

Please verify whether SyncroSim requires root privileges at runtime before applying this change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 16 - 127, The image runs processes as root (Trivy
DS-0002); confirm whether SyncroSim requires root at runtime, and if not: create
a dedicated non-root user and group (e.g., "syncro"), chown runtime directories
(/opt/syncrosim, /opt/ecoClassify, /opt/ecoClassify/java and any other runtime
writeable paths created by RUN commands such as the SyncroSim install and
PATH/LD_LIBRARY_PATH locations), and switch to that user with USER before the
container entrypoint; ensure any build-time operations that need root remain
earlier in the Dockerfile (the existing RUN sequences that install packages,
Rscript::install_torch, conda/Miniconda steps, and mkdir/cp steps should keep
root), and verify that SyncroSim works under the non-root user before merging.
src/updates/update-2.4.xml (1)

35-41: This pattern is established across all update files and follows the codebase principle of preferring explicit failures for troubleshooting.

The ALTER TABLE ADD COLUMN statements in lines 36-40 follow the same approach used throughout all update files (updates 1.0–2.4). If columns already exist, the first failed ALTER will prevent subsequent columns from being added. However, this appears to be intentional based on the consistent pattern and aligns with the learned preference for explicit failures.

Consider documenting that package updates are intended to run only once, or add a note in the Terminology datasheet definition clarifying when these columns are added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/updates/update-2.4.xml` around lines 35 - 41, The ALTER TABLE statements
in the update (action Exec with condition TableExists and criteria
ecoClassify_Terminology in update-2.4.xml) intentionally fail fast if any column
already exists; to make this explicit, add documentation or a datasheet note
stating that package updates (updates 1.0–2.4) are intended to run once and that
the ecoClassify_Terminology columns (redBand, greenBand, blueBand, nirBand,
swirBand) are added during that one-time update, or alternatively add a
clarifying comment in the Terminology datasheet definition referencing the
update-2.4.xml change and the run-once expectation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/settings.local.json:
- Around line 6-7: Remove the machine-specific entries in
.claude/settings.local.json that expose the personal local path (the Bash(grep
... /mnt/c/Users/HannahAdams/Documents/GitHub/ecoClassify ...) line) and the
non-portable "Bash(grep:*)" entry; instead sanitize the file by replacing those
lines with a generic example or placeholders, add .claude/settings.local.json to
.gitignore so local configs stay untracked, and commit a sanitized template
(e.g., .claude/settings.local.template.json) for others to copy and fill in.
Ensure you update any references to the Bash(...) entries so no absolute user
paths remain in the committed config.

In @.gitignore:
- Around line 7-8: Remove Dockerfile and .dockerignore from .gitignore so
tracked container build files are not ignored; open .gitignore, delete the lines
containing "Dockerfile" and ".dockerignore" (or comment them out) to ensure
future Dockerfile/build-context changes are visible in commits and diffs.

In `@src/2-predict.R`:
- Around line 511-524: The aggregation is incorrectly averaging per-tile
probability summaries; instead compute MeanProbability as a weighted mean using
PredictedPixels (use sum(.PredictedPixels * .MeanProbability) /
sum(.PredictedPixels) with .sum_na helpers), compute MinProbability as the
global minimum of tile MinProbability (min(..., na.rm=TRUE)), MaxProbability as
the global maximum of tile MaxProbability (max(..., na.rm=TRUE)), and do not
attempt to recover MedianProbability from tile medians — set MedianProbability
to NA_real_ (or compute it from raw pixel probabilities if raw values are
available). Update the block that builds .aggregated (referencing .allStats,
buildSummaryRow, .sum_na, .mean_na) to implement the weighted mean and replace
the simple .mean_na calls for min/max/median as described.

In `@src/3-postprocess.R`:
- Around line 554-560: The lookup for
predictingOutputDataframe$ClassifiedProbability for a given timestep t returns
multiple paths and then file.exists() only checks the first element while
terra::rast() will silently open multilayer rasters; change the code that sets
predProbPath to collapse to a single filepath (e.g., pick the
first/unique/existing path for
predictingOutputDataframe$ClassifiedProbability[predictingOutputDataframe$Timestep
== t]) before calling file.exists() and terra::rast(), then pass that single
filepath into buildSummaryRow (same fix also for the similar block that uses
ClassifiedProbability at lines ~595-600); ensure you validate the chosen path
exists and is a single string prior to creating the raster.

In `@src/functions/0.5-histogram.r`:
- Around line 253-255: The code uses preds[, 2] assuming preds is a matrix or
data frame with at least two columns, which can cause errors if it has fewer.
Modify the selection logic in the relevant lines of the function to first check
if a column named "presence" exists in preds and use it if present, otherwise
use the second column if available, and finally fallback to the first column if
neither criteria are met. This defensive pattern ensures safe extraction of
prediction values regardless of preds' structure.

In `@src/installDependencies.r`:
- Line 46: The code hardcodes the Windows Conda library path ("Lib/R/library")
when computing env_lib; update the logic to choose "Lib" for Windows and "lib"
for Unix-like systems using an OS check (e.g. .Platform$OS.type or Sys.info())
so file.path(cp, <OS-specific folder>, "R", "library") points to the correct
Conda library folder; modify the env_lib assignment that references cp and
.libPaths() accordingly so both Windows and Unix installs resolve correctly.
- Around line 356-357: The cleanup code for installDependencies() currently
calls sink(type="message") and close(.msg_con) in the error handler and then
again later, which can attempt to close an already-closed connection; refactor
by extracting the sink/close logic into a single reusable cleanup function
(e.g., cleanup_message_conn()) that holds a local guard flag (e.g.,
.cleanup_done) or checks connection validity before running, and replace all
direct sink(type="message") / close(.msg_con) calls with a single call to
cleanup_message_conn(); ensure the function references .msg_con, calls
sink(type="message") and close(.msg_con) exactly once, and is invoked from both
the error handler and normal exit paths so cleanup runs only once.

---

Outside diff comments:
In `@src/3-postprocess.R`:
- Around line 145-153: The code currently only checks for NA on
predictedPresenceFilepath before calling terra::rast, which still fails if the
path is stale; update the loop handling trainTimestepList so that after
computing predictedPresenceFilepath from trainingOutputDataframe (Timestep == t)
you also check file.exists(predictedPresenceFilepath) and skip (or log a
warning) when the file does not exist instead of calling rast; ensure this guard
is applied where predictedPresence is created so functions/variables like
predictedPresenceFilepath, predictedPresence, and the call to rast() are
protected.

In `@src/functions/0.2-training-and-prediction.r`:
- Around line 1281-1297: The PNG device opened via grDevices::png(filename =
png_file) is never closed; wrap the plotting call in tryCatch (or use
tryCatch(..., finally = {grDevices::dev.off()})) so that grDevices::dev.off() is
always called regardless of success or error. Specifically, modify the block
containing terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin") and
the error handler that calls updateRunLog(...) to ensure a final
grDevices::dev.off() runs (or use on.exit(grDevices::dev.off()) immediately
after opening the device) so the PNG device is properly finalized.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 84-91: The Dockerfile currently hardcodes the SyncroSim download
URL with version "3-1-24" in the RUN that creates /tmp/syncrosim and wget's
syncrosim.zip; make the version configurable by adding a build argument (e.g.,
SYNCROSIM_VERSION) and update the wget call to use that variable so future
updates only require changing the arg; optionally add a checksum or S3/mirror
fallback for stability and keep the rest of the steps (unzip, copy to
/opt/syncrosim, chmod, cleanup) unchanged but referenced via the same RUN
sequence.
- Around line 16-127: The image runs processes as root (Trivy DS-0002); confirm
whether SyncroSim requires root at runtime, and if not: create a dedicated
non-root user and group (e.g., "syncro"), chown runtime directories
(/opt/syncrosim, /opt/ecoClassify, /opt/ecoClassify/java and any other runtime
writeable paths created by RUN commands such as the SyncroSim install and
PATH/LD_LIBRARY_PATH locations), and switch to that user with USER before the
container entrypoint; ensure any build-time operations that need root remain
earlier in the Dockerfile (the existing RUN sequences that install packages,
Rscript::install_torch, conda/Miniconda steps, and mkdir/cp steps should keep
root), and verify that SyncroSim works under the non-root user before merging.

In `@src/2a-prepare-spatial-tiling.R`:
- Around line 100-107: The auto-tiling can produce fewer tiles than requested
because numTilesY is computed with floor; change the calculation so numTilesY
uses ceiling(nTiles / numTilesX) instead of floor(...) to ensure at least nTiles
tiles (reference symbols: nTiles, numTilesX, numTilesY) and update the
updateRunLog message to report the actual number of tiles (numTilesX *
numTilesY) so logs reflect the adjusted tiling.
- Around line 442-449: The current saveDatasheet call unconditionally sets
MaximumJobs = nValidTiles and can override a user-configured value; instead,
read the existing configured maximum (from the saved "core_Multiprocessing"
datasheet or the variable that holds the user's MaximumJobs), compute chosenJobs
<- ifelse(is.na(userMax) || userMax <= 0, nValidTiles, min(userMax,
nValidTiles)), and pass MaximumJobs = chosenJobs to saveDatasheet (keep
EnableMultiprocessing = TRUE); also update the updateRunLog message to report
the final chosenJobs value so it accurately reflects whether the user's setting
or the tile count limited parallelism.

In `@src/functions/0.1-pre-processing.r`:
- Around line 723-756: The function overrideBandNames currently calls
read.csv(..., header = TRUE) which will treat a missing header row as column
names and drop the first band; change the read.csv call to use header = FALSE so
every CSV row is treated as a band name (keep bandLabels[[1]] as the source of
names), and update the function docs/docstring to state the CSV must be one name
per row with no header (or explicitly allow an optional header and describe the
behavior). Ensure this change is applied in overrideBandNames and any associated
documentation strings referencing bandLabelFile.

In `@src/updates/update-2.4.xml`:
- Around line 35-41: The ALTER TABLE statements in the update (action Exec with
condition TableExists and criteria ecoClassify_Terminology in update-2.4.xml)
intentionally fail fast if any column already exists; to make this explicit, add
documentation or a datasheet note stating that package updates (updates 1.0–2.4)
are intended to run once and that the ecoClassify_Terminology columns (redBand,
greenBand, blueBand, nirBand, swirBand) are added during that one-time update,
or alternatively add a clarifying comment in the Terminology datasheet
definition referencing the update-2.4.xml change and the run-once expectation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: ea26417c-c11d-4d23-9e7c-7110280575c9

📥 Commits

Reviewing files that changed from the base of the PR and between 9df98d0 and 9667900.

📒 Files selected for processing (20)
  • .claude/settings.local.json
  • .dockerignore
  • .gitignore
  • Dockerfile
  • src/1-train-classifier.R
  • src/2-predict.R
  • src/2-predict.r
  • src/2a-prepare-spatial-tiling.R
  • src/3-postprocess.R
  • src/ecoClassify-conda.yml
  • src/functions/0.0-setup.r
  • src/functions/0.1-pre-processing.r
  • src/functions/0.2-training-and-prediction.r
  • src/functions/0.4-maxent.r
  • src/functions/0.4-random-forest.r
  • src/functions/0.5-histogram.r
  • src/functions/0.6-post-processing.r
  • src/installDependencies.r
  • src/package.xml
  • src/updates/update-2.4.xml
💤 Files with no reviewable changes (1)
  • src/2-predict.r

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/updates/update-2.1.xml (2)

31-37: ⚠️ Potential issue | 🔴 Critical

This resets advanced classifier settings on upgrade.

The columns dropped above are recreated here in a new table, but there is no backfill step, and the UPDATE ... WHERE AdvancedClassifierOptionsId=1 is a no-op on an empty table. Existing normalizeRasters, manualThreshold, applyContextualization, etc. are lost as soon as the old columns are removed. Copy the existing values into ecoClassify_AdvancedClassifierOptions before dropping the source columns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/updates/update-2.1.xml` around lines 31 - 37, The migration creates a new
ecoClassify_AdvancedClassifierOptions table and then updates
AdvancedClassifierOptionsId=1, which is a no-op because no rows are migrated; as
a result settings like normalizeRasters, manualThreshold, applyContextualization
are lost. Before dropping the old columns/table, copy existing values into the
new ecoClassify_AdvancedClassifierOptions (e.g., INSERT INTO
ecoClassify_AdvancedClassifierOptions(...) SELECT ... FROM
<old_table_or_renamed_table> WHERE AdvancedClassifierOptionsId=1 or INSERT ...
SELECT DISTINCT to preserve rows), ensuring columns normalizeRasters,
rasterDecimalPlaces, modelTuning, setManualThreshold, manualThreshold,
applyContextualization, contextualizationWindowSize, setSeed and
AdvancedClassifierOptionsId are populated; only then drop or rename the source
columns so the UPDATE on ecoClassify_AdvancedClassifierOptions targets real
migrated rows.

31-42: ⚠️ Potential issue | 🟠 Major

ScenarioId is schema-only right now.

src/functions/0.1-pre-processing.r (Lines 116-119) still reads ecoClassify_AdvancedClassifierOptions without ScenarioId scoping, and src/3-postprocess.R (Lines 78-81) does the same for ecoClassify_PostProcessingRule. In multi-scenario libraries, that can mix another scenario's options or rules into the current run. Ship the reader-side filtering in the same PR, or defer the new column until it is enforced end-to-end.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/updates/update-2.1.xml` around lines 31 - 42, The R readers currently
pull ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule
without scoping by the new ScenarioId column, which can mix settings across
scenarios; update the read queries in src/functions/0.1-pre-processing.r (the
code that reads ecoClassify_AdvancedClassifierOptions) and src/3-postprocess.R
(the code that reads ecoClassify_PostProcessingRule) to include a WHERE
ScenarioId = <currentScenarioId> filter (use the existing scenario id variable
passed into those functions or add one if missing), and fall back safely (e.g.,
return only rows for that ScenarioId or an empty/default set) so options/rules
are enforced per-scenario. Ensure you reference the table names
ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule and the
scenario id variable used by the workflow when implementing the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/2-predict.R`:
- Around line 496-503: The code currently warns and persists partial results
when length(.allFiles) < tileCount; instead, do not save or publish partial
aggregates—either wait for an explicit completion signal or fail the run. In the
branch that checks length(.allFiles) < tileCount (using .allFiles, tileCount,
.maxWait), replace the update+save behavior with a hard failure (e.g., call
stop() or return an error) or implement a re-check loop that waits for a
completion flag from workers before calling ecoClassify_SummaryOutput; ensure
the final path that writes/publishes summary only executes when
length(.allFiles) == tileCount.
- Around line 529-530: The aggregation for MinProbability/MaxProbability
currently uses min(grp$MinProbability, na.rm=TRUE)/max(...) which returns
Inf/-Inf when all values are NA; update the logic around the MinProbability and
MaxProbability computations to first test whether all(is.na(grp$MinProbability))
/ all(is.na(grp$MaxProbability)) and return NA_real_ in that case, otherwise
compute min(..., na.rm=TRUE) or max(..., na.rm=TRUE); locate and change the two
lines that set MinProbability and MaxProbability in the grouping/summary block
that references grp.

---

Outside diff comments:
In `@src/updates/update-2.1.xml`:
- Around line 31-37: The migration creates a new
ecoClassify_AdvancedClassifierOptions table and then updates
AdvancedClassifierOptionsId=1, which is a no-op because no rows are migrated; as
a result settings like normalizeRasters, manualThreshold, applyContextualization
are lost. Before dropping the old columns/table, copy existing values into the
new ecoClassify_AdvancedClassifierOptions (e.g., INSERT INTO
ecoClassify_AdvancedClassifierOptions(...) SELECT ... FROM
<old_table_or_renamed_table> WHERE AdvancedClassifierOptionsId=1 or INSERT ...
SELECT DISTINCT to preserve rows), ensuring columns normalizeRasters,
rasterDecimalPlaces, modelTuning, setManualThreshold, manualThreshold,
applyContextualization, contextualizationWindowSize, setSeed and
AdvancedClassifierOptionsId are populated; only then drop or rename the source
columns so the UPDATE on ecoClassify_AdvancedClassifierOptions targets real
migrated rows.
- Around line 31-42: The R readers currently pull
ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule without
scoping by the new ScenarioId column, which can mix settings across scenarios;
update the read queries in src/functions/0.1-pre-processing.r (the code that
reads ecoClassify_AdvancedClassifierOptions) and src/3-postprocess.R (the code
that reads ecoClassify_PostProcessingRule) to include a WHERE ScenarioId =
<currentScenarioId> filter (use the existing scenario id variable passed into
those functions or add one if missing), and fall back safely (e.g., return only
rows for that ScenarioId or an empty/default set) so options/rules are enforced
per-scenario. Ensure you reference the table names
ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule and the
scenario id variable used by the workflow when implementing the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: af2b010a-eab0-42f8-89e5-f54d02c42ef3

📥 Commits

Reviewing files that changed from the base of the PR and between 1e8e81f and 5317187.

📒 Files selected for processing (2)
  • src/2-predict.R
  • src/updates/update-2.1.xml

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/3-postprocess.R (2)

460-471: Comment mentions tile cropping but no implementation present.

Line 460 states "crop to tile extent when running as a spatial tile job" but the subsequent code loads the rule raster without any cropping logic. This comment may be outdated or indicate incomplete implementation.

-          # Load rule raster; crop to tile extent when running as a spatial tile job
+          # Load rule raster
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/3-postprocess.R` around lines 460 - 471, The comment promises cropping
the loaded rule raster to the current tile extent but the code just calls
rast(rulePath); either remove the misleading comment or implement cropping:
after loading with rast(rulePath) apply terra::crop (or terra::ext + crop) using
the job’s tile extent variable (the same extent used elsewhere for spatial
tiles), conditioned on the spatial-tile flag used in this module (e.g., the
job/spatialTile boolean you use elsewhere); keep the updateRunLog and
terraOptions(todisk = old_todisk) behavior and ensure the raster is replaced by
the cropped result before continuing.

554-556: Inconsistent de-duplication approach between training and predicting loops.

The predicting loop uses a more defensive pattern (na.omit(unique(...))[1] with validation) compared to the simpler [1] indexing in the training loop (e.g., line 371). While both approaches work, consider aligning them for consistency and maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/3-postprocess.R` around lines 554 - 556, The predicting loop's extraction
of predProbPath uses a defensive de-duplication/validation pattern
(na.omit(unique(...))[1] plus checks) while the training loop uses plain [1];
make them consistent by applying the same defensive approach to the
training-side extraction (or conversely simplify predicting to the training
pattern) so both use the same de-duplication and NA/character validation; update
the code that computes predProbPath in the training loop to mirror the logic
used with predictingOutputDataframe$ClassifiedProbability / $Timestep (or change
the predicting predProbPath assignment to use the simpler [1] indexing),
ensuring you reference and adjust the same variable names (predProbPath,
predictingOutputDataframe, ClassifiedProbability, Timestep) across both loops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/3-postprocess.R`:
- Around line 460-471: The comment promises cropping the loaded rule raster to
the current tile extent but the code just calls rast(rulePath); either remove
the misleading comment or implement cropping: after loading with rast(rulePath)
apply terra::crop (or terra::ext + crop) using the job’s tile extent variable
(the same extent used elsewhere for spatial tiles), conditioned on the
spatial-tile flag used in this module (e.g., the job/spatialTile boolean you use
elsewhere); keep the updateRunLog and terraOptions(todisk = old_todisk) behavior
and ensure the raster is replaced by the cropped result before continuing.
- Around line 554-556: The predicting loop's extraction of predProbPath uses a
defensive de-duplication/validation pattern (na.omit(unique(...))[1] plus
checks) while the training loop uses plain [1]; make them consistent by applying
the same defensive approach to the training-side extraction (or conversely
simplify predicting to the training pattern) so both use the same de-duplication
and NA/character validation; update the code that computes predProbPath in the
training loop to mirror the logic used with
predictingOutputDataframe$ClassifiedProbability / $Timestep (or change the
predicting predProbPath assignment to use the simpler [1] indexing), ensuring
you reference and adjust the same variable names (predProbPath,
predictingOutputDataframe, ClassifiedProbability, Timestep) across both loops.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2208ebd1-ada4-46ca-984c-d20563fc5e39

📥 Commits

Reviewing files that changed from the base of the PR and between 5317187 and 119b8db.

📒 Files selected for processing (1)
  • src/3-postprocess.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant