Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Note
|
| Cohort / File(s) | Summary |
|---|---|
Docker & VCS ignores \.dockerignore, Dockerfile, \.gitignore |
Add .dockerignore; new Dockerfile building geospatial R + JVM + PyTorch + SyncroSim image; extend .gitignore patterns. |
Conda & installer src/ecoClassify-conda.yml, src/installDependencies.r |
Bump conda env name and add r-gower; sanitize conda/lib paths and capture install logs to temp sink with improved error echoing. |
New tiling transformer & schema src/2a-prepare-spatial-tiling.R, src/updates/update-2.4.xml, src/package.xml |
Add PrepareSpatialTiling transformer; generate tile_manifest and tile mask raster; add TilingOptions & SummaryOutputChart schemas and package.xml wiring/layout updates. |
Prediction scripts src/2-predict.R, src/2-predict.r |
Add new tile-aware src/2-predict.R (tile manifest handling, per-tile preprocessing, tile-aware RGB and summary aggregation); remove legacy src/2-predict.r. |
Training script src/1-train-classifier.R |
Load terminology band overrides, validate target-class presence, write covariate outputs to transferDir, adjust memory management and summary output saving. |
Post-processing src/3-postprocess.R |
Refine raster path selection, change predictionType labels, recompute confusion metrics from post-processed rasters, consolidate summaries/charts, and improve cleanup. |
Core setup & model package loading src/functions/0.0-setup.r, src/functions/0.4-maxent.r |
Remove eager model preloads; add loadModelPackages(modelType) to load model-specific deps and shift MAXENT availability gating to setup. |
Preprocessing & covariates src/functions/0.1-pre-processing.r |
Add defaults in assignVariables(), force file-backed covariates, change addCovariates(..., outDir = NULL), and update overrideBandNames(..., bandLabelFile). |
Training/prediction helpers src/functions/0.2-training-and-prediction.r, src/functions/0.4-random-forest.r, src/functions/0.5-histogram.r |
Add rgbBands param to saveFiles(), improve raster read/write lifecycle in predictRanger(), make histogram extraction robust to shapes, and warn on low sample counts for thresholding. |
Post-processing helpers & labels src/functions/0.6-post-processing.r |
Rename two statistic labels to full names (Positive/Negative Predictive Value); keep computed metrics unchanged. |
Updates & metadata src/updates/update-2.1.xml, src/package.xml |
Schema and package.xml edits: transformer env versions, datasheet/column/layout updates, and added/removed datasheets per 2.4 changes. |
Sequence Diagram(s)
sequenceDiagram
participant User
participant SyncroSim
participant Prepare as "PrepareSpatialTiling\n(2a-prepare-spatial-tiling.R)"
participant Predict as "Predict\n(2-predict.R)"
participant Train as "TrainClassifier\n(1-train-classifier.R)"
participant Post as "PostProcess\n(3-postprocess.R)"
participant DB as "SyncroSim DB / Datasheets"
User->>SyncroSim: Run scenario (tiling enabled)
SyncroSim->>Prepare: Execute PrepareSpatialTiling
Prepare->>Prepare: Compute tile grid, pre-crop rasters, write tile_manifest.json
Prepare->>DB: Save tile manifest & grid metadata
SyncroSim->>Train: Execute TrainClassifier
Train->>DB: Read training inputs, load model packages
Train->>DB: Save trained model & summary
SyncroSim->>Predict: Launch tile jobs (1..N)
loop per tile job
Predict->>DB: Load tile-cropped rasters & model
Predict->>Predict: Preprocess, predict per timestep
Predict->>DB: Write per-tile prediction rasters and tile summary
end
Predict->>Predict: Job 1 aggregates tile summaries -> unified SummaryOutput
Predict->>DB: Save aggregated SummaryOutput
SyncroSim->>Post: Execute PostProcessing
Post->>DB: Load predictions, reclassify/filter, compute metrics
Post->>DB: Save ModelStatistics & SummaryOutputChart
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
- Af multi class #55: Overlaps prediction/post-processing and datasheet/schema changes (SummaryOutput / TargetClassOptions and predict/postprocess scripts).
- Update conda environment #49: Related to conda/environment and installer modifications.
- Af-improved-splitdata-function #51: Related to disk-backed probability/presence outputs and saveFiles/getPredictionRasters signature changes.
Suggested reviewers
- afilazzola
"I nibbled code and hatched a plan,
Tiles in rows like carrot span,
Models hum and rasters sing,
Summaries bloom—oh what a spring!
Hooray for tidy, tiled machine! 🥕"
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Title check | ❓ Inconclusive | The title "Ha raster tiling" is vague and does not clearly convey the scope or primary objective of the changes. While raster tiling is mentioned, the title lacks specificity about what was implemented, why, or what else changed. | Use a more descriptive title that clearly summarizes the primary changes, such as "Add spatial raster tiling support for multiprocessing" or "Implement spatial tiling transformer and related improvements". |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Docstring Coverage | ✅ Passed | No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
ha-raster-tiling
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/functions/0.2-training-and-prediction.r (1)
1281-1297:⚠️ Potential issue | 🟡 MinorMissing
dev.off()call after PNG device opening.The
grDevices::png()device is opened butdev.off()is not called afterplotRGB(). IfplotRGB()succeeds, the device remains open. While the error handler catches failures, the device should be closed in all cases to prevent resource leaks and ensure the PNG file is properly finalized.🛠️ Proposed fix
grDevices::png(filename = png_file) tryCatch( { terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin") }, error = function(e) { updateRunLog( sprintf( "t=%s: Failed to write RGB PNG: %s", timestep, conditionMessage(e) ), type = "warning" ) - } + }, + finally = { + grDevices::dev.off() + } ) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/functions/0.2-training-and-prediction.r` around lines 1281 - 1297, The PNG device opened via grDevices::png(filename = png_file) is never closed; wrap the plotting call in tryCatch (or use tryCatch(..., finally = {grDevices::dev.off()})) so that grDevices::dev.off() is always called regardless of success or error. Specifically, modify the block containing terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin") and the error handler that calls updateRunLog(...) to ensure a final grDevices::dev.off() runs (or use on.exit(grDevices::dev.off()) immediately after opening the device) so the PNG device is properly finalized.src/3-postprocess.R (1)
145-153:⚠️ Potential issue | 🟠 MajorGuard the training raster load with
file.exists().Line 151 only filters out
NA. A stale path still makesterra::rast()abort the whole post-processing pass; the predicting branch already handles this case.🛠️ Proposed fix
- if (!is.na(predictedPresenceFilepath)) { + if (!is.na(predictedPresenceFilepath) && file.exists(predictedPresenceFilepath)) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/3-postprocess.R` around lines 145 - 153, The code currently only checks for NA on predictedPresenceFilepath before calling terra::rast, which still fails if the path is stale; update the loop handling trainTimestepList so that after computing predictedPresenceFilepath from trainingOutputDataframe (Timestep == t) you also check file.exists(predictedPresenceFilepath) and skip (or log a warning) when the file does not exist instead of calling rast; ensure this guard is applied where predictedPresence is created so functions/variables like predictedPresenceFilepath, predictedPresence, and the call to rast() are protected.
🧹 Nitpick comments (6)
src/2a-prepare-spatial-tiling.R (2)
100-107: Auto-tiling may produce fewer tiles than available cores.The calculation
numTilesX = ceiling(sqrt(nTiles))andnumTilesY = floor(nTiles / numTilesX)can produce a grid with fewer tiles than the target. For example:
- 7 cores:
numTilesX = ceiling(sqrt(7)) = 3,numTilesY = floor(7/3) = 2→ 6 tiles- 11 cores:
numTilesX = ceiling(sqrt(11)) = 4,numTilesY = floor(11/4) = 2→ 8 tilesThis leaves some cores unused. Consider adjusting
numTilesYtoceiling(nTiles / numTilesX)to ensure at leastnTilestiles are created:💡 Alternative calculation to maximize core utilization
numTilesX <- ceiling(sqrt(nTiles)) - numTilesY <- floor(nTiles / numTilesX) + numTilesY <- ceiling(nTiles / numTilesX)This would create slightly more tiles than cores in some cases (e.g., 7 cores → 9 tiles), which is generally preferable to underutilization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/2a-prepare-spatial-tiling.R` around lines 100 - 107, The auto-tiling can produce fewer tiles than requested because numTilesY is computed with floor; change the calculation so numTilesY uses ceiling(nTiles / numTilesX) instead of floor(...) to ensure at least nTiles tiles (reference symbols: nTiles, numTilesX, numTilesY) and update the updateRunLog message to report the actual number of tiles (numTilesX * numTilesY) so logs reflect the adjusted tiling.
442-449: OverwritingMaximumJobsmay reduce user-configured parallelism.Line 444 sets
MaximumJobs = nValidTiles, which overwrites any user-configured value. If a user configured 16 jobs but only 4 valid tiles exist, this is correct behavior. However, if a user configured 4 jobs and 16 tiles exist, this would increase parallelism beyond what the user intended.The earlier check at lines 124-130 logs an info message when tiles exceed configured jobs, but then line 444 overwrites that configuration anyway. Consider preserving the user's configured maximum:
🔧 Proposed fix to respect user-configured maximum
+# Respect user-configured MaximumJobs as an upper bound +userMaxJobs <- if ( + nrow(multiprocessingSheet) > 0 && + !is.null(multiprocessingSheet$MaximumJobs) && + !is.na(multiprocessingSheet$MaximumJobs[1]) +) as.integer(multiprocessingSheet$MaximumJobs[1]) else nValidTiles + +finalMaxJobs <- min(nValidTiles, userMaxJobs) + saveDatasheet( myScenario, - data = data.frame(EnableMultiprocessing = TRUE, MaximumJobs = nValidTiles, + data = data.frame(EnableMultiprocessing = TRUE, MaximumJobs = finalMaxJobs, stringsAsFactors = FALSE), name = "core_Multiprocessing" ) -updateRunLog(paste0("core_Multiprocessing set to ", nValidTiles, " job(s) to match tile count.")) +updateRunLog(paste0("core_Multiprocessing set to ", finalMaxJobs, " job(s) (", nValidTiles, " tiles)."))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/2a-prepare-spatial-tiling.R` around lines 442 - 449, The current saveDatasheet call unconditionally sets MaximumJobs = nValidTiles and can override a user-configured value; instead, read the existing configured maximum (from the saved "core_Multiprocessing" datasheet or the variable that holds the user's MaximumJobs), compute chosenJobs <- ifelse(is.na(userMax) || userMax <= 0, nValidTiles, min(userMax, nValidTiles)), and pass MaximumJobs = chosenJobs to saveDatasheet (keep EnableMultiprocessing = TRUE); also update the updateRunLog message to report the final chosenJobs value so it accurately reflects whether the user's setting or the tile count limited parallelism.src/functions/0.1-pre-processing.r (1)
723-756: Consider documenting the expected CSV format.The
overrideBandNames()function usesread.csv(..., header = TRUE), which assumes the CSV has a header row. The docstring mentions "one band name per row (in band order), with an optional header" but the code always reads withheader = TRUE, which will consume the first data row as a header if no header exists.Consider either:
- Using
header = FALSEto treat all rows as data, or- Clarifying in documentation that a header row is required
💡 Option 1: Read without header assumption
bandLabels <- tryCatch( - read.csv(bandLabelFile, header = TRUE, stringsAsFactors = FALSE), + read.csv(bandLabelFile, header = FALSE, stringsAsFactors = FALSE), error = function(e) stop(paste0( "Could not read band label file '", bandLabelFile, "': ", conditionMessage(e) )) ) - newNames <- bandLabels[[1]] + newNames <- as.character(bandLabels[[1]])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/functions/0.1-pre-processing.r` around lines 723 - 756, The function overrideBandNames currently calls read.csv(..., header = TRUE) which will treat a missing header row as column names and drop the first band; change the read.csv call to use header = FALSE so every CSV row is treated as a band name (keep bandLabels[[1]] as the source of names), and update the function docs/docstring to state the CSV must be one name per row with no header (or explicitly allow an optional header and describe the behavior). Ensure this change is applied in overrideBandNames and any associated documentation strings referencing bandLabelFile.Dockerfile (2)
84-91: Verify SyncroSim download URL stability.The SyncroSim zip is downloaded from a versioned URL (
3-1-24). Consider whether this URL will remain available long-term, and whether the version should be parameterized as a build argument for easier updates.💡 Optional: Parameterize SyncroSim version
+ARG SYNCROSIM_VERSION=3-1-24 + # --- SyncroSim ---------------------------------------------------------------- RUN mkdir -p /tmp/syncrosim && \ cd /tmp/syncrosim && \ - wget -O syncrosim.zip https://downloads.syncrosim.com/3-1-24/syncrosim-linux-3-1-24.zip && \ + wget -O syncrosim.zip https://downloads.syncrosim.com/${SYNCROSIM_VERSION}/syncrosim-linux-${SYNCROSIM_VERSION}.zip && \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 84 - 91, The Dockerfile currently hardcodes the SyncroSim download URL with version "3-1-24" in the RUN that creates /tmp/syncrosim and wget's syncrosim.zip; make the version configurable by adding a build argument (e.g., SYNCROSIM_VERSION) and update the wget call to use that variable so future updates only require changing the arg; optionally add a checksum or S3/mirror fallback for stability and keep the rest of the steps (unzip, copy to /opt/syncrosim, chmod, cleanup) unchanged but referenced via the same RUN sequence.
16-127: Consider adding a non-root user for improved security posture.Per static analysis (Trivy DS-0002), the container runs as root by default. While this may be necessary for SyncroSim operations or specific file permissions, running containers as a non-root user is a security best practice that reduces the impact of container escape vulnerabilities.
If SyncroSim doesn't require root privileges at runtime, consider adding a dedicated user:
🔒 Proposed addition at the end of the Dockerfile
# Optional: pre-create a directory for maxent.jar RUN mkdir -p /opt/ecoClassify/java + +# Create non-root user for runtime +RUN useradd -m -s /bin/bash ecoclassify && \ + chown -R ecoclassify:ecoclassify /opt/ecoClassify +USER ecoclassifyPlease verify whether SyncroSim requires root privileges at runtime before applying this change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 16 - 127, The image runs processes as root (Trivy DS-0002); confirm whether SyncroSim requires root at runtime, and if not: create a dedicated non-root user and group (e.g., "syncro"), chown runtime directories (/opt/syncrosim, /opt/ecoClassify, /opt/ecoClassify/java and any other runtime writeable paths created by RUN commands such as the SyncroSim install and PATH/LD_LIBRARY_PATH locations), and switch to that user with USER before the container entrypoint; ensure any build-time operations that need root remain earlier in the Dockerfile (the existing RUN sequences that install packages, Rscript::install_torch, conda/Miniconda steps, and mkdir/cp steps should keep root), and verify that SyncroSim works under the non-root user before merging.src/updates/update-2.4.xml (1)
35-41: This pattern is established across all update files and follows the codebase principle of preferring explicit failures for troubleshooting.The ALTER TABLE ADD COLUMN statements in lines 36-40 follow the same approach used throughout all update files (updates 1.0–2.4). If columns already exist, the first failed ALTER will prevent subsequent columns from being added. However, this appears to be intentional based on the consistent pattern and aligns with the learned preference for explicit failures.
Consider documenting that package updates are intended to run only once, or add a note in the Terminology datasheet definition clarifying when these columns are added.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/updates/update-2.4.xml` around lines 35 - 41, The ALTER TABLE statements in the update (action Exec with condition TableExists and criteria ecoClassify_Terminology in update-2.4.xml) intentionally fail fast if any column already exists; to make this explicit, add documentation or a datasheet note stating that package updates (updates 1.0–2.4) are intended to run once and that the ecoClassify_Terminology columns (redBand, greenBand, blueBand, nirBand, swirBand) are added during that one-time update, or alternatively add a clarifying comment in the Terminology datasheet definition referencing the update-2.4.xml change and the run-once expectation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/settings.local.json:
- Around line 6-7: Remove the machine-specific entries in
.claude/settings.local.json that expose the personal local path (the Bash(grep
... /mnt/c/Users/HannahAdams/Documents/GitHub/ecoClassify ...) line) and the
non-portable "Bash(grep:*)" entry; instead sanitize the file by replacing those
lines with a generic example or placeholders, add .claude/settings.local.json to
.gitignore so local configs stay untracked, and commit a sanitized template
(e.g., .claude/settings.local.template.json) for others to copy and fill in.
Ensure you update any references to the Bash(...) entries so no absolute user
paths remain in the committed config.
In @.gitignore:
- Around line 7-8: Remove Dockerfile and .dockerignore from .gitignore so
tracked container build files are not ignored; open .gitignore, delete the lines
containing "Dockerfile" and ".dockerignore" (or comment them out) to ensure
future Dockerfile/build-context changes are visible in commits and diffs.
In `@src/2-predict.R`:
- Around line 511-524: The aggregation is incorrectly averaging per-tile
probability summaries; instead compute MeanProbability as a weighted mean using
PredictedPixels (use sum(.PredictedPixels * .MeanProbability) /
sum(.PredictedPixels) with .sum_na helpers), compute MinProbability as the
global minimum of tile MinProbability (min(..., na.rm=TRUE)), MaxProbability as
the global maximum of tile MaxProbability (max(..., na.rm=TRUE)), and do not
attempt to recover MedianProbability from tile medians — set MedianProbability
to NA_real_ (or compute it from raw pixel probabilities if raw values are
available). Update the block that builds .aggregated (referencing .allStats,
buildSummaryRow, .sum_na, .mean_na) to implement the weighted mean and replace
the simple .mean_na calls for min/max/median as described.
In `@src/3-postprocess.R`:
- Around line 554-560: The lookup for
predictingOutputDataframe$ClassifiedProbability for a given timestep t returns
multiple paths and then file.exists() only checks the first element while
terra::rast() will silently open multilayer rasters; change the code that sets
predProbPath to collapse to a single filepath (e.g., pick the
first/unique/existing path for
predictingOutputDataframe$ClassifiedProbability[predictingOutputDataframe$Timestep
== t]) before calling file.exists() and terra::rast(), then pass that single
filepath into buildSummaryRow (same fix also for the similar block that uses
ClassifiedProbability at lines ~595-600); ensure you validate the chosen path
exists and is a single string prior to creating the raster.
In `@src/functions/0.5-histogram.r`:
- Around line 253-255: The code uses preds[, 2] assuming preds is a matrix or
data frame with at least two columns, which can cause errors if it has fewer.
Modify the selection logic in the relevant lines of the function to first check
if a column named "presence" exists in preds and use it if present, otherwise
use the second column if available, and finally fallback to the first column if
neither criteria are met. This defensive pattern ensures safe extraction of
prediction values regardless of preds' structure.
In `@src/installDependencies.r`:
- Line 46: The code hardcodes the Windows Conda library path ("Lib/R/library")
when computing env_lib; update the logic to choose "Lib" for Windows and "lib"
for Unix-like systems using an OS check (e.g. .Platform$OS.type or Sys.info())
so file.path(cp, <OS-specific folder>, "R", "library") points to the correct
Conda library folder; modify the env_lib assignment that references cp and
.libPaths() accordingly so both Windows and Unix installs resolve correctly.
- Around line 356-357: The cleanup code for installDependencies() currently
calls sink(type="message") and close(.msg_con) in the error handler and then
again later, which can attempt to close an already-closed connection; refactor
by extracting the sink/close logic into a single reusable cleanup function
(e.g., cleanup_message_conn()) that holds a local guard flag (e.g.,
.cleanup_done) or checks connection validity before running, and replace all
direct sink(type="message") / close(.msg_con) calls with a single call to
cleanup_message_conn(); ensure the function references .msg_con, calls
sink(type="message") and close(.msg_con) exactly once, and is invoked from both
the error handler and normal exit paths so cleanup runs only once.
---
Outside diff comments:
In `@src/3-postprocess.R`:
- Around line 145-153: The code currently only checks for NA on
predictedPresenceFilepath before calling terra::rast, which still fails if the
path is stale; update the loop handling trainTimestepList so that after
computing predictedPresenceFilepath from trainingOutputDataframe (Timestep == t)
you also check file.exists(predictedPresenceFilepath) and skip (or log a
warning) when the file does not exist instead of calling rast; ensure this guard
is applied where predictedPresence is created so functions/variables like
predictedPresenceFilepath, predictedPresence, and the call to rast() are
protected.
In `@src/functions/0.2-training-and-prediction.r`:
- Around line 1281-1297: The PNG device opened via grDevices::png(filename =
png_file) is never closed; wrap the plotting call in tryCatch (or use
tryCatch(..., finally = {grDevices::dev.off()})) so that grDevices::dev.off() is
always called regardless of success or error. Specifically, modify the block
containing terra::plotRGB(rgb_rast, r = 1, g = 2, b = 3, stretch = "lin") and
the error handler that calls updateRunLog(...) to ensure a final
grDevices::dev.off() runs (or use on.exit(grDevices::dev.off()) immediately
after opening the device) so the PNG device is properly finalized.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 84-91: The Dockerfile currently hardcodes the SyncroSim download
URL with version "3-1-24" in the RUN that creates /tmp/syncrosim and wget's
syncrosim.zip; make the version configurable by adding a build argument (e.g.,
SYNCROSIM_VERSION) and update the wget call to use that variable so future
updates only require changing the arg; optionally add a checksum or S3/mirror
fallback for stability and keep the rest of the steps (unzip, copy to
/opt/syncrosim, chmod, cleanup) unchanged but referenced via the same RUN
sequence.
- Around line 16-127: The image runs processes as root (Trivy DS-0002); confirm
whether SyncroSim requires root at runtime, and if not: create a dedicated
non-root user and group (e.g., "syncro"), chown runtime directories
(/opt/syncrosim, /opt/ecoClassify, /opt/ecoClassify/java and any other runtime
writeable paths created by RUN commands such as the SyncroSim install and
PATH/LD_LIBRARY_PATH locations), and switch to that user with USER before the
container entrypoint; ensure any build-time operations that need root remain
earlier in the Dockerfile (the existing RUN sequences that install packages,
Rscript::install_torch, conda/Miniconda steps, and mkdir/cp steps should keep
root), and verify that SyncroSim works under the non-root user before merging.
In `@src/2a-prepare-spatial-tiling.R`:
- Around line 100-107: The auto-tiling can produce fewer tiles than requested
because numTilesY is computed with floor; change the calculation so numTilesY
uses ceiling(nTiles / numTilesX) instead of floor(...) to ensure at least nTiles
tiles (reference symbols: nTiles, numTilesX, numTilesY) and update the
updateRunLog message to report the actual number of tiles (numTilesX *
numTilesY) so logs reflect the adjusted tiling.
- Around line 442-449: The current saveDatasheet call unconditionally sets
MaximumJobs = nValidTiles and can override a user-configured value; instead,
read the existing configured maximum (from the saved "core_Multiprocessing"
datasheet or the variable that holds the user's MaximumJobs), compute chosenJobs
<- ifelse(is.na(userMax) || userMax <= 0, nValidTiles, min(userMax,
nValidTiles)), and pass MaximumJobs = chosenJobs to saveDatasheet (keep
EnableMultiprocessing = TRUE); also update the updateRunLog message to report
the final chosenJobs value so it accurately reflects whether the user's setting
or the tile count limited parallelism.
In `@src/functions/0.1-pre-processing.r`:
- Around line 723-756: The function overrideBandNames currently calls
read.csv(..., header = TRUE) which will treat a missing header row as column
names and drop the first band; change the read.csv call to use header = FALSE so
every CSV row is treated as a band name (keep bandLabels[[1]] as the source of
names), and update the function docs/docstring to state the CSV must be one name
per row with no header (or explicitly allow an optional header and describe the
behavior). Ensure this change is applied in overrideBandNames and any associated
documentation strings referencing bandLabelFile.
In `@src/updates/update-2.4.xml`:
- Around line 35-41: The ALTER TABLE statements in the update (action Exec with
condition TableExists and criteria ecoClassify_Terminology in update-2.4.xml)
intentionally fail fast if any column already exists; to make this explicit, add
documentation or a datasheet note stating that package updates (updates 1.0–2.4)
are intended to run once and that the ecoClassify_Terminology columns (redBand,
greenBand, blueBand, nirBand, swirBand) are added during that one-time update,
or alternatively add a clarifying comment in the Terminology datasheet
definition referencing the update-2.4.xml change and the run-once expectation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: ea26417c-c11d-4d23-9e7c-7110280575c9
📒 Files selected for processing (20)
.claude/settings.local.json.dockerignore.gitignoreDockerfilesrc/1-train-classifier.Rsrc/2-predict.Rsrc/2-predict.rsrc/2a-prepare-spatial-tiling.Rsrc/3-postprocess.Rsrc/ecoClassify-conda.ymlsrc/functions/0.0-setup.rsrc/functions/0.1-pre-processing.rsrc/functions/0.2-training-and-prediction.rsrc/functions/0.4-maxent.rsrc/functions/0.4-random-forest.rsrc/functions/0.5-histogram.rsrc/functions/0.6-post-processing.rsrc/installDependencies.rsrc/package.xmlsrc/updates/update-2.4.xml
💤 Files with no reviewable changes (1)
- src/2-predict.r
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/updates/update-2.1.xml (2)
31-37:⚠️ Potential issue | 🔴 CriticalThis resets advanced classifier settings on upgrade.
The columns dropped above are recreated here in a new table, but there is no backfill step, and the
UPDATE ... WHERE AdvancedClassifierOptionsId=1is a no-op on an empty table. ExistingnormalizeRasters,manualThreshold,applyContextualization, etc. are lost as soon as the old columns are removed. Copy the existing values intoecoClassify_AdvancedClassifierOptionsbefore dropping the source columns.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/updates/update-2.1.xml` around lines 31 - 37, The migration creates a new ecoClassify_AdvancedClassifierOptions table and then updates AdvancedClassifierOptionsId=1, which is a no-op because no rows are migrated; as a result settings like normalizeRasters, manualThreshold, applyContextualization are lost. Before dropping the old columns/table, copy existing values into the new ecoClassify_AdvancedClassifierOptions (e.g., INSERT INTO ecoClassify_AdvancedClassifierOptions(...) SELECT ... FROM <old_table_or_renamed_table> WHERE AdvancedClassifierOptionsId=1 or INSERT ... SELECT DISTINCT to preserve rows), ensuring columns normalizeRasters, rasterDecimalPlaces, modelTuning, setManualThreshold, manualThreshold, applyContextualization, contextualizationWindowSize, setSeed and AdvancedClassifierOptionsId are populated; only then drop or rename the source columns so the UPDATE on ecoClassify_AdvancedClassifierOptions targets real migrated rows.
31-42:⚠️ Potential issue | 🟠 Major
ScenarioIdis schema-only right now.
src/functions/0.1-pre-processing.r(Lines 116-119) still readsecoClassify_AdvancedClassifierOptionswithoutScenarioIdscoping, andsrc/3-postprocess.R(Lines 78-81) does the same forecoClassify_PostProcessingRule. In multi-scenario libraries, that can mix another scenario's options or rules into the current run. Ship the reader-side filtering in the same PR, or defer the new column until it is enforced end-to-end.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/updates/update-2.1.xml` around lines 31 - 42, The R readers currently pull ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule without scoping by the new ScenarioId column, which can mix settings across scenarios; update the read queries in src/functions/0.1-pre-processing.r (the code that reads ecoClassify_AdvancedClassifierOptions) and src/3-postprocess.R (the code that reads ecoClassify_PostProcessingRule) to include a WHERE ScenarioId = <currentScenarioId> filter (use the existing scenario id variable passed into those functions or add one if missing), and fall back safely (e.g., return only rows for that ScenarioId or an empty/default set) so options/rules are enforced per-scenario. Ensure you reference the table names ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule and the scenario id variable used by the workflow when implementing the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/2-predict.R`:
- Around line 496-503: The code currently warns and persists partial results
when length(.allFiles) < tileCount; instead, do not save or publish partial
aggregates—either wait for an explicit completion signal or fail the run. In the
branch that checks length(.allFiles) < tileCount (using .allFiles, tileCount,
.maxWait), replace the update+save behavior with a hard failure (e.g., call
stop() or return an error) or implement a re-check loop that waits for a
completion flag from workers before calling ecoClassify_SummaryOutput; ensure
the final path that writes/publishes summary only executes when
length(.allFiles) == tileCount.
- Around line 529-530: The aggregation for MinProbability/MaxProbability
currently uses min(grp$MinProbability, na.rm=TRUE)/max(...) which returns
Inf/-Inf when all values are NA; update the logic around the MinProbability and
MaxProbability computations to first test whether all(is.na(grp$MinProbability))
/ all(is.na(grp$MaxProbability)) and return NA_real_ in that case, otherwise
compute min(..., na.rm=TRUE) or max(..., na.rm=TRUE); locate and change the two
lines that set MinProbability and MaxProbability in the grouping/summary block
that references grp.
---
Outside diff comments:
In `@src/updates/update-2.1.xml`:
- Around line 31-37: The migration creates a new
ecoClassify_AdvancedClassifierOptions table and then updates
AdvancedClassifierOptionsId=1, which is a no-op because no rows are migrated; as
a result settings like normalizeRasters, manualThreshold, applyContextualization
are lost. Before dropping the old columns/table, copy existing values into the
new ecoClassify_AdvancedClassifierOptions (e.g., INSERT INTO
ecoClassify_AdvancedClassifierOptions(...) SELECT ... FROM
<old_table_or_renamed_table> WHERE AdvancedClassifierOptionsId=1 or INSERT ...
SELECT DISTINCT to preserve rows), ensuring columns normalizeRasters,
rasterDecimalPlaces, modelTuning, setManualThreshold, manualThreshold,
applyContextualization, contextualizationWindowSize, setSeed and
AdvancedClassifierOptionsId are populated; only then drop or rename the source
columns so the UPDATE on ecoClassify_AdvancedClassifierOptions targets real
migrated rows.
- Around line 31-42: The R readers currently pull
ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule without
scoping by the new ScenarioId column, which can mix settings across scenarios;
update the read queries in src/functions/0.1-pre-processing.r (the code that
reads ecoClassify_AdvancedClassifierOptions) and src/3-postprocess.R (the code
that reads ecoClassify_PostProcessingRule) to include a WHERE ScenarioId =
<currentScenarioId> filter (use the existing scenario id variable passed into
those functions or add one if missing), and fall back safely (e.g., return only
rows for that ScenarioId or an empty/default set) so options/rules are enforced
per-scenario. Ensure you reference the table names
ecoClassify_AdvancedClassifierOptions and ecoClassify_PostProcessingRule and the
scenario id variable used by the workflow when implementing the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: af2b010a-eab0-42f8-89e5-f54d02c42ef3
📒 Files selected for processing (2)
src/2-predict.Rsrc/updates/update-2.1.xml
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/3-postprocess.R (2)
460-471: Comment mentions tile cropping but no implementation present.Line 460 states "crop to tile extent when running as a spatial tile job" but the subsequent code loads the rule raster without any cropping logic. This comment may be outdated or indicate incomplete implementation.
- # Load rule raster; crop to tile extent when running as a spatial tile job + # Load rule raster🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/3-postprocess.R` around lines 460 - 471, The comment promises cropping the loaded rule raster to the current tile extent but the code just calls rast(rulePath); either remove the misleading comment or implement cropping: after loading with rast(rulePath) apply terra::crop (or terra::ext + crop) using the job’s tile extent variable (the same extent used elsewhere for spatial tiles), conditioned on the spatial-tile flag used in this module (e.g., the job/spatialTile boolean you use elsewhere); keep the updateRunLog and terraOptions(todisk = old_todisk) behavior and ensure the raster is replaced by the cropped result before continuing.
554-556: Inconsistent de-duplication approach between training and predicting loops.The predicting loop uses a more defensive pattern (
na.omit(unique(...))[1]with validation) compared to the simpler[1]indexing in the training loop (e.g., line 371). While both approaches work, consider aligning them for consistency and maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/3-postprocess.R` around lines 554 - 556, The predicting loop's extraction of predProbPath uses a defensive de-duplication/validation pattern (na.omit(unique(...))[1] plus checks) while the training loop uses plain [1]; make them consistent by applying the same defensive approach to the training-side extraction (or conversely simplify predicting to the training pattern) so both use the same de-duplication and NA/character validation; update the code that computes predProbPath in the training loop to mirror the logic used with predictingOutputDataframe$ClassifiedProbability / $Timestep (or change the predicting predProbPath assignment to use the simpler [1] indexing), ensuring you reference and adjust the same variable names (predProbPath, predictingOutputDataframe, ClassifiedProbability, Timestep) across both loops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/3-postprocess.R`:
- Around line 460-471: The comment promises cropping the loaded rule raster to
the current tile extent but the code just calls rast(rulePath); either remove
the misleading comment or implement cropping: after loading with rast(rulePath)
apply terra::crop (or terra::ext + crop) using the job’s tile extent variable
(the same extent used elsewhere for spatial tiles), conditioned on the
spatial-tile flag used in this module (e.g., the job/spatialTile boolean you use
elsewhere); keep the updateRunLog and terraOptions(todisk = old_todisk) behavior
and ensure the raster is replaced by the cropped result before continuing.
- Around line 554-556: The predicting loop's extraction of predProbPath uses a
defensive de-duplication/validation pattern (na.omit(unique(...))[1] plus
checks) while the training loop uses plain [1]; make them consistent by applying
the same defensive approach to the training-side extraction (or conversely
simplify predicting to the training pattern) so both use the same de-duplication
and NA/character validation; update the code that computes predProbPath in the
training loop to mirror the logic used with
predictingOutputDataframe$ClassifiedProbability / $Timestep (or change the
predicting predProbPath assignment to use the simpler [1] indexing), ensuring
you reference and adjust the same variable names (predProbPath,
predictingOutputDataframe, ClassifiedProbability, Timestep) across both loops.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 2208ebd1-ada4-46ca-984c-d20563fc5e39
📒 Files selected for processing (1)
src/3-postprocess.R
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Updates
Chores