Skip to content

Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.#79

Open
jverbus wants to merge 39 commits intomasterfrom
eif_scala
Open

Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.#79
jverbus wants to merge 39 commits intomasterfrom
eif_scala

Conversation

@jverbus
Copy link
Contributor

@jverbus jverbus commented Mar 12, 2026

Summary

This PR adds Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.

The new implementation introduces a Spark ML Estimator/Model pair for EIF, adds save/load support for the extended tree format, expands test coverage substantially, and updates the README with usage examples, parameter documentation, benchmarks, and clear notes about ONNX support.

In addition to the new EIF functionality, this PR also refactors shared training/persistence utilities and improves robustness and backward compatibility for the existing standard IsolationForest implementation.

Background

The existing library only supported the standard Isolation Forest algorithm, which uses axis-aligned splits. That works well in many cases, but it can introduce directional bias and may underperform when anomalies lie along correlated or non-axis-aligned directions.

Extended Isolation Forest addresses this by replacing axis-aligned splits with random hyperplane splits, making the detector more rotationally invariant and better suited for data with correlated features.

This PR implements EIF based on the original paper and validates behavior against the reference implementation:

What this PR adds

1. New Extended Isolation Forest implementation

Added a new package:

  • com.linkedin.relevance.isolationforest.extended

with the following new public Spark ML APIs:

  • ExtendedIsolationForest
  • ExtendedIsolationForestModel

and the supporting implementation types:

  • ExtendedIsolationForestParams
  • ExtendedIsolationTree
  • ExtendedNodes
  • ExtendedUtils
  • ExtendedIsolationForestModelReadWrite

2. Random hyperplane splits for EIF

Each EIF tree now isolates points using random hyperplanes instead of single-feature thresholds.

Implementation details:

  • hyperplane normals are sampled randomly and L2-normalized
  • hyperplanes are stored sparsely
  • only the non-zero coordinates are persisted and used at score time
  • coordinates are canonicalized into sorted order for stable persistence/debugging

3. extensionLevel parameter

EIF introduces a new parameter:

  • extensionLevel

Behavior:

  • 0 -> axis-aligned EIF splits
  • numFeatures - 1 -> fully extended hyperplanes
  • if unset, it defaults at fit time to the fully extended setting for the resolved feature subspace

Important semantics captured in the implementation and docs:

  • extensionLevel is interpreted relative to the resolved per-tree feature subspace
  • when maxFeatures < 1.0, the valid extensionLevel range is based on that subspace, not the original dataset dimension
  • invalid values fail fast during fit

4. EIF model save/load support

Added full persistence support for ExtendedIsolationForestModel.

This includes:

  • metadata read/write
  • Avro serialization of extended trees
  • reconstruction of tree structure on load
  • persistence of sparse hyperplane data:
    • indices
    • weights
    • offset

A saved EIF model can now be loaded with:

val model = ExtendedIsolationForestModel.load(path)

Standard Isolation Forest improvements included in this PR

While adding EIF, this PR also cleans up and strengthens the existing standard IF implementation.

1. Shared training/schema utilities

Refactored common logic into reusable helpers so both standard IF and EIF use the same code paths where appropriate.

New shared helpers include:

  • schema validation/output column appending
  • parameter resolution for maxFeatures / maxSamples
  • sampled dataset creation
  • tree training orchestration
  • threshold computation

2. Shared read/write metadata utilities

Extracted common metadata handling into:

  • core/IsolationForestModelReadWriteUtils.scala

This reduces duplication between standard IF and EIF persistence code.

3. Track totalNumFeatures in models

Standard IsolationForestModel now records:

  • numFeatures = resolved per-tree feature count
  • totalNumFeatures = full training input dimension

This enables feature-dimension validation during scoring and makes persisted model metadata more explicit.

4. Backward-compatible standard IF constructor behavior

To preserve compatibility, standard IF now has:

  • an internal 5-argument constructor that carries totalNumFeatures
  • a restored public 4-argument constructor that preserves the legacy unknown-dimension path

This keeps older construction patterns working while allowing newly trained or newly loaded models to retain full dimensionality metadata.

Related to that, ExtendedIsolationForestModel keeps its 5-argument constructor internal-only so EIF model construction remains package-scoped.

5. Legacy model loading support

Older saved standard IF models that do not contain totalNumFeatures metadata still load successfully.

In that case the model follows an unknown-dimension path and skips feature-size validation.

6. More robust scoring behavior

Standard and extended models now fail fast for invalid scoring scenarios such as:

  • empty ensembles
  • numSamples < 2
  • feature vectors whose dimension does not match the model’s known training dimension

7. Small validation/robustness fixes

Additional guardrails added in shared training logic:

  • resolved maxSamples must be at least 2
  • explicit failure if a training partition ends up with zero samples
  • schema validation logic is centralized and reused

EIF implementation details worth calling out

A few implementation choices are deliberate and important.

Degenerate splits and zero-size leaves

EIF can legitimately produce degenerate hyperplane splits where all points land on one side.
To match the EIF algorithm/reference behavior, this implementation:

  • does not retry the split
  • allows zero-size leaves via ExtendedExternalNode(0)

This is different from the standard IF behavior, where axis-aligned feature selection retries until a splittable feature is found.

Sparse hyperplane representation

Hyperplanes are stored in the original feature space using only the non-zero coordinates.
This keeps:

  • node-local storage proportional to the number of active coordinates
  • dot-product work proportional to the active coordinates, not the full input dimension

ExtendedIF_0 is not identical to standard IF

Although extensionLevel = 0 yields axis-aligned hyperplanes, it is still EIF behavior, not standard IF behavior.

Key differences include:

  • standard IF retries when a constant feature is selected
  • EIF commits to the random draw
  • RNG consumption patterns differ

This distinction is now documented in the README and discussed explicitly in the benchmark section.


Documentation updates

The README was expanded significantly to reflect the new functionality.

Added documentation for EIF

New sections include:

  • Extended Isolation Forest overview
  • when to use EIF vs standard IF
  • extensionLevel parameter semantics
  • Scala usage example
  • save/load example

Clarified ONNX support

The README now explicitly states:

  • ONNX export is supported for standard IsolationForestModel only
  • EIF is not currently supported by the ONNX converter because the converter assumes axis-aligned tree ensembles

Benchmarks section updated

The README benchmark section now compares:

  • StandardIF
  • ExtendedIF_0
  • ExtendedIF_max
  • Liu et al. 2008 results
  • the reference Python EIF implementation (sahandha/eif)

It also summarizes where EIF helps most and where it may underperform.

Additional README cleanup

Also included:

  • dependency examples updated to current artifact naming/version style
  • version references updated from Spark 3.5.1 to 3.5.5
  • citation section added
  • Scala/Python example cleanup and typo fixes

Tests added / updated

This PR adds substantial new test coverage for both EIF and supporting standard IF changes.

New EIF test suites

Added:

  • ExtendedIsolationForestTest
  • ExtendedIsolationForestModelWriteReadTest
  • ExtendedIsolationTreeTest

Coverage includes:

  • training/scoring on benchmark datasets
  • zero contamination behavior
  • default and explicit extensionLevel
  • invalid extensionLevel handling
  • persistence round-trip
  • saved model structure snapshot tests
  • identical-feature edge cases
  • zero-size leaf handling
  • sparse hyperplane validation
  • L2 normalization of hyperplane normals
  • feature-dimension validation
  • empty model / invalid model transform failures

Updated standard IF tests

Expanded existing standard IF coverage to validate:

  • totalNumFeatures persistence
  • feature-dimension validation at score time
  • empty model transform failure
  • invalid numSamples handling
  • legacy constructor compatibility
  • legacy saved model loading without totalNumFeatures

Added / updated test resources

Added snapshot resources for EIF model persistence, including:

  • saved extended model metadata
  • saved extended tree Avro data
  • expected extended tree structure text fixture

Updated standard saved model metadata fixture to include totalNumFeatures.


Backward compatibility

API compatibility

  • Existing standard IsolationForest API remains available.
  • The legacy public 4-arg IsolationForestModel constructor is preserved.
  • Existing standard model load paths are preserved.

Persistence compatibility

  • Newly saved standard IF models now include totalNumFeatures.
  • Older saved standard IF models without that metadata still load successfully.

ONNX compatibility

  • No new ONNX behavior is introduced for EIF.
  • Existing standard IF ONNX flow remains unchanged.
  • EIF is intentionally documented as unsupported for ONNX export.

Behavior changes to be aware of

This PR introduces a few stricter validation behaviors:

  • very small maxSamples values that resolve to fewer than 2 samples now fail fast
  • transform now rejects empty ensembles
  • scoring validates feature vector size when training dimension is known

These are intentional robustness improvements.


Validation

Automated test validation

Verified with:

./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest
./gradlew -g /tmp/codex-gradle-home :isolation-forest:test

Benchmark validation in README

The README now includes a detailed benchmark comparison across 13 datasets for:

  • standard IF
  • EIF with extensionLevel = 0
  • fully extended EIF
  • Liu et al. 2008 paper results
  • the reference Python EIF implementation

The headline result is that:

  • StandardIF remains in strong agreement with the original Liu et al. paper
  • ExtendedIF_max closely tracks the reference Python EIF implementation across the benchmark suite, with differences generally within a couple of standard errors

Additional edge-case study (not included in this PR)

In addition to the checked-in unit/integration tests and README benchmark comparison, we also ran a separate detailed hyperparameter / edge-case study outside the PR. This is not part of the checked-in test suite, but it gives extra confidence in the implementation.

Summary: 61 / 61 checks passed in ~20s

Hyperparameter behavior

  • Extension level (ionosphere, 33D): AUROC increases monotonically from 0.860 (ext=0) to 0.908 (ext=32), confirming that higher extension levels capture more complex anomaly structure.
  • Number of trees (breastw): AUROC is stable at about 0.985+ across 10–200 trees. More trees help slightly, but returns diminish beyond ~50 trees.
  • Sample size (pima): AUROC ranges from 0.63–0.67 across 32–512 samples. Larger samples are slightly better. Fractional maxSamples=0.5 behaves correctly.
  • Max features: Halving the feature subspace (maxFeatures=0.5) has minimal effect on standard IF but slightly improves EIF on ionosphere (0.908 -> 0.913).
  • Bootstrap: on/off produces nearly identical results for both IF and EIF.

Correctness checks

  • Contamination only affects the prediction threshold, not the scores themselves. contamination=0 labels nothing; contamination=0.35 labels about 35.6%.
  • Seed reproducibility: same seed gives bit-identical scores; different seeds produce different scores.
  • Save/load: Avro round-trip preserves scores exactly (max diff = 0).
  • IF vs EIF ext=0: similar but not identical AUROC (diff 0.002–0.017), confirming that they are related but algorithmically distinct.
  • Score distribution: all scores remain in [0,1], and anomalies consistently score higher than normals.

Validation and edge cases

  • extensionLevel > dim - 1 is correctly rejected.
  • maxSamples > n is correctly rejected; maxSamples = n and fractional values behave correctly.
  • 1D/2D data: both IF and EIF handle low-dimensional data well (AUROC 0.97–0.99).
  • Constant features: no degradation; IF and EIF both isolate using the non-constant dimensions.
  • All features constant: does not crash. IF assigns 0.5 everywhere; EIF assigns 0.291 everywhere.
  • Tiny dataset (n=3, ss=2): works and produces identical scores for all points. A single-row dataset is correctly rejected.
Raw output from the external edge-case study
======================================================================
  Isolation Forest / Extended IF — Hyperparameter & Edge Case Tests
======================================================================

=== Test: Extension Level Sweep (ionosphere, 33D) ===
  [PASS] ext_sweep ext=0 — AUROC=0.8600
  [PASS] ext_sweep ext=1 — AUROC=0.8733
  [PASS] ext_sweep ext=4 — AUROC=0.8884
  [PASS] ext_sweep ext=8 — AUROC=0.9027
  [PASS] ext_sweep ext=16 — AUROC=0.9063
  [PASS] ext_sweep ext=32 — AUROC=0.9079
  [PASS] ext_sweep ext=0 != ext=max — ext=0 AUROC=0.8600, ext=max AUROC=0.9079

=== Test: Number of Trees Sweep (breastw) ===
  [PASS] num_trees n=10 — AUROC=0.9851
  [PASS] num_trees n=50 — AUROC=0.9862
  [PASS] num_trees n=100 — AUROC=0.9865
  [PASS] num_trees n=200 — AUROC=0.9880
  [PASS] num_trees monotonic trend — n=10: 0.9851, n=100: 0.9865

=== Test: Sample Size Sweep (pima) ===
  [PASS] sample_size ss=32 — AUROC=0.6411
  [PASS] sample_size ss=64 — AUROC=0.6329
  [PASS] sample_size ss=128 — AUROC=0.6380
  [PASS] sample_size ss=256 — AUROC=0.6569
  [PASS] sample_size ss=512 — AUROC=0.6698
  [PASS] sample_size fractional (0.5) — AUROC=0.6616

=== Test: Max Features / Feature Subspace (ionosphere) ===
  [PASS] max_features IF mf=0.5 — AUROC=0.8438
  [PASS] max_features EIF mf=0.5 — AUROC=0.9134
  [PASS] max_features IF mf=1.0 — AUROC=0.8431
  [PASS] max_features EIF mf=1.0 — AUROC=0.9079

=== Test: Bootstrap (breastw) ===
  [PASS] bootstrap IF bs=false — AUROC=0.9865
  [PASS] bootstrap EIF bs=false — AUROC=0.9832
  [PASS] bootstrap IF bs=true — AUROC=0.9878
  [PASS] bootstrap EIF bs=true — AUROC=0.9825

=== Test: Contamination & Threshold (breastw) ===
  [PASS] contamination=0 no predictions — predicted 0 anomalies (expected 0)
  [PASS] contamination=0.35 fraction — predicted fraction=0.356 (expected ~0.35)
  [PASS] contamination does not affect scores — max score diff=0.00e+00
  [PASS] EIF contamination=0 no predictions — predicted 0 anomalies (expected 0)

=== Test: Seed Reproducibility ===
  [PASS] IF seed reproducibility — max diff=0.00e+00
  [PASS] IF different seeds differ — max diff=0.0412
  [PASS] EIF seed reproducibility — max diff=0.00e+00

=== Test: Save/Load Round-Trip ===
  [PASS] IF save/load scores match — max diff=0.00e+00
  [PASS] EIF save/load scores match — max diff=0.00e+00

=== Test: Standard IF vs EIF ext=0 ===
  [PASS] IF_vs_EIF0 ionosphere — IF=0.8431, EIF_0=0.8600, diff=0.0169
  [PASS] IF_vs_EIF0 breastw — IF=0.9865, EIF_0=0.9881, diff=0.0016
  [PASS] IF_vs_EIF0 pima — IF=0.6569, EIF_0=0.6665, diff=0.0096

=== Test: Score Distribution Properties ===
  [PASS] score_range IF in [0,1] — min=0.3382, max=0.6915
  [PASS] score_separation IF — mean_normal=0.3842, mean_anomaly=0.5741
  [PASS] score_range EIF in [0,1] — min=0.3376, max=0.6724
  [PASS] score_separation EIF — mean_normal=0.3699, mean_anomaly=0.5409

=== Test: Extension Level Validation ===
  [PASS] ext_level > dim-1 rejected — raised: requirement failed: parameter extensionLevel given invalid value 8, but must be
  [PASS] ext_level = dim-1 accepted — ext=7 on 8D data

=== Test: Edge Case — Small Dataset (< default sample_size) ===
  [PASS] small_dataset maxSamples>n rejected — raised: requirement failed: parameter maxSamples given invalid value 256.0 specifying th
  [PASS] small_dataset IF maxSamples=n — n=50, scores range=[0.4007, 0.5988]
  [PASS] small_dataset EIF maxSamples=n — n=50, scores range=[0.3952, 0.5522]
  [PASS] small_dataset IF maxSamples=30 — scores range=[0.4250, 0.5825]
  [PASS] small_dataset IF fractional maxSamples=0.5 — scores range=[0.4288, 0.5900]

=== Test: Edge Case — Low Dimensional (1D, 2D) ===
  [PASS] low_dim IF 1D — AUROC=0.9756
  [PASS] low_dim EIF 1D ext=0 — AUROC=0.9786
  [PASS] low_dim IF 2D — AUROC=0.9934
  [PASS] low_dim EIF 2D ext=1 — AUROC=0.9829

=== Test: Edge Case — Constant Features ===
  [PASS] constant_features IF — AUROC=0.9988
  [PASS] constant_features EIF ext=max — AUROC=0.9980
  [PASS] constant_features EIF ext=0 — AUROC=1.0000

=== Test: Edge Case — All Features Constant ===
  [PASS] all_constant IF no crash — score range=[0.5000, 0.5000], std=0.000000
  [PASS] all_constant EIF no crash — score range=[0.2910, 0.2910], std=0.000000

=== Test: Edge Case — Minimal Dataset ===
  [PASS] single_row IF rejected — raised: requirement failed: parameter maxSamples given invalid value 1.0 specifying the
  [PASS] tiny_dataset IF (n=3, ss=2) — scores=[0.011238790132930549, 0.011238790132930549, 0.011238790132930549]
  [PASS] tiny_dataset EIF (n=3, ss=2) — scores=[0.011238790132930549, 0.011238790132930549, 0.011238790132930549]

======================================================================
  SUMMARY: 61 passed, 0 failed (20.2s)
======================================================================

Why this design

This implementation tries to keep EIF aligned with the rest of the library:

  • same Spark ML Estimator / Model pattern as standard IF
  • same thresholding flow for contamination
  • same save/load ergonomics
  • maximum reuse of shared training and metadata logic

At the same time, it preserves the important algorithmic differences required for EIF correctness:

  • hyperplane-based tree structure
  • sparse hyperplane persistence
  • no retry on degenerate EIF splits
  • support for zero-size leaves

Limitations / follow-ups

Not included in this PR:

  • ONNX export for ExtendedIsolationForestModel

The current ONNX converter assumes axis-aligned tree ensembles, so EIF would require a different export representation or converter strategy.


References

jverbus and others added 30 commits March 9, 2026 16:31
…ng. Results look reasonable, but detailed correctness not yet verified.
…h Isolation Forest, improved docs, and parameter validation

- Renamed local variables in `ExtendedIsolationForest.scala` for clarity (`dataset` → `data`).
- Moved and refined parameter validation in `validateAndResolveParams`, logging chosen samples/features.
- Updated Javadoc-style comments in `ExtendedIsolationForest`, `ExtendedIsolationForestModel`, and related classes.
- Changed schema checks to use `VectorType` instead of `SQLDataTypes.VectorType`.
- Renamed and documented internal methods (e.g., `pathLengthInternal`) in `ExtendedIsolationTree`.
- Ensured consistent naming across `ExtendedIsolationForestModel` fields (e.g., `extendedIsolationTrees`).
- Cleaned up imports, minor style fixes, and removed commented-out debug prints.

There are still likely opprotunities to factor out more shared logic into `core`..
…rcept sampling, ≤ semantics, and degeneracy handling

- Sample normal in the selected subspace with up to (extensionLevel+1) non‑zero coords; normalize and guard zero‑norm.
- Sample intercept as point p by drawing each active coordinate uniformly from that node’s data range; set offset = n·p.
- Use inclusive left-branch test x·n ≤ n·p in both training and scoring so the split predicate matches the paper.
- Treat minDot == maxDot or an empty partition as a leaf (stores numInstances); keeps trees well‑formed.
- Compute dot against a full‑length normal (zeros for unused coords) to match the (x − p)·n test.
- Minor: log message tweaks; one‑pass min/max scan instead of materializing arrays; consistent ≤ in train/score.
- No change to model IO or public params.
…afing

  Previously, a single failed split attempt (constant feature, all-same
  dot products, or empty partition) immediately produced a leaf node.
  This meant extensionLevel=0 was not equivalent to standard IF when the
  first randomly chosen feature happened to be constant. Now retries up
  to 50 times before falling back to a leaf.
  Remove Int.MaxValue-1 sentinel default. If the user sets extensionLevel
  above numFeatures-1, throw immediately. If unset, default to
  numFeatures-1 (fully extended). The resolved value is persisted in the
  model rather than the sentinel.
Guard against dataForTree.head crash when a partition receives zero
sampled points. Throws a clear IllegalStateException instead of a
confusing NoSuchElementException.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
  Divide path length sum by the actual number of trees in the model
  rather than the $(numEstimators) parameter, preventing model/param
  drift from producing incorrect anomaly scores.
…check

  Use recursive node-by-node comparison with epsilon tolerance for
  doubles instead of fragile toString matching.
  Add EIF section covering when to use it, the extensionLevel parameter
  and its interaction with maxFeatures, and a usage example. Call out
  that ONNX export is not supported for EIF. Add Hariri et al. 2018
  to references.
…ing estimator

  Change the split criterion in ExtendedIsolationTree from <= to strict <,
  matching both the reference implementation (sahandha/eif) and our own
  standard IsolationTree. Affects tree building (partition) and scoring
  (path traversal).

  Remove the set(extensionLevel, resolvedExtensionLevel) call in
  ExtendedIsolationForest.fit() that mutated the estimator. When
  extensionLevel was unset (defaulting to fully extended), the first
  fit() permanently set it, causing reuse on a dataset with fewer
  features to either fail validation or silently use the wrong level.
…retry loop

  Remove bounded retry loop for degenerate splits. Instead, follow the
  EIF paper and reference implementation: allow empty partitions to
  become ExtendedExternalNode(0) leaves. Change split predicate from
  <= to strict < to match reference implementation's (x-p)·n < 0.
  Relax ExtendedExternalNode to accept numInstances >= 0.
…_max results

  Replace the old IF-only benchmark table with comprehensive results
  across 13 datasets comparing all three model variants against Liu
  et al. and the reference Python EIF implementation.
Set the resolved extensionLevel on the estimator before copyValues so
it flows into the model's param map. Without this, models trained
without explicitly calling setExtensionLevel() would lose the effective
value on save/load. Add test covering default resolution and
round-trip persistence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…igned splits

Exercise the numInstances >= 0 semantics that became first-class EIF
behavior when degenerate hyperplane splits were allowed to produce empty
children. New tests cover:
- ExtendedExternalNode(0) construction and subtreeDepth
- Path length through a zero-size leaf contributes avgPathLength(0) = 0
- Save/load round-trip preserves a tree containing a zero-size leaf
- extensionLevel=0 produces strictly axis-aligned normals (1 non-zero coordinate)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use "closely matches" for ExtendedIF_max reference comparison
- Note mulcross as an open outlier in ExtendedIF_0 parity (12 of 13)
- Describe extensionLevel=0 as "uses axis-aligned splits" instead of
  "recovers standard axis-aligned splits"
- Frame low-dimensional underperformance as our benchmark observation,
  not a broad established finding from the paper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
  Uncomment savedExtendedIsolationForestModelTreeStructureTest and add the
  required resource files: a saved ExtendedIsolationForestModel and its
  expected first-tree toString golden file. This provides a regression
  guard against accidental changes to tree serialization or structure.
…od into SharedTrainLogic (where the other shared training helpers already live). Both IsolationForest.scala

  and ExtendedIsolationForest.scala now call the single shared implementation, passing $(maxFeatures) and $(maxSamples) as arguments.

  Files changed:
  - core/SharedTrainLogic.scala — added validateAndResolveParams(dataset, maxFeatures, maxSamples) method and its ResolvedParams import
  - IsolationForest.scala — removed private method, updated import and call site
  - extended/ExtendedIsolationForest.scala — removed private method, updated import and call site
…ansformSchema

  All four Estimator/Model classes had identical 15-line transformSchema
  overrides. Extract the shared logic into Utils and delegate with a
  one-liner in each class.
…comparison style

  - Remove unused IsolationForestModel import from ExtendedIsolationForestModelReadWrite
  - Fix reader docstring that incorrectly said "standard" instead of "Extended"
  - Change `outlierScoreThreshold > 0.0` to `> 0` to match standard IF style
…l, and intermediate levels

  - Verify all hyperplane normals are L2-normalized across extension levels and seeds
  - Verify extensionLevel > numFeatures - 1 throws IllegalArgumentException at fit time
  - Verify intermediate extensionLevel values (1-4) train valid models with reasonable AUROC
…ationForestModel

  - Remove unnecessary self-import of ExtendedIsolationForestModel in ReadWrite file
  - Fix companion object and threshold comments that said "IsolationForestModel"
    instead of "ExtendedIsolationForestModel"
  Resolve the review issues uncovered while comparing the extended isolation
  forest branch against master and the EIF reference implementation.

  ExtendedIsolationForest
  - stop mutating the estimator with a resolved default extensionLevel during
    fit()
  - keep dataset-dependent extensionLevel resolution local to each fit and
    apply the resolved value only to the trained model
  - add a regression test that reuses the same estimator across datasets with
    different feature dimensions to ensure default extensionLevel does not
    leak across fits

  IsolationForestModel / ExtendedIsolationForestModel
  - fail fast when transform() is called on an empty ensemble instead of
    dividing by zero and producing invalid scores
  - keep scoring normalized by the actual loaded tree count, but guard the
    zero-tree case explicitly
  - add transform-throws coverage for manually constructed empty standard and
    extended models
  - preserve existing empty model write/read tests so persistence still
    round-trips this edge case correctly

  Tests and style cleanup
  - move ExtendedIsolationForestModelWriteReadTest into the
    com.linkedin.relevance.isolationforest.extended package so package names
    match file paths and the surrounding test suites
  - restore the Spark-derived attribution header on the moved/copied
    read-write helpers
  - align ExtendedIsolationForestModelReadWrite visibility with the rest of
    the package-private isolation forest internals

  Verification
  - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test
  - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestTest
  - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests
  com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest
  - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:compileScala :isolation-forest:compileTestScala
jverbus added 7 commits March 10, 2026 19:17
  Update the README so the documented examples and version references match
  the current repo state and are copy-paste runnable.

  README updates
  - change the documented default Spark version from 3.5.1 to 3.5.5
  - update the example build command to use the current default Spark/Scala
    combination
  - replace stale hardcoded library and ONNX package versions with
    <latest-version> / <matching-version> placeholders
  - switch the Gradle dependency example from deprecated `compile` to
    `implementation`
  - add the missing `org.apache.spark.sql.functions.col` import to the Scala
    training example
  - fix the training example text to refer to the `label` column instead of
    `labels`
  - clarify the EIF `extensionLevel(5)` example comment so the dimensional
    assumption is explicit
  - define `dataset_name` and `num_examples_to_print` in the ONNX Python
    inference example so the snippet is runnable as written
  - remove the benchmark prose reference to a `LI IF` comparison column that
    is not present in the table

  This is a documentation-only change.
  Add Extended Isolation Forest (EIF) support alongside the existing standard
  Isolation Forest implementation, and harden the standard/extended model
  persistence and scoring paths.

  Extended Isolation Forest
  - add ExtendedIsolationForest estimator, ExtendedIsolationForestModel,
    ExtendedIsolationForestParams, ExtendedIsolationTree, ExtendedNodes, and
    ExtendedUtils
  - implement EIF training with extensionLevel-controlled random hyperplane
    splits based on the Hariri et al. algorithm
  - resolve extensionLevel per fit without mutating estimator state
  - support axis-aligned EIF (extensionLevel = 0) through fully extended EIF
    (extensionLevel = numFeatures - 1)

  Sparse EIF model representation
  - store hyperplanes sparsely as (indices, weights, offset) instead of dense
    per-node normal vectors
  - canonicalize stored sparse coordinates by sorting feature indices before
    constructing SplitHyperplane
  - use sparse dot products for tree traversal and add a direct Spark Vector
    scoring path so EIF scoring benefits from sparsity end to end
  - enforce sparse hyperplane invariants: non-empty, length-matched,
    non-negative, distinct, sorted indices

  Persistence and read/write refactor
  - move standard model read/write into a top-level
    IsolationForestModelReadWrite implementation
  - add shared metadata helpers in IsolationForestModelReadWriteUtils
  - add sparse EIF model read/write support and checked-in EIF persistence
    fixtures
  - preserve standard-model backward compatibility when loading older saved
    models that do not contain totalNumFeatures metadata, logging that
    dimension validation is unavailable for those legacy models

  Model/scoring hardening
  - reject numSamples values that resolve to fewer than 2 samples during
    training
  - fail fast when transform() is called on empty standard or extended models
  - store totalNumFeatures in newly saved models and validate scoring input
    dimension when that training dimension is known
  - keep standard IF backward compatibility by restoring the legacy public
    4-arg IsolationForestModel constructor while making the richer internal
    constructor package-private
  - restrict the extended model constructor to package-private use so
    totalNumFeatures remains internal to fit/load/copy flows

  Tests
  - add comprehensive EIF estimator, tree, sparse-hyperplane, and write/read
    tests
  - add regression coverage for repeated EIF fits, empty-model scoring guards,
    numSamples >= 2 enforcement, scoring-time feature dimension validation,
    standard legacy metadata loading, and standard legacy constructor behavior
  - update saved model metadata/tree-structure fixtures for the new extended
    persistence format and formatting changes

  Documentation
  - refresh README dependency/version examples and fix copy-paste issues in the
    Scala and ONNX examples
  - add EIF usage and persistence examples
  - document benchmark results for standard IF vs EIF variants
  - fix benchmark/doc typos and soften the benchmark agreement statement to
    avoid overstating row-by-row verification

  Verification
  - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test
  - Apply  rounding to all value ± error pairs (1 sig fig on
    error, 2 if leading digit is 1)
  - Move Ref Python results from StandardIF to ExtendedIF_0 rows since
    the reference Python EIF at ext=0 is not a true standard IF
  - Add DOI to EIF paper reference and add reference Python eif repo
  - Clarify column headers (Liu et al., Ref. Python with IF/EIF labels)
  - Simplify key observations and fix overstated dimensionality claim
  - Minor wording improvements throughout
  1. Non-breaking spaces around ± — replaced ± with &nbsp;±&nbsp; in all value cells so values like 0.813 ± 0.004 won't wrap mid-value.
  2. Dashes in empty cells — all empty reference cells now show - instead of blank:
    - StandardIF rows: - in both Ref. Python columns
    - ExtendedIF rows: - in the Liu et al. column
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Extended Isolation Forest (EIF) as a first-class Spark ML Estimator/Model (random hyperplane splits), alongside refactors to share training and persistence utilities between standard IF and EIF, plus expanded tests and documentation updates.

Changes:

  • Introduces com.linkedin.relevance.isolationforest.extended (EIF) with training, scoring, and Avro-based save/load.
  • Refactors/centralizes shared training + schema validation + metadata utilities; improves standard IF persistence with totalNumFeatures and legacy-load handling.
  • Adds extensive new EIF and robustness tests; updates README with EIF usage, parameter semantics, benchmarks, and ONNX support notes.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTreeTest.scala Adds unit tests for EIF tree behavior (splits, path length, sparse hyperplanes).
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestTest.scala Adds integration-style EIF training/scoring tests and extensionLevel validation tests.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelWriteReadTest.scala Adds EIF model save/load round-trip tests and structure snapshot tests.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationTreeTest.scala Fixes variable naming typo (featureIndices).
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestTest.scala Adds validation test for invalid maxSamples resolving to < 2.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestModelWriteReadTest.scala Expands standard IF persistence/robustness tests (dimension validation, legacy metadata).
isolation-forest/src/test/resources/savedIsolationForestModel/metadata/part-00000 Updates saved-model fixture to include totalNumFeatures.
isolation-forest/src/test/resources/savedExtendedIsolationForestModel/metadata/part-00000 Adds EIF saved-model metadata fixture.
isolation-forest/src/test/resources/expectedExtendedTreeStructure.txt Adds expected EIF tree structure snapshot text.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedUtils.scala Adds sparse hyperplane representation and dot-product helpers.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedNodes.scala Adds EIF node types (internal hyperplane split / external leaf).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTree.scala Implements EIF tree training and scoring (random hyperplane splits).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestParams.scala Adds EIF-specific Spark Param (extensionLevel).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelReadWrite.scala Adds Avro persistence for EIF tree ensembles and model metadata.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModel.scala Adds EIF Spark ML Model scoring + schema validation + writer wiring.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForest.scala Adds EIF Spark ML Estimator training flow and param resolution.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/Utils.scala Centralizes schema validation/output-column appending + feature-size validation.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala Centralizes param resolution and adds training guardrails for small sample edge cases.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/NodesBase.scala Adjusts shared node base traits to support EIF needs (e.g., zero-size leaves).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWriteUtils.scala Extracts shared model metadata read/write utilities.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWrite.scala Removes old core read/write implementation (replaced by refactored code).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/Nodes.scala Ensures node toString behavior aligns with updated node base traits.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModelReadWrite.scala Refactors standard IF model persistence to use shared metadata utils + totalNumFeatures.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModel.scala Adds totalNumFeatures tracking and optional feature-dimension validation at scoring time.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala Refactors training to use shared param resolution + shared schema validation.
README.md Documents EIF usage/params/benchmarks and clarifies ONNX support limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jverbus and others added 2 commits March 11, 2026 20:50
…ompatibility

  Spark 4.x's Avro encoder silently demotes Array[Double] elements to
  float (32-bit) precision during serialization, while scalar Double
  fields survive intact. This caused all five EIF model write/read tests
  to fail on Spark 4.0.1 and 4.1.1, with weight mismatches at ~1e-8
  (the exact double→float→double precision boundary).

  The fix changes SplitHyperplane weights from Array[Double] to
  Array[Float]. This is the correct design from first principles:
  - Features are already Array[Float] (DataPoint.features)
  - Weights define the hyperplane *direction* (analogous to the feature
    index in standard IF, which is just an Int)
  - The offset defines *where* to split and remains Double (analogous to
    splitValue in standard IF, which is Double for the same reason)
  - The dot product is accumulated in Double regardless of operand type
  - The split comparison (dot < offset) is always Double vs Double

  Weights are converted to float after normalization but before computing
  the offset, so training and scoring are consistent. Benchmarks confirm
  the change is invisible after rounding: only one value across all 13
  datasets changed (breastw ExtendedIF_max AUPRC: 0.9568 → 0.9569,
  well within the ±0.0015 error bar).

  Production code:
  - ExtendedUtils.scala: SplitHyperplane.weights Array[Double] → Array[Float]
  - ExtendedIsolationTree.scala: normalize to float before offset computation
  - ExtendedIsolationForestModelReadWrite.scala: ExtendedNodeData.weights
    and NullWeights updated to float

  Test code:
  - ExtendedIsolationTreeTest.scala: float literals, L2 norm tolerance
    widened from 1e-10 to 1e-6 (appropriate for float precision)
  - ExtendedIsolationForestModelWriteReadTest.scala: float literals,
    added disabled regenerateGoldenExtendedModel helper
  - Regenerated golden model and expected tree structure

  README:
  - Updated breastw ExtendedIF_max AUPRC from 0.9568 to 0.9569

  Verified all 67 tests pass on Spark 3.5.5, 4.0.1, and 4.1.1.
Copilot review responses:

1. Grammar fix (accepted): "some dataset" → "some datasets" in README
   benchmark observations.

2. .toFloat cast comment (accepted): Added clarifying comment explaining
   why features(indices(i)).toFloat is intentional — it matches the
   DataPoint (Array[Float]) precision used during training, ensuring
   scoring consistency between the DataPoint and Vector code paths.

3. Shuffle optimization (declined): Copilot suggested replacing
   Random.shuffle + take with reservoir sampling. The shuffle operates
   on a tiny array (≤ dim features, typically < 100 elements) once per
   tree node during training — not a hot path. Readability outweighs
   the micro-optimization.

4. outlierScoreThreshold > 0 sentinel (declined): Copilot noted that
   threshold=0.0 would be treated as "unset". Technically correct, but
   this mirrors the existing standard IF pattern identically. A
   threshold of 0.0 (label everything as outlier) is not a practical
   use case. Fixing it properly requires changing both IF and EIF
   together in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jverbus
Copy link
Contributor Author

jverbus commented Mar 12, 2026

Copy link

@CrustyAIBot CrustyAIBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review complete; looks good to merge. Remaining points are follow-up hardening items, not blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants