Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library. by jverbus · Pull Request #79 · linkedin/isolation-forest

jverbus · 2026-03-12T01:08:44Z

Summary

This PR adds Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.

The new implementation introduces a Spark ML Estimator/Model pair for EIF, adds save/load support for the extended tree format, expands test coverage substantially, and updates the README with usage examples, parameter documentation, benchmarks, and clear notes about ONNX support.

In addition to the new EIF functionality, this PR also refactors shared training/persistence utilities and improves robustness and backward compatibility for the existing standard IsolationForest implementation.

Background

The existing library only supported the standard Isolation Forest algorithm, which uses axis-aligned splits. That works well in many cases, but it can introduce directional bias and may underperform when anomalies lie along correlated or non-axis-aligned directions.

Extended Isolation Forest addresses this by replacing axis-aligned splits with random hyperplane splits, making the detector more rotationally invariant and better suited for data with correlated features.

This PR implements EIF based on the original paper and validates behavior against the reference implementation:

Original EIF paper: Hariri et al., "Extended Isolation Forest"
Reference implementation: sahandha/eif

What this PR adds

1. New Extended Isolation Forest implementation

Added a new package:

com.linkedin.relevance.isolationforest.extended

with the following new public Spark ML APIs:

ExtendedIsolationForest
ExtendedIsolationForestModel

and the supporting implementation types:

ExtendedIsolationForestParams
ExtendedIsolationTree
ExtendedNodes
ExtendedUtils
ExtendedIsolationForestModelReadWrite

2. Random hyperplane splits for EIF

Each EIF tree now isolates points using random hyperplanes instead of single-feature thresholds.

Implementation details:

hyperplane normals are sampled randomly and L2-normalized
hyperplanes are stored sparsely
only the non-zero coordinates are persisted and used at score time
coordinates are canonicalized into sorted order for stable persistence/debugging

3. `extensionLevel` parameter

EIF introduces a new parameter:

extensionLevel

Behavior:

0 -> axis-aligned EIF splits
numFeatures - 1 -> fully extended hyperplanes
if unset, it defaults at fit time to the fully extended setting for the resolved feature subspace

Important semantics captured in the implementation and docs:

extensionLevel is interpreted relative to the resolved per-tree feature subspace
when maxFeatures < 1.0, the valid extensionLevel range is based on that subspace, not the original dataset dimension
invalid values fail fast during fit

4. EIF model save/load support

Added full persistence support for ExtendedIsolationForestModel.

This includes:

metadata read/write
Avro serialization of extended trees
reconstruction of tree structure on load
persistence of sparse hyperplane data:
- indices
- weights
- offset

A saved EIF model can now be loaded with:

val model = ExtendedIsolationForestModel.load(path)

Standard Isolation Forest improvements included in this PR

While adding EIF, this PR also cleans up and strengthens the existing standard IF implementation.

1. Shared training/schema utilities

Refactored common logic into reusable helpers so both standard IF and EIF use the same code paths where appropriate.

New shared helpers include:

schema validation/output column appending
parameter resolution for maxFeatures / maxSamples
sampled dataset creation
tree training orchestration
threshold computation

2. Shared read/write metadata utilities

Extracted common metadata handling into:

core/IsolationForestModelReadWriteUtils.scala

This reduces duplication between standard IF and EIF persistence code.

3. Track `totalNumFeatures` in models

Standard IsolationForestModel now records:

numFeatures = resolved per-tree feature count
totalNumFeatures = full training input dimension

This enables feature-dimension validation during scoring and makes persisted model metadata more explicit.

4. Backward-compatible standard IF constructor behavior

To preserve compatibility, standard IF now has:

an internal 5-argument constructor that carries totalNumFeatures
a restored public 4-argument constructor that preserves the legacy unknown-dimension path

This keeps older construction patterns working while allowing newly trained or newly loaded models to retain full dimensionality metadata.

Related to that, ExtendedIsolationForestModel keeps its 5-argument constructor internal-only so EIF model construction remains package-scoped.

5. Legacy model loading support

Older saved standard IF models that do not contain totalNumFeatures metadata still load successfully.

In that case the model follows an unknown-dimension path and skips feature-size validation.

6. More robust scoring behavior

Standard and extended models now fail fast for invalid scoring scenarios such as:

empty ensembles
numSamples < 2
feature vectors whose dimension does not match the model’s known training dimension

7. Small validation/robustness fixes

Additional guardrails added in shared training logic:

resolved maxSamples must be at least 2
explicit failure if a training partition ends up with zero samples
schema validation logic is centralized and reused

EIF implementation details worth calling out

A few implementation choices are deliberate and important.

Degenerate splits and zero-size leaves

EIF can legitimately produce degenerate hyperplane splits where all points land on one side.
To match the EIF algorithm/reference behavior, this implementation:

does not retry the split
allows zero-size leaves via ExtendedExternalNode(0)

This is different from the standard IF behavior, where axis-aligned feature selection retries until a splittable feature is found.

Sparse hyperplane representation

Hyperplanes are stored in the original feature space using only the non-zero coordinates.
This keeps:

node-local storage proportional to the number of active coordinates
dot-product work proportional to the active coordinates, not the full input dimension

`ExtendedIF_0` is not identical to standard IF

Although extensionLevel = 0 yields axis-aligned hyperplanes, it is still EIF behavior, not standard IF behavior.

Key differences include:

standard IF retries when a constant feature is selected
EIF commits to the random draw
RNG consumption patterns differ

This distinction is now documented in the README and discussed explicitly in the benchmark section.

Documentation updates

The README was expanded significantly to reflect the new functionality.

Added documentation for EIF

New sections include:

Extended Isolation Forest overview
when to use EIF vs standard IF
extensionLevel parameter semantics
Scala usage example
save/load example

Clarified ONNX support

The README now explicitly states:

ONNX export is supported for standard IsolationForestModel only
EIF is not currently supported by the ONNX converter because the converter assumes axis-aligned tree ensembles

Benchmarks section updated

The README benchmark section now compares:

StandardIF
ExtendedIF_0
ExtendedIF_max
Liu et al. 2008 results
the reference Python EIF implementation (sahandha/eif)

It also summarizes where EIF helps most and where it may underperform.

Additional README cleanup

Also included:

dependency examples updated to current artifact naming/version style
version references updated from Spark 3.5.1 to 3.5.5
citation section added
Scala/Python example cleanup and typo fixes

Tests added / updated

This PR adds substantial new test coverage for both EIF and supporting standard IF changes.

New EIF test suites

Added:

ExtendedIsolationForestTest
ExtendedIsolationForestModelWriteReadTest
ExtendedIsolationTreeTest

Coverage includes:

training/scoring on benchmark datasets
zero contamination behavior
default and explicit extensionLevel
invalid extensionLevel handling
persistence round-trip
saved model structure snapshot tests
identical-feature edge cases
zero-size leaf handling
sparse hyperplane validation
L2 normalization of hyperplane normals
feature-dimension validation
empty model / invalid model transform failures

Updated standard IF tests

Expanded existing standard IF coverage to validate:

totalNumFeatures persistence
feature-dimension validation at score time
empty model transform failure
invalid numSamples handling
legacy constructor compatibility
legacy saved model loading without totalNumFeatures

Added / updated test resources

Added snapshot resources for EIF model persistence, including:

saved extended model metadata
saved extended tree Avro data
expected extended tree structure text fixture

Updated standard saved model metadata fixture to include totalNumFeatures.

Backward compatibility

API compatibility

Existing standard IsolationForest API remains available.
The legacy public 4-arg IsolationForestModel constructor is preserved.
Existing standard model load paths are preserved.

Persistence compatibility

Newly saved standard IF models now include totalNumFeatures.
Older saved standard IF models without that metadata still load successfully.

ONNX compatibility

No new ONNX behavior is introduced for EIF.
Existing standard IF ONNX flow remains unchanged.
EIF is intentionally documented as unsupported for ONNX export.

Behavior changes to be aware of

This PR introduces a few stricter validation behaviors:

very small maxSamples values that resolve to fewer than 2 samples now fail fast
transform now rejects empty ensembles
scoring validates feature vector size when training dimension is known

These are intentional robustness improvements.

Validation

Automated test validation

Verified with:

./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest
./gradlew -g /tmp/codex-gradle-home :isolation-forest:test

Benchmark validation in README

The README now includes a detailed benchmark comparison across 13 datasets for:

standard IF
EIF with extensionLevel = 0
fully extended EIF
Liu et al. 2008 paper results
the reference Python EIF implementation

The headline result is that:

StandardIF remains in strong agreement with the original Liu et al. paper
ExtendedIF_max closely tracks the reference Python EIF implementation across the benchmark suite, with differences generally within a couple of standard errors

Additional edge-case study (not included in this PR)

In addition to the checked-in unit/integration tests and README benchmark comparison, we also ran a separate detailed hyperparameter / edge-case study outside the PR. This is not part of the checked-in test suite, but it gives extra confidence in the implementation.

Summary: 61 / 61 checks passed in ~20s

Hyperparameter behavior

Extension level (ionosphere, 33D): AUROC increases monotonically from 0.860 (ext=0) to 0.908 (ext=32), confirming that higher extension levels capture more complex anomaly structure.
Number of trees (breastw): AUROC is stable at about 0.985+ across 10–200 trees. More trees help slightly, but returns diminish beyond ~50 trees.
Sample size (pima): AUROC ranges from 0.63–0.67 across 32–512 samples. Larger samples are slightly better. Fractional maxSamples=0.5 behaves correctly.
Max features: Halving the feature subspace (maxFeatures=0.5) has minimal effect on standard IF but slightly improves EIF on ionosphere (0.908 -> 0.913).
Bootstrap: on/off produces nearly identical results for both IF and EIF.

Correctness checks

Contamination only affects the prediction threshold, not the scores themselves. contamination=0 labels nothing; contamination=0.35 labels about 35.6%.
Seed reproducibility: same seed gives bit-identical scores; different seeds produce different scores.
Save/load: Avro round-trip preserves scores exactly (max diff = 0).
IF vs EIF ext=0: similar but not identical AUROC (diff 0.002–0.017), confirming that they are related but algorithmically distinct.
Score distribution: all scores remain in [0,1], and anomalies consistently score higher than normals.

Validation and edge cases

extensionLevel > dim - 1 is correctly rejected.
maxSamples > n is correctly rejected; maxSamples = n and fractional values behave correctly.
1D/2D data: both IF and EIF handle low-dimensional data well (AUROC 0.97–0.99).
Constant features: no degradation; IF and EIF both isolate using the non-constant dimensions.
All features constant: does not crash. IF assigns 0.5 everywhere; EIF assigns 0.291 everywhere.
Tiny dataset (n=3, ss=2): works and produces identical scores for all points. A single-row dataset is correctly rejected.

Raw output from the external edge-case study

======================================================================
  Isolation Forest / Extended IF — Hyperparameter & Edge Case Tests
======================================================================

=== Test: Extension Level Sweep (ionosphere, 33D) ===
  [PASS] ext_sweep ext=0 — AUROC=0.8600
  [PASS] ext_sweep ext=1 — AUROC=0.8733
  [PASS] ext_sweep ext=4 — AUROC=0.8884
  [PASS] ext_sweep ext=8 — AUROC=0.9027
  [PASS] ext_sweep ext=16 — AUROC=0.9063
  [PASS] ext_sweep ext=32 — AUROC=0.9079
  [PASS] ext_sweep ext=0 != ext=max — ext=0 AUROC=0.8600, ext=max AUROC=0.9079

=== Test: Number of Trees Sweep (breastw) ===
  [PASS] num_trees n=10 — AUROC=0.9851
  [PASS] num_trees n=50 — AUROC=0.9862
  [PASS] num_trees n=100 — AUROC=0.9865
  [PASS] num_trees n=200 — AUROC=0.9880
  [PASS] num_trees monotonic trend — n=10: 0.9851, n=100: 0.9865

=== Test: Sample Size Sweep (pima) ===
  [PASS] sample_size ss=32 — AUROC=0.6411
  [PASS] sample_size ss=64 — AUROC=0.6329
  [PASS] sample_size ss=128 — AUROC=0.6380
  [PASS] sample_size ss=256 — AUROC=0.6569
  [PASS] sample_size ss=512 — AUROC=0.6698
  [PASS] sample_size fractional (0.5) — AUROC=0.6616

=== Test: Max Features / Feature Subspace (ionosphere) ===
  [PASS] max_features IF mf=0.5 — AUROC=0.8438
  [PASS] max_features EIF mf=0.5 — AUROC=0.9134
  [PASS] max_features IF mf=1.0 — AUROC=0.8431
  [PASS] max_features EIF mf=1.0 — AUROC=0.9079

=== Test: Bootstrap (breastw) ===
  [PASS] bootstrap IF bs=false — AUROC=0.9865
  [PASS] bootstrap EIF bs=false — AUROC=0.9832
  [PASS] bootstrap IF bs=true — AUROC=0.9878
  [PASS] bootstrap EIF bs=true — AUROC=0.9825

=== Test: Contamination & Threshold (breastw) ===
  [PASS] contamination=0 no predictions — predicted 0 anomalies (expected 0)
  [PASS] contamination=0.35 fraction — predicted fraction=0.356 (expected ~0.35)
  [PASS] contamination does not affect scores — max score diff=0.00e+00
  [PASS] EIF contamination=0 no predictions — predicted 0 anomalies (expected 0)

=== Test: Seed Reproducibility ===
  [PASS] IF seed reproducibility — max diff=0.00e+00
  [PASS] IF different seeds differ — max diff=0.0412
  [PASS] EIF seed reproducibility — max diff=0.00e+00

=== Test: Save/Load Round-Trip ===
  [PASS] IF save/load scores match — max diff=0.00e+00
  [PASS] EIF save/load scores match — max diff=0.00e+00

=== Test: Standard IF vs EIF ext=0 ===
  [PASS] IF_vs_EIF0 ionosphere — IF=0.8431, EIF_0=0.8600, diff=0.0169
  [PASS] IF_vs_EIF0 breastw — IF=0.9865, EIF_0=0.9881, diff=0.0016
  [PASS] IF_vs_EIF0 pima — IF=0.6569, EIF_0=0.6665, diff=0.0096

=== Test: Score Distribution Properties ===
  [PASS] score_range IF in [0,1] — min=0.3382, max=0.6915
  [PASS] score_separation IF — mean_normal=0.3842, mean_anomaly=0.5741
  [PASS] score_range EIF in [0,1] — min=0.3376, max=0.6724
  [PASS] score_separation EIF — mean_normal=0.3699, mean_anomaly=0.5409

=== Test: Extension Level Validation ===
  [PASS] ext_level > dim-1 rejected — raised: requirement failed: parameter extensionLevel given invalid value 8, but must be
  [PASS] ext_level = dim-1 accepted — ext=7 on 8D data

=== Test: Edge Case — Small Dataset (< default sample_size) ===
  [PASS] small_dataset maxSamples>n rejected — raised: requirement failed: parameter maxSamples given invalid value 256.0 specifying th
  [PASS] small_dataset IF maxSamples=n — n=50, scores range=[0.4007, 0.5988]
  [PASS] small_dataset EIF maxSamples=n — n=50, scores range=[0.3952, 0.5522]
  [PASS] small_dataset IF maxSamples=30 — scores range=[0.4250, 0.5825]
  [PASS] small_dataset IF fractional maxSamples=0.5 — scores range=[0.4288, 0.5900]

=== Test: Edge Case — Low Dimensional (1D, 2D) ===
  [PASS] low_dim IF 1D — AUROC=0.9756
  [PASS] low_dim EIF 1D ext=0 — AUROC=0.9786
  [PASS] low_dim IF 2D — AUROC=0.9934
  [PASS] low_dim EIF 2D ext=1 — AUROC=0.9829

=== Test: Edge Case — Constant Features ===
  [PASS] constant_features IF — AUROC=0.9988
  [PASS] constant_features EIF ext=max — AUROC=0.9980
  [PASS] constant_features EIF ext=0 — AUROC=1.0000

=== Test: Edge Case — All Features Constant ===
  [PASS] all_constant IF no crash — score range=[0.5000, 0.5000], std=0.000000
  [PASS] all_constant EIF no crash — score range=[0.2910, 0.2910], std=0.000000

=== Test: Edge Case — Minimal Dataset ===
  [PASS] single_row IF rejected — raised: requirement failed: parameter maxSamples given invalid value 1.0 specifying the
  [PASS] tiny_dataset IF (n=3, ss=2) — scores=[0.011238790132930549, 0.011238790132930549, 0.011238790132930549]
  [PASS] tiny_dataset EIF (n=3, ss=2) — scores=[0.011238790132930549, 0.011238790132930549, 0.011238790132930549]

======================================================================
  SUMMARY: 61 passed, 0 failed (20.2s)
======================================================================

Why this design

This implementation tries to keep EIF aligned with the rest of the library:

same Spark ML Estimator / Model pattern as standard IF
same thresholding flow for contamination
same save/load ergonomics
maximum reuse of shared training and metadata logic

At the same time, it preserves the important algorithmic differences required for EIF correctness:

hyperplane-based tree structure
sparse hyperplane persistence
no retry on degenerate EIF splits
support for zero-size leaves

Limitations / follow-ups

Not included in this PR:

ONNX export for ExtendedIsolationForestModel

The current ONNX converter assumes axis-aligned tree ensembles, so EIF would require a different export representation or converter strategy.

References

Hariri, S., Carrasco Kind, M., and Brunner, R. J. Extended Isolation Forest.
https://arxiv.org/abs/1811.02141
Reference EIF implementation used for comparison:
https://github.com/sahandha/eif

…ng. Results look reasonable, but detailed correctness not yet verified.

…h Isolation Forest, improved docs, and parameter validation - Renamed local variables in `ExtendedIsolationForest.scala` for clarity (`dataset` → `data`). - Moved and refined parameter validation in `validateAndResolveParams`, logging chosen samples/features. - Updated Javadoc-style comments in `ExtendedIsolationForest`, `ExtendedIsolationForestModel`, and related classes. - Changed schema checks to use `VectorType` instead of `SQLDataTypes.VectorType`. - Renamed and documented internal methods (e.g., `pathLengthInternal`) in `ExtendedIsolationTree`. - Ensured consistent naming across `ExtendedIsolationForestModel` fields (e.g., `extendedIsolationTrees`). - Cleaned up imports, minor style fixes, and removed commented-out debug prints. There are still likely opprotunities to factor out more shared logic into `core`..

… a work in progress.

…ite working with tests.

…rcept sampling, ≤ semantics, and degeneracy handling - Sample normal in the selected subspace with up to (extensionLevel+1) non‑zero coords; normalize and guard zero‑norm. - Sample intercept as point p by drawing each active coordinate uniformly from that node’s data range; set offset = n·p. - Use inclusive left-branch test x·n ≤ n·p in both training and scoring so the split predicate matches the paper. - Treat minDot == maxDot or an empty partition as a leaf (stores numInstances); keeps trees well‑formed. - Compute dot against a full‑length normal (zeros for unused coords) to match the (x − p)·n test. - Minor: log message tweaks; one‑pass min/max scan instead of materializing arrays; consistent ≤ in train/score. - No change to model IO or public params.

…afing Previously, a single failed split attempt (constant feature, all-same dot products, or empty partition) immediately produced a leaf node. This meant extensionLevel=0 was not equivalent to standard IF when the first randomly chosen feature happened to be constant. Now retries up to 50 times before falling back to a leaf.

…n typos

Remove Int.MaxValue-1 sentinel default. If the user sets extensionLevel above numFeatures-1, throw immediately. If unset, default to numFeatures-1 (fully extended). The resolved value is persisted in the model rather than the sentinel.

Guard against dataForTree.head crash when a partition receives zero sampled points. Throws a clear IllegalStateException instead of a confusing NoSuchElementException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Divide path length sum by the actual number of trees in the model rather than the $(numEstimators) parameter, preventing model/param drift from producing incorrect anomaly scores.

…check Use recursive node-by-node comparison with epsilon tolerance for doubles instead of fragile toString matching.

Add EIF section covering when to use it, the extensionLevel parameter and its interaction with maxFeatures, and a usage example. Call out that ONNX export is not supported for EIF. Add Hariri et al. 2018 to references.

…ing estimator Change the split criterion in ExtendedIsolationTree from <= to strict <, matching both the reference implementation (sahandha/eif) and our own standard IsolationTree. Affects tree building (partition) and scoring (path traversal). Remove the set(extensionLevel, resolvedExtensionLevel) call in ExtendedIsolationForest.fit() that mutated the estimator. When extensionLevel was unset (defaulting to fully extended), the first fit() permanently set it, causing reuse on a dataset with fewer features to either fail validation or silently use the wrong level.

…retry loop Remove bounded retry loop for degenerate splits. Instead, follow the EIF paper and reference implementation: allow empty partitions to become ExtendedExternalNode(0) leaves. Change split predicate from <= to strict < to match reference implementation's (x-p)·n < 0. Relax ExtendedExternalNode to accept numInstances >= 0.

…_max results Replace the old IF-only benchmark table with comprehensive results across 13 datasets comparing all three model variants against Liu et al. and the reference Python EIF implementation.

… datasets

Set the resolved extensionLevel on the estimator before copyValues so it flows into the model's param map. Without this, models trained without explicitly calling setExtensionLevel() would lose the effective value on save/load. Add test covering default resolution and round-trip persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…igned splits Exercise the numInstances >= 0 semantics that became first-class EIF behavior when degenerate hyperplane splits were allowed to produce empty children. New tests cover: - ExtendedExternalNode(0) construction and subtreeDepth - Path length through a zero-size leaf contributes avgPathLength(0) = 0 - Save/load round-trip preserves a tree containing a zero-size leaf - extensionLevel=0 produces strictly axis-aligned normals (1 non-zero coordinate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use "closely matches" for ExtendedIF_max reference comparison - Note mulcross as an open outlier in ExtendedIF_0 parity (12 of 13) - Describe extensionLevel=0 as "uses axis-aligned splits" instead of "recovers standard axis-aligned splits" - Frame low-dimensional underperformance as our benchmark observation, not a broad established finding from the paper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Uncomment savedExtendedIsolationForestModelTreeStructureTest and add the required resource files: a saved ExtendedIsolationForestModel and its expected first-tree toString golden file. This provides a regression guard against accidental changes to tree serialization or structure.

…od into SharedTrainLogic (where the other shared training helpers already live). Both IsolationForest.scala and ExtendedIsolationForest.scala now call the single shared implementation, passing $(maxFeatures) and $(maxSamples) as arguments. Files changed: - core/SharedTrainLogic.scala — added validateAndResolveParams(dataset, maxFeatures, maxSamples) method and its ResolvedParams import - IsolationForest.scala — removed private method, updated import and call site - extended/ExtendedIsolationForest.scala — removed private method, updated import and call site

…ansformSchema All four Estimator/Model classes had identical 15-line transformSchema overrides. Extract the shared logic into Utils and delegate with a one-liner in each class.

…comparison style - Remove unused IsolationForestModel import from ExtendedIsolationForestModelReadWrite - Fix reader docstring that incorrectly said "standard" instead of "Extended" - Change `outlierScoreThreshold > 0.0` to `> 0` to match standard IF style

…l, and intermediate levels - Verify all hyperplane normals are L2-normalized across extension levels and seeds - Verify extensionLevel > numFeatures - 1 throws IllegalArgumentException at fit time - Verify intermediate extensionLevel values (1-4) train valid models with reasonable AUROC

…ationForestModel - Remove unnecessary self-import of ExtendedIsolationForestModel in ReadWrite file - Fix companion object and threshold comments that said "IsolationForestModel" instead of "ExtendedIsolationForestModel"

Resolve the review issues uncovered while comparing the extended isolation forest branch against master and the EIF reference implementation. ExtendedIsolationForest - stop mutating the estimator with a resolved default extensionLevel during fit() - keep dataset-dependent extensionLevel resolution local to each fit and apply the resolved value only to the trained model - add a regression test that reuses the same estimator across datasets with different feature dimensions to ensure default extensionLevel does not leak across fits IsolationForestModel / ExtendedIsolationForestModel - fail fast when transform() is called on an empty ensemble instead of dividing by zero and producing invalid scores - keep scoring normalized by the actual loaded tree count, but guard the zero-tree case explicitly - add transform-throws coverage for manually constructed empty standard and extended models - preserve existing empty model write/read tests so persistence still round-trips this edge case correctly Tests and style cleanup - move ExtendedIsolationForestModelWriteReadTest into the com.linkedin.relevance.isolationforest.extended package so package names match file paths and the surrounding test suites - restore the Spark-derived attribution header on the moved/copied read-write helpers - align ExtendedIsolationForestModelReadWrite visibility with the rest of the package-private isolation forest internals Verification - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestTest - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:compileScala :isolation-forest:compileTestScala

Update the README so the documented examples and version references match the current repo state and are copy-paste runnable. README updates - change the documented default Spark version from 3.5.1 to 3.5.5 - update the example build command to use the current default Spark/Scala combination - replace stale hardcoded library and ONNX package versions with <latest-version> / <matching-version> placeholders - switch the Gradle dependency example from deprecated `compile` to `implementation` - add the missing `org.apache.spark.sql.functions.col` import to the Scala training example - fix the training example text to refer to the `label` column instead of `labels` - clarify the EIF `extensionLevel(5)` example comment so the dimensional assumption is explicit - define `dataset_name` and `num_examples_to_print` in the ONNX Python inference example so the snippet is runnable as written - remove the benchmark prose reference to a `LI IF` comparison column that is not present in the table This is a documentation-only change.

Add Extended Isolation Forest (EIF) support alongside the existing standard Isolation Forest implementation, and harden the standard/extended model persistence and scoring paths. Extended Isolation Forest - add ExtendedIsolationForest estimator, ExtendedIsolationForestModel, ExtendedIsolationForestParams, ExtendedIsolationTree, ExtendedNodes, and ExtendedUtils - implement EIF training with extensionLevel-controlled random hyperplane splits based on the Hariri et al. algorithm - resolve extensionLevel per fit without mutating estimator state - support axis-aligned EIF (extensionLevel = 0) through fully extended EIF (extensionLevel = numFeatures - 1) Sparse EIF model representation - store hyperplanes sparsely as (indices, weights, offset) instead of dense per-node normal vectors - canonicalize stored sparse coordinates by sorting feature indices before constructing SplitHyperplane - use sparse dot products for tree traversal and add a direct Spark Vector scoring path so EIF scoring benefits from sparsity end to end - enforce sparse hyperplane invariants: non-empty, length-matched, non-negative, distinct, sorted indices Persistence and read/write refactor - move standard model read/write into a top-level IsolationForestModelReadWrite implementation - add shared metadata helpers in IsolationForestModelReadWriteUtils - add sparse EIF model read/write support and checked-in EIF persistence fixtures - preserve standard-model backward compatibility when loading older saved models that do not contain totalNumFeatures metadata, logging that dimension validation is unavailable for those legacy models Model/scoring hardening - reject numSamples values that resolve to fewer than 2 samples during training - fail fast when transform() is called on empty standard or extended models - store totalNumFeatures in newly saved models and validate scoring input dimension when that training dimension is known - keep standard IF backward compatibility by restoring the legacy public 4-arg IsolationForestModel constructor while making the richer internal constructor package-private - restrict the extended model constructor to package-private use so totalNumFeatures remains internal to fit/load/copy flows Tests - add comprehensive EIF estimator, tree, sparse-hyperplane, and write/read tests - add regression coverage for repeated EIF fits, empty-model scoring guards, numSamples >= 2 enforcement, scoring-time feature dimension validation, standard legacy metadata loading, and standard legacy constructor behavior - update saved model metadata/tree-structure fixtures for the new extended persistence format and formatting changes Documentation - refresh README dependency/version examples and fix copy-paste issues in the Scala and ONNX examples - add EIF usage and persistence examples - document benchmark results for standard IF vs EIF variants - fix benchmark/doc typos and soften the benchmark agreement statement to avoid overstating row-by-row verification Verification - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test

- Apply rounding to all value ± error pairs (1 sig fig on error, 2 if leading digit is 1) - Move Ref Python results from StandardIF to ExtendedIF_0 rows since the reference Python EIF at ext=0 is not a true standard IF - Add DOI to EIF paper reference and add reference Python eif repo - Clarify column headers (Liu et al., Ref. Python with IF/EIF labels) - Simplify key observations and fix overstated dimensionality claim - Minor wording improvements throughout

1. Non-breaking spaces around ± — replaced ± with  ±  in all value cells so values like 0.813 ± 0.004 won't wrap mid-value. 2. Dashes in empty cells — all empty reference cells now show - instead of blank: - StandardIF rows: - in both Ref. Python columns - ExtendedIF rows: - in the Liu et al. column

Copilot

Pull request overview

This PR adds Extended Isolation Forest (EIF) as a first-class Spark ML Estimator/Model (random hyperplane splits), alongside refactors to share training and persistence utilities between standard IF and EIF, plus expanded tests and documentation updates.

Changes:

Introduces com.linkedin.relevance.isolationforest.extended (EIF) with training, scoring, and Avro-based save/load.
Refactors/centralizes shared training + schema validation + metadata utilities; improves standard IF persistence with totalNumFeatures and legacy-load handling.
Adds extensive new EIF and robustness tests; updates README with EIF usage, parameter semantics, benchmarks, and ONNX support notes.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTreeTest.scala	Adds unit tests for EIF tree behavior (splits, path length, sparse hyperplanes).
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestTest.scala	Adds integration-style EIF training/scoring tests and extensionLevel validation tests.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelWriteReadTest.scala	Adds EIF model save/load round-trip tests and structure snapshot tests.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationTreeTest.scala	Fixes variable naming typo (`featureIndices`).
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestTest.scala	Adds validation test for invalid `maxSamples` resolving to < 2.
isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestModelWriteReadTest.scala	Expands standard IF persistence/robustness tests (dimension validation, legacy metadata).
isolation-forest/src/test/resources/savedIsolationForestModel/metadata/part-00000	Updates saved-model fixture to include `totalNumFeatures`.
isolation-forest/src/test/resources/savedExtendedIsolationForestModel/metadata/part-00000	Adds EIF saved-model metadata fixture.
isolation-forest/src/test/resources/expectedExtendedTreeStructure.txt	Adds expected EIF tree structure snapshot text.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedUtils.scala	Adds sparse hyperplane representation and dot-product helpers.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedNodes.scala	Adds EIF node types (internal hyperplane split / external leaf).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTree.scala	Implements EIF tree training and scoring (random hyperplane splits).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestParams.scala	Adds EIF-specific Spark Param (`extensionLevel`).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelReadWrite.scala	Adds Avro persistence for EIF tree ensembles and model metadata.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModel.scala	Adds EIF Spark ML Model scoring + schema validation + writer wiring.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForest.scala	Adds EIF Spark ML Estimator training flow and param resolution.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/Utils.scala	Centralizes schema validation/output-column appending + feature-size validation.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala	Centralizes param resolution and adds training guardrails for small sample edge cases.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/NodesBase.scala	Adjusts shared node base traits to support EIF needs (e.g., zero-size leaves).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWriteUtils.scala	Extracts shared model metadata read/write utilities.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWrite.scala	Removes old core read/write implementation (replaced by refactored code).
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/Nodes.scala	Ensures node `toString` behavior aligns with updated node base traits.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModelReadWrite.scala	Refactors standard IF model persistence to use shared metadata utils + `totalNumFeatures`.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModel.scala	Adds `totalNumFeatures` tracking and optional feature-dimension validation at scoring time.
isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala	Refactors training to use shared param resolution + shared schema validation.
README.md	Documents EIF usage/params/benchmarks and clarifies ONNX support limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ain/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModel.scala

...on-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedUtils.scala

...t/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTree.scala

README.md

…ompatibility Spark 4.x's Avro encoder silently demotes Array[Double] elements to float (32-bit) precision during serialization, while scalar Double fields survive intact. This caused all five EIF model write/read tests to fail on Spark 4.0.1 and 4.1.1, with weight mismatches at ~1e-8 (the exact double→float→double precision boundary). The fix changes SplitHyperplane weights from Array[Double] to Array[Float]. This is the correct design from first principles: - Features are already Array[Float] (DataPoint.features) - Weights define the hyperplane *direction* (analogous to the feature index in standard IF, which is just an Int) - The offset defines *where* to split and remains Double (analogous to splitValue in standard IF, which is Double for the same reason) - The dot product is accumulated in Double regardless of operand type - The split comparison (dot < offset) is always Double vs Double Weights are converted to float after normalization but before computing the offset, so training and scoring are consistent. Benchmarks confirm the change is invisible after rounding: only one value across all 13 datasets changed (breastw ExtendedIF_max AUPRC: 0.9568 → 0.9569, well within the ±0.0015 error bar). Production code: - ExtendedUtils.scala: SplitHyperplane.weights Array[Double] → Array[Float] - ExtendedIsolationTree.scala: normalize to float before offset computation - ExtendedIsolationForestModelReadWrite.scala: ExtendedNodeData.weights and NullWeights updated to float Test code: - ExtendedIsolationTreeTest.scala: float literals, L2 norm tolerance widened from 1e-10 to 1e-6 (appropriate for float precision) - ExtendedIsolationForestModelWriteReadTest.scala: float literals, added disabled regenerateGoldenExtendedModel helper - Regenerated golden model and expected tree structure README: - Updated breastw ExtendedIF_max AUPRC from 0.9568 to 0.9569 Verified all 67 tests pass on Spark 3.5.5, 4.0.1, and 4.1.1.

Copilot review responses: 1. Grammar fix (accepted): "some dataset" → "some datasets" in README benchmark observations. 2. .toFloat cast comment (accepted): Added clarifying comment explaining why features(indices(i)).toFloat is intentional — it matches the DataPoint (Array[Float]) precision used during training, ensuring scoring consistency between the DataPoint and Vector code paths. 3. Shuffle optimization (declined): Copilot suggested replacing Random.shuffle + take with reservoir sampling. The shuffle operates on a tiny array (≤ dim features, typically < 100 elements) once per tree node during training — not a hot path. Readability outweighs the micro-optimization. 4. outlierScoreThreshold > 0 sentinel (declined): Copilot noted that threshold=0.0 would be treated as "unset". Technically correct, but this mirrors the existing standard IF pattern identically. A threshold of 0.0 (label everything as outlier) is not a practical use case. Fixing it properly requires changing both IF and EIF together in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

...ion-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala

jverbus · 2026-03-12T07:20:54Z

Performance and benchmark results here: https://github.com/linkedin/isolation-forest/tree/eif_scala?tab=readme-ov-file#performance-and-benchmarks

CrustyAIBot

Deep review complete; looks good to merge. Remaining points are follow-up hardening items, not blockers.

jverbus and others added 30 commits March 9, 2026 16:31

First working version of extended isolation forest training and scori…

230a4bb

…ng. Results look reasonable, but detailed correctness not yet verified.

Updated rough draft code for EIF.

39810d7

Got standard isolation forest R/W working after major refactor. Still…

b88ea95

… a work in progress.

Fixed package structure.

bceea39

WORK IN PROGRESS - Have prototype extended isolation forest read / wr…

bbc25a0

…ite working with tests.

Did linting for eif code.

9c5b224

chore(EIF): remove dead code, unused imports, and fix test descriptio…

306f84e

…n typos

fix: fail fast on empty partition in shared tree training

aaf5e7e

Guard against dataForTree.head crash when a partition receives zero sampled points. Throws a clear IllegalStateException instead of a confusing NoSuchElementException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use actual tree count instead of numEstimators param in scoring

b5bf855

Divide path length sum by the actual number of trees in the model rather than the $(numEstimators) parameter, preventing model/param drift from producing incorrect anomaly scores.

test(EIF): replace toString tree comparison with structural equality …

7ef4bf4

…check Use recursive node-by-node comparison with epsilon tolerance for doubles instead of fragile toString matching.

docs: add Extended Isolation Forest documentation to README

ad62bc4

Add EIF section covering when to use it, the extensionLevel parameter and its interaction with maxFeatures, and a usage example. Call out that ONNX export is not supported for EIF. Add Hariri et al. 2018 to references.

Added citation info to readme.

645361e

docs: update benchmarks with StandardIF, ExtendedIF_0, and ExtendedIF…

8cf44ac

…_max results Replace the old IF-only benchmark table with comprehensive results across 13 datasets comparing all three model variants against Liu et al. and the reference Python EIF implementation.

docs: update benchmark table with reference Python results for all 13…

301f643

… datasets

refactor: extract duplicated transformSchema into Utils.validateAndTr…

58f0d47

…ansformSchema All four Estimator/Model classes had identical 15-line transformSchema overrides. Extract the shared logic into Utils and delegate with a one-liner in each class.

jverbus added 7 commits March 10, 2026 19:17

Updated readme.

9ea62de

Added scroll to results table.

522a6dc

Updated readme.

b6c3fee

jverbus requested a review from Copilot March 12, 2026 01:10

Copilot started reviewing on behalf of jverbus March 12, 2026 01:11 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

jverbus and others added 2 commits March 11, 2026 20:50

CrustyAIBot reviewed Mar 12, 2026

View reviewed changes

...ion-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala Show resolved Hide resolved

CrustyAIBot reviewed Mar 12, 2026

View reviewed changes

...ion-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala Show resolved Hide resolved

CrustyAIBot approved these changes Mar 12, 2026

View reviewed changes

jverbus mentioned this pull request Mar 12, 2026

Any interest in implementing the extended isolation forest version #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.#79

Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.#79
jverbus wants to merge 39 commits intomasterfrom
eif_scala

jverbus commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jverbus commented Mar 12, 2026

Uh oh!

CrustyAIBot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jverbus commented Mar 12, 2026

Summary

Background

What this PR adds

1. New Extended Isolation Forest implementation

2. Random hyperplane splits for EIF

3. extensionLevel parameter

4. EIF model save/load support

Standard Isolation Forest improvements included in this PR

1. Shared training/schema utilities

2. Shared read/write metadata utilities

3. Track totalNumFeatures in models

4. Backward-compatible standard IF constructor behavior

5. Legacy model loading support

6. More robust scoring behavior

7. Small validation/robustness fixes

EIF implementation details worth calling out

Degenerate splits and zero-size leaves

Sparse hyperplane representation

ExtendedIF_0 is not identical to standard IF

Documentation updates

Added documentation for EIF

Clarified ONNX support

Benchmarks section updated

Additional README cleanup

Tests added / updated

New EIF test suites

Updated standard IF tests

Added / updated test resources

Backward compatibility

API compatibility

Persistence compatibility

ONNX compatibility

Behavior changes to be aware of

Validation

Automated test validation

Benchmark validation in README

Additional edge-case study (not included in this PR)

Hyperparameter behavior

Correctness checks

Validation and edge cases

Why this design

Limitations / follow-ups

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jverbus commented Mar 12, 2026

Uh oh!

CrustyAIBot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3. `extensionLevel` parameter

3. Track `totalNumFeatures` in models

`ExtendedIF_0` is not identical to standard IF