Add Extended Isolation Forest (EIF) support to the Scala/Spark isolation-forest library.#79
Conversation
…ng. Results look reasonable, but detailed correctness not yet verified.
…h Isolation Forest, improved docs, and parameter validation - Renamed local variables in `ExtendedIsolationForest.scala` for clarity (`dataset` → `data`). - Moved and refined parameter validation in `validateAndResolveParams`, logging chosen samples/features. - Updated Javadoc-style comments in `ExtendedIsolationForest`, `ExtendedIsolationForestModel`, and related classes. - Changed schema checks to use `VectorType` instead of `SQLDataTypes.VectorType`. - Renamed and documented internal methods (e.g., `pathLengthInternal`) in `ExtendedIsolationTree`. - Ensured consistent naming across `ExtendedIsolationForestModel` fields (e.g., `extendedIsolationTrees`). - Cleaned up imports, minor style fixes, and removed commented-out debug prints. There are still likely opprotunities to factor out more shared logic into `core`..
… a work in progress.
…ite working with tests.
…rcept sampling, ≤ semantics, and degeneracy handling - Sample normal in the selected subspace with up to (extensionLevel+1) non‑zero coords; normalize and guard zero‑norm. - Sample intercept as point p by drawing each active coordinate uniformly from that node’s data range; set offset = n·p. - Use inclusive left-branch test x·n ≤ n·p in both training and scoring so the split predicate matches the paper. - Treat minDot == maxDot or an empty partition as a leaf (stores numInstances); keeps trees well‑formed. - Compute dot against a full‑length normal (zeros for unused coords) to match the (x − p)·n test. - Minor: log message tweaks; one‑pass min/max scan instead of materializing arrays; consistent ≤ in train/score. - No change to model IO or public params.
…afing Previously, a single failed split attempt (constant feature, all-same dot products, or empty partition) immediately produced a leaf node. This meant extensionLevel=0 was not equivalent to standard IF when the first randomly chosen feature happened to be constant. Now retries up to 50 times before falling back to a leaf.
Remove Int.MaxValue-1 sentinel default. If the user sets extensionLevel above numFeatures-1, throw immediately. If unset, default to numFeatures-1 (fully extended). The resolved value is persisted in the model rather than the sentinel.
Guard against dataForTree.head crash when a partition receives zero sampled points. Throws a clear IllegalStateException instead of a confusing NoSuchElementException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Divide path length sum by the actual number of trees in the model rather than the $(numEstimators) parameter, preventing model/param drift from producing incorrect anomaly scores.
…check Use recursive node-by-node comparison with epsilon tolerance for doubles instead of fragile toString matching.
Add EIF section covering when to use it, the extensionLevel parameter and its interaction with maxFeatures, and a usage example. Call out that ONNX export is not supported for EIF. Add Hariri et al. 2018 to references.
…ing estimator Change the split criterion in ExtendedIsolationTree from <= to strict <, matching both the reference implementation (sahandha/eif) and our own standard IsolationTree. Affects tree building (partition) and scoring (path traversal). Remove the set(extensionLevel, resolvedExtensionLevel) call in ExtendedIsolationForest.fit() that mutated the estimator. When extensionLevel was unset (defaulting to fully extended), the first fit() permanently set it, causing reuse on a dataset with fewer features to either fail validation or silently use the wrong level.
…retry loop Remove bounded retry loop for degenerate splits. Instead, follow the EIF paper and reference implementation: allow empty partitions to become ExtendedExternalNode(0) leaves. Change split predicate from <= to strict < to match reference implementation's (x-p)·n < 0. Relax ExtendedExternalNode to accept numInstances >= 0.
…_max results Replace the old IF-only benchmark table with comprehensive results across 13 datasets comparing all three model variants against Liu et al. and the reference Python EIF implementation.
Set the resolved extensionLevel on the estimator before copyValues so it flows into the model's param map. Without this, models trained without explicitly calling setExtensionLevel() would lose the effective value on save/load. Add test covering default resolution and round-trip persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…igned splits Exercise the numInstances >= 0 semantics that became first-class EIF behavior when degenerate hyperplane splits were allowed to produce empty children. New tests cover: - ExtendedExternalNode(0) construction and subtreeDepth - Path length through a zero-size leaf contributes avgPathLength(0) = 0 - Save/load round-trip preserves a tree containing a zero-size leaf - extensionLevel=0 produces strictly axis-aligned normals (1 non-zero coordinate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use "closely matches" for ExtendedIF_max reference comparison - Note mulcross as an open outlier in ExtendedIF_0 parity (12 of 13) - Describe extensionLevel=0 as "uses axis-aligned splits" instead of "recovers standard axis-aligned splits" - Frame low-dimensional underperformance as our benchmark observation, not a broad established finding from the paper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Uncomment savedExtendedIsolationForestModelTreeStructureTest and add the required resource files: a saved ExtendedIsolationForestModel and its expected first-tree toString golden file. This provides a regression guard against accidental changes to tree serialization or structure.
…od into SharedTrainLogic (where the other shared training helpers already live). Both IsolationForest.scala and ExtendedIsolationForest.scala now call the single shared implementation, passing $(maxFeatures) and $(maxSamples) as arguments. Files changed: - core/SharedTrainLogic.scala — added validateAndResolveParams(dataset, maxFeatures, maxSamples) method and its ResolvedParams import - IsolationForest.scala — removed private method, updated import and call site - extended/ExtendedIsolationForest.scala — removed private method, updated import and call site
…ansformSchema All four Estimator/Model classes had identical 15-line transformSchema overrides. Extract the shared logic into Utils and delegate with a one-liner in each class.
…comparison style - Remove unused IsolationForestModel import from ExtendedIsolationForestModelReadWrite - Fix reader docstring that incorrectly said "standard" instead of "Extended" - Change `outlierScoreThreshold > 0.0` to `> 0` to match standard IF style
…l, and intermediate levels - Verify all hyperplane normals are L2-normalized across extension levels and seeds - Verify extensionLevel > numFeatures - 1 throws IllegalArgumentException at fit time - Verify intermediate extensionLevel values (1-4) train valid models with reasonable AUROC
…ationForestModel
- Remove unnecessary self-import of ExtendedIsolationForestModel in ReadWrite file
- Fix companion object and threshold comments that said "IsolationForestModel"
instead of "ExtendedIsolationForestModel"
Resolve the review issues uncovered while comparing the extended isolation
forest branch against master and the EIF reference implementation.
ExtendedIsolationForest
- stop mutating the estimator with a resolved default extensionLevel during
fit()
- keep dataset-dependent extensionLevel resolution local to each fit and
apply the resolved value only to the trained model
- add a regression test that reuses the same estimator across datasets with
different feature dimensions to ensure default extensionLevel does not
leak across fits
IsolationForestModel / ExtendedIsolationForestModel
- fail fast when transform() is called on an empty ensemble instead of
dividing by zero and producing invalid scores
- keep scoring normalized by the actual loaded tree count, but guard the
zero-tree case explicitly
- add transform-throws coverage for manually constructed empty standard and
extended models
- preserve existing empty model write/read tests so persistence still
round-trips this edge case correctly
Tests and style cleanup
- move ExtendedIsolationForestModelWriteReadTest into the
com.linkedin.relevance.isolationforest.extended package so package names
match file paths and the surrounding test suites
- restore the Spark-derived attribution header on the moved/copied
read-write helpers
- align ExtendedIsolationForestModelReadWrite visibility with the rest of
the package-private isolation forest internals
Verification
- ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test
- ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestTest
- ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests
com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest
- ./gradlew -g /tmp/codex-gradle-home :isolation-forest:compileScala :isolation-forest:compileTestScala
Update the README so the documented examples and version references match
the current repo state and are copy-paste runnable.
README updates
- change the documented default Spark version from 3.5.1 to 3.5.5
- update the example build command to use the current default Spark/Scala
combination
- replace stale hardcoded library and ONNX package versions with
<latest-version> / <matching-version> placeholders
- switch the Gradle dependency example from deprecated `compile` to
`implementation`
- add the missing `org.apache.spark.sql.functions.col` import to the Scala
training example
- fix the training example text to refer to the `label` column instead of
`labels`
- clarify the EIF `extensionLevel(5)` example comment so the dimensional
assumption is explicit
- define `dataset_name` and `num_examples_to_print` in the ONNX Python
inference example so the snippet is runnable as written
- remove the benchmark prose reference to a `LI IF` comparison column that
is not present in the table
This is a documentation-only change.
Add Extended Isolation Forest (EIF) support alongside the existing standard
Isolation Forest implementation, and harden the standard/extended model
persistence and scoring paths.
Extended Isolation Forest
- add ExtendedIsolationForest estimator, ExtendedIsolationForestModel,
ExtendedIsolationForestParams, ExtendedIsolationTree, ExtendedNodes, and
ExtendedUtils
- implement EIF training with extensionLevel-controlled random hyperplane
splits based on the Hariri et al. algorithm
- resolve extensionLevel per fit without mutating estimator state
- support axis-aligned EIF (extensionLevel = 0) through fully extended EIF
(extensionLevel = numFeatures - 1)
Sparse EIF model representation
- store hyperplanes sparsely as (indices, weights, offset) instead of dense
per-node normal vectors
- canonicalize stored sparse coordinates by sorting feature indices before
constructing SplitHyperplane
- use sparse dot products for tree traversal and add a direct Spark Vector
scoring path so EIF scoring benefits from sparsity end to end
- enforce sparse hyperplane invariants: non-empty, length-matched,
non-negative, distinct, sorted indices
Persistence and read/write refactor
- move standard model read/write into a top-level
IsolationForestModelReadWrite implementation
- add shared metadata helpers in IsolationForestModelReadWriteUtils
- add sparse EIF model read/write support and checked-in EIF persistence
fixtures
- preserve standard-model backward compatibility when loading older saved
models that do not contain totalNumFeatures metadata, logging that
dimension validation is unavailable for those legacy models
Model/scoring hardening
- reject numSamples values that resolve to fewer than 2 samples during
training
- fail fast when transform() is called on empty standard or extended models
- store totalNumFeatures in newly saved models and validate scoring input
dimension when that training dimension is known
- keep standard IF backward compatibility by restoring the legacy public
4-arg IsolationForestModel constructor while making the richer internal
constructor package-private
- restrict the extended model constructor to package-private use so
totalNumFeatures remains internal to fit/load/copy flows
Tests
- add comprehensive EIF estimator, tree, sparse-hyperplane, and write/read
tests
- add regression coverage for repeated EIF fits, empty-model scoring guards,
numSamples >= 2 enforcement, scoring-time feature dimension validation,
standard legacy metadata loading, and standard legacy constructor behavior
- update saved model metadata/tree-structure fixtures for the new extended
persistence format and formatting changes
Documentation
- refresh README dependency/version examples and fix copy-paste issues in the
Scala and ONNX examples
- add EIF usage and persistence examples
- document benchmark results for standard IF vs EIF variants
- fix benchmark/doc typos and soften the benchmark agreement statement to
avoid overstating row-by-row verification
Verification
- ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test
- Apply rounding to all value ± error pairs (1 sig fig on
error, 2 if leading digit is 1)
- Move Ref Python results from StandardIF to ExtendedIF_0 rows since
the reference Python EIF at ext=0 is not a true standard IF
- Add DOI to EIF paper reference and add reference Python eif repo
- Clarify column headers (Liu et al., Ref. Python with IF/EIF labels)
- Simplify key observations and fix overstated dimensionality claim
- Minor wording improvements throughout
1. Non-breaking spaces around ± — replaced ± with ± in all value cells so values like 0.813 ± 0.004 won't wrap mid-value.
2. Dashes in empty cells — all empty reference cells now show - instead of blank:
- StandardIF rows: - in both Ref. Python columns
- ExtendedIF rows: - in the Liu et al. column
There was a problem hiding this comment.
Pull request overview
This PR adds Extended Isolation Forest (EIF) as a first-class Spark ML Estimator/Model (random hyperplane splits), alongside refactors to share training and persistence utilities between standard IF and EIF, plus expanded tests and documentation updates.
Changes:
- Introduces
com.linkedin.relevance.isolationforest.extended(EIF) with training, scoring, and Avro-based save/load. - Refactors/centralizes shared training + schema validation + metadata utilities; improves standard IF persistence with
totalNumFeaturesand legacy-load handling. - Adds extensive new EIF and robustness tests; updates README with EIF usage, parameter semantics, benchmarks, and ONNX support notes.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTreeTest.scala | Adds unit tests for EIF tree behavior (splits, path length, sparse hyperplanes). |
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestTest.scala | Adds integration-style EIF training/scoring tests and extensionLevel validation tests. |
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelWriteReadTest.scala | Adds EIF model save/load round-trip tests and structure snapshot tests. |
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationTreeTest.scala | Fixes variable naming typo (featureIndices). |
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestTest.scala | Adds validation test for invalid maxSamples resolving to < 2. |
| isolation-forest/src/test/scala/com/linkedin/relevance/isolationforest/IsolationForestModelWriteReadTest.scala | Expands standard IF persistence/robustness tests (dimension validation, legacy metadata). |
| isolation-forest/src/test/resources/savedIsolationForestModel/metadata/part-00000 | Updates saved-model fixture to include totalNumFeatures. |
| isolation-forest/src/test/resources/savedExtendedIsolationForestModel/metadata/part-00000 | Adds EIF saved-model metadata fixture. |
| isolation-forest/src/test/resources/expectedExtendedTreeStructure.txt | Adds expected EIF tree structure snapshot text. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedUtils.scala | Adds sparse hyperplane representation and dot-product helpers. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedNodes.scala | Adds EIF node types (internal hyperplane split / external leaf). |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTree.scala | Implements EIF tree training and scoring (random hyperplane splits). |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestParams.scala | Adds EIF-specific Spark Param (extensionLevel). |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModelReadWrite.scala | Adds Avro persistence for EIF tree ensembles and model metadata. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModel.scala | Adds EIF Spark ML Model scoring + schema validation + writer wiring. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForest.scala | Adds EIF Spark ML Estimator training flow and param resolution. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/Utils.scala | Centralizes schema validation/output-column appending + feature-size validation. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala | Centralizes param resolution and adds training guardrails for small sample edge cases. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/NodesBase.scala | Adjusts shared node base traits to support EIF needs (e.g., zero-size leaves). |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWriteUtils.scala | Extracts shared model metadata read/write utilities. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/IsolationForestModelReadWrite.scala | Removes old core read/write implementation (replaced by refactored code). |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/Nodes.scala | Ensures node toString behavior aligns with updated node base traits. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModelReadWrite.scala | Refactors standard IF model persistence to use shared metadata utils + totalNumFeatures. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForestModel.scala | Adds totalNumFeatures tracking and optional feature-dimension validation at scoring time. |
| isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala | Refactors training to use shared param resolution + shared schema validation. |
| README.md | Documents EIF usage/params/benchmarks and clarifies ONNX support limitations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...ain/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationForestModel.scala
Show resolved
Hide resolved
...on-forest/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedUtils.scala
Show resolved
Hide resolved
...t/src/main/scala/com/linkedin/relevance/isolationforest/extended/ExtendedIsolationTree.scala
Show resolved
Hide resolved
…ompatibility
Spark 4.x's Avro encoder silently demotes Array[Double] elements to
float (32-bit) precision during serialization, while scalar Double
fields survive intact. This caused all five EIF model write/read tests
to fail on Spark 4.0.1 and 4.1.1, with weight mismatches at ~1e-8
(the exact double→float→double precision boundary).
The fix changes SplitHyperplane weights from Array[Double] to
Array[Float]. This is the correct design from first principles:
- Features are already Array[Float] (DataPoint.features)
- Weights define the hyperplane *direction* (analogous to the feature
index in standard IF, which is just an Int)
- The offset defines *where* to split and remains Double (analogous to
splitValue in standard IF, which is Double for the same reason)
- The dot product is accumulated in Double regardless of operand type
- The split comparison (dot < offset) is always Double vs Double
Weights are converted to float after normalization but before computing
the offset, so training and scoring are consistent. Benchmarks confirm
the change is invisible after rounding: only one value across all 13
datasets changed (breastw ExtendedIF_max AUPRC: 0.9568 → 0.9569,
well within the ±0.0015 error bar).
Production code:
- ExtendedUtils.scala: SplitHyperplane.weights Array[Double] → Array[Float]
- ExtendedIsolationTree.scala: normalize to float before offset computation
- ExtendedIsolationForestModelReadWrite.scala: ExtendedNodeData.weights
and NullWeights updated to float
Test code:
- ExtendedIsolationTreeTest.scala: float literals, L2 norm tolerance
widened from 1e-10 to 1e-6 (appropriate for float precision)
- ExtendedIsolationForestModelWriteReadTest.scala: float literals,
added disabled regenerateGoldenExtendedModel helper
- Regenerated golden model and expected tree structure
README:
- Updated breastw ExtendedIF_max AUPRC from 0.9568 to 0.9569
Verified all 67 tests pass on Spark 3.5.5, 4.0.1, and 4.1.1.
Copilot review responses: 1. Grammar fix (accepted): "some dataset" → "some datasets" in README benchmark observations. 2. .toFloat cast comment (accepted): Added clarifying comment explaining why features(indices(i)).toFloat is intentional — it matches the DataPoint (Array[Float]) precision used during training, ensuring scoring consistency between the DataPoint and Vector code paths. 3. Shuffle optimization (declined): Copilot suggested replacing Random.shuffle + take with reservoir sampling. The shuffle operates on a tiny array (≤ dim features, typically < 100 elements) once per tree node during training — not a hot path. Readability outweighs the micro-optimization. 4. outlierScoreThreshold > 0 sentinel (declined): Copilot noted that threshold=0.0 would be treated as "unset". Technically correct, but this mirrors the existing standard IF pattern identically. A threshold of 0.0 (label everything as outlier) is not a practical use case. Fixing it properly requires changing both IF and EIF together in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
...ion-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala
Show resolved
Hide resolved
...ion-forest/src/main/scala/com/linkedin/relevance/isolationforest/core/SharedTrainLogic.scala
Show resolved
Hide resolved
|
Performance and benchmark results here: https://github.com/linkedin/isolation-forest/tree/eif_scala?tab=readme-ov-file#performance-and-benchmarks |
CrustyAIBot
left a comment
There was a problem hiding this comment.
Deep review complete; looks good to merge. Remaining points are follow-up hardening items, not blockers.
Summary
This PR adds Extended Isolation Forest (EIF) support to the Scala/Spark
isolation-forestlibrary.The new implementation introduces a Spark ML
Estimator/Modelpair for EIF, adds save/load support for the extended tree format, expands test coverage substantially, and updates the README with usage examples, parameter documentation, benchmarks, and clear notes about ONNX support.In addition to the new EIF functionality, this PR also refactors shared training/persistence utilities and improves robustness and backward compatibility for the existing standard
IsolationForestimplementation.Background
The existing library only supported the standard Isolation Forest algorithm, which uses axis-aligned splits. That works well in many cases, but it can introduce directional bias and may underperform when anomalies lie along correlated or non-axis-aligned directions.
Extended Isolation Forest addresses this by replacing axis-aligned splits with random hyperplane splits, making the detector more rotationally invariant and better suited for data with correlated features.
This PR implements EIF based on the original paper and validates behavior against the reference implementation:
What this PR adds
1. New Extended Isolation Forest implementation
Added a new package:
com.linkedin.relevance.isolationforest.extendedwith the following new public Spark ML APIs:
ExtendedIsolationForestExtendedIsolationForestModeland the supporting implementation types:
ExtendedIsolationForestParamsExtendedIsolationTreeExtendedNodesExtendedUtilsExtendedIsolationForestModelReadWrite2. Random hyperplane splits for EIF
Each EIF tree now isolates points using random hyperplanes instead of single-feature thresholds.
Implementation details:
3.
extensionLevelparameterEIF introduces a new parameter:
extensionLevelBehavior:
0-> axis-aligned EIF splitsnumFeatures - 1-> fully extended hyperplanesImportant semantics captured in the implementation and docs:
extensionLevelis interpreted relative to the resolved per-tree feature subspacemaxFeatures < 1.0, the validextensionLevelrange is based on that subspace, not the original dataset dimension4. EIF model save/load support
Added full persistence support for
ExtendedIsolationForestModel.This includes:
A saved EIF model can now be loaded with:
Standard Isolation Forest improvements included in this PR
While adding EIF, this PR also cleans up and strengthens the existing standard IF implementation.
1. Shared training/schema utilities
Refactored common logic into reusable helpers so both standard IF and EIF use the same code paths where appropriate.
New shared helpers include:
maxFeatures/maxSamples2. Shared read/write metadata utilities
Extracted common metadata handling into:
core/IsolationForestModelReadWriteUtils.scalaThis reduces duplication between standard IF and EIF persistence code.
3. Track
totalNumFeaturesin modelsStandard
IsolationForestModelnow records:numFeatures= resolved per-tree feature counttotalNumFeatures= full training input dimensionThis enables feature-dimension validation during scoring and makes persisted model metadata more explicit.
4. Backward-compatible standard IF constructor behavior
To preserve compatibility, standard IF now has:
totalNumFeaturesThis keeps older construction patterns working while allowing newly trained or newly loaded models to retain full dimensionality metadata.
Related to that,
ExtendedIsolationForestModelkeeps its 5-argument constructor internal-only so EIF model construction remains package-scoped.5. Legacy model loading support
Older saved standard IF models that do not contain
totalNumFeaturesmetadata still load successfully.In that case the model follows an unknown-dimension path and skips feature-size validation.
6. More robust scoring behavior
Standard and extended models now fail fast for invalid scoring scenarios such as:
numSamples < 27. Small validation/robustness fixes
Additional guardrails added in shared training logic:
maxSamplesmust be at least2EIF implementation details worth calling out
A few implementation choices are deliberate and important.
Degenerate splits and zero-size leaves
EIF can legitimately produce degenerate hyperplane splits where all points land on one side.
To match the EIF algorithm/reference behavior, this implementation:
ExtendedExternalNode(0)This is different from the standard IF behavior, where axis-aligned feature selection retries until a splittable feature is found.
Sparse hyperplane representation
Hyperplanes are stored in the original feature space using only the non-zero coordinates.
This keeps:
ExtendedIF_0is not identical to standard IFAlthough
extensionLevel = 0yields axis-aligned hyperplanes, it is still EIF behavior, not standard IF behavior.Key differences include:
This distinction is now documented in the README and discussed explicitly in the benchmark section.
Documentation updates
The README was expanded significantly to reflect the new functionality.
Added documentation for EIF
New sections include:
extensionLevelparameter semanticsClarified ONNX support
The README now explicitly states:
IsolationForestModelonlyBenchmarks section updated
The README benchmark section now compares:
StandardIFExtendedIF_0ExtendedIF_maxsahandha/eif)It also summarizes where EIF helps most and where it may underperform.
Additional README cleanup
Also included:
3.5.1to3.5.5Tests added / updated
This PR adds substantial new test coverage for both EIF and supporting standard IF changes.
New EIF test suites
Added:
ExtendedIsolationForestTestExtendedIsolationForestModelWriteReadTestExtendedIsolationTreeTestCoverage includes:
extensionLevelextensionLevelhandlingUpdated standard IF tests
Expanded existing standard IF coverage to validate:
totalNumFeaturespersistencenumSampleshandlingtotalNumFeaturesAdded / updated test resources
Added snapshot resources for EIF model persistence, including:
Updated standard saved model metadata fixture to include
totalNumFeatures.Backward compatibility
API compatibility
IsolationForestAPI remains available.IsolationForestModelconstructor is preserved.Persistence compatibility
totalNumFeatures.ONNX compatibility
Behavior changes to be aware of
This PR introduces a few stricter validation behaviors:
maxSamplesvalues that resolve to fewer than 2 samples now fail fastThese are intentional robustness improvements.
Validation
Automated test validation
Verified with:
Benchmark validation in README
The README now includes a detailed benchmark comparison across 13 datasets for:
extensionLevel = 0The headline result is that:
Additional edge-case study (not included in this PR)
In addition to the checked-in unit/integration tests and README benchmark comparison, we also ran a separate detailed hyperparameter / edge-case study outside the PR. This is not part of the checked-in test suite, but it gives extra confidence in the implementation.
Summary: 61 / 61 checks passed in ~20s
Hyperparameter behavior
0.860(ext=0) to0.908(ext=32), confirming that higher extension levels capture more complex anomaly structure.0.985+across10–200trees. More trees help slightly, but returns diminish beyond ~50 trees.0.63–0.67across32–512samples. Larger samples are slightly better. FractionalmaxSamples=0.5behaves correctly.maxFeatures=0.5) has minimal effect on standard IF but slightly improves EIF on ionosphere (0.908 -> 0.913).Correctness checks
contamination=0labels nothing;contamination=0.35labels about35.6%.max diff = 0).diff 0.002–0.017), confirming that they are related but algorithmically distinct.[0,1], and anomalies consistently score higher than normals.Validation and edge cases
extensionLevel > dim - 1is correctly rejected.maxSamples > nis correctly rejected;maxSamples = nand fractional values behave correctly.AUROC 0.97–0.99).0.5everywhere; EIF assigns0.291everywhere.n=3,ss=2): works and produces identical scores for all points. A single-row dataset is correctly rejected.Raw output from the external edge-case study
Why this design
This implementation tries to keep EIF aligned with the rest of the library:
Estimator/Modelpattern as standard IFAt the same time, it preserves the important algorithmic differences required for EIF correctness:
Limitations / follow-ups
Not included in this PR:
ExtendedIsolationForestModelThe current ONNX converter assumes axis-aligned tree ensembles, so EIF would require a different export representation or converter strategy.
References
Hariri, S., Carrasco Kind, M., and Brunner, R. J. Extended Isolation Forest.
https://arxiv.org/abs/1811.02141
Reference EIF implementation used for comparison:
https://github.com/sahandha/eif