Skip to content

Comments

⚡️ Speed up function _get_all_json_refs by 38%#282

Open
codeflash-ai[bot] wants to merge 1 commit intopython-onlyfrom
codeflash/optimize-_get_all_json_refs-mlujbi71
Open

⚡️ Speed up function _get_all_json_refs by 38%#282
codeflash-ai[bot] wants to merge 1 commit intopython-onlyfrom
codeflash/optimize-_get_all_json_refs-mlujbi71

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 20, 2026

📄 38% (0.38x) speedup for _get_all_json_refs in src/algorithms/search.py

⏱️ Runtime : 20.3 milliseconds 14.7 milliseconds (best of 103 runs)

📝 Explanation and details

This optimization achieves a 37% runtime improvement (from 20.3ms to 14.7ms) by eliminating the overhead of creating and merging temporary sets during recursive traversal of JSON schemas.

Key Optimization: Shared Set Pattern

The optimized code replaces the original's refs.update(_get_all_json_refs(value)) pattern with a helper function _collect_json_refs(item, refs) that mutates a shared set in-place. This eliminates:

  1. Set creation overhead: The original creates a new set() on every recursive call (72,401 allocations according to line profiler)
  2. Set merge overhead: Each refs.update() call copies elements from the returned set into the parent set

The line profiler data shows the impact clearly:

  • Original: Lines with refs.update() consume 33-34% of total time (79.46M + 2.677M + 2.615M ns)
  • Optimized: A single call to _collect_json_refs() handles all work (224.6M ns total)

Why This Works

In Python, set operations like update() have overhead even when merging small sets. For deeply nested JSON schemas with many $ref entries (like the test cases with 1000+ refs), this overhead multiplies across recursive calls. By passing a single shared set through the recursion stack and using direct add() operations, we avoid this multiplicative cost.

Performance Characteristics

The optimization excels for nested and large-scale JSON schemas:

  • Deeply nested structures (100 levels): 170% faster (125μs → 46.2μs)
  • Wide structures (500 refs): 30-52% faster
  • Mixed nested lists/dicts (1000+ items): 36-42% faster

For trivial inputs (empty dicts/lists, primitives), there's a 10-28% slowdown due to the additional function call overhead. However, these cases complete in ~300-700ns (negligible absolute time), while the optimization targets real-world JSON schema parsing where nested structures are common.

Impact on Workloads

Since _get_all_json_refs is used for JSON schema reference extraction, this optimization particularly benefits:

  • Large API schemas with interconnected type definitions
  • Schema validation pipelines processing many documents
  • Tools that analyze or transform complex JSON schemas

The test results confirm that real-world usage patterns (complex schemas with multiple nesting levels and many refs) see significant speedups, making this optimization valuable despite minor regressions on trivial cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 62 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from typing import Set

# imports
import pytest  # used for our unit tests
# import the function and JsonRef NewType from the actual module under test
from src.algorithms.search import JsonRef, _get_all_json_refs

def test_empty_and_non_container_inputs_return_empty_set():
    # An empty dict should produce no refs
    codeflash_output = _get_all_json_refs({}) # 625ns -> 708ns (11.7% slower)
    # An empty list should produce no refs
    codeflash_output = _get_all_json_refs([]) # 334ns -> 375ns (10.9% slower)
    # A plain string (non-container) should produce no refs
    codeflash_output = _get_all_json_refs("just a string") # 250ns -> 333ns (24.9% slower)
    # An integer (non-container) should also produce no refs
    codeflash_output = _get_all_json_refs(12345) # 209ns -> 291ns (28.2% slower)

def test_single_top_level_ref_is_found_and_typed_as_jsonref():
    # A simple dict with a top-level '$ref' string should return a set with that ref
    data = {"$ref": "#/definitions/MyType"}
    expected = {JsonRef("#/definitions/MyType")}
    codeflash_output = _get_all_json_refs(data); result = codeflash_output # 916ns -> 1.00μs (8.40% slower)
    # Each returned element should be a string at runtime (JsonRef is a NewType over str)
    for elt in result:
        pass

def test_nested_refs_in_dicts_and_lists_are_collected_and_deduplicated():
    # Prepare a nested structure mixing dicts and lists with repeated refs
    data = {
        "$ref": "#/root",
        "level1": {
            "a": {"$ref": "#/A"},
            "b": [{"$ref": "#/B"}, {"c": {"$ref": "#/C"}}, {"c_dup": {"$ref": "#/C"}}],
        },
        "level1_list": [
            {"deep": {"$ref": "#/D"}},
            [{"nested_list_item": {"$ref": "#/E"}}],  # list inside list
        ],
        "other": [{"no_ref_here": True}, {"$ref": "#/B"}],  # B is repeated to test dedup
    }
    expected = {JsonRef("#/root"), JsonRef("#/A"), JsonRef("#/B"),
                JsonRef("#/C"), JsonRef("#/D"), JsonRef("#/E")}
    # call the function and compare sets
    codeflash_output = _get_all_json_refs(data) # 7.04μs -> 5.42μs (30.0% faster)

def test_ref_key_with_non_string_value_is_ignored():
    # If "$ref" exists but its value is not a string, it should not be included
    data = {
        "$ref": {"not": "a string"},
        "child": {"$ref": 123},  # integer -> ignored
        "valid": {"$ref": "#/valid"}  # only this one should be included
    }
    expected = {JsonRef("#/valid")}
    codeflash_output = _get_all_json_refs(data) # 2.42μs -> 2.29μs (5.50% faster)

def test_empty_string_and_special_char_refs_are_included():
    # An empty string value is still a string and should be collected
    # Also include a ref with various ASCII-special characters
    data = {
        "empty": {"$ref": ""},
        "special": {"$ref": "#/defs/some-name_with+chars.123"},
    }
    expected = {JsonRef(""), JsonRef("#/defs/some-name_with+chars.123")}
    codeflash_output = _get_all_json_refs(data) # 1.96μs -> 1.75μs (11.9% faster)

def test_none_input_returns_empty_set():
    # None is not a dict or list, so should result in an empty set
    codeflash_output = _get_all_json_refs(None) # 458ns -> 583ns (21.4% slower)

def test_lists_with_mixed_types_do_not_raise_and_collect_valid_refs():
    # Lists containing non-dict/list primitives should be tolerated and ignored
    data = [
        1,
        "string",
        {"$ref": "#/one"},
        [ {"$ref": "#/two"}, 42, None, ["ignored", {"$ref": "#/three"}] ],
    ]
    expected = {JsonRef("#/one"), JsonRef("#/two"), JsonRef("#/three")}
    codeflash_output = _get_all_json_refs(data) # 3.62μs -> 2.54μs (42.6% faster)

def test_keys_named_similar_to_ref_are_not_matched():
    # Keys that merely contain '$ref' as a substring should not count
    data = {
        "not_$ref": "#/should_not_be_seen",
        "dollarref": {"$ref_extra": "#/also_not"},
        "actual": {"$ref": "#/yes_this_one"},
    }
    expected = {JsonRef("#/yes_this_one")}
    codeflash_output = _get_all_json_refs(data) # 1.96μs -> 1.83μs (6.82% faster)

def test_large_scale_breadth_many_items_has_expected_unique_count():
    # Build a large flat list (breadth) with many dicts each containing "$ref"
    total_items = 1000
    # Use only 500 unique refs repeated twice to test deduplication on scale
    unique_count = 500
    big_list = []
    for i in range(total_items):
        # create repeated refs (i % unique_count) to ensure duplicates
        ref_value = f"#/defs/{i % unique_count}"
        big_list.append({"$ref": ref_value})
    # Wrap into a top-level dict to exercise dict-handling path
    data = {"items": big_list}
    codeflash_output = _get_all_json_refs(data); result = codeflash_output # 320μs -> 229μs (40.1% faster)

def test_large_scale_mixed_nested_structure_performance_and_correctness():
    # Construct a mixed nested structure with many small nested lists and dicts
    # We avoid extreme recursion depth; we focus on breadth and many iterations.
    outer = {"payload": []}
    unique_refs: Set[JsonRef] = set()
    for i in range(1000):  # 1000 iterations to test scale
        ref = f"#/bulk/{i % 250}"  # 250 unique refs repeated
        unique_refs.add(JsonRef(ref))
        # create a small nested chunk combining dicts and lists
        chunk = {
            "meta": {"index": i},
            "refs": [ {"$ref": ref} , {"other": [ {"$ref": ref} , {"noop": True} ] } ],
        }
        outer["payload"].append(chunk)
    codeflash_output = _get_all_json_refs(outer); result = codeflash_output # 1.79ms -> 1.26ms (42.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

from typing import Any, NewType

# imports
import pytest
from src.algorithms.search import _get_all_json_refs

def test_empty_dict():
    """Test that an empty dictionary returns an empty set of refs."""
    codeflash_output = _get_all_json_refs({}); result = codeflash_output # 625ns -> 708ns (11.7% slower)

def test_empty_list():
    """Test that an empty list returns an empty set of refs."""
    codeflash_output = _get_all_json_refs([]); result = codeflash_output # 459ns -> 583ns (21.3% slower)

def test_single_ref_in_dict():
    """Test extraction of a single $ref from a dict."""
    schema = {"$ref": "#/definitions/MyType"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.00μs -> 1.04μs (4.03% slower)

def test_single_ref_in_nested_dict():
    """Test extraction of a $ref from a nested dict."""
    schema = {
        "properties": {
            "field": {"$ref": "#/definitions/Person"}
        }
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.75μs -> 1.54μs (13.5% faster)

def test_multiple_refs_in_dict():
    """Test extraction of multiple $refs from a single dict."""
    schema = {
        "$ref": "#/definitions/First",
        "properties": {
            "second": {"$ref": "#/definitions/Second"}
        }
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.92μs -> 1.67μs (15.0% faster)

def test_ref_in_list():
    """Test extraction of a $ref from within a list."""
    schema = {
        "oneOf": [
            {"$ref": "#/definitions/TypeA"},
            {"$ref": "#/definitions/TypeB"}
        ]
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 2.00μs -> 1.75μs (14.3% faster)

def test_deeply_nested_refs():
    """Test extraction of $refs from deeply nested structures."""
    schema = {
        "level1": {
            "level2": {
                "level3": {
                    "$ref": "#/definitions/Deep"
                }
            }
        }
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.96μs -> 1.75μs (11.9% faster)

def test_ref_with_empty_string():
    """Test that a $ref with an empty string value is captured."""
    schema = {"$ref": ""}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 958ns -> 1.04μs (7.97% slower)

def test_ref_with_url():
    """Test that $ref values can be URLs."""
    schema = {"$ref": "https://example.com/schema#/definitions/Type"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 917ns -> 1.04μs (12.0% slower)

def test_ref_with_special_characters():
    """Test that $ref values can contain special characters."""
    schema = {"$ref": "#/definitions/Type-With_Special.Chars123"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 959ns -> 1.04μs (7.97% slower)

def test_direct_list_input():
    """Test that a list can be passed directly as input."""
    schema = [
        {"$ref": "#/definitions/A"},
        {"$ref": "#/definitions/B"}
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.83μs -> 1.58μs (15.8% faster)

def test_list_with_mixed_items():
    """Test a list containing both dicts with refs and other items."""
    schema = [
        {"$ref": "#/definitions/TypeA"},
        "string_value",
        42,
        {"nested": {"$ref": "#/definitions/TypeB"}}
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 2.62μs -> 2.08μs (26.0% faster)

def test_ref_key_with_non_string_value():
    """Test that $ref with non-string value is ignored."""
    schema = {
        "$ref": 123,  # non-string value
        "other": "value"
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.12μs -> 1.17μs (3.60% slower)

def test_ref_key_with_none_value():
    """Test that $ref with None value is ignored."""
    schema = {"$ref": None}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 875ns -> 1.00μs (12.5% slower)

def test_ref_key_with_list_value():
    """Test that $ref with list value is ignored."""
    schema = {"$ref": ["#/definitions/A", "#/definitions/B"]}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.54μs -> 1.33μs (15.7% faster)

def test_ref_key_with_dict_value():
    """Test that $ref with dict value is ignored."""
    schema = {"$ref": {"nested": "value"}}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.42μs -> 1.33μs (6.23% faster)

def test_property_named_ref():
    """Test that a property named 'ref' (without $) is not confused with $ref."""
    schema = {"ref": "#/definitions/NotARef"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 875ns -> 958ns (8.66% slower)

def test_property_named_dollar_ref():
    """Test that only exact key match '$ref' is recognized."""
    schema = {
        "$ref_extra": "#/definitions/NotARef",
        "$REF": "#/definitions/AlsoNotARef"
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 1.04μs -> 1.17μs (10.7% slower)

def test_duplicate_refs():
    """Test that duplicate refs are returned as a set (no duplicates)."""
    schema = {
        "first": {"$ref": "#/definitions/Same"},
        "second": {"$ref": "#/definitions/Same"},
        "third": {"$ref": "#/definitions/Same"}
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 2.38μs -> 2.04μs (16.4% faster)

def test_ref_in_list_at_multiple_depths():
    """Test refs found in lists at different nesting depths."""
    schema = [
        {"$ref": "#/definitions/Top"},
        {
            "nested": [
                {"$ref": "#/definitions/Middle"}
            ]
        }
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 2.25μs -> 1.92μs (17.4% faster)

def test_integer_input():
    """Test that non-dict/non-list inputs are handled gracefully."""
    codeflash_output = _get_all_json_refs(42); result = codeflash_output # 417ns -> 542ns (23.1% slower)

def test_string_input():
    """Test that string input is handled gracefully."""
    codeflash_output = _get_all_json_refs("not a schema"); result = codeflash_output # 417ns -> 542ns (23.1% slower)

def test_none_input():
    """Test that None input is handled gracefully."""
    codeflash_output = _get_all_json_refs(None); result = codeflash_output # 458ns -> 542ns (15.5% slower)

def test_boolean_input():
    """Test that boolean input is handled gracefully."""
    codeflash_output = _get_all_json_refs(True); result = codeflash_output # 542ns -> 666ns (18.6% slower)
    
    codeflash_output = _get_all_json_refs(False); result = codeflash_output # 292ns -> 292ns (0.000% faster)

def test_float_input():
    """Test that float input is handled gracefully."""
    codeflash_output = _get_all_json_refs(3.14); result = codeflash_output # 458ns -> 583ns (21.4% slower)

def test_ref_value_with_unicode():
    """Test that $ref values can contain unicode characters."""
    schema = {"$ref": "#/definitions/Tëst"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 958ns -> 1.00μs (4.20% slower)

def test_complex_json_schema_example():
    """Test a realistic complex JSON schema structure."""
    schema = {
        "type": "object",
        "properties": {
            "user": {"$ref": "#/definitions/User"},
            "address": {
                "type": "object",
                "properties": {
                    "location": {"$ref": "#/definitions/Location"}
                }
            },
            "contacts": {
                "type": "array",
                "items": [
                    {"$ref": "#/definitions/Phone"},
                    {"$ref": "#/definitions/Email"}
                ]
            }
        },
        "allOf": [
            {"$ref": "#/definitions/Timestamped"}
        ]
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 5.29μs -> 4.21μs (25.7% faster)

def test_nested_empty_lists():
    """Test handling of nested empty lists."""
    schema = {
        "items": [
            [],
            [[]],
            {"$ref": "#/definitions/Type"}
        ]
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 2.46μs -> 1.92μs (28.2% faster)

def test_mixed_nesting_dict_and_list():
    """Test complex mixed nesting of dicts and lists."""
    schema = [
        {
            "items": [
                {"$ref": "#/definitions/A"},
                [{"$ref": "#/definitions/B"}]
            ]
        },
        [
            [
                {"$ref": "#/definitions/C"}
            ]
        ]
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 3.25μs -> 2.54μs (27.9% faster)

def test_many_refs_in_single_dict():
    """Test extraction of many $refs from a single dictionary level."""
    schema = {f"field_{i}": {"$ref": f"#/definitions/Type{i}"} for i in range(100)}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 37.8μs -> 28.8μs (30.9% faster)
    for i in range(100):
        pass

def test_many_refs_in_list():
    """Test extraction of many $refs from a list."""
    schema = [{"$ref": f"#/definitions/Type{i}"} for i in range(100)]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 33.5μs -> 25.4μs (31.6% faster)
    for i in range(100):
        pass

def test_deeply_nested_structure():
    """Test extraction from a very deeply nested structure."""
    # Create a deeply nested dict chain (100 levels deep)
    schema = {"level": "root"}
    current = schema
    for i in range(100):
        current["nested"] = {"level": i, "$ref": f"#/definitions/Level{i}"}
        current = current["nested"]
    
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 125μs -> 46.2μs (170% faster)
    for i in range(100):
        pass

def test_wide_and_deep_structure():
    """Test extraction from a structure that is both wide and deep."""
    schema = {}
    current = schema
    
    # Create 20 levels of nesting
    for level in range(20):
        current["children"] = [
            {f"field_{j}": {"$ref": f"#/definitions/Level{level}_Item{j}"} 
             for j in range(10)}
        ]
        current = current["children"][0]
    
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 120μs -> 59.7μs (102% faster)
    for level in range(20):
        for j in range(10):
            pass

def test_large_list_with_nested_dicts():
    """Test a large list containing many nested dicts with refs."""
    schema = [
        {
            "item": i,
            "refs": [
                {"$ref": f"#/definitions/Ref{i}_{j}"}
                for j in range(10)
            ]
        }
        for i in range(50)
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 207μs -> 131μs (57.7% faster)
    for i in range(50):
        for j in range(10):
            pass

def test_many_duplicate_refs():
    """Test that many duplicate refs are correctly deduplicated."""
    # Create a schema with 1000 references to the same definition
    schema = [
        {"$ref": "#/definitions/AlwaysTheSame"}
        for _ in range(1000)
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 277μs -> 199μs (39.0% faster)

def test_large_dict_with_many_keys():
    """Test extraction from a dict with many keys (only some are $ref)."""
    schema = {}
    expected_refs = []
    
    for i in range(500):
        if i % 2 == 0:
            # Even indices get $ref keys
            schema[f"ref_{i}"] = {"$ref": f"#/definitions/Type{i}"}
            expected_refs.append(f"#/definitions/Type{i}")
        else:
            # Odd indices get regular keys
            schema[f"key_{i}"] = f"value_{i}"
    
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 112μs -> 86.2μs (30.7% faster)
    for ref in expected_refs:
        pass

def test_complex_real_world_schema():
    """Test a large realistic JSON schema with multiple interconnected types."""
    # Build a schema similar to a large API specification
    definitions = {}
    
    for type_num in range(50):
        type_schema = {
            "type": "object",
            "properties": {}
        }
        
        # Add properties that reference other types
        for prop_num in range(10):
            target_type = (type_num + prop_num) % 50
            type_schema["properties"][f"prop_{prop_num}"] = {
                "$ref": f"#/definitions/Type{target_type}"
            }
        
        definitions[f"Type{type_num}"] = type_schema
    
    schema = {
        "definitions": definitions,
        "root": {"$ref": "#/definitions/Type0"}
    }
    
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 235μs -> 155μs (52.0% faster)

def test_alternating_dict_list_nesting():
    """Test a structure that alternates between dict and list nesting."""
    schema = {
        "level_1_dict": [
            {
                "level_2_dict": [
                    {
                        "level_3_dict": [
                            {"$ref": "#/definitions/Deep"}
                            for _ in range(50)
                        ]
                    }
                    for _ in range(40)
                ]
            }
            for _ in range(30)
        ]
    }
    
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 16.6ms -> 12.1ms (36.9% faster)

def test_return_type_is_set():
    """Test that the return type is always a set."""
    test_cases = [
        {},
        [],
        {"$ref": "#/test"},
        [{"$ref": "#/test"}],
        "string",
        42,
        None
    ]
    
    for test_case in test_cases:
        codeflash_output = _get_all_json_refs(test_case); result = codeflash_output # 3.08μs -> 3.29μs (6.29% slower)

def test_result_contains_json_ref_type():
    """Test that results contain JsonRef type (which is a string NewType)."""
    schema = {"$ref": "#/definitions/Test"}
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 958ns -> 1.08μs (11.6% slower)
    
    # JsonRef is a NewType over str, so it should be a string
    for item in result:
        pass

def test_empty_dict_with_many_non_ref_properties():
    """Test a dict with many properties but no $refs."""
    schema = {
        f"property_{i}": f"value_{i}"
        for i in range(500)
    }
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 33.7μs -> 34.1μs (1.10% slower)

def test_large_nested_list_no_refs():
    """Test a large nested list structure with no $refs."""
    schema = [
        [
            [
                {"data": i, "value": f"item_{i}"}
                for i in range(10)
            ]
            for _ in range(10)
        ]
        for _ in range(10)
    ]
    codeflash_output = _get_all_json_refs(schema); result = codeflash_output # 334μs -> 275μs (21.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_get_all_json_refs-mlujbi71 and push.

Codeflash Static Badge

This optimization achieves a **37% runtime improvement** (from 20.3ms to 14.7ms) by eliminating the overhead of creating and merging temporary sets during recursive traversal of JSON schemas.

## Key Optimization: Shared Set Pattern

The optimized code replaces the original's `refs.update(_get_all_json_refs(value))` pattern with a helper function `_collect_json_refs(item, refs)` that mutates a shared set in-place. This eliminates:

1. **Set creation overhead**: The original creates a new `set()` on every recursive call (72,401 allocations according to line profiler)
2. **Set merge overhead**: Each `refs.update()` call copies elements from the returned set into the parent set

The line profiler data shows the impact clearly:
- Original: Lines with `refs.update()` consume 33-34% of total time (79.46M + 2.677M + 2.615M ns)
- Optimized: A single call to `_collect_json_refs()` handles all work (224.6M ns total)

## Why This Works

In Python, set operations like `update()` have overhead even when merging small sets. For deeply nested JSON schemas with many `$ref` entries (like the test cases with 1000+ refs), this overhead multiplies across recursive calls. By passing a single shared set through the recursion stack and using direct `add()` operations, we avoid this multiplicative cost.

## Performance Characteristics

The optimization excels for **nested and large-scale JSON schemas**:
- **Deeply nested structures** (100 levels): 170% faster (125μs → 46.2μs)
- **Wide structures** (500 refs): 30-52% faster
- **Mixed nested lists/dicts** (1000+ items): 36-42% faster

For **trivial inputs** (empty dicts/lists, primitives), there's a 10-28% slowdown due to the additional function call overhead. However, these cases complete in ~300-700ns (negligible absolute time), while the optimization targets real-world JSON schema parsing where nested structures are common.

## Impact on Workloads

Since `_get_all_json_refs` is used for JSON schema reference extraction, this optimization particularly benefits:
- Large API schemas with interconnected type definitions
- Schema validation pipelines processing many documents
- Tools that analyze or transform complex JSON schemas

The test results confirm that real-world usage patterns (complex schemas with multiple nesting levels and many refs) see significant speedups, making this optimization valuable despite minor regressions on trivial cases.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 February 20, 2026 06:54
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants