Skip to content

Comments

⚡️ Speed up function find_last_node by 5,211%#270

Closed
codeflash-ai[bot] wants to merge 1 commit intooptimizefrom
codeflash/optimize-find_last_node-mldfuqcq
Closed

⚡️ Speed up function find_last_node by 5,211%#270
codeflash-ai[bot] wants to merge 1 commit intooptimizefrom
codeflash/optimize-find_last_node-mldfuqcq

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 8, 2026

📄 5,211% (52.11x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 14.3 milliseconds 269 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 52x speedup (5210% faster) by fundamentally changing the algorithmic complexity from O(N*M) to O(N+M), where N is the number of nodes and M is the number of edges.

Key Optimization: Pre-computing Source Set

The original code uses a nested loop structure via generator expressions:

next((n for n in nodes if all(e["source"] != n["id"] for e in edges)), None)

For each node, it iterates through ALL edges to check if that node appears as a source. This creates O(N*M) comparisons.

The optimized code instead:

  1. Builds a set of all source IDs once: sources = {e["source"] for e in edges} - O(M) operation
  2. Performs O(1) lookups: if node_id not in sources - uses set membership testing
  3. Total complexity: O(N+M) instead of O(N*M)

Performance Impact by Test Case:

  • Small graphs (2-10 nodes): 60-174% faster - modest gains as overhead of set creation is offset by fewer nodes
  • Medium graphs (50-100 nodes): 361-1827% faster - the optimization starts to shine
  • Large graphs (200-500 nodes): 3632-18094% faster - dramatic speedup as the quadratic behavior of the original becomes prohibitive

The test test_large_scale_chain_of_500_nodes_is_handled_correctly shows the most dramatic improvement: 5.47ms → 30.1μs (18094% faster) - because it must check 500 nodes against 499 edges, resulting in nearly 250,000 comparisons in the original vs just ~999 operations in the optimized version.

Correctness Preservation:

The optimized code carefully preserves edge cases:

  • Falls back to original behavior for non-reiterable iterators (preserves consumption semantics)
  • Falls back for unhashable source values (TypeError handling)
  • Mimics lazy evaluation for n["id"] access when sources is empty (returns node without checking id)
  • Raises appropriate exceptions (TypeError, KeyError) at the same logical points as the original

This optimization is particularly valuable for graph traversal workflows where finding terminal nodes is a frequent operation on graphs with hundreds of nodes and edges.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 47 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node


def test_basic_chain_returns_true_last_node():
    # Basic: simple linear chain A -> B -> C should return node C (no outgoing edges)
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]  # define nodes in order
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "C"},
    ]  # chain edges
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.71μs -> 1.12μs (141% faster)


def test_no_edges_returns_first_node_due_to_all_true_on_empty_iterable():
    # Edge/behavioral: when edges list is empty, every node satisfies "no outgoing edges"
    # Implementation uses 'all' over an empty iterable, which returns True, so the first node is returned.
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = []  # no edges
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.38μs -> 833ns (65.1% faster)


def test_multiple_candidates_returns_first_node_in_nodes_order():
    # Basic: when multiple nodes are not sources, the function should return the first such node
    nodes = [{"id": "x"}, {"id": "y"}, {"id": "z"}]
    # edges reference an unrelated source so none of the nodes are sources
    edges = [{"source": "other", "target": "y"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.54μs -> 958ns (61.0% faster)


def test_duplicate_node_ids_returns_first_matching_instance():
    # Edge: duplicate node ids in nodes list should still return the first instance
    nodes = [{"id": 1}, {"id": 1}, {"id": 2}]
    # edges make node with id 2 a source, so nodes with id==1 are candidates
    edges = [{"source": 2, "target": 99}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.58μs -> 917ns (72.6% faster)


def test_empty_nodes_returns_none():
    # Edge: no nodes provided -> there is nothing to return; function should return None
    nodes = []
    edges = [{"source": "anything"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.00μs -> 792ns (26.3% faster)


def test_all_nodes_are_sources_returns_none():
    # Edge: if every node appears as a source in edges, then there is no 'last' node -> None
    nodes = [{"id": "n1"}, {"id": "n2"}]
    edges = [{"source": "n1"}, {"source": "n2"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.21μs -> 1.04μs (112% faster)


def test_edge_without_source_raises_keyerror():
    # Edge: edges elements must contain 'source'. Missing 'source' will raise KeyError when accessed.
    nodes = [{"id": 1}]
    edges = [{}]  # missing 'source'
    with pytest.raises(KeyError):
        find_last_node(nodes, edges)  # 3.08μs -> 1.38μs (124% faster)


def test_non_mapping_edge_raises_typeerror():
    # Edge: if an edge is not subscriptable (e.g., None), accessing e["source"] raises TypeError
    nodes = [{"id": 1}]
    edges = [None]
    with pytest.raises(TypeError):
        find_last_node(nodes, edges)  # 3.08μs -> 3.17μs (2.62% slower)


def test_large_scale_chain_of_500_nodes_is_handled_correctly():
    # Large Scale: create a chain of 500 nodes (well below 1000), ensure the last node is found.
    # We keep the size under 1000 as requested.
    size = 500
    nodes = [{"id": i} for i in range(size)]  # nodes with ids 0..499
    # create edges 0->1, 1->2, ..., 498->499 so node 499 has no outgoing edges
    edges = [{"source": i, "target": i + 1} for i in range(size - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 5.47ms -> 30.1μs (18094% faster)


def test_large_scale_many_candidates_returns_first_node_quickly():
    # Large Scale: many nodes (700) but edges do not reference them, so all are candidates.
    # The function should return the first node quickly (and according to implementation).
    size = 700  # chosen < 1000
    nodes = [{"id": f"node-{i}"} for i in range(size)]
    # edges reference entirely different ids so none of the nodes are sources
    edges = [{"source": f"external-{i}"} for i in range(10)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.54μs -> 1.83μs (38.7% faster)


def test_order_of_edges_does_not_change_result_for_chain():
    # Basic: the order of edges should not affect which node is last; node with no outgoing edges wins.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    # edges are provided out-of-order but represent the same chain A->B, B->C
    edges = [{"source": "B", "target": "C"}, {"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.96μs -> 1.17μs (154% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from src.algorithms.graph import find_last_node


def test_single_node_no_edges():
    """Test with a single node and no edges - should return that node."""
    nodes = [{"id": "node1", "label": "Node 1"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.58μs -> 833ns (90.2% faster)


def test_linear_chain_of_nodes():
    """Test with a linear chain: node1 -> node2 -> node3. Should return node3 (last node)."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
        {"id": "node3", "label": "Node 3"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {"source": "node2", "target": "node3"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.88μs -> 1.21μs (138% faster)


def test_diamond_graph_structure():
    """Test with diamond structure where node1 branches to node2 and node3, both converge to node4."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
        {"id": "node3", "label": "Node 3"},
        {"id": "node4", "label": "Node 4"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {"source": "node1", "target": "node3"},
        {"source": "node2", "target": "node4"},
        {"source": "node3", "target": "node4"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 3.54μs -> 1.29μs (174% faster)


def test_two_nodes_one_edge():
    """Test with two nodes and one edge between them."""
    nodes = [
        {"id": "start", "label": "Start"},
        {"id": "end", "label": "End"},
    ]
    edges = [{"source": "start", "target": "end"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.21μs -> 1.04μs (112% faster)


def test_multiple_disconnected_components():
    """Test with disconnected graph: node1->node2 and node3->node4. Should return one of the final nodes."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
        {"id": "node3", "label": "Node 3"},
        {"id": "node4", "label": "Node 4"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {"source": "node3", "target": "node4"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.38μs -> 1.08μs (119% faster)


def test_node_with_self_loop():
    """Test with a node that has a self-loop (points to itself)."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {"source": "node2", "target": "node2"},  # self-loop
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.25μs -> 1.08μs (108% faster)


def test_empty_nodes_list():
    """Test with empty nodes list."""
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 958ns -> 667ns (43.6% faster)


def test_empty_edges_list():
    """Test with nodes but no edges - should return the first node."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
    ]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.50μs -> 833ns (80.1% faster)


def test_node_with_no_outgoing_edges():
    """Test identifying a node with no outgoing edges in a larger graph."""
    nodes = [
        {"id": "a", "label": "A"},
        {"id": "b", "label": "B"},
        {"id": "c", "label": "C"},
    ]
    edges = [
        {"source": "a", "target": "b"},
        {"source": "b", "target": "c"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.75μs -> 1.12μs (144% faster)


def test_multiple_last_nodes_same_level():
    """Test with multiple nodes at the same level with no outgoing edges."""
    nodes = [
        {"id": "root", "label": "Root"},
        {"id": "leaf1", "label": "Leaf 1"},
        {"id": "leaf2", "label": "Leaf 2"},
    ]
    edges = [
        {"source": "root", "target": "leaf1"},
        {"source": "root", "target": "leaf2"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.38μs -> 1.04μs (128% faster)


def test_node_with_extra_attributes():
    """Test nodes with various additional attributes beyond id and label."""
    nodes = [
        {"id": "node1", "label": "Node 1", "type": "start", "color": "green"},
        {
            "id": "node2",
            "label": "Node 2",
            "type": "end",
            "color": "red",
            "description": "Final node",
        },
    ]
    edges = [{"source": "node1", "target": "node2"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.17μs -> 1.00μs (117% faster)


def test_edge_with_extra_attributes():
    """Test edges that have attributes beyond source and target."""
    nodes = [
        {"id": "start", "label": "Start"},
        {"id": "end", "label": "End"},
    ]
    edges = [
        {
            "source": "start",
            "target": "end",
            "weight": 5,
            "label": "transition",
            "color": "blue",
        }
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.08μs -> 1.00μs (108% faster)


def test_node_id_with_special_characters():
    """Test node IDs containing special characters."""
    nodes = [
        {"id": "node-1", "label": "Node 1"},
        {"id": "node_2", "label": "Node 2"},
        {"id": "node.3", "label": "Node 3"},
    ]
    edges = [
        {"source": "node-1", "target": "node_2"},
        {"source": "node_2", "target": "node.3"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 3.21μs -> 1.29μs (148% faster)


def test_node_id_with_numbers():
    """Test node IDs that are purely numeric strings."""
    nodes = [
        {"id": "1", "label": "First"},
        {"id": "2", "label": "Second"},
        {"id": "3", "label": "Third"},
    ]
    edges = [
        {"source": "1", "target": "2"},
        {"source": "2", "target": "3"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.33μs -> 1.08μs (115% faster)


def test_very_long_node_id():
    """Test with very long node ID strings."""
    long_id = "a" * 1000
    nodes = [
        {"id": long_id, "label": "Long ID Node"},
        {"id": "normal_id", "label": "Normal"},
    ]
    edges = [{"source": long_id, "target": "normal_id"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.04μs -> 959ns (113% faster)


def test_unicode_node_ids():
    """Test with unicode characters in node IDs."""
    nodes = [
        {"id": "🔴", "label": "Red Circle"},
        {"id": "🟢", "label": "Green Circle"},
    ]
    edges = [{"source": "🔴", "target": "🟢"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.08μs (80.9% faster)


def test_none_value_in_node_extra_attributes():
    """Test that nodes with None values in attributes are handled correctly."""
    nodes = [
        {"id": "node1", "label": None, "value": 0},
        {"id": "node2", "label": "Node 2", "value": None},
    ]
    edges = [{"source": "node1", "target": "node2"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.00μs (87.5% faster)


def test_nodes_with_same_label_different_ids():
    """Test that nodes are correctly identified by ID, not label."""
    nodes = [
        {"id": "node1", "label": "Same Label"},
        {"id": "node2", "label": "Same Label"},
    ]
    edges = [{"source": "node1", "target": "node2"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 959ns (95.5% faster)


def test_cycle_in_graph():
    """Test with a cycle: node1 -> node2 -> node3 -> node1."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
        {"id": "node3", "label": "Node 3"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {"source": "node2", "target": "node3"},
        {"source": "node3", "target": "node1"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.46μs -> 1.17μs (111% faster)


def test_complete_cycle_all_nodes():
    """Test with each node pointing to every other node (complete cycle)."""
    nodes = [
        {"id": "a", "label": "A"},
        {"id": "b", "label": "B"},
    ]
    edges = [
        {"source": "a", "target": "b"},
        {"source": "b", "target": "a"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.00μs (95.8% faster)


def test_edge_referencing_nonexistent_node():
    """Test with an edge referencing a node that doesn't exist in nodes list."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
    ]
    edges = [
        {"source": "node1", "target": "node2"},
        {
            "source": "node2",
            "target": "node_nonexistent",
        },  # References non-existent node
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.00μs (83.3% faster)


def test_isolated_nodes_among_connected():
    """Test with some isolated nodes and some connected nodes."""
    nodes = [
        {"id": "isolated1", "label": "Isolated 1"},
        {"id": "connected1", "label": "Connected 1"},
        {"id": "connected2", "label": "Connected 2"},
        {"id": "isolated2", "label": "Isolated 2"},
    ]
    edges = [
        {"source": "connected1", "target": "connected2"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.46μs -> 916ns (59.2% faster)


def test_node_pointing_to_itself_is_last():
    """Test that a node with only a self-loop is not considered a last node."""
    nodes = [
        {"id": "node1", "label": "Node 1"},
        {"id": "node2", "label": "Node 2"},
    ]
    edges = [
        {"source": "node1", "target": "node1"},  # self-loop
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 958ns (100% faster)


def test_wide_graph_many_children_one_parent():
    """Test with one parent node and many children."""
    nodes = [{"id": "parent", "label": "Parent"}]
    nodes.extend([{"id": f"child{i}", "label": f"Child {i}"} for i in range(10)])
    edges = [{"source": "parent", "target": f"child{i}"} for i in range(10)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.62μs -> 1.38μs (90.9% faster)


def test_deep_graph_long_chain():
    """Test with a very deep chain of nodes."""
    depth = 50
    nodes = [{"id": f"node{i}", "label": f"Node {i}"} for i in range(depth)]
    edges = [{"source": f"node{i}", "target": f"node{i+1}"} for i in range(depth - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 72.3μs -> 7.08μs (920% faster)


def test_large_linear_chain():
    """Test with a large linear chain of 500 nodes."""
    num_nodes = 500
    nodes = [{"id": f"node{i}", "label": f"Node {i}"} for i in range(num_nodes)]
    edges = [
        {"source": f"node{i}", "target": f"node{i+1}"} for i in range(num_nodes - 1)
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 5.81ms -> 57.3μs (10027% faster)


def test_large_branching_tree():
    """Test with a large binary tree structure (5 levels deep)."""
    # Build a binary tree
    nodes = []
    edges = []
    node_id = 0

    for level in range(5):
        level_size = 2**level
        for i in range(level_size):
            nodes.append({"id": f"node{node_id}", "label": f"Node {node_id}"})
            parent_id = node_id // 2 if node_id > 0 else None
            if parent_id is not None and level > 0:
                edges.append({"source": f"node{parent_id}", "target": f"node{node_id}"})
            node_id += 1

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 19.0μs -> 4.12μs (361% faster)


def test_large_diamond_multiple_paths():
    """Test with large graph having multiple convergence points."""
    # Create nodes
    num_sources = 100
    num_middle = 50

    nodes = [{"id": "start", "label": "Start"}]
    for i in range(num_sources):
        nodes.append({"id": f"source{i}", "label": f"Source {i}"})
    for i in range(num_middle):
        nodes.append({"id": f"middle{i}", "label": f"Middle {i}"})
    nodes.append({"id": "end", "label": "End"})

    edges = []
    # Connect start to all sources
    for i in range(num_sources):
        edges.append({"source": "start", "target": f"source{i}"})
    # Connect sources to middle nodes
    for i in range(num_sources):
        edges.append({"source": f"source{i}", "target": f"middle{i % num_middle}"})
    # Connect all middle nodes to end
    for i in range(num_middle):
        edges.append({"source": f"middle{i}", "target": "end"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.23ms -> 23.5μs (5130% faster)


def test_large_sparse_graph():
    """Test with a large sparse graph."""
    num_nodes = 300
    nodes = [{"id": f"node{i}", "label": f"Node {i}"} for i in range(num_nodes)]

    # Create a sparse set of edges
    edges = []
    for i in range(0, num_nodes - 1, 5):  # Only connect every 5th node
        edges.append({"source": f"node{i}", "target": f"node{i+1}"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 5.67μs -> 4.83μs (17.2% faster)


def test_large_nodes_with_complex_attributes():
    """Test with large number of nodes each having multiple attributes."""
    num_nodes = 200
    nodes = []
    for i in range(num_nodes):
        nodes.append(
            {
                "id": f"node{i}",
                "label": f"Node {i}",
                "type": "standard" if i % 2 == 0 else "special",
                "level": i // 10,
                "weight": i * 0.5,
                "data": {"nested": f"value{i}"},
            }
        )

    # Create a simple chain
    edges = [
        {"source": f"node{i}", "target": f"node{i+1}"} for i in range(num_nodes - 1)
    ]

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 922μs -> 24.7μs (3632% faster)


def test_large_graph_with_multiple_disjoint_chains():
    """Test with multiple independent chains."""
    num_chains = 20
    chain_length = 25
    nodes = []
    edges = []

    for chain in range(num_chains):
        for pos in range(chain_length):
            node_id = f"chain{chain}_node{pos}"
            nodes.append({"id": node_id, "label": f"Chain {chain} Node {pos}"})
            if pos > 0:
                prev_id = f"chain{chain}_node{pos-1}"
                edges.append({"source": prev_id, "target": node_id})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 46.6μs -> 35.8μs (30.3% faster)


def test_performance_with_many_edges():
    """Test performance with a node that has many outgoing edges."""
    num_targets = 200
    nodes = [{"id": "hub", "label": "Hub"}]
    nodes.extend(
        [{"id": f"target{i}", "label": f"Target {i}"} for i in range(num_targets)]
    )

    edges = [{"source": "hub", "target": f"target{i}"} for i in range(num_targets)]

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 11.3μs -> 4.92μs (130% faster)


def test_large_graph_all_nodes_have_incoming_edges():
    """Test with a large graph where all nodes have incoming edges (cycle)."""
    num_nodes = 100
    nodes = [{"id": f"node{i}", "label": f"Node {i}"} for i in range(num_nodes)]

    # Create edges forming a cycle where every node is both source and target
    edges = []
    for i in range(num_nodes):
        edges.append({"source": f"node{i}", "target": f"node{(i+1) % num_nodes}"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 259μs -> 13.5μs (1827% faster)


def test_large_complete_graph_subset():
    """Test with a large mostly-connected subgraph."""
    num_nodes = 50
    nodes = [{"id": f"node{i}", "label": f"Node {i}"} for i in range(num_nodes)]

    edges = []
    # Connect first 49 nodes to node 49 (creating a DAG ending at node 49)
    for i in range(num_nodes - 1):
        edges.append({"source": f"node{i}", "target": f"node{num_nodes - 1}"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 71.2μs -> 6.71μs (962% faster)


def test_large_graph_mixed_structure():
    """Test with a large graph having mixed structures (chains, branches, convergences)."""
    nodes = []
    edges = []

    # Create initial chain
    for i in range(20):
        nodes.append({"id": f"chain{i}", "label": f"Chain {i}"})
        if i > 0:
            edges.append({"source": f"chain{i-1}", "target": f"chain{i}"})

    # Branch from the last chain node
    for i in range(30):
        nodes.append({"id": f"branch{i}", "label": f"Branch {i}"})
        edges.append({"source": "chain19", "target": f"branch{i}"})

    # Converge all branches to final node
    nodes.append({"id": "final", "label": "Final"})
    for i in range(30):
        edges.append({"source": f"branch{i}", "target": "final"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 114μs -> 8.46μs (1248% faster)


def test_large_wide_then_narrow_graph():
    """Test with a graph that widens then narrows."""
    nodes = [{"id": "root", "label": "Root"}]
    edges = []

    # Widen: root connects to 50 nodes
    for i in range(50):
        nodes.append({"id": f"level1_{i}", "label": f"Level 1 Node {i}"})
        edges.append({"source": "root", "target": f"level1_{i}"})

    # Narrow: all level 1 nodes connect to single final node
    nodes.append({"id": "final", "label": "Final"})
    for i in range(50):
        edges.append({"source": f"level1_{i}", "target": "final"})

    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 179μs -> 9.79μs (1735% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mldfuqcq and push.

Codeflash Static Badge

The optimized code achieves a **52x speedup** (5210% faster) by fundamentally changing the algorithmic complexity from O(N*M) to O(N+M), where N is the number of nodes and M is the number of edges.

**Key Optimization: Pre-computing Source Set**

The original code uses a nested loop structure via generator expressions:
```python
next((n for n in nodes if all(e["source"] != n["id"] for e in edges)), None)
```

For each node, it iterates through ALL edges to check if that node appears as a source. This creates O(N*M) comparisons.

The optimized code instead:
1. **Builds a set of all source IDs once**: `sources = {e["source"] for e in edges}` - O(M) operation
2. **Performs O(1) lookups**: `if node_id not in sources` - uses set membership testing
3. **Total complexity**: O(N+M) instead of O(N*M)

**Performance Impact by Test Case:**

- **Small graphs** (2-10 nodes): 60-174% faster - modest gains as overhead of set creation is offset by fewer nodes
- **Medium graphs** (50-100 nodes): 361-1827% faster - the optimization starts to shine
- **Large graphs** (200-500 nodes): 3632-18094% faster - dramatic speedup as the quadratic behavior of the original becomes prohibitive

The test `test_large_scale_chain_of_500_nodes_is_handled_correctly` shows the most dramatic improvement: **5.47ms → 30.1μs (18094% faster)** - because it must check 500 nodes against 499 edges, resulting in nearly 250,000 comparisons in the original vs just ~999 operations in the optimized version.

**Correctness Preservation:**

The optimized code carefully preserves edge cases:
- Falls back to original behavior for non-reiterable iterators (preserves consumption semantics)
- Falls back for unhashable source values (TypeError handling)
- Mimics lazy evaluation for `n["id"]` access when `sources` is empty (returns node without checking id)
- Raises appropriate exceptions (TypeError, KeyError) at the same logical points as the original

This optimization is particularly valuable for graph traversal workflows where finding terminal nodes is a frequent operation on graphs with hundreds of nodes and edges.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 February 8, 2026 07:45
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 8, 2026
@KRRT7 KRRT7 closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant