Skip to content

⚡️ Speed up function find_last_node by 17,641%#278

Open
codeflash-ai[bot] wants to merge 1 commit intooptimizefrom
codeflash/optimize-find_last_node-mlfrkaie
Open

⚡️ Speed up function find_last_node by 17,641%#278
codeflash-ai[bot] wants to merge 1 commit intooptimizefrom
codeflash/optimize-find_last_node-mlfrkaie

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 9, 2026

📄 17,641% (176.41x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 78.8 milliseconds 444 microseconds (best of 187 runs)

📝 Explanation and details

The optimized code achieves a 176x speedup (17,641% faster) by eliminating redundant work in the core algorithm. The original implementation had O(N×M) complexity where it checked every edge against every node, resulting in excessive dictionary lookups and comparisons. The optimization reduces this to O(N+M) by preprocessing edges into a set of sources.

Key optimization:
The original code uses a nested generator expression that checks all(e["source"] != n["id"] for e in edges) for each node, causing each node to scan through the entire edges list. With 1000 nodes and 999 edges, this results in ~1 million edge checks.

The optimized code instead:

  1. Builds a set of all source IDs from edges once: sources = {e["source"] for e in edges}
  2. For each node, performs a single O(1) set membership check: if n["id"] not in sources

This transforms the algorithm from checking each (node, edge) pair to a simple linear scan with constant-time lookups.

Why it's faster:

  • Set membership is O(1) vs. O(M) linear edge scanning per node
  • Single edge traversal instead of N traversals (once per node)
  • Reduced dictionary access overhead: Each edge's "source" key is accessed once, not N times

Test results demonstrate:

  • Small graphs (2-4 nodes): 50-95% faster due to reduced overhead
  • Large graphs (1000 nodes): 120x-360x faster, proving the algorithmic improvement scales
  • The test_large_scale_single_chain_1000_nodes shows the most dramatic speedup (36,045% faster) because the original code would scan nearly all 999 edges for each of the 1000 nodes

Edge case handling:
The optimization includes special handling for single-pass iterators to preserve original semantics, and checks for empty edge lists to avoid accessing n["id"] unnecessarily, maintaining backward compatibility with nodes that may lack an 'id' key when no edges exist.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 45 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node


def test_empty_nodes_returns_none():
    # An empty nodes list cannot contain a sink node; should return None
    nodes = []  # no nodes at all
    edges = [
        {"source": "a"},
        {"source": "b"},
    ]  # edges irrelevant when there are no nodes
    codeflash_output = find_last_node(nodes, edges)  # 625ns -> 792ns (21.1% slower)


def test_no_edges_returns_first_node():
    # When there are no edges, every node has no outgoing edge; function returns the first node
    nodes = [{"id": "n1"}, {"id": "n2"}, {"id": "n3"}]  # three nodes
    edges = []  # no edges at all
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.04μs -> 666ns (56.5% faster)


def test_multiple_sinks_returns_first_encountered():
    # If multiple sink nodes exist, the first one encountered in 'nodes' should be returned
    nodes = [{"id": "a"}, {"id": "b"}, {"id": "c"}, {"id": "d"}]
    # edges source references make only 'b' and 'd' sinks if sources are 'a' and 'c'
    edges = [{"source": "a"}, {"source": "c"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.75μs -> 1.04μs (68.1% faster)


def test_fully_cyclic_returns_none():
    # When every node has at least one outgoing edge (every id appears as a source),
    # there is no sink node and the function should return None.
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    # Create edges such that all node ids appear as sources (cyclic or fully-connected)
    edges = [{"source": 1}, {"source": 2}, {"source": 3}]
    codeflash_output = find_last_node(nodes, edges)  # 1.88μs -> 1.00μs (87.5% faster)


def test_edges_with_unknown_sources_ignored():
    # Edges that reference sources not present in nodes should not prevent existing nodes
    # from being identified as sinks. Only sources that match node ids matter.
    nodes = [{"id": "alpha"}, {"id": "beta"}]
    # 'ghost' is not an id of any node; only 'alpha' is referenced as a source
    edges = [{"source": "ghost"}, {"source": "alpha"}]
    # 'alpha' is a source, so 'beta' is the first non-source node and should be returned
    codeflash_output = find_last_node(nodes, edges)  # 1.79μs -> 1.00μs (79.2% faster)


def test_nodes_with_various_id_types():
    # Ensure the function works correctly with different id types (int, str, None)
    nodes = [{"id": 1}, {"id": "1"}, {"id": None}]
    # Only the numeric id 1 is used as a source, the string "1" and None should be treated distinctly
    edges = [{"source": 1}]
    # The first node which is not source is the second node (id = "1")
    codeflash_output = find_last_node(nodes, edges)  # 1.62μs -> 917ns (77.2% faster)
    # If we add an edge with source "1", then the sink becomes the node with id None
    edges.append({"source": "1"})
    codeflash_output = find_last_node(nodes, edges)  # 1.38μs -> 625ns (120% faster)


def test_missing_id_key_raises_keyerror():
    # If a node dictionary lacks the 'id' key, accessing n["id"] should raise KeyError
    nodes = [{"id": "ok"}, {"name": "missing_id"}]  # second node missing 'id'
    edges = [{"source": "ok"}]
    with pytest.raises(KeyError):
        # The function will iterate nodes and attempt to access the missing 'id' key -> KeyError
        find_last_node(nodes, edges)  # 2.00μs -> 1.12μs (77.8% faster)


def test_edge_missing_source_key_raises_keyerror():
    # If an edge dictionary lacks the 'source' key, evaluating e["source"] will raise KeyError
    nodes = [{"id": "x"}, {"id": "y"}]
    edges = [{"source": "x"}, {"target": "y"}]  # second edge missing 'source'
    with pytest.raises(KeyError):
        # As soon as the generator examines the problematic edge, KeyError should propagate
        find_last_node(nodes, edges)  # 2.04μs -> 1.00μs (104% faster)


def test_large_scale_single_chain_1000_nodes():
    # Build 1000 nodes in a chain where every node except the last appears as a source
    size = 1000  # number of nodes (as required up to 1000)
    nodes = [{"id": i} for i in range(size)]  # node ids 0..999
    # Make edges so that each node 0..998 appears as a source (so only node 999 is a sink)
    edges = [{"source": i} for i in range(size - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.6ms -> 51.3μs (36045% faster)


def test_large_scale_multiple_sinks_prefers_first():
    # Build 1000 nodes but make every node except id 500 and id 999 a source.
    # The first sink encountered in node order should be id 500.
    size = 1000
    nodes = [{"id": i} for i in range(size)]
    # Make almost all nodes sources, but deliberately omit 500 and 999
    edges = [{"source": i} for i in range(size) if i not in (500, 999)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 4.45ms -> 36.0μs (12264% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from src.algorithms.graph import find_last_node


class TestFindLastNodeBasic:
    """Basic tests - verify fundamental functionality under normal conditions."""

    def test_single_node_no_edges(self):
        """Test a single node with no edges returns that node."""
        nodes = [{"id": 1, "name": "A"}]
        edges = []
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.00μs -> 667ns (49.9% faster)

    def test_simple_linear_chain(self):
        """Test a simple chain: A -> B -> C returns C (the last node)."""
        nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}, {"id": 3, "name": "C"}]
        edges = [
            {"source": 1, "target": 2},
            {"source": 2, "target": 3},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.96μs -> 1.00μs (95.8% faster)

    def test_two_nodes_one_edge(self):
        """Test two nodes with one edge from A to B returns B."""
        nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}]
        edges = [{"source": 1, "target": 2}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.54μs -> 875ns (76.1% faster)

    def test_node_with_multiple_outgoing_edges(self):
        """Test a node with multiple outgoing edges is not returned as the last node."""
        nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}, {"id": 3, "name": "C"}]
        edges = [
            {"source": 1, "target": 2},
            {"source": 1, "target": 3},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.58μs -> 958ns (65.2% faster)

    def test_node_with_multiple_incoming_edges(self):
        """Test that a node with multiple incoming edges is still returned if it has no outgoing edges."""
        nodes = [
            {"id": 1, "name": "A"},
            {"id": 2, "name": "B"},
            {"id": 3, "name": "C"},
        ]
        edges = [
            {"source": 1, "target": 3},
            {"source": 2, "target": 3},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.79μs -> 1.00μs (79.2% faster)

    def test_returns_first_sink_node_in_order(self):
        """Test that the first sink node in iteration order is returned."""
        nodes = [
            {"id": 1, "name": "A"},
            {"id": 2, "name": "B"},
            {"id": 3, "name": "C"},
        ]
        edges = [
            {"source": 1, "target": 2},
        ]
        # Both B and C have no outgoing edges, but C appears first in nodes list
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.46μs -> 917ns (59.0% faster)


class TestFindLastNodeEdgeCases:
    """Edge tests - evaluate behavior under extreme or unusual conditions."""

    def test_empty_nodes_list(self):
        """Test with an empty nodes list returns None."""
        nodes = []
        edges = []
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 625ns -> 667ns (6.30% slower)

    def test_empty_edges_all_nodes_are_sinks(self):
        """Test with empty edges means all nodes are sinks, returns first."""
        nodes = [
            {"id": 1, "name": "A"},
            {"id": 2, "name": "B"},
            {"id": 3, "name": "C"},
        ]
        edges = []
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.00μs -> 666ns (50.2% faster)

    def test_fully_cyclic_graph_no_sinks(self):
        """Test a fully cyclic graph where every node has outgoing edges returns None."""
        nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
        edges = [
            {"source": 1, "target": 2},
            {"source": 2, "target": 3},
            {"source": 3, "target": 1},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.92μs -> 1.04μs (84.1% faster)

    def test_self_loop_not_considered_outgoing(self):
        """Test that a self-loop is still an outgoing edge."""
        nodes = [{"id": 1, "name": "A"}]
        edges = [{"source": 1, "target": 1}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.21μs -> 833ns (45.0% faster)

    def test_node_id_string_type(self):
        """Test with string node IDs instead of integers."""
        nodes = [
            {"id": "node_a", "name": "A"},
            {"id": "node_b", "name": "B"},
        ]
        edges = [{"source": "node_a", "target": "node_b"}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.67μs -> 917ns (81.8% faster)

    def test_node_with_extra_properties(self):
        """Test nodes with additional properties beyond id and name."""
        nodes = [
            {"id": 1, "name": "A", "value": 100, "type": "start"},
            {"id": 2, "name": "B", "value": 200, "type": "end"},
        ]
        edges = [{"source": 1, "target": 2}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.50μs -> 917ns (63.6% faster)

    def test_edge_with_extra_properties(self):
        """Test edges with additional properties beyond source and target."""
        nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
        edges = [
            {"source": 1, "target": 2, "weight": 5, "label": "edge1"},
            {"source": 2, "target": 3, "weight": 10, "label": "edge2"},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.88μs -> 1.00μs (87.5% faster)

    def test_multiple_edges_same_source_and_target(self):
        """Test multiple edges between the same pair of nodes."""
        nodes = [{"id": 1}, {"id": 2}]
        edges = [
            {"source": 1, "target": 2},
            {"source": 1, "target": 2},
            {"source": 1, "target": 2},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.67μs -> 1.04μs (59.9% faster)

    def test_node_id_zero(self):
        """Test that node ID of 0 is handled correctly."""
        nodes = [{"id": 0, "name": "Zero"}, {"id": 1, "name": "One"}]
        edges = [{"source": 0, "target": 1}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.42μs -> 875ns (61.9% faster)

    def test_node_id_negative(self):
        """Test that negative node IDs are handled correctly."""
        nodes = [{"id": -1, "name": "Neg"}, {"id": 1, "name": "Pos"}]
        edges = [{"source": -1, "target": 1}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.50μs -> 917ns (63.6% faster)

    def test_large_node_id_numbers(self):
        """Test with very large node ID numbers."""
        nodes = [{"id": 999999999}, {"id": 1000000000}]
        edges = [{"source": 999999999, "target": 1000000000}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.54μs -> 958ns (60.9% faster)

    def test_node_id_none_type(self):
        """Test with None as a node ID value."""
        nodes = [{"id": None}, {"id": 1}]
        edges = [{"source": None, "target": 1}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.71μs -> 916ns (86.5% faster)

    def test_duplicate_nodes_same_id(self):
        """Test with duplicate node entries having the same ID."""
        nodes = [
            {"id": 1, "name": "A"},
            {"id": 1, "name": "A"},
            {"id": 2, "name": "B"},
        ]
        edges = [{"source": 1, "target": 2}]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 1.71μs -> 917ns (86.3% faster)

    def test_diamond_graph_pattern(self):
        """Test a diamond-shaped graph: A splits to B and C, both merge to D."""
        nodes = [
            {"id": 1, "name": "A"},
            {"id": 2, "name": "B"},
            {"id": 3, "name": "C"},
            {"id": 4, "name": "D"},
        ]
        edges = [
            {"source": 1, "target": 2},
            {"source": 1, "target": 3},
            {"source": 2, "target": 4},
            {"source": 3, "target": 4},
        ]
        codeflash_output = find_last_node(nodes, edges)
        result = codeflash_output  # 2.25μs -> 1.17μs (93.0% faster)

To edit these changes git checkout codeflash/optimize-find_last_node-mlfrkaie and push.

Codeflash Static Badge

The optimized code achieves a **176x speedup** (17,641% faster) by eliminating redundant work in the core algorithm. The original implementation had O(N×M) complexity where it checked every edge against every node, resulting in excessive dictionary lookups and comparisons. The optimization reduces this to O(N+M) by preprocessing edges into a set of sources.

**Key optimization:**
The original code uses a nested generator expression that checks `all(e["source"] != n["id"] for e in edges)` for each node, causing each node to scan through the entire edges list. With 1000 nodes and 999 edges, this results in ~1 million edge checks.

The optimized code instead:
1. Builds a set of all source IDs from edges once: `sources = {e["source"] for e in edges}`
2. For each node, performs a single O(1) set membership check: `if n["id"] not in sources`

This transforms the algorithm from checking each (node, edge) pair to a simple linear scan with constant-time lookups.

**Why it's faster:**
- **Set membership is O(1)** vs. O(M) linear edge scanning per node
- **Single edge traversal** instead of N traversals (once per node)
- **Reduced dictionary access overhead**: Each edge's "source" key is accessed once, not N times

**Test results demonstrate:**
- Small graphs (2-4 nodes): 50-95% faster due to reduced overhead
- Large graphs (1000 nodes): 120x-360x faster, proving the algorithmic improvement scales
- The `test_large_scale_single_chain_1000_nodes` shows the most dramatic speedup (36,045% faster) because the original code would scan nearly all 999 edges for each of the 1000 nodes

**Edge case handling:**
The optimization includes special handling for single-pass iterators to preserve original semantics, and checks for empty edge lists to avoid accessing `n["id"]` unnecessarily, maintaining backward compatibility with nodes that may lack an 'id' key when no edges exist.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 February 9, 2026 22:48
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants

Comments