Skip to content

KeyError: 'id' in generate_concept() due to orphan nodes created by add_edge() #45

@FredDsR

Description

@FredDsR

Hello, I'm using your KG construction pipeline (atlas-rag 0.0.5.post1) in my experiments and I'm very grateful for your work. While running concept generation on a ~13k-node graph, I hit an issue in the concept generation module, as follows.

Description

generate_concept() in concept_generation.py crashes with KeyError: 'id' when accessing neighbor node attributes. The root cause is in csvs_to_temp_graphml() (csv_to_graphml.py): when edges reference nodes not present in the triple_nodes CSV, nx.DiGraph.add_edge() auto-creates those nodes without any attributes (id, type). Later, generate_concept() accesses temp_kg.nodes[neighbor]['id'] unconditionally and crashes.

Reproducing

This occurs on larger, real-world KGs where the edges CSV contains :START_ID or :END_ID values not present in the nodes CSV name:ID column. In our case, ~1,700 out of 13,376 nodes were orphans.

Root Cause

csv_to_graphml.py lines 51 vs 62:

# Line 51 — adds node WITH attributes
g.add_node(mapped_id, id=node_id, type=row["type"])

# Line 62 — add_edge implicitly creates nodes WITHOUT attributes
g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])

concept_generation.py lines 212, 216 — no guard for missing attributes:

context += ", ".join([
    f"{temp_kg.nodes[neighbor]['id']} {temp_kg[neighbor][node_id]['relation']}"
    for neighbor in random_two_neighbors
])

Suggested Fix

In csvs_to_temp_graphml(), ensure edge endpoints exist as fully-attributed nodes before adding the edge:

for row in reader:
    start_id = get_node_id(row[":START_ID"], entity_to_id)
    end_id = get_node_id(row[":END_ID"], entity_to_id)
    # Ensure both endpoints exist with attributes
    if start_id not in g.nodes:
        g.add_node(start_id, id=row[":START_ID"], type="Entity")
    if end_id not in g.nodes:
        g.add_node(end_id, id=row[":END_ID"], type="Entity")
    if not g.has_edge(start_id, end_id):
        g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])

And/or add a defensive guard in generate_concept():

if node_id not in temp_kg or 'id' not in temp_kg.nodes.get(node_id, {}):
    continue

Additional: Cache bug in get_node_id()

There's also a cache inefficiency in get_node_id() (csv_to_graphml.py:29-38). The function appends '_entity' to entity_name before storing in entity_to_id, but the cache lookup on line 31 checks the original (unmodified) name, so the cache is never hit:

def get_node_id(entity_name, entity_to_id={}):
    if entity_name not in entity_to_id:       # checks "foo"
        entity_name = entity_name + '_entity'  # mutates to "foo_entity"
        ...
        entity_to_id[entity_name] = hash_hex   # stores under "foo_entity"
    return entity_to_id[entity_name]           # looks up "foo_entity" — works only because
                                               # entity_name was already mutated above

This doesn't cause incorrect hashes (the mutation happens before hashing), but it means every call recomputes the hash since "foo" is never found as a key — only "foo_entity" is stored.

Environment

  • atlas-rag 0.0.5.post1
  • Python 3.13
  • NetworkX (latest)
  • Model: qwen3:14b
  • ~13,000 nodes, ~9,500 edges

Traceback

concept_generation.py:212 in generate_concept
    context += ", ".join([f"{temp_kg.nodes[neighbor]['id']} ..."
KeyError: 'id'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions