Skip to content

KeyError crash when running ~1000 queries on a single connection after any DDL/ROLLBACK (connection becomes unusable) #273

@colin-rogers-dbt

Description

@colin-rogers-dbt

Summary

When a connection executes a DDL statement (CREATE, DROP, ALTER) or triggers a ROLLBACK, the prepared-statement LRU cache enters a split-brain state. Once the cache subsequently fills back up to max_prepared_statements (default: 1000), the next query crashes with a KeyError inside the eviction logic. The crash also corrupts the connection so that the following rollback() call fails too, leaving the connection permanently broken.

This manifests in some dbt projects because users can run an arbitrary number of queries after dbt executes at least one DDL all of which execute on the same connection.

Environment

  • redshift-connector version: 2.1.12
  • Python: 3.x
  • Cluster: Redshift (provisioned)

Steps to reproduce

Minimal inline repro:

```python
import redshift_connector

conn = redshift_connector.connect(host=..., database=..., user=..., password=...)
conn.autocommit = True
cur = conn.cursor()

Step 1: DDL poisons the LRU tracker

cur.execute("CREATE TABLE IF NOT EXISTS repro_test (id INT)")

Step 2: fill the prepared-statement cache to the limit (default 1000)

for i in range(1000):
cur.execute(f"SELECT {i} AS n")
cur.fetchall()

Step 3: one more new query triggers eviction → KeyError

cur.execute("SELECT 'boom' AS x") # ← crashes here
```

Expected: query succeeds.
Actual:
```
KeyError: ('CREATE TABLE IF NOT EXISTS repro_test (id INT)', ())
File "redshift_connector/core.py", line 1849, in execute
statements_to_close.append(cache["ps"][oldest_key]["statement_name_bin"])
```

Root cause

There are two parallel cache structures kept in sync for LRU eviction:

```
cache["ps"] – dict : key → prepared-statement metadata
cache["statement_dict"] – OrderedDict : key → None (LRU order)
```

Bug 1handle_COMMAND_COMPLETE (core.py:2029–2034) clears ps on DDL/ROLLBACK but leaves statement_dict intact:

```python
if command in (b"ALTER", b"CREATE", b"DROP", b"ROLLBACK"):
for ps in pcache["ps"].values():
self.close_prepared_statement(ps["statement_name_bin"])
pcache["ps"].clear() # ← ps emptied
# statement_dict is NOT cleared ← stale keys remain
```

Bug 2 — The consistency-repair check (core.py:1842–1844) only adds entries to statement_dict; it never removes stale ones:

```python
if len(cache["ps"]) != len(cache["statement_dict"]):
for existing_key in cache["ps"]:
cache["statement_dict"][existing_key] = None # adds, never removes
```

Bug 3 (crash site) — Eviction blindly assumes the oldest statement_dict key exists in ps (core.py:1847–1850):

```python
if len(cache["ps"]) >= self.max_prepared_statements:
oldest_key, _ = cache["statement_dict"].popitem(last=False)
statements_to_close.append(cache["ps"][oldest_key]["statement_name_bin"])
# ↑ KeyError if oldest_key is stale
```

Failure sequence

  1. Connection runs N unique queries → ps and statement_dict each have N entries
  2. A DDL or ROLLBACK executes → ps cleared to 0, statement_dict still has N stale keys
  3. N more unique queries run → ps refills to N; statement_dict now has 2N entries
  4. One more new query: eviction fires (len(ps) >= 1000), pops the oldest statement_dict entry (a stale pre-DDL key), crashes with KeyError
  5. The crashed query left a PARSE message partially sent; the subsequent rollback() fails with ProgrammingError: prepared statement "..." already exists, making the connection permanently unusable

Observed diagnostic output

```
ps size after DDL: 0 statement_dict size: 502 ← 502 stale entries remain
query 1000: ps=1000, sd=1502 ← divergence grows with each query
Step 4: KeyError: ('SELECT 1 AS query_num', ()) ← crash at eviction
```

Note: in a realistic dbt workflow the failure occurs at query ~999 rather than 1001, because DDL setup queries (DROP TABLE, CREATE TABLE) run before the main loop and pre-poison statement_dict with stale keys, reducing the effective headroom.

Proposed fix

The minimal fix is to also clear statement_dict wherever ps is cleared:

```python

core.py ~line 2034

pcache["ps"].clear()
pcache["statement_dict"].clear() # ← add this line
```

A more defensive fix also guards the eviction against stale keys:

```python

core.py ~line 1847

if len(cache["ps"]) >= self.max_prepared_statements:
while cache["statement_dict"]:
oldest_key, _ = cache["statement_dict"].popitem(last=False)
if oldest_key in cache["ps"]:
statements_to_close.append(cache["ps"][oldest_key]["statement_name_bin"])
del cache["ps"][oldest_key]
break
# else: stale key, discard and keep looking
```

Both fixes together fully resolve the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions