Skip to content

Avoid double-counting Automaton in CompiledAutomaton.ramBytesUsed#16046

Open
reugn wants to merge 2 commits into
apache:mainfrom
reugn:fix-compiled-automaton-rambytes
Open

Avoid double-counting Automaton in CompiledAutomaton.ramBytesUsed#16046
reugn wants to merge 2 commits into
apache:mainfrom
reugn:fix-compiled-automaton-rambytes

Conversation

@reugn
Copy link
Copy Markdown

@reugn reugn commented May 10, 2026

Description

CompiledAutomaton.ramBytesUsed() counts the underlying Automaton twice on the DFA path, over-reporting retained heap by 18–35% on non-trivial wildcard and regexp queries.

The automaton field is aliased to runAutomaton.automaton — a single Automaton instance referenced from two places:

// CompiledAutomaton.java:261-263
runAutomaton = new ByteRunAutomaton(binary, true);
this.automaton = runAutomaton.automaton;   // same reference

ramBytesUsed() accounts for it twice: once directly via sizeOfObject(automaton), and again through sizeOfObject(runAutomaton) which delegates to RunAutomaton.ramBytesUsed() and adds sizeOfObject(automaton) itself.

The fix is to drop the redundant sizeOfObject(automaton) from CompiledAutomaton.ramBytesUsed(). In the NFA branch both fields are null, so this is a no-op there.

import org.apache.lucene.index.Term;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.util.automaton.*;

Automaton dfa = Operations.determinize(
    WildcardQuery.toAutomaton(new Term("f", "*" + "x".repeat(3000) + "*")),
    Integer.MAX_VALUE);
CompiledAutomaton ca = new CompiledAutomaton(dfa, false, true, false);

System.out.println(ca.ramBytesUsed());
// Before: 8_361_506   (over-reports by 2.15 MB, ratio ~1.35)
// After:  6_214_953   (matches retained heap within 159 bytes)

Cross-checked against org.openjdk.jol.info.GraphLayout.parseInstance(ca).totalSize() — reported below is CompiledAutomaton.ramBytesUsed(), retained is JOL's GraphLayout.totalSize().

Pattern reported (before) reported (after) retained
*foo* 13,634 10,593 10,752
*foo*bar*baz* 53,586 42,921 43,080
*a*b*c*d*e*f*g* 79,282 61,577 61,736
* + x×1000 + * 2,789,994 2,073,945 2,074,104
* + x×3000 + * 8,361,506 6,214,953 6,215,112
*a*b*…*j* (depth 10) 154,522 121,457 121,616
*a*b*…*t* (depth 20) 677,562 548,497 548,656
.*foo.*bar.* 34,258 27,161 27,320

After the fix, ramBytesUsed() lands within a fixed +159-byte offset of actual retained heap across all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant