Conversation
f2e2b7c to
e408931
Compare
Migrate the Python file parser from recursive AST traversal to pre-compiled tree-sitter queries, add object pooling, and batch-parse all files into a lookup table to eliminate redundant work. Changes: - file_parser.go: Replace recursive parse() with query-based approach using pre-compiled tree-sitter queries (init-time), sync.Pool for parser and cursor reuse, and TYPE_CHECKING block detection. - parser.go: Add parseAllToLUT() for concurrent batch parsing with errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed results without re-parsing. - generate.go: Parse all .py files once upfront into a LUT before target generation. Replace per-target parser.parse() calls with parseFromLUT() lookups. Replace per-call regexp.MustCompile in isDjangoTestFile with strings.Contains. Profiled on production codebase (cpu pprof): | Metric | Before | After | Improvement | |-------------|---------|---------|-------------| | Wall time | 46.72s | 13.25s | 3.5x faster | | CPU samples | 53.04s | 31.69s | 40% less | | cgocall | 28.13s | 23.68s | 16% less | Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e408931 to
b574ffb
Compare
There was a problem hiding this comment.
Pull request overview
This PR modernizes and accelerates Python dependency parsing in the Gazelle extension by switching from recursive AST walking to pre-compiled tree-sitter queries, pooling parser/cursor objects, and parsing all Python files once into a shared lookup table (LUT) that is reused during rule generation.
Changes:
- Add concurrent batch parsing (
parseAllToLUT) and LUT-based consumption (parseFromLUT) to avoid repeated parsing across multiple targets. - Rewrite
FileParser.Parseto use pre-compiled tree-sitter queries plus pooling for parsers/cursors, including TYPE_CHECKING-block handling. - Update rule generation to pre-parse all
.pyfiles into a LUT and replace regex-based Django test detection with substring checks.
Reviewed changes
Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/parser.go | Introduces LUT-based concurrent parsing and LUT consumption APIs. |
| python/generate.go | Parses all Python files once into a LUT; switches per-target parsing to LUT lookups; simplifies Django marker detection. |
| python/file_parser.go | Replaces recursive traversal with query-based extraction and adds pooling + TYPE_CHECKING handling. |
| patches/go-tree-sitter.diff | Removes the previously maintained patch file for go-tree-sitter vendoring/patching. |
| go.mod | Bumps Go version and updates dependency versions (Gazelle, rules_go, x/sync, etc.). |
| go.sum | Updates module checksums to match the new dependency set. |
| MODULE.bazel | Updates Bazel module deps and removes custom http_archive patching for go-tree-sitter, relying on go_deps instead. |
| MODULE.bazel.lock | Regenerates lockfile reflecting updated Bazel deps and transitive graph. |
Comments suppressed due to low confidence (2)
python/generate.go:479
- After the scan loop,
scanner.Err()isn’t checked. If scanning fails (includingbufio.ErrTooLongfor long lines), this will silently return false and can misclassify Django tests. Consider checkingscanner.Err()and handling/reporting the error consistently with the rest of this function.
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.Contains(line, "pytest.mark.django_db") {
return true
}
if strings.Contains(line, "gazelle: django_test") {
return true
}
if strings.Contains(line, "django.test") && strings.Contains(line, "TestCase") {
return true
}
}
return false
python/generate.go:464
log.Fatalfalready terminates the process, so the subsequentpanic(err)is unreachable and can be removed for clarity.
file, err := os.Open(path)
if err != nil {
log.Fatalf("ERROR: %v\n", err)
panic(err)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if child.Type() == "if_statement" && | ||
| child.ChildCount() > 1 && | ||
| child.Child(1).Type() == sitterNodeTypeComparisonOperator && | ||
| child.Child(1).Child(1).Type() == "==" { | ||
| statement := child.Child(1) | ||
| a, b := statement.Child(0), statement.Child(2) | ||
| // convert "'__main__' == __name__" to "__name__ == '__main__'" | ||
| if b.Type() == sitterNodeTypeIdentifier { | ||
| a, b = b, a | ||
| } | ||
| if a.Type() == sitterNodeTypeIdentifier && a.Content(p.code) == "__name__" && | ||
| // at github.com/smacker/go-tree-sitter@latest (after v0.0.0-20240422154435-0628b34cbf9c we used) | ||
| // "__main__" is the second child of b. But now, it isn't. | ||
| // we cannot use the latest go-tree-sitter because of the top level reference in scanner.c. | ||
| // https://github.com/smacker/go-tree-sitter/blob/04d6b33fe138a98075210f5b770482ded024dc0f/python/scanner.c#L1 | ||
| b.Type() == sitterNodeTypeString && string(p.code[b.StartByte()+1:b.EndByte()-1]) == "__main__" { | ||
| return true |
There was a problem hiding this comment.
parseMain indexes into the comparison node (child.Child(1).Child(1), statement.Child(2)) without verifying the child counts, which can panic on unexpected/invalid AST shapes (e.g., syntax errors or different comparison forms). Add defensive ChildCount() checks (and nil checks where applicable) before accessing those children.
| } else if node.Type() == sitterNodeTypeImportFromStatement { | ||
| from := node.Child(1).Content(p.code) | ||
| if strings.HasPrefix(from, ".") { | ||
| return true | ||
| return | ||
| } | ||
| for j := 3; j < int(node.ChildCount()); j++ { | ||
| m, ok := parseImportStatement(node.Child(j), p.code) |
There was a problem hiding this comment.
parseImportStatements assumes import_from_statement always has at least 2 children and starts imported names at index 3. On malformed/incomplete from ... import ... statements this can result in node.Child(1) being nil/out-of-bounds and panicking. Consider guarding with ChildCount() checks and returning early when the expected children aren’t present.
| @@ -112,7 +160,7 @@ func parseImportStatement(node *sitter.Node, code []byte) (module, bool) { | |||
| return module{}, false | |||
| } | |||
|
|
|||
| func (p *FileParser) parseImportStatements(node *sitter.Node) bool { | |||
| func (p *FileParser) parseImportStatements(node *sitter.Node) { | |||
| if node.Type() == sitterNodeTypeImportStatement { | |||
| for j := 1; j < int(node.ChildCount()); j++ { | |||
| m, ok := parseImportStatement(node.Child(j), p.code) | |||
| @@ -128,7 +176,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool { | |||
| } else if node.Type() == sitterNodeTypeImportFromStatement { | |||
| from := node.Child(1).Content(p.code) | |||
| if strings.HasPrefix(from, ".") { | |||
| return true | |||
| return | |||
| } | |||
| for j := 3; j < int(node.ChildCount()); j++ { | |||
| m, ok := parseImportStatement(node.Child(j), p.code) | |||
| @@ -140,18 +188,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool { | |||
| m.Name = fmt.Sprintf("%s.%s", from, m.Name) | |||
| p.output.Modules = append(p.output.Modules, m) | |||
| } | |||
| } else { | |||
| return false | |||
| } | |||
| return true | |||
| } | |||
|
|
|||
| func (p *FileParser) parseComments(node *sitter.Node) bool { | |||
| if node.Type() == sitterNodeTypeComment { | |||
| p.output.Comments = append(p.output.Comments, comment(node.Content(p.code))) | |||
| return true | |||
| } | |||
| return false | |||
| } | |||
|
|
|||
| func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string) { | |||
| @@ -160,34 +197,68 @@ func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string | |||
| p.output.FileName = filename | |||
| } | |||
|
|
|||
| func (p *FileParser) parse(ctx context.Context, node *sitter.Node) { | |||
| if node == nil { | |||
| return | |||
| // Parse uses pre-compiled tree-sitter queries to extract imports and comments | |||
| // from the parsed AST, replacing the previous recursive traversal approach. | |||
| func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) { | |||
| rootNode, err := parseCode(p.code) | |||
| if err != nil { | |||
| return nil, err | |||
| } | |||
| for i := 0; i < int(node.ChildCount()); i++ { | |||
|
|
|||
| p.output.HasMain = p.parseMain(rootNode) | |||
|
|
|||
| cursor := cursorPool.Get().(*sitter.QueryCursor) | |||
| defer cursorPool.Put(cursor) | |||
|
|
|||
| cursor.Exec(importQuery, rootNode) | |||
|
|
|||
| seenImports := make(map[uint32]bool) | |||
|
|
|||
| for { | |||
| if err := ctx.Err(); err != nil { | |||
| return | |||
| } | |||
| child := node.Child(i) | |||
| if p.parseImportStatements(child) { | |||
| continue | |||
| return nil, err | |||
| } | |||
| if p.parseComments(child) { | |||
| continue | |||
| match, ok := cursor.NextMatch() | |||
| if !ok { | |||
| break | |||
| } | |||
| p.parse(ctx, child) | |||
| } | |||
| } | |||
|
|
|||
| func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) { | |||
| rootNode, err := ParseCode(p.code) | |||
| if err != nil { | |||
| return nil, err | |||
| } | |||
| for _, capture := range match.Captures { | |||
| captureName := importQuery.CaptureNameForId(capture.Index) | |||
| node := capture.Node | |||
|
|
|||
| switch captureName { | |||
| case "import", "import_from": | |||
| if seenImports[node.StartByte()] { | |||
| continue | |||
| } | |||
| seenImports[node.StartByte()] = true | |||
| p.parseImportStatements(node) | |||
|
|
|||
| case "comment": | |||
| p.output.Comments = append(p.output.Comments, comment(node.Content(p.code))) | |||
|
|
|||
| p.output.HasMain = p.parseMain(ctx, rootNode) | |||
| case "if_stmt": | |||
| if p.isTypeCheckingBlock(node) { | |||
| for j := 0; j < int(node.ChildCount()); j++ { | |||
| subChild := node.Child(j) | |||
| if subChild.Type() == "block" { | |||
| for k := 0; k < int(subChild.ChildCount()); k++ { | |||
| stmt := subChild.Child(k) | |||
| if stmt.Type() == sitterNodeTypeImportStatement || stmt.Type() == sitterNodeTypeImportFromStatement { | |||
| if !seenImports[stmt.StartByte()] { | |||
| seenImports[stmt.StartByte()] = true | |||
| p.parseImportStatements(stmt) | |||
| } | |||
| } | |||
| } | |||
| } | |||
| } | |||
| } | |||
There was a problem hiding this comment.
New TYPE_CHECKING handling (isTypeCheckingBlock + the if_stmt capture path) isn’t covered by tests. Adding unit tests for if TYPE_CHECKING: and if typing.TYPE_CHECKING: blocks (including nested imports) would help ensure the intended dependency behavior and prevent regressions.
Optimize Python parser: 3.5x speedup (46.7s → 13.3s)
Migrate the Python file parser from recursive AST traversal to
pre-compiled tree-sitter queries, add object pooling, and batch-parse
all files into a lookup table to eliminate redundant work.
Changes:
using pre-compiled tree-sitter queries (init-time), sync.Pool for
parser and cursor reuse, and TYPE_CHECKING block detection.
errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed
results without re-parsing.
target generation. Replace per-target parser.parse() calls with
parseFromLUT() lookups. Replace per-call regexp.MustCompile in
isDjangoTestFile with strings.Contains.
Profiled on production codebase (cpu pprof):
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com