Update packages by ewianda · Pull Request #19 · benchsci/rules_python_gazelle

ewianda · 2026-02-20T22:27:11Z

Optimize Python parser: 3.5x speedup (46.7s → 13.3s)
Migrate the Python file parser from recursive AST traversal to
pre-compiled tree-sitter queries, add object pooling, and batch-parse
all files into a lookup table to eliminate redundant work.

Changes:

file_parser.go: Replace recursive parse() with query-based approach
using pre-compiled tree-sitter queries (init-time), sync.Pool for
parser and cursor reuse, and TYPE_CHECKING block detection.
parser.go: Add parseAllToLUT() for concurrent batch parsing with
errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed
results without re-parsing.
generate.go: Parse all .py files once upfront into a LUT before
target generation. Replace per-target parser.parse() calls with
parseFromLUT() lookups. Replace per-call regexp.MustCompile in
isDjangoTestFile with strings.Contains.

Profiled on production codebase (cpu pprof):

Metric	Before	After	Improvement
Wall time	46.72s	13.25s	3.5x faster
CPU samples	53.04s	31.69s	40% less
cgocall	28.13s	23.68s	16% less

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Migrate the Python file parser from recursive AST traversal to pre-compiled tree-sitter queries, add object pooling, and batch-parse all files into a lookup table to eliminate redundant work. Changes: - file_parser.go: Replace recursive parse() with query-based approach using pre-compiled tree-sitter queries (init-time), sync.Pool for parser and cursor reuse, and TYPE_CHECKING block detection. - parser.go: Add parseAllToLUT() for concurrent batch parsing with errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed results without re-parsing. - generate.go: Parse all .py files once upfront into a LUT before target generation. Replace per-target parser.parse() calls with parseFromLUT() lookups. Replace per-call regexp.MustCompile in isDjangoTestFile with strings.Contains. Profiled on production codebase (cpu pprof): | Metric | Before | After | Improvement | |-------------|---------|---------|-------------| | Wall time | 46.72s | 13.25s | 3.5x faster | | CPU samples | 53.04s | 31.69s | 40% less | | cgocall | 28.13s | 23.68s | 16% less | Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR modernizes and accelerates Python dependency parsing in the Gazelle extension by switching from recursive AST walking to pre-compiled tree-sitter queries, pooling parser/cursor objects, and parsing all Python files once into a shared lookup table (LUT) that is reused during rule generation.

Changes:

Add concurrent batch parsing (parseAllToLUT) and LUT-based consumption (parseFromLUT) to avoid repeated parsing across multiple targets.
Rewrite FileParser.Parse to use pre-compiled tree-sitter queries plus pooling for parsers/cursors, including TYPE_CHECKING-block handling.
Update rule generation to pre-parse all .py files into a LUT and replace regex-based Django test detection with substring checks.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
python/parser.go	Introduces LUT-based concurrent parsing and LUT consumption APIs.
python/generate.go	Parses all Python files once into a LUT; switches per-target parsing to LUT lookups; simplifies Django marker detection.
python/file_parser.go	Replaces recursive traversal with query-based extraction and adds pooling + TYPE_CHECKING handling.
patches/go-tree-sitter.diff	Removes the previously maintained patch file for go-tree-sitter vendoring/patching.
go.mod	Bumps Go version and updates dependency versions (Gazelle, rules_go, x/sync, etc.).
go.sum	Updates module checksums to match the new dependency set.
MODULE.bazel	Updates Bazel module deps and removes custom http_archive patching for go-tree-sitter, relying on go_deps instead.
MODULE.bazel.lock	Regenerates lockfile reflecting updated Bazel deps and transitive graph.

Comments suppressed due to low confidence (2)

python/generate.go:479

After the scan loop, scanner.Err() isn’t checked. If scanning fails (including bufio.ErrTooLong for long lines), this will silently return false and can misclassify Django tests. Consider checking scanner.Err() and handling/reporting the error consistently with the rest of this function.

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.Contains(line, "pytest.mark.django_db") {
			return true
		}
		if strings.Contains(line, "gazelle: django_test") {
			return true
		}
		if strings.Contains(line, "django.test") && strings.Contains(line, "TestCase") {
			return true
		}
	}
	return false

python/generate.go:464

log.Fatalf already terminates the process, so the subsequent panic(err) is unreachable and can be removed for clarity.

	file, err := os.Open(path)
	if err != nil {
		log.Fatalf("ERROR: %v\n", err)
		panic(err)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T22:34:07Z

python/file_parser.go

+		if child.Type() == "if_statement" &&
+			child.ChildCount() > 1 &&
+			child.Child(1).Type() == sitterNodeTypeComparisonOperator &&
+			child.Child(1).Child(1).Type() == "==" {
 			statement := child.Child(1)
 			a, b := statement.Child(0), statement.Child(2)
 			// convert "'__main__' == __name__" to "__name__ == '__main__'"
 			if b.Type() == sitterNodeTypeIdentifier {
 				a, b = b, a
 			}
 			if a.Type() == sitterNodeTypeIdentifier && a.Content(p.code) == "__name__" &&
-				// at github.com/smacker/go-tree-sitter@latest (after v0.0.0-20240422154435-0628b34cbf9c we used)
-				// "__main__" is the second child of b. But now, it isn't.
-				// we cannot use the latest go-tree-sitter because of the top level reference in scanner.c.
-				// https://github.com/smacker/go-tree-sitter/blob/04d6b33fe138a98075210f5b770482ded024dc0f/python/scanner.c#L1
 				b.Type() == sitterNodeTypeString && string(p.code[b.StartByte()+1:b.EndByte()-1]) == "__main__" {
 				return true


parseMain indexes into the comparison node (child.Child(1).Child(1), statement.Child(2)) without verifying the child counts, which can panic on unexpected/invalid AST shapes (e.g., syntax errors or different comparison forms). Add defensive ChildCount() checks (and nil checks where applicable) before accessing those children.

Copilot · 2026-02-20T22:34:08Z

python/file_parser.go

 	} else if node.Type() == sitterNodeTypeImportFromStatement {
 		from := node.Child(1).Content(p.code)
 		if strings.HasPrefix(from, ".") {
-			return true
+			return
 		}
 		for j := 3; j < int(node.ChildCount()); j++ {
 			m, ok := parseImportStatement(node.Child(j), p.code)


parseImportStatements assumes import_from_statement always has at least 2 children and starts imported names at index 3. On malformed/incomplete from ... import ... statements this can result in node.Child(1) being nil/out-of-bounds and panicking. Consider guarding with ChildCount() checks and returning early when the expected children aren’t present.

Copilot · 2026-02-20T22:34:08Z

python/file_parser.go

@@ -112,7 +160,7 @@ func parseImportStatement(node *sitter.Node, code []byte) (module, bool) {
 	return module{}, false
 }

-func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
+func (p *FileParser) parseImportStatements(node *sitter.Node) {
 	if node.Type() == sitterNodeTypeImportStatement {
 		for j := 1; j < int(node.ChildCount()); j++ {
 			m, ok := parseImportStatement(node.Child(j), p.code)
@@ -128,7 +176,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
 	} else if node.Type() == sitterNodeTypeImportFromStatement {
 		from := node.Child(1).Content(p.code)
 		if strings.HasPrefix(from, ".") {
-			return true
+			return
 		}
 		for j := 3; j < int(node.ChildCount()); j++ {
 			m, ok := parseImportStatement(node.Child(j), p.code)
@@ -140,18 +188,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
 			m.Name = fmt.Sprintf("%s.%s", from, m.Name)
 			p.output.Modules = append(p.output.Modules, m)
 		}
-	} else {
-		return false
 	}
-	return true
-}
-
-func (p *FileParser) parseComments(node *sitter.Node) bool {
-	if node.Type() == sitterNodeTypeComment {
-		p.output.Comments = append(p.output.Comments, comment(node.Content(p.code)))
-		return true
-	}
-	return false
 }

 func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string) {
@@ -160,34 +197,68 @@ func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string
 	p.output.FileName = filename
 }

-func (p *FileParser) parse(ctx context.Context, node *sitter.Node) {
-	if node == nil {
-		return
+// Parse uses pre-compiled tree-sitter queries to extract imports and comments
+// from the parsed AST, replacing the previous recursive traversal approach.
+func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) {
+	rootNode, err := parseCode(p.code)
+	if err != nil {
+		return nil, err
 	}
-	for i := 0; i < int(node.ChildCount()); i++ {
+
+	p.output.HasMain = p.parseMain(rootNode)
+
+	cursor := cursorPool.Get().(*sitter.QueryCursor)
+	defer cursorPool.Put(cursor)
+
+	cursor.Exec(importQuery, rootNode)
+
+	seenImports := make(map[uint32]bool)
+
+	for {
 		if err := ctx.Err(); err != nil {
-			return
-		}
-		child := node.Child(i)
-		if p.parseImportStatements(child) {
-			continue
+			return nil, err
 		}
-		if p.parseComments(child) {
-			continue
+		match, ok := cursor.NextMatch()
+		if !ok {
+			break
 		}
-		p.parse(ctx, child)
-	}
-}

-func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) {
-	rootNode, err := ParseCode(p.code)
-	if err != nil {
-		return nil, err
-	}
+		for _, capture := range match.Captures {
+			captureName := importQuery.CaptureNameForId(capture.Index)
+			node := capture.Node
+
+			switch captureName {
+			case "import", "import_from":
+				if seenImports[node.StartByte()] {
+					continue
+				}
+				seenImports[node.StartByte()] = true
+				p.parseImportStatements(node)
+
+			case "comment":
+				p.output.Comments = append(p.output.Comments, comment(node.Content(p.code)))

-	p.output.HasMain = p.parseMain(ctx, rootNode)
+			case "if_stmt":
+				if p.isTypeCheckingBlock(node) {
+					for j := 0; j < int(node.ChildCount()); j++ {
+						subChild := node.Child(j)
+						if subChild.Type() == "block" {
+							for k := 0; k < int(subChild.ChildCount()); k++ {
+								stmt := subChild.Child(k)
+								if stmt.Type() == sitterNodeTypeImportStatement || stmt.Type() == sitterNodeTypeImportFromStatement {
+									if !seenImports[stmt.StartByte()] {
+										seenImports[stmt.StartByte()] = true
+										p.parseImportStatements(stmt)
+									}
+								}
+							}
+						}
+					}
+				}


New TYPE_CHECKING handling (isTypeCheckingBlock + the if_stmt capture path) isn’t covered by tests. Adding unit tests for if TYPE_CHECKING: and if typing.TYPE_CHECKING: blocks (including nested imports) would help ensure the intended dependency behavior and prevent regressions.

Update packages

e66ec94

ewianda requested a review from Copilot February 20, 2026 22:27

Copilot started reviewing on behalf of ewianda February 20, 2026 22:28 View session

ewianda force-pushed the Update-packages branch from f2e2b7c to e408931 Compare February 20, 2026 22:29

ewianda force-pushed the Update-packages branch from e408931 to b574ffb Compare February 20, 2026 22:33

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update packages#19

Update packages#19
ewianda wants to merge 2 commits intomainfrom
Update-packages

ewianda commented Feb 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

ewianda commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

ewianda commented Feb 20, 2026 •

edited

Loading