Skip to content

Update packages#19

Open
ewianda wants to merge 2 commits intomainfrom
Update-packages
Open

Update packages#19
ewianda wants to merge 2 commits intomainfrom
Update-packages

Conversation

@ewianda
Copy link
Collaborator

@ewianda ewianda commented Feb 20, 2026

Optimize Python parser: 3.5x speedup (46.7s → 13.3s)
Migrate the Python file parser from recursive AST traversal to
pre-compiled tree-sitter queries, add object pooling, and batch-parse
all files into a lookup table to eliminate redundant work.

Changes:

  • file_parser.go: Replace recursive parse() with query-based approach
    using pre-compiled tree-sitter queries (init-time), sync.Pool for
    parser and cursor reuse, and TYPE_CHECKING block detection.
  • parser.go: Add parseAllToLUT() for concurrent batch parsing with
    errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed
    results without re-parsing.
  • generate.go: Parse all .py files once upfront into a LUT before
    target generation. Replace per-target parser.parse() calls with
    parseFromLUT() lookups. Replace per-call regexp.MustCompile in
    isDjangoTestFile with strings.Contains.

Profiled on production codebase (cpu pprof):

Metric Before After Improvement
Wall time 46.72s 13.25s 3.5x faster
CPU samples 53.04s 31.69s 40% less
cgocall 28.13s 23.68s 16% less

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Migrate the Python file parser from recursive AST traversal to
pre-compiled tree-sitter queries, add object pooling, and batch-parse
all files into a lookup table to eliminate redundant work.

Changes:
- file_parser.go: Replace recursive parse() with query-based approach
  using pre-compiled tree-sitter queries (init-time), sync.Pool for
  parser and cursor reuse, and TYPE_CHECKING block detection.
- parser.go: Add parseAllToLUT() for concurrent batch parsing with
  errgroup limited to NumCPU, and parseFromLUT() to process pre-parsed
  results without re-parsing.
- generate.go: Parse all .py files once upfront into a LUT before
  target generation. Replace per-target parser.parse() calls with
  parseFromLUT() lookups. Replace per-call regexp.MustCompile in
  isDjangoTestFile with strings.Contains.

Profiled on production codebase (cpu pprof):

  | Metric      | Before  | After   | Improvement |
  |-------------|---------|---------|-------------|
  | Wall time   | 46.72s  | 13.25s  | 3.5x faster |
  | CPU samples | 53.04s  | 31.69s  | 40% less    |
  | cgocall     | 28.13s  | 23.68s  | 16% less    |

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes and accelerates Python dependency parsing in the Gazelle extension by switching from recursive AST walking to pre-compiled tree-sitter queries, pooling parser/cursor objects, and parsing all Python files once into a shared lookup table (LUT) that is reused during rule generation.

Changes:

  • Add concurrent batch parsing (parseAllToLUT) and LUT-based consumption (parseFromLUT) to avoid repeated parsing across multiple targets.
  • Rewrite FileParser.Parse to use pre-compiled tree-sitter queries plus pooling for parsers/cursors, including TYPE_CHECKING-block handling.
  • Update rule generation to pre-parse all .py files into a LUT and replace regex-based Django test detection with substring checks.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/parser.go Introduces LUT-based concurrent parsing and LUT consumption APIs.
python/generate.go Parses all Python files once into a LUT; switches per-target parsing to LUT lookups; simplifies Django marker detection.
python/file_parser.go Replaces recursive traversal with query-based extraction and adds pooling + TYPE_CHECKING handling.
patches/go-tree-sitter.diff Removes the previously maintained patch file for go-tree-sitter vendoring/patching.
go.mod Bumps Go version and updates dependency versions (Gazelle, rules_go, x/sync, etc.).
go.sum Updates module checksums to match the new dependency set.
MODULE.bazel Updates Bazel module deps and removes custom http_archive patching for go-tree-sitter, relying on go_deps instead.
MODULE.bazel.lock Regenerates lockfile reflecting updated Bazel deps and transitive graph.
Comments suppressed due to low confidence (2)

python/generate.go:479

  • After the scan loop, scanner.Err() isn’t checked. If scanning fails (including bufio.ErrTooLong for long lines), this will silently return false and can misclassify Django tests. Consider checking scanner.Err() and handling/reporting the error consistently with the rest of this function.
	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.Contains(line, "pytest.mark.django_db") {
			return true
		}
		if strings.Contains(line, "gazelle: django_test") {
			return true
		}
		if strings.Contains(line, "django.test") && strings.Contains(line, "TestCase") {
			return true
		}
	}
	return false

python/generate.go:464

  • log.Fatalf already terminates the process, so the subsequent panic(err) is unreachable and can be removed for clarity.
	file, err := os.Open(path)
	if err != nil {
		log.Fatalf("ERROR: %v\n", err)
		panic(err)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +108 to 120
if child.Type() == "if_statement" &&
child.ChildCount() > 1 &&
child.Child(1).Type() == sitterNodeTypeComparisonOperator &&
child.Child(1).Child(1).Type() == "==" {
statement := child.Child(1)
a, b := statement.Child(0), statement.Child(2)
// convert "'__main__' == __name__" to "__name__ == '__main__'"
if b.Type() == sitterNodeTypeIdentifier {
a, b = b, a
}
if a.Type() == sitterNodeTypeIdentifier && a.Content(p.code) == "__name__" &&
// at github.com/smacker/go-tree-sitter@latest (after v0.0.0-20240422154435-0628b34cbf9c we used)
// "__main__" is the second child of b. But now, it isn't.
// we cannot use the latest go-tree-sitter because of the top level reference in scanner.c.
// https://github.com/smacker/go-tree-sitter/blob/04d6b33fe138a98075210f5b770482ded024dc0f/python/scanner.c#L1
b.Type() == sitterNodeTypeString && string(p.code[b.StartByte()+1:b.EndByte()-1]) == "__main__" {
return true
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseMain indexes into the comparison node (child.Child(1).Child(1), statement.Child(2)) without verifying the child counts, which can panic on unexpected/invalid AST shapes (e.g., syntax errors or different comparison forms). Add defensive ChildCount() checks (and nil checks where applicable) before accessing those children.

Copilot uses AI. Check for mistakes.
Comment on lines 176 to 182
} else if node.Type() == sitterNodeTypeImportFromStatement {
from := node.Child(1).Content(p.code)
if strings.HasPrefix(from, ".") {
return true
return
}
for j := 3; j < int(node.ChildCount()); j++ {
m, ok := parseImportStatement(node.Child(j), p.code)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseImportStatements assumes import_from_statement always has at least 2 children and starts imported names at index 3. On malformed/incomplete from ... import ... statements this can result in node.Child(1) being nil/out-of-bounds and panicking. Consider guarding with ChildCount() checks and returning early when the expected children aren’t present.

Copilot uses AI. Check for mistakes.
Comment on lines 127 to 257
@@ -112,7 +160,7 @@ func parseImportStatement(node *sitter.Node, code []byte) (module, bool) {
return module{}, false
}

func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
func (p *FileParser) parseImportStatements(node *sitter.Node) {
if node.Type() == sitterNodeTypeImportStatement {
for j := 1; j < int(node.ChildCount()); j++ {
m, ok := parseImportStatement(node.Child(j), p.code)
@@ -128,7 +176,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
} else if node.Type() == sitterNodeTypeImportFromStatement {
from := node.Child(1).Content(p.code)
if strings.HasPrefix(from, ".") {
return true
return
}
for j := 3; j < int(node.ChildCount()); j++ {
m, ok := parseImportStatement(node.Child(j), p.code)
@@ -140,18 +188,7 @@ func (p *FileParser) parseImportStatements(node *sitter.Node) bool {
m.Name = fmt.Sprintf("%s.%s", from, m.Name)
p.output.Modules = append(p.output.Modules, m)
}
} else {
return false
}
return true
}

func (p *FileParser) parseComments(node *sitter.Node) bool {
if node.Type() == sitterNodeTypeComment {
p.output.Comments = append(p.output.Comments, comment(node.Content(p.code)))
return true
}
return false
}

func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string) {
@@ -160,34 +197,68 @@ func (p *FileParser) SetCodeAndFile(code []byte, relPackagePath, filename string
p.output.FileName = filename
}

func (p *FileParser) parse(ctx context.Context, node *sitter.Node) {
if node == nil {
return
// Parse uses pre-compiled tree-sitter queries to extract imports and comments
// from the parsed AST, replacing the previous recursive traversal approach.
func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) {
rootNode, err := parseCode(p.code)
if err != nil {
return nil, err
}
for i := 0; i < int(node.ChildCount()); i++ {

p.output.HasMain = p.parseMain(rootNode)

cursor := cursorPool.Get().(*sitter.QueryCursor)
defer cursorPool.Put(cursor)

cursor.Exec(importQuery, rootNode)

seenImports := make(map[uint32]bool)

for {
if err := ctx.Err(); err != nil {
return
}
child := node.Child(i)
if p.parseImportStatements(child) {
continue
return nil, err
}
if p.parseComments(child) {
continue
match, ok := cursor.NextMatch()
if !ok {
break
}
p.parse(ctx, child)
}
}

func (p *FileParser) Parse(ctx context.Context) (*ParserOutput, error) {
rootNode, err := ParseCode(p.code)
if err != nil {
return nil, err
}
for _, capture := range match.Captures {
captureName := importQuery.CaptureNameForId(capture.Index)
node := capture.Node

switch captureName {
case "import", "import_from":
if seenImports[node.StartByte()] {
continue
}
seenImports[node.StartByte()] = true
p.parseImportStatements(node)

case "comment":
p.output.Comments = append(p.output.Comments, comment(node.Content(p.code)))

p.output.HasMain = p.parseMain(ctx, rootNode)
case "if_stmt":
if p.isTypeCheckingBlock(node) {
for j := 0; j < int(node.ChildCount()); j++ {
subChild := node.Child(j)
if subChild.Type() == "block" {
for k := 0; k < int(subChild.ChildCount()); k++ {
stmt := subChild.Child(k)
if stmt.Type() == sitterNodeTypeImportStatement || stmt.Type() == sitterNodeTypeImportFromStatement {
if !seenImports[stmt.StartByte()] {
seenImports[stmt.StartByte()] = true
p.parseImportStatements(stmt)
}
}
}
}
}
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New TYPE_CHECKING handling (isTypeCheckingBlock + the if_stmt capture path) isn’t covered by tests. Adding unit tests for if TYPE_CHECKING: and if typing.TYPE_CHECKING: blocks (including nested imports) would help ensure the intended dependency behavior and prevent regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants