Fix: Refactor is_comment_line to use Pygments for improved accuracy #68

langverse2023 · 2025-12-04T11:29:24Z

This PR replaces the previous regex-based, line-by-line comment detection with a more robust solution using the Pygments library.

The Problem:
The previous approach analyzed each line in isolation. This caused issues with multi-line comments (e.g., C-style /* ... */), where middle lines starting with * were often invalidly interpreted as multiplication operators or code syntax.

The Solution:
I have introduced a new helper function get_comment_line_indices that preserves the context of the code block.

Implementation Details:

Reconstruct Source: The function strips diff markers (+, -) from the patch to reconstruct the continuous source text.
Context-Aware Lexing: Instead of checking line by line, we pass the full text to the Pygments lexer. This allows the lexer to correctly identify multi-line comment blocks (like """ docstrings """ or /* block comments */).
Token Mapping: We iterate through the tokens generated by Pygments and map them back to the original line numbers.
Filtering: A line is marked as "non-scoreable" only if it contains comment tokens and strictly no code tokens.

Key Changes:

Removed is_comment_line (regex-based).
Added get_comment_line_indices (Pygments-based).
Updated count_non_scoreable_lines to pre-calculate comment indices before iterating.

Fixes #64
Improves #56

Contribution by Gittensor, learn more at https://gittensor.io/

The count_non_scoreable_lines function had a bug where typo pairs (- followed by +) were being counted incorrectly. When a typo was detected, the function would: 1. Mark both lines as non-scoreable (+2) 2. Continue to skip the - line 3. BUT the next iteration would process the + line again as scoreable This allowed miners to gain score credit for typo corrections that should be filtered out. Fixed by adding a skip_next flag to ensure the + line of a detected typo pair is not processed in the next iteration. Impact: Prevents gaming via intentional typo corrections that were previously being scored. Contribution by Gittensor, learn more at https://gittensor.io/

Contributing

The calculate_issue_multiplier function was filtering out invalid issues (self-created, wrong timing, etc.) into valid_issues, but then incorrectly using the unfiltered pr.issues list for scoring. This allowed miners to game the system by creating invalid issues that would still count toward their multiplier. Changed lines 190 and 195 to use valid_issues instead of pr.issues, ensuring only validated issues contribute to the score multiplier. Contribution by Gittensor, learn more at https://gittensor.io/

* fix: prevent gamming issue multiplier * fix comment * small fix

* update subnet repsitories * add SDKv10 branch to bittensor repo * add sn118 repo

* Fix confusing recycle allocation logic in dynamic emissions The original code used a backwards ternary expression that would add 1.0 to recycled emissions when total_recycled <= 0, which doesn't make logical sense for an emissions recycling system. Original buggy code: max(total_recycled, 1 if total_recycled <= 0 else 0) This evaluates to: - If total_recycled > 0: max(total_recycled, 0) = total_recycled ✅ - If total_recycled <= 0: max(total_recycled, 1) = 1 ❌ (adds 1.0 when nothing should be recycled) Fixed to simply ensure non-negative recycled amount: max(total_recycled, 0.0) Impact: Prevents incorrect allocation of 1.0 emissions to RECYCLE_UID when no emissions should be recycled. Contribution by Gittensor, learn more at https://gittensor.io/ * Fix edge case: handle total_original=0 with dynamic recycle bound - Add dynamic_recycle_bound: 1 if no earned scores, 0 otherwise - Update recycle_percentage to 100% when total_original=0 - Addresses owner feedback on PR entrius#54 Contribution by Gittensor, learn more at https://gittensor.io/

* Removed # in some languages * Removed unused languages and comment * Update constants.py * Update constants.py

* Implemented PR success ratio multipliers to apply to score. (% merged / total attempted PRs) * Updated test file contributions to only get 10% of earned score. Down from 25% * Updated storage functionality with new multipliers / fields * Storage functionality touchups for MinerEvaluation datamodels * Dec 4 1PM EST Time cushion before closed PRs impact success ratio score * Added edge case safeguard on if no closed prs exist * Ensured merged count only begins after 12/4 1pm est. * date fix update on typo - better logging * added protection threshold before ratio takes place * kimbo like 10 - we do 10

…n PR descr. (entrius#62)

* Require issues to be closed within 2 days of PR merged data + 2 issues scored per pr max * 1D, more appropiate issue resolvement time

hsparks-codes · 2025-12-04T12:31:52Z

@langverse2023 Close this PR, otherwise will report. You're stealing my effort.

SmartDever02 · 2025-12-04T13:10:05Z

# Test cases
if __name__ == "__main__":
    # Test 1: Comments only
    """
    /*
    * Hello world
    */

    """
    test_string = """
* Hello world
"""
    print(f"Rust comments only: {is_comment_line(test_string, '.rs')}")  
    
    # Test 2: Code
    test_string = """
#include <stdio.h>
"""
    print(f"C program code: {is_comment_line(test_string, '.c')}")  
    
    # Test 3: multi-line comment
    """
    '''
    This is a comment
    '''
    """
    test_string = '''
This is a comment
'''
    print(f"Multi-line comment: {is_comment_line(test_string, '.py')}")

I don't think Pygments library works better than current solution as we check line by line

thewhaleking · 2025-12-04T18:30:22Z

gittensor/validator/utils/spam_detection.py

+from pygments import lex
+from pygments.lexers import get_lexer_for_filename, TextLexer
+from pygments.token import Comment, String
+from pygments.util import ClassNotFound


You need to include pygments as a dependency

langverse2023 · 2025-12-05T03:03:26Z

@langverse2023 Close this PR, otherwise will report. You're stealing my effort.
I believe there is a misunderstanding. My implementation is distinct because it integrates the Pygments library to handle comment detection, whereas your solution did not use this library. This is a different technical approach, not a copy of your work.

langverse2023 · 2025-12-05T06:39:45Z

# Test cases
if __name__ == "__main__":
    # Test 1: Comments only
    """
    /*
    * Hello world
    */

    """
    test_string = """
* Hello world
"""
    print(f"Rust comments only: {is_comment_line(test_string, '.rs')}")  
    
    # Test 2: Code
    test_string = """
#include <stdio.h>
"""
    print(f"C program code: {is_comment_line(test_string, '.c')}")  
    
    # Test 3: multi-line comment
    """
    '''
    This is a comment
    '''
    """
    test_string = '''
This is a comment
'''
    print(f"Multi-line comment: {is_comment_line(test_string, '.py')}")

I don't think Pygments library works better than current solution as we check line by line

Thanks for the feedback! You are absolutely right that checking line-by-line is problematic for multi-line comments (e.g., Pygments might interpret an isolated * Hello world as a multiplication or pointer instead of a comment).
To address this, I have refactored the logic to analyze the entire patch context at once.

langverse2023 · 2025-12-05T06:41:33Z

@LandynDev could u check my pr

anderdc · 2025-12-05T16:46:38Z

I like the idea of this PR, it will take a couple days to fully vet it and test, expect a full review then

anderdc · 2025-12-14T03:09:06Z

resolve conflicts, I'm looking into this PR now finally

anderdc · 2025-12-14T05:11:12Z

gittensor/validator/utils/spam_detection.py

+
+    # 2. Determine the appropriate lexer
+    try:
+        filename = f"dummy{file_extension}" if file_extension else "dummy.txt"


double check the dummy filename construction, I don't believe file_extension has the ..

also why not just use the actual filename? seems strange to just make up a dummy filename

Agreed. I updated the function signature to pass the full filename directly. It's much cleaner now.

requirements.txt

langverse2023 · 2025-12-15T03:32:37Z

@anderdc could u check my pr

langverse2023 · 2025-12-20T02:36:08Z

@LandynDev , @anderdc would you review my PR ?

hsparks-codes and others added 16 commits December 1, 2025 22:09

contributing guidelines

43599ca

Merge branch 'test' into contributing

d87937a

no table

8a6030d

Merge pull request entrius#55 from entrius/contributing

4817145

Contributing

fix: prevent gamming issue multiplier (entrius#53)

e93c686

* fix: prevent gamming issue multiplier * fix comment * small fix

fix: update subnet repsitories (entrius#57)

ce9d6c5

* update subnet repsitories * add SDKv10 branch to bittensor repo * add sn118 repo

Added new branch for BitMind (entrius#59)

be50add

feat: adding sn23 repo in the master repo list (entrius#60)

c091111

Fix: Skip '#' Comment Pattern for Preprocessor Languages (entrius#56)

749a9db

* Removed # in some languages * Removed unused languages and comment * Update constants.py * Update constants.py

Removed gittensor repository from being eligible for tagline boosts i…

bd69d5d

…n PR descr. (entrius#62)

Require issues to be closed within 1 day of PR merged date (entrius#63)

e3a128a

* Require issues to be closed within 2 days of PR merged data + 2 issues scored per pr max * 1D, more appropiate issue resolvement time

update

3404664

thewhaleking reviewed Dec 4, 2025

View reviewed changes

LandynDev force-pushed the test branch from e593e4d to 42dcb06 Compare December 4, 2025 21:44

update

e5ec391

langverse2023 and others added 5 commits December 5, 2025 10:21

Merge branch 'test' into kln/20251204182044

991d5f6

Update scoring.py

ca34545

Update scoring.py

e5bd670

update

d430875

update

d9ab6e6

Merge branch 'test' into kln/20251204182044

256c85a

anderdc reviewed Dec 14, 2025

View reviewed changes

requirements.txt Show resolved Hide resolved

khanhkhanhlele and others added 2 commits December 15, 2025 10:27

update

69cd561

Merge branch 'test' into kln/20251204182044

215d21c

khanhkhanhlele and others added 5 commits December 15, 2025 10:45

update

163ddc5

Merge branch 'test' into kln/20251204182044

0065321

Merge branch 'test' into kln/20251204182044

891f3e1

Merge branch 'test' into kln/20251204182044

50960ae

Merge branch 'test' into kln/20251204182044

f0b7e57

LandynDev closed this Dec 31, 2025

Fix: Refactor is_comment_line to use Pygments for improved accuracy #68

Fix: Refactor is_comment_line to use Pygments for improved accuracy #68

Uh oh!

Conversation

langverse2023 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsparks-codes commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SmartDever02 commented Dec 4, 2025

Uh oh!

thewhaleking Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

langverse2023 commented Dec 5, 2025

Uh oh!

langverse2023 commented Dec 5, 2025

Uh oh!

langverse2023 commented Dec 5, 2025

Uh oh!

anderdc commented Dec 5, 2025

Uh oh!

anderdc commented Dec 14, 2025

Uh oh!

anderdc Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

langverse2023 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

langverse2023 commented Dec 15, 2025

Uh oh!

langverse2023 commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

langverse2023 commented Dec 4, 2025 •

edited

Loading

hsparks-codes commented Dec 4, 2025 •

edited

Loading

anderdc Dec 14, 2025 •

edited

Loading