fix: sort CVE records correctly#2020
Conversation
Reviewer's GuideImplement numeric-aware sorting for CVE identifiers by introducing a normalized SQL sort key (id_sort_key) and updating the sort translator to route 'id' sorts through it, and add tests to verify correct ascending and descending ordering. Entity relationship diagram for CVE ID sorting keyerDiagram
VULNERABILITY {
id TEXT
id_sort_key TEXT
}
VULNERABILITY ||--o{ PAGINATED_RESULTS : contains
VULNERABILITY ||--o{ COLUMNS : uses
COLUMNS {
id_sort_key TEXT
}
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
a5150d6 to
07293c7
Compare
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Extract the CASE WHEN regex padding logic into a named constant or helper to improve readability and avoid inline SQL clutter.
- Consider persisting the computed id_sort_key as a computed (or materialized) column and indexing it to avoid expensive regex/substring processing on each query.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the CASE WHEN regex padding logic into a named constant or helper to improve readability and avoid inline SQL clutter.
- Consider persisting the computed id_sort_key as a computed (or materialized) column and indexing it to avoid expensive regex/substring processing on each query.
## Individual Comments
### Comment 1
<location> `modules/fundamental/src/vulnerability/service/test.rs:545-540` </location>
<code_context>
+async fn vulnerability_numeric_sorting(ctx: &TrustifyContext) -> Result<(), anyhow::Error> {
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for edge cases such as malformed or non-standard CVE IDs.
Including malformed CVE IDs in tests will help verify that the sort key logic handles unexpected formats correctly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
The change looks good. I'm just not sure it is the right approach. Yes, it makes it more convenient for CVE IDs. However, there are a lot of OSV sources which use a similar format:
Now the user would see CVE IDs sorted differently than those. And that would be hard to explain and understand. If we can change this to a way that we split this into components and then sort each part as ASCII or numeric (if it's numeric only), I think this could work. |
Thank you. I wasn't aware of those. I think we could certainly generalize those patterns. Out of those three examples, MAL and PSF do seem to follow the same pattern as CVE. RUSTSEC, if I'm reading the spec correctly, always requires 4 digits in the sequence sections, thus For my own notes, the different sources are listed here. Interestingly, some sources follow a slightly different pattern: https://github.com/AlmaLinux/osv-database/tree/master/advisories/almalinux10 Let me explore a way to generalize this. Do you have any performance concerns with this approach? We could introduce a new column that stores the computed sort ID but maybe that's a premature performance improvement right now. |
I always have concerns. 😬 And especially for performance. But we do have scale tests, which can be triggered using |
|
Pushed a change to make the sorting handle different vulnerability IDs. It looks like I broke some tests, I'll have a look at those next, but wanted to share the hmm... creative solution. |
1328ffb to
5ae3648
Compare
The tests were broken. I just didn't have the expected locale set on my local system. If we want the approach of using expressions at query time, I believe the changes here achieve that. It would be great for someone with access to approve running the workflows and maybe run |
I think the concern mentioned in #2020 (comment) is valid and should be addressed. We also do need some performance tests. |
|
This is a change that I think will be very beneficial to RHTPA. But I think we need to revisit the mechanics of the solution to ensure it is performant and develop some scale tests for it. So I've created this ticket TC-3602 to bring this work to a successful conclusion. |
|
@lcarva - Luiz do you think you'll be able to find some time to address the PR feedback? |
5ae3648 to
0b3c66f
Compare
|
Changes have been pushed to address the feedback and are now ready for review 🙏 |
0b3c66f to
80f6499
Compare
ctron
left a comment
There was a problem hiding this comment.
Thanks for following up! Graphql was removed, maybe the PR needs to be rebased. However, I suspect a rebase conflict with the migration once the group PR is (finally) merged. So maybe wait until that.
I like the idea of that materialized column. That might increase storage, but should sort the performance issue.
I guess the test could benefit a bit from DRY-ing it up.
80f6499 to
28d8be6
Compare
|
@ctron, I believe I addressed all the comments. |
|
/scale-test |
|
🛠️ Scale test has started! Follow the progress here: Workflow Run |
ctron
left a comment
There was a problem hiding this comment.
LGTM. Let's not forget to check scale tests before merging.
|
Looks like it needs to be re-based and have |
Goose ReportGoose Attack ReportPlan Overview
Request Metrics
Response Time Metrics
Status Code Metrics
Transaction Metrics
Scenario Metrics
Error Metrics
📄 Full Report (Go to "Artifacts" and download report) |
Vulnerability IDs (e.g., CVE-2024-12345) contain numeric segments that should be sorted numerically rather than lexicographically. This change adds a generated database column that stores a normalized sort key where numeric segments are zero-padded to 19 digits (the max defined in the CVE ID spec), enabling proper numeric ordering. Implementation uses a PostgreSQL STORED generated column with an index, which provides: - Automatic computation and maintenance of sort keys for all rows - Efficient indexed sorting without runtime overhead - Single source of truth for the normalization logic fixes guacsec#1811 Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
|
I'm scheduling to merge this anyway. The scale tests are still not helpful to asses the impact. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2020 +/- ##
==========================================
+ Coverage 68.36% 68.41% +0.04%
==========================================
Files 422 423 +1
Lines 24681 24698 +17
Branches 24681 24698 +17
==========================================
+ Hits 16874 16896 +22
+ Misses 6887 6877 -10
- Partials 920 925 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CVE records follow a specific format where the last segment represents a numerical sequence. To properly sort CVE records, we must treat this sequence segment differently than the rest of the record ID.
fixes #1811
Summary by Sourcery
Implement proper numeric sorting for CVE identifiers by introducing a normalized sort key and updating the sorting translator to use it, ensuring correct ascending and descending order across different ID prefixes.
Enhancements:
id_sort_keySQL expression to pad the numeric segment of CVE IDs for accurate numeric sorting.idsort operations to use the newid_sort_keywhen sorting vulnerabilities.Tests:
vulnerability_numeric_sortingintegration test to verify correct ascending and descending ordering for CVE, GHSA, and custom IDs.