Skip to content

feat: support appstream data keywords and categories for FTS#204

Open
m1rm wants to merge 10 commits intomainfrom
feat/integrate-appstream-data
Open

feat: support appstream data keywords and categories for FTS#204
m1rm wants to merge 10 commits intomainfrom
feat/integrate-appstream-data

Conversation

@m1rm
Copy link
Copy Markdown
Collaborator

@m1rm m1rm commented Apr 12, 2026

TL;DR

Integrates Arch Linux AppStream metadata into the Go app: keywords and categories columns on package, filled from upstream Components-x86_64.xml.gz on sources.archlinux.org, exposed to FTS5 search with tuned BM25 weights. Adds a update-appstream CLI command and just update-appstream / inclusion in just update-data.

Motivation

Improve package search (including German-oriented terms) using AppStream and data.
Keep the implementation streaming and aligned with existing update jobs (pacmandb-style callback parsing).

Data source & versioning

Behaviour

What is indexed

  • Keywords: text from only (not name/summary/description).
  • Categories: text from only (not pacman groups).
  • Language:
    • blocks without xml:lang (neutral),
    • en / de (including BCP47 prefixes like de-DE),
    • and the same rules on individual / when present.
  • Stopwords: English + German closed-class words stripped in dedupeWords before storage (dedupe is case-insensitive)

Database & search

  • Migrations add keywords and categories on package, extend package_fts with matching columns, rebuild after changes.
  • update-packages upserts do not overwrite keywords / categories (same pattern as popularity).
  • Search uses BM25 with higher weight on name/description than on keywords / categories to limit dilution from AppStream text.

Operations

  • go run . update-appstream (requires DATABASE; optional APPSTREAM_SOURCES_BASE).
  • just update-appstream; just update-data runs update-appstream after update-packages.

Testing

  • Unit tests for XML parsing (keywords, categories, xml:lang on blocks and elements), keywordLangAccepted, stopwords, dedupeWords.
  • Tests updated for new FTS column list; just test / just lint pass.

Follow-ups (optional)

  • If we chose to keep both keywords and categories, we could merge the migrations into one single migration
  • Surface categories in the package detail UI if desired.
  • Revisit BM25 weights after production metrics.

@m1rm m1rm self-assigned this Apr 12, 2026
@m1rm m1rm marked this pull request as draft April 12, 2026 13:25
@m1rm m1rm changed the title feat: initial implementation for appstream data fetcher in Golang feat: support appstream data keywords and categories for FTS Apr 12, 2026
@m1rm m1rm marked this pull request as ready for review April 12, 2026 14:26
m1rm added 2 commits April 12, 2026 16:35
…est into main exists and pushes are added to branch with open PR

piggyback: improve ci setup; account for duplicate runs

tryout to fix duplicate ci runs the other way
@m1rm m1rm force-pushed the feat/integrate-appstream-data branch from 01c46f3 to 9157312 Compare April 12, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant