Skip to content

Conversation

@xmo-odoo
Copy link

curl (instead of wget) and streaming decompressors & processors allow creating the final files without actually needing intermediate artefacts and needing to clean them up, so when possible do that.

Update kym_scrape to not need requests, and also to shove the phrases to stdout directly, and move the update tracking to stderr, that way it can also be used in a "streaming" manner. The log isn't really useful anymore as the stdout essentially provides the same tracking (just deduplicated).

POI can't be made streaming in a portable manner: some variants of unzip apparently allow passing an archive as stdin but not all, KSH-style process substitution opens a named pipe which generic unzip also dislike, and ZSH's = variant works but is ZSH-specific (also the lack of feedback is a bit disturbing, as it needs to download the entire content into a temporary file which it can then hand out, and the POI dataset is quite large). However we can use the -p feature to avoid the intermediate allCountries.txt.

curl (instead of wget) and streaming decompressors & processors allow
creating the final files without actually needing intermediate
artefacts and needing to clean them up, so when possible do that.

Update `kym_scrape` to not need `requests`, and also to shove the
phrases to stdout directly, and move the update tracking to stderr,
that way it can also be used in a "streaming" manner. The log isn't
really useful anymore as the stdout essentially provides the same
tracking (just deduplicated).

POI can't be made streaming in a portable manner: *some* variants of
unzip apparently allow passing an archive as stdin but not all,
KSH-style process substitution opens a named pipe which generic unzip
also dislike, and ZSH's `=` variant works but is ZSH-specific (also
the lack of feedback is a bit disturbing, as it needs to download the
entire content into a temporary file which it can then hand out, and
the POI dataset is quite large). However we can use the `-p` feature
to avoid the intermediate `allCountries.txt`.
@xmo-odoo
Copy link
Author

xmo-odoo commented Jan 16, 2026

Nota: if you're using a tool which supports PEP 723 (pdm, uv, ...) then it might make sense to put requests back in (probably using a Session to make use of persistent connections), or switch to httpx for http/2 (as it looks like KYM supports that), and maybe add tqdm or something like it to do the progress reporting.

Also not sure if it changed since the script was created but it looks like KYM does not really have a page 0, it's the same content as page 1, I think KYM just interprets it as not specifying a page which is why it's not a 404, so might make sense to start at 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant