Skip to content

dags: Refactor fetch_images into a helper module#56

Merged
awilfox merged 1 commit intomainfrom
awilfox/AP-670-image-fetcher-refactor
May 6, 2026
Merged

dags: Refactor fetch_images into a helper module#56
awilfox merged 1 commit intomainfrom
awilfox/AP-670-image-fetcher-refactor

Conversation

@awilfox
Copy link
Copy Markdown
Member

@awilfox awilfox commented Apr 27, 2026

We can reuse this code in many different ways by making it a fully parameterised class instead of embedded in the DAG.

This also gives us a lot of flexibility in how we use it in the future.

Ref: AP-670


I still need to write tests for this, but I wanted to make sure the overall API I wrote was actually good and not just a diphenhydramine-induced fever dream.

Copy link
Copy Markdown
Member Author

@awilfox awilfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reviewed to point out questions I had during implementation.

Comment thread mokelumne/dags/fetch_images.py Outdated
Comment thread mokelumne/util/image_fetcher.py Outdated
Comment thread mokelumne/util/image_fetcher.py
@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from 224c5da to b8c9da8 Compare April 28, 2026 16:28
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented Apr 28, 2026

v2:

  • correct calculation with max width/height (ensure it is floating point, and use the present value instead of the original value, since it may already be scaled for byte size)

Copy link
Copy Markdown
Member

@anarchivist anarchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. all my questions/comments are minor and nonblocking. r+

Comment thread mokelumne/dags/fetch_images.py
@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from b8c9da8 to 7e58d79 Compare April 28, 2026 16:57
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented Apr 28, 2026

v3:

  • make md_cache a private attribute of the class instead of a public attribute.

Need to write tests, then this should be good to go.

@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from 7e58d79 to 530c5da Compare April 28, 2026 19:13
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented Apr 28, 2026

v4:

  • Figured out that I could use the metadata fetch method on the image fetcher class in the DAG as well. This de-couples it from the inner TindClient, and also means we're able to re-use the cache so we only make one call instead of two!

Comment thread mokelumne/dags/fetch_images.py Outdated
Comment thread mokelumne/dags/fetch_images.py Outdated
Comment thread mokelumne/util/image_fetcher.py
Comment thread mokelumne/util/image_fetcher.py
Comment thread mokelumne/util/image_fetcher.py Outdated
Comment thread mokelumne/util/image_fetcher.py Outdated
Comment thread mokelumne/util/image_fetcher.py
if self.size_transform:
curr_size = self.size_transform(curr_size)

if self.max_size and curr_size > self.max_size:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image size should scale roughly with the square of the change in side length, i.e. if you double the width you should naively expect the image size to quadruple. Ditto for shrinking: cutting the height of the image in half (while maintaining the aspect ratio) should reduce the image size to ~1/4 of the original (naively, ignoring resampling / compression). Obviously the reality is different but that would put a lower bound on the change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, but the IIIF endpoint sends a quality=2 (i.e. 100%) JPEG, so it balloons it quite heavily even with smaller dimensions than the original. A shame we cannot tweak that further.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to do this without naive calculations such as this would be to download a full-size version from IIIF so that we actually know the maximal byte size. But we would be doing that just to throw it away, using a lot of bandwidth and time in the process.

Happy to consider other ways to do this and discuss it further, but not sure it's worth it at this stage. Up to you.

@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from 530c5da to da53224 Compare April 28, 2026 19:57
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented Apr 28, 2026

v5:

  • Don't pass the 0 index from the fetch_image_to_record_directory task because it is already the default.
  • Add get_file_metadata method to the FetchTind object so we can use it. Make get_first_file_metadata call it.
  • Use functools @cache instead of manually implementing it.
  • Only scale once, using the smallest of the potential factors.
  • Simplify calculations in the second size check.
  • Give another 5% cushion when re-downloading for over-sized result from IIIF API.

Copy link
Copy Markdown
Contributor

@jason-raitz jason-raitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ Good work!

@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from da53224 to a2a0dc4 Compare May 1, 2026 00:49
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented May 1, 2026

v6:

  • Add unit tests.

@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from a2a0dc4 to 95cdb20 Compare May 1, 2026 16:28
@awilfox
Copy link
Copy Markdown
Member Author

awilfox commented May 1, 2026

v7:

  • Fix small test gaffe.
  • Rebase on main.

We can reuse this code in many different ways by making it a fully
parameterised class instead of embedded in the DAG.

This also gives us a lot of flexibility in how we use it in the future.

Ref: AP-670
@awilfox awilfox force-pushed the awilfox/AP-670-image-fetcher-refactor branch from 95cdb20 to 0df33df Compare May 1, 2026 16:34
Copy link
Copy Markdown
Contributor

@jason-raitz jason-raitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ I love how much simpler this makes the fetch_images DAG!
I'm going to have to rework some of #677 after this gets merged.

@awilfox awilfox merged commit 0df33df into main May 6, 2026
5 checks passed
@awilfox awilfox deleted the awilfox/AP-670-image-fetcher-refactor branch May 6, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants