Add documentation, especially warc2zim doc about rewriting

benoit74 · benoit74 · commit c0f4e19a0771 · 2024-10-21T09:44:05.000Z
diff --git a/README.md b/README.md
@@ -22,6 +22,8 @@ Example usage:
 zimscraperlib>=1.1,<1.2
 ```
 
+See [functional architecture](docs/functional_architecture.md), [software architecture](docs/software_architecture.md) and [technical architecture](docs/technical_architecture.md) for more details on scraperlib (not all aspects are covered yet, this is a WIP).
+
 # Dependencies
 
 * libmagic
diff --git a/docs/functional_architecture.md b/docs/functional_architecture.md
@@ -0,0 +1,92 @@
+# Functional Architecture
+
+## Enrich libzim functions
+
+zimscraperlib has primitives to enrich libzim functions with some operations which are known to be shared across scrapers. See `zim` module.
+
+## Handle videos
+
+zimscraperlib has primitives to manipulate videos with some operations which are known to be shared across scrapers. See `video` module.
+
+## Handle pictures
+
+zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module.
+
+## Store and rewrite mostly unmodified HTML, CSS and JS from online website
+
+zimscraperlib also contains primitives to rewrite HTML, CSS and JS fetched online, to proper operate within a ZIM without heavy modifications. While originaly developped for warc2zim, some of these primitives are now also used for mindtouch scraper and others might follow, so they are shared in zimscraperlib. See `rewriting` module.
+
+## ZIM storage
+
+While storing web resources in a ZIM is mostly straightforward (we just transfer the raw bytes, after some modification for URL rewriting if needed), the decision of the path where the resource will be stored is very important.
+
+This is purely conventional, even if ZIM specification has to be respected for proper operation in readers.
+
+This function is responsible to compute the ZIM path where a given web resource is going to be stored.
+
+While the URL is the only driver of this computation for now, zimscraperlib might have to consider other contextual data in the future. E.g. the resource to serve might by dynamic, depending not only on URL query parameters but also header(s) value(s).
+
+## Fuzzy rules
+
+Unfortunately, it is not always possible / desirable to store the resource with a simple transformation.
+
+A typical situation is that some query parameters are dynamically computed by some Javascript code to include user tracking identifier, current datetime information, ...
+
+When running again the same javascript code inside the ZIM, the URL will hence be slightly different because context has changed, but the same content needs to be retrieved.
+
+zimscraperlib hence relies on fuzzy rules to transform/simplify some URLs when computing the ZIM path.
+
+## URL Rewriting
+
+zimscraperlib transforms (rewrites) URLs found in documents (HTML, CSS, JS, ...) so that they are usable inside the ZIM.
+
+### General case
+
+One simple example is that we might have following code in an HTML document to load an image with an absolute URL:
+
+```
+  <img src="https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg"></img>
+```
+
+The URL `https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` has to be transformed to a URL that it is usable inside the ZIM.
+
+For proper reader operation, openZIM prohibits using absolute URLs, so this has to be a relative URL. This relative URL is hence dependant on the location of the resource currently being rewriten.
+
+The table below gives some examples of what the rewritten URL is going to be, depending on the URL of the rewritten document.
+
+| HTML document URL | image URL rewritten for usage inside the ZIM |
+|--|--|
+| `https://en.wikipedia.org/wiki/Kiwix` | `./File:Kiwix_logo_v3.svg` |
+| `https://en.wikipedia.org/wiki` | `./wiki/File:Kiwix_logo_v3.svg` |
+| `https://en.wikipedia.org/waka/Kiwix` | `../wiki/File:Kiwix_logo_v3.svg` |
+| `https://fr.wikipedia.org/wiki/Kiwix` | `../../en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` |
+
+As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.
+
+### Dynamic case
+
+The explanation above more or less assumed that the transformations can be done statically, i.e zimscraperlib can open every known document, find existing URLs and replace them with their counterpart inside the ZIM.
+
+While this is possible for HTML and CSS documents typically, it is not possible when the URL is dynamically computed. This is typically the case for JS documents, where in the general case the URL is not statically stored inside the JS code but computed on-the-fly by aggregating various strings and values.
+
+Rewriting these computations is not deemed feasible due to the huge variety of situation which might be encountered.
+
+A specific function is hence needed to rewrite URL **live in client browser**, intercept any function triggering a web request, transform the URL according to conventions (where we expect the resource to be located in the general case) and fuzzy rules.
+
+_Spoiler: this is where we will rely on wombat.js from webrecorder team, since this dynamic interception is quite complex and already done quite neatly by them_
+
+### Fuzzy rules
+
+The same fuzzy rules that have been used to compute the ZIM path from a resource URL have to be applied again when rewriting URLs.
+
+While this is expected to serve mostly for the dynamic case, we still applies them on both side (staticaly and dynamicaly) for coherency.
+
+## Documents rewriten statically
+
+For now zimscraperlib rewrites HTML, CSS and JS documents. For CSS and JS, this mainly consists in replacing URLs. For HTML, we also have more specific rewritting necessary (e.g. to handle base href or redirects with meta).
+
+No domain specific (DS) rules are applied like it is done in wabac.JS because these rules are already applied in Browsertrix Crawler. For the same reason, JSON is not rewritten anymore (URL do not need to be rewritten in JSON because these URLs will be used by JS, intercepted by wombat and dynamically rewritten).
+
+JSONP callbacks are supposed to be rewritten but this has not been heavily tested.
+
+Other types of documents are supposed to be either not feasible / not worth it (e.g. URLs inside PDF documents), meaningless (e.g. images, fonts) or planned for later due to limited usage in the wild (e.g. XML).
diff --git a/docs/software_architecture.md b/docs/software_architecture.md
@@ -0,0 +1,25 @@
+# Software architecture
+
+## HTML rewriting
+
+HTML rewriting is purely static (i.e. before resources are written to the ZIM). HTML code is parsed with the [HTML parser from Python standard library](https://docs.python.org/3/library/html.parser.html).
+
+A small header script is inserted in HTML code to initialize wombat.js which will wrap all JS APIs to dynamically rewrite URLs comming from JS.
+
+This header script is generated using [Jinja2](https://pypi.org/project/Jinja2/) template since it needs to populate some JS context variables needed by wombat.js operations (original scheme, original url, ...).
+
+## CSS rewriting
+
+CSS rewriting is purely static (i.e. before resources are written to the ZIM). CSS code is parsed with the [tinycss2 Python library](https://pypi.org/project/tinycss2/).
+
+## JS rewriting
+
+### Static
+
+Static JS rewriting is simply a matter of pure textual manipulation with regular expressions. No parsing is done at all.
+
+### Dynamic
+
+Dynamic JS rewriting is done with [wombat JS library](https://github.com/webrecorder/wombat). The same fuzzy rules that are used for static rewritting are injected into wombat configuration. Code to rewrite URLs is an adapted version of the code used to compute ZIM paths.
+
+For wombat setup, including the URL rewriting part, we need to pass wombat configuration info. This code is developed in the `javascript` folder. For URL parsing, it relies on the [uri-js library](https://www.npmjs.com/package/uri-js). This javascript code is bundled into a single `wombatSetup.js` file with [rollup bundler](https://rollupjs.org), the same bundler used by webrecorder team to bundle wombat.
diff --git a/docs/technical_architecture.md b/docs/technical_architecture.md
@@ -0,0 +1,52 @@
+# Technical architecture
+
+## Fuzzy rules
+
+Fuzzy rules are stored in `rules/rules.yaml`. This configuration file is then used by `rules/generateRules.py` to generate Python and JS code.
+
+Should you update these fuzzy rules, you hence have to:
+- regenerate Python and JS files by running `python rules/generateRules.py`
+- bundle again Javascript `wombatSetup.js` (see below).
+
+## Wombat configuration
+
+Wombat configuration contains some static configuration and the dynamic URL rewriting, including fuzzy rules.
+
+It is bundled by rollup with `cd javascript && yarn build-prod` and the result is pushed to proper scraper location for inclusion at build time.
+
+Tests are available and run with `cd javascript && yarn test`.
+
+## Transformation of URL into ZIM path
+
+Transforming a URL into a ZIM path has to respect the ZIM specification: path must not be url-encoded (i.e. it must be decoded) and it must be stored as UTF-8.
+
+WARC record stores the items URL inside a header named "WARC-Target-URI". The value inside this header is encoded, or more exactly it is "exactly what the browser sent at the HTTP level" (see https://github.com/webrecorder/browsertrix-crawler/issues/492 for more details).
+
+It has been decided (by convention) that we will drop the scheme, the port, the username and password from the URL. Headers are also not considered in this computation.
+
+Computation of the ZIM path is hence mostly straightforward:
+- decode the hostname which is puny-encoded
+- decode the path and query parameter which might be url-encoded
+
+## URL rewriting
+
+In addition to the computation of the relative path from the current document URL to the URL to rewrite, URL rewriting also consists in computing the proper ZIM path (with same operation as above) and properly encoding it so that the resulting URL respects [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986). Some important stuff has to be noted in this encoding.
+
+- since the original hostname is now part of the path, it will now be url-encoded
+- since the `?` and following query parameters are also part of the path (we do not want readers to drop them like kiwix-serve would do), they are also url-encoded
+
+Below is an example case of the rewrite operation on an image URL found in an HTML document.
+
+- Document original URL: `https://kiwix.org/a/article/document.html`
+- Document ZIM path: `kiwix.org/a/article/document.html`
+- Image original URL: `//xn--exmple-cva.com/a/resource/image.png?foo=bar`
+- Image rewritten URL: `../../../ex%C3%A9mple.com/a/resource/image.png%3Ffoo%3Dbar`
+- Image ZIM Path: `exémple.com/a/resource/image.png?foo=bar`
+
+## JS Rewriting
+
+JS Rewriting is a bit special because rules to apply are different wether we are using "classic" Javascript or "module" Javascript.
+
+Detection of Javascript modules starts at the HTML level where we have a `<script type="module"  src="...">` tag. This tells us that file at src location is a Javascript module. From there we now that its subresources are also Javascript module.
+
+Currently this detection is done on-the-fly, based on the fact that WARC items are processed in the same order that they have been fetched by the browser, and we hence do not need a multi-pass approach. Meaning that HTML will be processed first, then parent JS, then its dependencies, ... **This is a strong assumption**.
diff --git a/openzim.toml b/openzim.toml
@@ -8,15 +8,3 @@ execute_after=[
 action="get_file"
 source="https://cdn.jsdelivr.net/npm/@webrecorder/wombat@3.8.2/dist/wombat.js"
 target_file="wombat.js"
-
-# wombatSetup.js is supposed to be built locally from files in javascript folder.
-# Should someone not have proper skills / tooling / knowledge, or simply install from
-# sdist / Github repo directly, without any advanced knowledge of this specificity, the
-# configuration below ensures that wombatSetup.js is downloaded from dev.kiwix.org,
-# where we have the latest version from `main` branch. wheel contains the wombatSetup.js
-# which was built at the same time than the wheel. (reminder: get_file action does not
-# overwrite a file which already exists)
-[files.assets.actions."wombatSetup.js"]
-action="get_file"
-source="https://dev.kiwix.org/zimscraperlib/wombatSetup.js"
-target_file="wombatSetup.js"
diff --git a/pyproject.toml b/pyproject.toml
@@ -91,6 +91,13 @@ artifacts = [
   "tests/rewriting/test_fuzzy_rules.py",
 ]
 
+[tool.hatch.build.targets.sdist]
+include = [
+  "src/zimscraperlib/rewriting/statics/**",
+  "src/zimscraperlib/rewriting/rules.py",
+  "tests/rewriting/test_fuzzy_rules.py",
+]
+
 [tool.hatch.envs.default]
 features = ["dev"]
 
diff --git a/src/zimscraperlib/rewriting/statics/README.md b/src/zimscraperlib/rewriting/statics/README.md
@@ -0,0 +1,8 @@
+This folder must contain two files which are not under Git version control:
+- wombat.js, a webrecorder software
+- wombatSetup.js, a custom configuration script for wombat.js, which is built in this
+project from files in the javascript folder
+
+If you install zimscraperlib from sdist or wheel, we've pre-packaged these files for 
+convenience and also so that your version of wombatSetup.js ais "aligned" (i.e. if you 
+install zimscraperlib x.y.z, we are sure which version wombatSetup.js you have).