|
| 1 | +# Functional Architecture |
| 2 | + |
| 3 | +## Enrich libzim functions |
| 4 | + |
| 5 | +zimscraperlib has primitives to enrich libzim functions with some operations which are known to be shared across scrapers. See `zim` module. |
| 6 | + |
| 7 | +## Handle videos |
| 8 | + |
| 9 | +zimscraperlib has primitives to manipulate videos with some operations which are known to be shared across scrapers. See `video` module. |
| 10 | + |
| 11 | +## Handle pictures |
| 12 | + |
| 13 | +zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module. |
| 14 | + |
| 15 | +## Store and rewrite mostly unmodified HTML, CSS and JS from online website |
| 16 | + |
| 17 | +zimscraperlib also contains primitives to rewrite HTML, CSS and JS fetched online, to proper operate within a ZIM without heavy modifications. While originaly developped for warc2zim, some of these primitives are now also used for mindtouch scraper and others might follow, so they are shared in zimscraperlib. See `rewriting` module. |
| 18 | + |
| 19 | +## ZIM storage |
| 20 | + |
| 21 | +While storing web resources in a ZIM is mostly straightforward (we just transfer the raw bytes, after some modification for URL rewriting if needed), the decision of the path where the resource will be stored is very important. |
| 22 | + |
| 23 | +This is purely conventional, even if ZIM specification has to be respected for proper operation in readers. |
| 24 | + |
| 25 | +This function is responsible to compute the ZIM path where a given web resource is going to be stored. |
| 26 | + |
| 27 | +While the URL is the only driver of this computation for now, zimscraperlib might have to consider other contextual data in the future. E.g. the resource to serve might by dynamic, depending not only on URL query parameters but also header(s) value(s). |
| 28 | + |
| 29 | +## Fuzzy rules |
| 30 | + |
| 31 | +Unfortunately, it is not always possible / desirable to store the resource with a simple transformation. |
| 32 | + |
| 33 | +A typical situation is that some query parameters are dynamically computed by some Javascript code to include user tracking identifier, current datetime information, ... |
| 34 | + |
| 35 | +When running again the same javascript code inside the ZIM, the URL will hence be slightly different because context has changed, but the same content needs to be retrieved. |
| 36 | + |
| 37 | +zimscraperlib hence relies on fuzzy rules to transform/simplify some URLs when computing the ZIM path. |
| 38 | + |
| 39 | +## URL Rewriting |
| 40 | + |
| 41 | +zimscraperlib transforms (rewrites) URLs found in documents (HTML, CSS, JS, ...) so that they are usable inside the ZIM. |
| 42 | + |
| 43 | +### General case |
| 44 | + |
| 45 | +One simple example is that we might have following code in an HTML document to load an image with an absolute URL: |
| 46 | + |
| 47 | +``` |
| 48 | + <img src="https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg"></img> |
| 49 | +``` |
| 50 | + |
| 51 | +The URL `https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` has to be transformed to a URL that it is usable inside the ZIM. |
| 52 | + |
| 53 | +For proper reader operation, openZIM prohibits using absolute URLs, so this has to be a relative URL. This relative URL is hence dependant on the location of the resource currently being rewriten. |
| 54 | + |
| 55 | +The table below gives some examples of what the rewritten URL is going to be, depending on the URL of the rewritten document. |
| 56 | + |
| 57 | +| HTML document URL | image URL rewritten for usage inside the ZIM | |
| 58 | +|--|--| |
| 59 | +| `https://en.wikipedia.org/wiki/Kiwix` | `./File:Kiwix_logo_v3.svg` | |
| 60 | +| `https://en.wikipedia.org/wiki` | `./wiki/File:Kiwix_logo_v3.svg` | |
| 61 | +| `https://en.wikipedia.org/waka/Kiwix` | `../wiki/File:Kiwix_logo_v3.svg` | |
| 62 | +| `https://fr.wikipedia.org/wiki/Kiwix` | `../../en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` | |
| 63 | + |
| 64 | +As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored. |
| 65 | + |
| 66 | +### Dynamic case |
| 67 | + |
| 68 | +The explanation above more or less assumed that the transformations can be done statically, i.e zimscraperlib can open every known document, find existing URLs and replace them with their counterpart inside the ZIM. |
| 69 | + |
| 70 | +While this is possible for HTML and CSS documents typically, it is not possible when the URL is dynamically computed. This is typically the case for JS documents, where in the general case the URL is not statically stored inside the JS code but computed on-the-fly by aggregating various strings and values. |
| 71 | + |
| 72 | +Rewriting these computations is not deemed feasible due to the huge variety of situation which might be encountered. |
| 73 | + |
| 74 | +A specific function is hence needed to rewrite URL **live in client browser**, intercept any function triggering a web request, transform the URL according to conventions (where we expect the resource to be located in the general case) and fuzzy rules. |
| 75 | + |
| 76 | +_Spoiler: this is where we will rely on wombat.js from webrecorder team, since this dynamic interception is quite complex and already done quite neatly by them_ |
| 77 | + |
| 78 | +### Fuzzy rules |
| 79 | + |
| 80 | +The same fuzzy rules that have been used to compute the ZIM path from a resource URL have to be applied again when rewriting URLs. |
| 81 | + |
| 82 | +While this is expected to serve mostly for the dynamic case, we still applies them on both side (staticaly and dynamicaly) for coherency. |
| 83 | + |
| 84 | +## Documents rewriten statically |
| 85 | + |
| 86 | +For now zimscraperlib rewrites HTML, CSS and JS documents. For CSS and JS, this mainly consists in replacing URLs. For HTML, we also have more specific rewritting necessary (e.g. to handle base href or redirects with meta). |
| 87 | + |
| 88 | +No domain specific (DS) rules are applied like it is done in wabac.JS because these rules are already applied in Browsertrix Crawler. For the same reason, JSON is not rewritten anymore (URL do not need to be rewritten in JSON because these URLs will be used by JS, intercepted by wombat and dynamically rewritten). |
| 89 | + |
| 90 | +JSONP callbacks are supposed to be rewritten but this has not been heavily tested. |
| 91 | + |
| 92 | +Other types of documents are supposed to be either not feasible / not worth it (e.g. URLs inside PDF documents), meaningless (e.g. images, fonts) or planned for later due to limited usage in the wild (e.g. XML). |
0 commit comments