Skip to content
bmaranville edited this page Jul 6, 2011 · 13 revisions

Cached Data

Motivation

Certain operations will take a lot of time to complete. When the user makes a change to a parameter midway through a reduction chain, we do not want to recalculate the whole thing unless necessary.

Therefore, every result from a module may be cached, in the following way:

fingerprinting results

Each module output can be "fingerprinted" with an SHA1 key that uniquely identifies a particular configuration of that module. The inputs to the hash function will be:

  • an ordered list of the fingerprints (hash functions) of the inputs to the module
  • an ordered list of the parameters to the module (always in the same order)
  • the source code of the module
  • the id of the ouput terminal (1, 2, etc.)

If the above items remain the same, the output should remain the same also, as well as the generated fingerprint.

needed functions

Refresh fingerprints

Any time you want to use a particular output, you want to calculate the fingerprint for that output given the current state of the reduction chain. Therefore, starting at the beginning of the chain, the Refresh function will calculate the fingerprint of the first-level modules' outputs (these modules have no inputs, only parameters), and will then plug these into the ordered list of input hashes of all the "child" modules. Note: this function can follow the form of the existing Calculate function closely, so that it traverses the tree in exactly the same way

Check cache for fingerprint (key)

If the fingerprint of a particular output has changed since the last time a calculation was run (due to some change upstream of that output), the cache will not contain any data corresponding to that key. A recalculation of the chain up to that point is necessary.

Recalculate (same as Calculate)

For every module (starting at the beginning of the chain and moving down), we need to calculate a value or grab the cached value. We run "Check cache", and if the fingerprint for the module output needed at the moment exists in the cache, it is not recalculated but instead the cached value is used. If a valid cached value does not exist, run the calculation and then store the value under the fingerprint key. Then move on to the next module.

Cleanup

There are some immediately obvious options for this:

  1. at Refresh - if a key changes when the Refresh operation is run, delete the old key and associated value. This has non-disastrous consequences if the identical reduction chain is being used simultaneously in multiple projects or by multiple users, as the conveniently cached value is lost to all, and must be recalculated.
  2. automated global (scheduled) cleanup - on a regular basis, in the middle of the night or something, the server could check all the reductions in existence and remove all key-value pairs that don't match up with the current reduction states.
  3. reference counting - this is hard. Keep track of all references to the same key, and delete as soon as they go to zero. #2 is like reference counting that is only enforced periodically.
  4. auto-expiry - if a tool like Redis is used, use the built-in data expiration feature to have it magically evaporate a specified time after creation (one week? one month?)

Advantages of this scheme:

  • Every calculation, at every step of the reduction, can check to see if a cached result already exists
  • Changes to the module code base automatically trigger a recalculation of data
  • No enforcement of data validity is needed beyond running the "Refresh" command to make sure upstream changes to input parameters or topology cause a recalculation.
  • "IsDirty" command can be easily implemented, if desired, by recalculating fingerprints and comparing to existing ones.

Disadvantages:

  • The Refresh command may be slow if we inspect the full source code every time. Need to test. Possibly the module source can be fingerprinted at import time, rather than at Refresh time. Note: this is confirmed:

    In [17]: %timeit hashlib.sha1(inspect.getsourcelines(f1).__str__())

    100 loops, best of 3: 4.61 ms per loop

    In [18]: source = inspect.getsourcelines(f1)

    In [19]: sourcestr = str(source)

    In [20]: %timeit hashlib.sha1(sourcestr)

    100000 loops, best of 3: 9.35 us per loop

Clone this wiki locally