Speed up find_pairs by using a Numba-optimised kd-tree for searching M to build S #104

robinmessage · 2024-03-15T15:32:31Z

This has none of the memory sharing optimisations to make this easier to run in parallel (or possible for larger projects). It also has none of the optimisations we talked about to split M into 100 or so subsets and pick from the randomly.

However, it does seem to pick reasonable pairs and have better SMD number than the current version.

It's also worth nothing this code moves to using 100% of K instead of 10% - is that still what we want? It was part of the original motivation for this change.

I'm happy to talk it through with anyone or do any further testing you want to suggest to make sure it is robust before I add the memory sharing optimisation (which requires a fair bit of restructuring to thread everything through but shouldn't change the output).

…M to build S

mdales

LGTM overall at a first pass.

mdales · 2024-03-21T15:05:03Z

methods/matching/find_pairs.py

    k_set = pd.read_parquet(k_parquet_filename)
    k_subset = k_set.sample(
-        frac=0.1,
+        frac=1,


Isn't this just the same as k_subset = k_set?

Yes; I didn't clean this up yet as I wasn't sure if we definitely wanted to change to 100% of K instead of 10%.

mdales · 2024-03-21T15:19:20Z

methods/matching/find_pairs.py

+
+    # Find categories in K
+    hard_match_category_columns = [k[hard_match_columns].to_numpy() for _, k in k_set.iterrows()]
+    hard_match_categories = {k.tobytes(): k for k in hard_match_category_columns}


I think this needs some comment here about what's happening - I had to work through this with real data to figure out the trick that's going on here to get unique columns. Given you don't use the keys ever again, I'd rather you called values here, rather than in make_s_set_mask, as again that'd make it a bit more obvious you're using this to find unique sets of columns. (assuming I understand what's happening here).

Fair point, I'll tidy this

mdales · 2024-03-21T15:27:16Z

methods/matching/find_pairs.py

+    return s_include, k_miss
+
+@jit(nopython=True, fastmath=True, error_model="numpy")
+def make_s_set_mask_numba(


Very happy to delete this version.

mdales · 2024-03-21T15:29:50Z

methods/matching/find_pairs.py

-    k_subset_dist_hard = np.ascontiguousarray(k_subset[hard_match_columns].to_numpy()).astype(np.int32)
-
    # Methodology 6.5.5: S should be 10 times the size of K, in order to achieve this for every
    # pixel in the subsample (which is 10% the size of K) we select 100 pixels.


Comment needs updating

methods/matching/find_pairs.py

mdales · 2024-03-21T15:46:43Z

methods/utils/kd_tree.py

+                if value >= low[d]:
+                    queue.append(self.lefts[pos])
+        return count
+    def members_sample(self, point: np.ndarray, count: int, rng: np.random.Generator):


I have to confess, due to lack of comments, I only skim reviewed this to try and work out what count was achieving, and then gave up. Which is fine at the prototype stage, but before we merge this some comments to API/algorithm would be useful, as this is quite nuanced I think.

I've added some docstrings and comments, hopefully that covers what is needed but please. do come back to me on anything else.

robinmessage · 2024-03-22T13:29:08Z

Thanks for reviewing this @mdales, and you're right it could do with some more comments in the gnarly bits and generally clearing up a bit. I'll do that as soon as I can and bounce it back to you (probably after Easter unfortunately).

robinmessage · 2024-04-16T15:10:54Z

@mdales I think I've fixed the stuff you've reviewed and improved the comments on the other parts.

mdales

LGTM, just a couple of things it'd be nice to tidy up.

mdales · 2024-05-20T09:53:38Z

methods/matching/find_pairs.py

-        random_state=rng
-    ).reset_index()
+    # TODO: This assumes the methodolgy is being updated to 100% of K
+    k_subset = k_set


Can we just collapse this change throughout, and when this is merged we bump versions of both the code and the methodology.

Will do when I merge

mdales · 2024-05-20T10:24:20Z

methods/utils/kd_tree.py

+                            rand = rand_state[0] + rand_state[3]
+                            t = rand_state[1] << 17
+                            rand_state[2] ^= rand_state[0]
+                            rand_state[3] ^= rand_state[1]
+                            rand_state[1] ^= rand_state[2]
+                            rand_state[0] ^= rand_state[3]
+                            rand_state[2] ^= t
+                            rand_state[3] = (rand_state[3] >> 45) | (rand_state[3] << 19)


Sad that we have to do this, but I see it's because of performance reasons. Can we at least pull out this code so that we're not baking into the algorithm the encryption method? The methodology does not require this particular algorithm, just we've chosen to use it for performance reasons.

I'm not sure I quite understand: just pull it out into a function (and hope Numba inlines it), with a comment saying it is a random number but no specific algorithm is needed, and this was just chosen for speed? Or something else? (And we can only do this I suspect if Numba does inline it correctly, and we'll be passing the state around so I'm not sure it'll be particularly clearer)

Speed up find_pairs by using a Numba-optimised kd-tree for searching …

6feb927

…M to build S

robinmessage requested review from mdales and patricoferris March 15, 2024 15:32

robinmessage mentioned this pull request Mar 15, 2024

Run find_pairs faster and on all of K #86

Closed

Removed debug print statements accidentally left in

dc69ba1

mdales requested changes Mar 21, 2024

View reviewed changes

Fixes from code review and better comments on RumbaTree

cd63f60

mdales approved these changes May 20, 2024

View reviewed changes

Speed up find_pairs by using a Numba-optimised kd-tree for searching M to build S #104

Are you sure you want to change the base?

Speed up find_pairs by using a Numba-optimised kd-tree for searching M to build S #104

Uh oh!

Conversation

robinmessage commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdales left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinmessage commented Mar 22, 2024

Uh oh!

robinmessage commented Apr 16, 2024

Uh oh!

mdales left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robinmessage commented Mar 15, 2024 •

edited

Loading