Add URL filtering #72

Voamorim · 2026-01-12T12:16:00Z

No description provided.

cunha

Main points:

Tests don't match expected behavior; we need to define a number of domain and path levels beyond which we start "anonymization"
Need tests to cover URL query parameters
Need tests to cover special domains and CC TLDs

cunha · 2026-01-12T16:30:17Z

urls_filter/get-tld-json.py

+        "manager": tld_manager
+    })
+
+with open("data/tlds.json", "w") as jsonfile:


I think we can add this file to the repo to avoid downloading it every time. The CC TLDs at least should not change frequently (if they do change at all).

cunha · 2026-01-12T16:30:39Z

urls_filter/requirements.txt

@@ -0,0 +1,3 @@
+bs4
+requests
+yacryptopan


Check if we actually need this, I think we might not need it.

cunha · 2026-01-12T16:32:11Z

urls_filter/test_filter.py

+        urls = ["http://example.com/path"]
+        result = filter_instance.filter_urls(urls)
+        assert result == urls  


OK, however, we should also check the identity key generated for each URL. In the case of http://example.com/path, the identity key should be something like http://example.com/path.

cunha · 2026-01-12T16:32:39Z

urls_filter/test_filter.py

+        result = filter_instance.filter_urls(urls)
+        expected = [
+            "http://example.com/path1",
+            "http://test.com/home"


We should get path2 here, no? It's a different path.

cunha · 2026-01-12T16:35:37Z

urls_filter/test_filter.py

+        expected = [
+            "http://a.b.example.com/page",
+            "http://b.example.com/other",
+            "http://c.b.example.com/page"


I think we should stop "anonymize" everything in the identity key after the second level. So the identity keys might be something like:

http://a.b.example.com/page --> http://1w.b.example.com/page http://b.example.com/other --> http://b.example.com/other http://a.b.example.com/page --> http://1w.b.example.com/page http://c.b.example.com/page --> http://1w.b.example.com/page

We should also add other cases:

http://c.b.example.com/other --> http://1w.b.example.com/other # should be included for crawling by the filter

cunha · 2026-01-12T16:37:01Z

urls_filter/test_filter.py

+        ]
+        result = filter_instance.filter_urls(urls)
+        expected = [
+            "http://Example.com/Page"


URL paths are case-sensitive, so Page != page != PAGE and we should fetch all pages.

cunha · 2026-01-12T16:38:08Z

urls_filter/test_filter.py

+        ]
+        result = filter_instance.filter_urls(urls)
+        expected = [
+            "http://example.com/page1",


Should have all three entries.

cunha · 2026-01-12T16:39:36Z

urls_filter/test_filter.py

+        expected = [
+            "http://example.com/page1",
+            "http://example.com/page2/abc",
+            "http://example.com/page3/def/ghi"


We should also have other URLs with deep paths. Like:

http://example.com/page3/def/ghi http://example.com/page3/def/hjk http://example.com/page3/def/jkl http://example.com/page3/def/lkj

And we need to decide at which path depth we start the "anonymization". For example, we might want to keep 2 levels of domain names and two path levels, and anonymize the rest.

cunha · 2026-01-12T16:40:52Z

urls_filter/test_filter.py

We need to add tests to cover URL queries (e.g., ?p=X&q=Y) after the path. These should be anonymized by keeping only the parameter names and ignoring the values. In other words, the identity key should be something like ?p&q.

cunha

Forgot to mention CC TLDs in the previous review.

cunha · 2026-01-12T16:43:14Z

urls_filter/data/dns-keywords.txt

We will not need this file, we can drop it from the PR.

cunha · 2026-01-12T16:43:30Z

urls_filter/test_filter.py

We also need tests to cover CC TLDs and special TLDs.

Add URL filtering

6db4e3c

cunha requested changes Jan 12, 2026

View reviewed changes

Add URL filtering #72

Are you sure you want to change the base?

Add URL filtering #72

Uh oh!

Conversation

Voamorim commented Jan 12, 2026

Uh oh!

cunha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cunha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants