-
Notifications
You must be signed in to change notification settings - Fork 4
Add URL filtering #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cunha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main points:
- Tests don't match expected behavior; we need to define a number of domain and path levels beyond which we start "anonymization"
- Need tests to cover URL query parameters
- Need tests to cover special domains and CC TLDs
| "manager": tld_manager | ||
| }) | ||
|
|
||
| with open("data/tlds.json", "w") as jsonfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add this file to the repo to avoid downloading it every time. The CC TLDs at least should not change frequently (if they do change at all).
| @@ -0,0 +1,3 @@ | |||
| bs4 | |||
| requests | |||
| yacryptopan | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if we actually need this, I think we might not need it.
| urls = ["http://example.com/path"] | ||
| result = filter_instance.filter_urls(urls) | ||
| assert result == urls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, however, we should also check the identity key generated for each URL. In the case of http://example.com/path, the identity key should be something like http://example.com/path.
| result = filter_instance.filter_urls(urls) | ||
| expected = [ | ||
| "http://example.com/path1", | ||
| "http://test.com/home" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get path2 here, no? It's a different path.
| expected = [ | ||
| "http://a.b.example.com/page", | ||
| "http://b.example.com/other", | ||
| "http://c.b.example.com/page" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stop "anonymize" everything in the identity key after the second level. So the identity keys might be something like:
http://a.b.example.com/page --> http://1w.b.example.com/page
http://b.example.com/other --> http://b.example.com/other
http://a.b.example.com/page --> http://1w.b.example.com/page
http://c.b.example.com/page --> http://1w.b.example.com/page
We should also add other cases:
http://c.b.example.com/other --> http://1w.b.example.com/other # should be included for crawling by the filter
| ] | ||
| result = filter_instance.filter_urls(urls) | ||
| expected = [ | ||
| "http://Example.com/Page" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL paths are case-sensitive, so Page != page != PAGE and we should fetch all pages.
| ] | ||
| result = filter_instance.filter_urls(urls) | ||
| expected = [ | ||
| "http://example.com/page1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have all three entries.
| expected = [ | ||
| "http://example.com/page1", | ||
| "http://example.com/page2/abc", | ||
| "http://example.com/page3/def/ghi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have other URLs with deep paths. Like:
http://example.com/page3/def/ghi
http://example.com/page3/def/hjk
http://example.com/page3/def/jkl
http://example.com/page3/def/lkj
And we need to decide at which path depth we start the "anonymization". For example, we might want to keep 2 levels of domain names and two path levels, and anonymize the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add tests to cover URL queries (e.g., ?p=X&q=Y) after the path. These should be anonymized by keeping only the parameter names and ignoring the values. In other words, the identity key should be something like ?p&q.
cunha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention CC TLDs in the previous review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not need this file, we can drop it from the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need tests to cover CC TLDs and special TLDs.
No description provided.