Skip to content

Conversation

@Voamorim
Copy link

No description provided.

Copy link
Member

@cunha cunha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main points:

  • Tests don't match expected behavior; we need to define a number of domain and path levels beyond which we start "anonymization"
  • Need tests to cover URL query parameters
  • Need tests to cover special domains and CC TLDs

"manager": tld_manager
})

with open("data/tlds.json", "w") as jsonfile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add this file to the repo to avoid downloading it every time. The CC TLDs at least should not change frequently (if they do change at all).

@@ -0,0 +1,3 @@
bs4
requests
yacryptopan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if we actually need this, I think we might not need it.

Comment on lines +14 to +16
urls = ["http://example.com/path"]
result = filter_instance.filter_urls(urls)
assert result == urls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, however, we should also check the identity key generated for each URL. In the case of http://example.com/path, the identity key should be something like http://example.com/path.

result = filter_instance.filter_urls(urls)
expected = [
"http://example.com/path1",
"http://test.com/home"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get path2 here, no? It's a different path.

expected = [
"http://a.b.example.com/page",
"http://b.example.com/other",
"http://c.b.example.com/page"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stop "anonymize" everything in the identity key after the second level. So the identity keys might be something like:

http://a.b.example.com/page --> http://1w.b.example.com/page
http://b.example.com/other --> http://b.example.com/other
http://a.b.example.com/page --> http://1w.b.example.com/page
http://c.b.example.com/page --> http://1w.b.example.com/page

We should also add other cases:

http://c.b.example.com/other --> http://1w.b.example.com/other  # should be included for crawling by the filter

]
result = filter_instance.filter_urls(urls)
expected = [
"http://Example.com/Page"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL paths are case-sensitive, so Page != page != PAGE and we should fetch all pages.

]
result = filter_instance.filter_urls(urls)
expected = [
"http://example.com/page1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have all three entries.

expected = [
"http://example.com/page1",
"http://example.com/page2/abc",
"http://example.com/page3/def/ghi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have other URLs with deep paths. Like:

http://example.com/page3/def/ghi
http://example.com/page3/def/hjk
http://example.com/page3/def/jkl
http://example.com/page3/def/lkj

And we need to decide at which path depth we start the "anonymization". For example, we might want to keep 2 levels of domain names and two path levels, and anonymize the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add tests to cover URL queries (e.g., ?p=X&q=Y) after the path. These should be anonymized by keeping only the parameter names and ignoring the values. In other words, the identity key should be something like ?p&q.

Copy link
Member

@cunha cunha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention CC TLDs in the previous review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not need this file, we can drop it from the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need tests to cover CC TLDs and special TLDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants