Skip to content

Commit 74b052d

Browse files
committed
feat: add relevance-based sorting for CSV export and fix tests
1 parent ac47498 commit 74b052d

9 files changed

Lines changed: 216 additions & 5 deletions

.env.example

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,5 @@ CACHE_TTL_DAYS=7
2828
# --- OUTPUT ---
2929
# Directory where CSV and Markdown reports will be saved
3030
OUTPUT_DIR=data
31+
# Export sorting configuration: "false" to keep original order, "true" to sort by score (desc) and domain (asc)
32+
EXPORT_SORT_BY_RELEVANCE=false

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,31 @@ poetry run python -m src.main --input data/seeds.csv --rerun-failed
6464
- **Works with any input:** A plain domain list or a complex CSV — the system finds failed domains via the database
6565
- **Time savings:** Re-scrapes only domains with `status=error`; successful ones are pulled from cache
6666

67+
### 📊 Export Sorting by Relevance (Smart Export Order)
68+
- **Configurable sorting:** `.env` parameter `EXPORT_SORT_BY_RELEVANCE=true` to sort CSV by score
69+
- **Two-level sorting:** First by score (100→0), then alphabetically for ties
70+
- **Preserve original order:** Default `false` — domains in CSV appear in the same order as input file
71+
- **NULL-safe:** Domains without score (failed scraping) automatically moved to the end of the list
72+
73+
**Example `.env` configuration:**
74+
```bash
75+
# Sort CSV by relevance (High Priority → Low Priority)
76+
EXPORT_SORT_BY_RELEVANCE=true
77+
78+
# Or preserve original order (default)
79+
EXPORT_SORT_BY_RELEVANCE=false
80+
```
81+
82+
**Output when `EXPORT_SORT_BY_RELEVANCE=true`:**
83+
```
84+
domain,score,priority
85+
apple.com,100,High ← highest score
86+
wikipedia.org,100,High ← same score → alphabetical order
87+
amazon.com,85,High
88+
httpbin.org,40,Low
89+
fake-domain.com,0,Low ← failed domains at the end
90+
```
91+
6792
### ⚡ Async I/O + Connection Pooling
6893
- **5 workers process 100 domains in ~20 seconds** (vs. 100 seconds in the synchronous variant)
6994
- **Configurable parallelism:** `--workers 10` for fast VPS or `--workers 2` for resource-constrained environments

docs/README_UA.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,31 @@ poetry run python -m src.main --input data/seeds.csv --rerun-failed
6363
- **Працює з будь-яким input:** Чистий список доменів або складний CSV — система знайде failed domains через БД
6464
- **Економія часу:** Пере-scrape лише домени з `status=error`, успішні береться з кешу
6565

66+
### 📊 Сортування експорту за релевантністю (Smart Export Order)
67+
- **Configurable sorting:** `.env` параметр `EXPORT_SORT_BY_RELEVANCE=true` для сортування CSV по score
68+
- **Двоступенева сортування:** Спочатку за балами (100→0), потім за алфавітом при однакових балах
69+
- **Збереження оригінального порядку:** За замовчуванням `false` — домени у CSV у тому ж порядку, що й у вхідному файлі
70+
- **NULL-safe:** Домени без score (failed scraping) автоматично переміщуються в кінець списку
71+
72+
**Приклад `.env` налаштування:**
73+
```bash
74+
# Сортувати CSV за релевантністю (High Priority → Low Priority)
75+
EXPORT_SORT_BY_RELEVANCE=true
76+
77+
# Або зберегти оригінальний порядок (default)
78+
EXPORT_SORT_BY_RELEVANCE=false
79+
```
80+
81+
**Output при `EXPORT_SORT_BY_RELEVANCE=true`:**
82+
```
83+
domain,score,priority
84+
apple.com,100,High ← найвищий score
85+
wikipedia.org,100,High ← однаковий score → алфавітний порядок
86+
amazon.com,85,High
87+
httpbin.org,40,Low
88+
fake-domain.com,0,Low ← failed domains в кінці
89+
```
90+
6691
### ⚡ Async I/O + Connection Pooling
6792
- **5 workers обробляють 100 доменів за ~20 секунд** (vs 100 секунд у sync варіанті)
6893
- **Configurable parallelism:** `--workers 10` для швидких VPS або `--workers 2` для обмежених ресурсів

docs/sort_output_20260508_003121.csv

Lines changed: 101 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Domain Triaging Executive Summary
2+
Generated on: 2026-05-08 00:31:21
3+
Input source: `output_20260508_003121.csv`
4+
5+
## 📊 Processing Statistics
6+
| Metric | Value |
7+
| :--- | :--- |
8+
| **Total Domains Processed** | 100 |
9+
| **Successful Scrapes** | 59 |
10+
| **Failed / Inaccessible** | 41 |
11+
| **Fallback API (Serper) Used** | 48 |
12+
| **Total Serper Credits Consumed** | 38 |
13+
14+
## 🎯 Triage Results (Prioritization)
15+
- **🔴 High Priority (Manual Review):** 54
16+
- **🟡 Medium Priority (Monitor):** 1
17+
- **🟢 Low Priority (Discard/Archive):** 45
18+
19+
## 🔍 Top Interesting Domains
20+
| Domain | Score | Next Action |
21+
| :--- | :--- | :--- |
22+
| cal.com | 100 | Manual Review |
23+
| discord.com | 100 | Manual Review |
24+
| miro.com | 100 | Manual Review |
25+
| calendly.com | 100 | Manual Review |
26+
| cloudflare.com | 100 | Manual Review |
27+
28+
29+
---
30+
*Full data available in the associated CSV file: output_20260508_003121.csv*

src/config.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ class Config:
3030

3131
# I/O Directories
3232
OUTPUT_DIR: str = "data"
33+
EXPORT_SORT_BY_RELEVANCE: bool = (
34+
os.getenv("EXPORT_SORT_BY_RELEVANCE", "false").lower() == "true"
35+
)
3336

3437

3538
config = Config()

src/exporter.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,15 @@ def export_to_csv(results: list[dict[str, Any]], output_path: str | None = None)
9393
# Discarding unnecessary columns (e.g., 'fallback_used') and aligning the order
9494
df = df[expected_columns]
9595

96+
if config.EXPORT_SORT_BY_RELEVANCE:
97+
# Create a temporary sorting column to handle None/NaN safely
98+
sort_score: Any = pd.to_numeric(df["score"], errors="coerce").fillna(-1) # type: ignore
99+
df = (
100+
df.assign(_sort_score=sort_score)
101+
.sort_values(by=["_sort_score", "domain"], ascending=[False, True])
102+
.drop(columns=["_sort_score"])
103+
)
104+
96105
try:
97106
# Writing with utf-8-sig (BOM) for correct import
98107
df.to_csv(output_path, index=False, encoding="utf-8-sig")

tests/test_exporter.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,22 @@ def test_export_to_csv_success(tmp_path: Path) -> None:
7171
assert df.loc[0, "scrape_method"] == "bs4"
7272
assert df.loc[1, "scrape_method"] == "serper"
7373
assert "Fallback to Serper.dev" in str(df.loc[1, "notes"])
74+
75+
76+
def test_export_to_csv_sorting(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
77+
"""Verifies sorting by relevance (score) and alphabetically by domain."""
78+
new_config = replace(config, EXPORT_SORT_BY_RELEVANCE=True)
79+
monkeypatch.setattr("src.exporter.config", new_config)
80+
81+
output_file = tmp_path / "test_sort.csv"
82+
mock_results = [
83+
{"domain": "b.com", "score": 50, "status": "success"},
84+
{"domain": "c.com", "score": 100, "status": "success"},
85+
{"domain": "a.com", "score": 50, "status": "success"},
86+
{"domain": "d.com", "score": None, "status": "error"},
87+
]
88+
export_to_csv(mock_results, output_path=str(output_file))
89+
90+
df = pd.read_csv(output_file)
91+
domains = df["domain"].tolist()
92+
assert domains == ["c.com", "a.com", "b.com", "d.com"]

tests/test_scorer.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,18 +46,15 @@ def test_assign_priority_medium() -> None:
4646
mock_data = {
4747
"domain": "average-site.net",
4848
"ssl_valid": True,
49-
"domain_age_days": 40,
50-
"has_live_content": False,
49+
"domain_age_days": 400,
50+
"has_live_content": True,
5151
"word_count": 60,
5252
"error": None,
5353
"status_code": 200,
5454
}
5555

5656
result = calculate_score(mock_data)
5757

58-
mock_data["domain_age_days"] = 400
59-
result = calculate_score(mock_data)
60-
6158
assert result["priority"] == "Medium"
6259
assert result["next_action"] == "Monitor"
6360

0 commit comments

Comments
 (0)