Conversation
Mhafsia
commented
Jan 6, 2026
- Added SerpAPI sync script to fetch publications from Google Scholar
- Migrated 135 publications to single YAML file (_data/publications.yml)
- Now 285 publications synced from Google Scholar
- New sidebar UI with filters: Type, Year (Category temporarily deactivated)
- Pill-shaped filter buttons with color coding (blue, teal)
- Clickable tags on publications to filter
- Normalized author names (Initial. Lastname format)
- Added clean_publications.py and normalize_publications.py scripts
- Removed old individual .md publication files
- Added SerpAPI sync script to fetch publications from Google Scholar - Migrated 135 publications to single YAML file (_data/publications.yml) - Now 285 publications synced from Google Scholar - New sidebar UI with filters: Type, Year (Category provisoirement désactivé) - Pill-shaped filter buttons with color coding (blue, teal) - Clickable tags on publications to filter - Normalized author names (Initial. Lastname format) - Added clean_publications.py and normalize_publications.py scripts - Removed old individual .md publication files
✅ Deploy Preview for ppsp ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive publications management system that migrates from individual markdown files to a centralized YAML-based approach with automated syncing from Google Scholar via SerpAPI. The changes include four Python utility scripts for synchronization, normalization, and data cleaning, plus a redesigned UI with interactive filtering capabilities.
Key Changes:
- Automated publication syncing from Google Scholar using SerpAPI (285 total publications)
- Migration of 135 existing publications from individual .md files to single _data/publications.yml
- New sidebar-based filtering UI with Type, Year filters, and clickable tags
- Author name normalization to "Initial. Lastname" format
Reviewed changes
Copilot reviewed 141 out of 143 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/sync_publications.py | Main sync script that fetches publications from Google Scholar API and adds new entries to YAML |
| scripts/normalize_publications.py | Normalizes author names, journal titles, and publication types for consistency |
| scripts/clean_publications.py | Cleans publication data by removing duplicates, fixing doubled years, and extracting correct years |
| scripts/migrate_to_yaml.py | One-time migration script to convert .md frontmatter to YAML format |
| scripts/requirements.txt | Python dependencies (requests, pyyaml) |
| _includes/publications.html | Redesigned publications page with sidebar filters, pill-style buttons, and JavaScript filtering logic |
| .gitignore | Added .env files to prevent API key exposure |
| _papers/papers/*.md | Removed 135+ individual publication markdown files (migrated to YAML) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/normalize_publications.py
Outdated
| if len(parts) >= 2: | ||
| # Last part is lastname, first parts are first names | ||
| lastname = parts[-1] | ||
| firstnames = parts[:-1] | ||
| # Convert first names to initials | ||
| initials = ' '.join([f"{n[0]}." for n in firstnames if n]) | ||
| normalized.append(f"{initials} {lastname}") |
There was a problem hiding this comment.
The author name normalization logic assumes "last part is lastname, first parts are first names", but this doesn't handle compound last names (e.g., "Van Der Berg", "De La Cruz") or names with suffixes (e.g., "John Smith Jr."). This could result in incorrect name formatting for authors with these name patterns.
| # If year is 2025 (our fallback), try to find real year in publication field | ||
| if current_year == 2025 and publication: | ||
| # Look for year pattern in publication | ||
| year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', publication) |
There was a problem hiding this comment.
The regex pattern for extracting years in clean_publications.py is '\b(19\d{2}|20[0-2]\d)\b' which will match years from 1900-2029. However, the pattern in sync_publications.py uses '\b(19|20)\d{2}\b' which matches 1900-2099. This inconsistency could lead to different behavior when extracting years. Consider using the same pattern in both scripts for consistency.
_includes/publications.html
Outdated
| <button class="filter-pill filter-pill-year" data-filter="2019-" | ||
| onclick="toggleFilter(this, 'year')">Before 2020</button> |
There was a problem hiding this comment.
The year filter "Before 2020" uses the filter value "2019-" which is checked with a special case in the applyFilters function. However, this naming is misleading - "2019-" suggests a range starting from 2019, when it actually means "before 2020". Consider using a more explicit value like "pre-2020" or "before-2020" to make the intent clearer.
| def normalize_title(title: str) -> str: | ||
| """Normalize title for comparison.""" | ||
| return re.sub(r'[^a-z0-9]', '', title.lower()) |
There was a problem hiding this comment.
The normalize_title function removes all non-alphanumeric characters for comparison, which could cause false positives. For example, "COVID-19 Study" and "COVID 19 Study" would be considered the same (correctly), but "Re-examination" and "Reexamination" would also match (potentially incorrectly if they're different papers). Consider a more nuanced normalization that preserves some punctuation context or use a fuzzy matching algorithm for better duplicate detection.
| except requests.exceptions.RequestException as e: | ||
| print(f"❌ API Error: {e}") | ||
| return |
There was a problem hiding this comment.
The error handling for RequestException catches all request-related errors generically and simply returns, but doesn't distinguish between different error types (network errors, timeout, 401 unauthorized, 429 rate limit, etc.). Consider adding more specific error handling to provide better diagnostics, especially for common issues like invalid API keys (401/403) or rate limiting (429).
_includes/publications.html
Outdated
| // Check type filter (case insensitive) | ||
| const typeMatch = currentFilters.type === 'all' || | ||
| type.toLowerCase() === currentFilters.type.toLowerCase(); | ||
|
|
||
| // Check category filter | ||
| const catMatch = currentFilters.category === 'all' || | ||
| categories.toLowerCase().includes(currentFilters.category.toLowerCase()); |
There was a problem hiding this comment.
The case-insensitive type matching converts both sides to lowercase for comparison, but the category filter uses case-sensitive includes(). This inconsistency could lead to unexpected behavior. Either make both case-insensitive or both case-sensitive for consistency. Since publication types and categories are controlled values, case-sensitive comparison would be more predictable.
| // Check type filter (case insensitive) | |
| const typeMatch = currentFilters.type === 'all' || | |
| type.toLowerCase() === currentFilters.type.toLowerCase(); | |
| // Check category filter | |
| const catMatch = currentFilters.category === 'all' || | |
| categories.toLowerCase().includes(currentFilters.category.toLowerCase()); | |
| // Check type filter (case sensitive, controlled values) | |
| const typeMatch = currentFilters.type === 'all' || | |
| type === currentFilters.type; | |
| // Check category filter (case sensitive, controlled values) | |
| const catMatch = currentFilters.category === 'all' || | |
| categories.includes(currentFilters.category); |
scripts/clean_publications.py
Outdated
| seen_titles = {} | ||
| unique_pubs = [] | ||
| duplicates_removed = 0 | ||
|
|
||
| for pub in pubs: | ||
| title = pub.get('title', '').lower().strip() | ||
| # Normalize title for comparison | ||
| normalized = re.sub(r'[^a-z0-9]', '', title) | ||
|
|
||
| if normalized and normalized not in seen_titles: | ||
| seen_titles[normalized] = True |
There was a problem hiding this comment.
The duplicate removal uses a dictionary where keys are normalized titles but the value is just 'True'. Using a set would be more appropriate and memory-efficient for this use case. Consider changing 'seen_titles = {}' to 'seen_titles = set()' and updating the logic accordingly.
_includes/publications.html
Outdated
| top: 100px; | ||
| } | ||
|
|
There was a problem hiding this comment.
The sticky sidebar has 'top: 100px' which may not work well on mobile devices or if the site header height changes. Consider using a more flexible approach with proper responsive breakpoints and calculating the offset based on the actual header height, or ensure this is tested across different screen sizes and header configurations.
| top: 100px; | |
| } | |
| top: 0; | |
| } | |
| @media (min-width: 992px) { | |
| .pub-sidebar { | |
| top: 100px; | |
| } | |
| } |
scripts/migrate_to_yaml.py
Outdated
| Run once to migrate, then delete this script. | ||
| """ | ||
|
|
||
| import os |
There was a problem hiding this comment.
Import of 'os' is not used.
scripts/sync_publications.py
Outdated
| import re | ||
| import yaml | ||
| import requests | ||
| from datetime import datetime |
There was a problem hiding this comment.
Import of 'datetime' is not used.
- Use dynamic year fallback instead of hardcoded value - Better error handling for API requests - Handle compound last names and multiple initials - Add ARIA labels for accessibility - Use data-filter-group for robust filter handling - Add responsive sticky sidebar - Remove unused imports
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 141 out of 143 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/normalize_publications.py
Outdated
| if re.match(r'^([A-Z]\.\s*)+[A-Za-z-]+$', author): | ||
| normalized.append(author) | ||
| continue |
There was a problem hiding this comment.
The regex pattern for detecting already-normalized author names on line 34 has a flaw. The pattern ^([A-Z]\.\s*)+[A-Za-z-]+$ will match names like "J. Smith-Jones" correctly, but it won't match compound last names with multiple parts separated by spaces (e.g., "J. van der Berg"). The pattern assumes the last name is a single continuous token, which doesn't handle compound last names with prefixes like "van", "von", "de", etc. that are already handled in the normalization logic starting at line 44.
scripts/clean_publications.py
Outdated
| def remove_duplicates(pubs: list) -> list: | ||
| """Remove duplicate publications based on title similarity.""" | ||
| seen_titles = set() # Use set instead of dict for efficiency | ||
| unique_pubs = [] | ||
| duplicates_removed = 0 | ||
|
|
||
| for pub in pubs: | ||
| title = pub.get('title', '').lower().strip() | ||
| # Normalize title for comparison | ||
| normalized = re.sub(r'[^a-z0-9]', '', title) | ||
|
|
||
| if normalized and normalized not in seen_titles: | ||
| seen_titles.add(normalized) | ||
| unique_pubs.append(pub) | ||
| else: | ||
| duplicates_removed += 1 | ||
| print(f" 🔄 Removed duplicate: {pub.get('title', '')[:60]}...") | ||
|
|
||
| return unique_pubs, duplicates_removed |
There was a problem hiding this comment.
The duplicate removal logic uses a set to track normalized titles but returns a tuple on line 66 when the function signature on line 48 indicates it should return a list. While Python can handle this in most cases, the inconsistency between the declared return type (list) and actual return type (tuple) could cause issues. The function should consistently return a list as indicated by its signature.
| <button class="filter-pill active" data-filter="all" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Show all publication types">All</button> | ||
| <button class="filter-pill" data-filter="Journal Paper" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Filter by type: Journal Paper">Journal | ||
| Paper</button> | ||
| <button class="filter-pill" data-filter="Preprint" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Filter by type: Preprint">Preprint</button> | ||
| <button class="filter-pill" data-filter="Book_Chapter" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Filter by type: Book Chapter">Book | ||
| Chapter</button> | ||
| <button class="filter-pill" data-filter="Comment" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Filter by type: Comment">Comment</button> | ||
| <button class="filter-pill" data-filter="Poster_Conference" onclick="toggleFilter(this, 'type')" | ||
| aria-label="Filter by type: Poster/Conference">Poster/Conference</button> |
There was a problem hiding this comment.
In the HTML template, the filter pills use inline onclick handlers which mix JavaScript with HTML. This is generally considered a maintainability anti-pattern. Consider moving to event listeners in the JavaScript section instead of inline event handlers. This would make the code more maintainable and follow modern web development best practices.
| def load_env(): | ||
| env_path = Path(__file__).parent / '.env' | ||
| env_vars = {} | ||
| if env_path.exists(): | ||
| with open(env_path) as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| if line and not line.startswith('#') and '=' in line: | ||
| key, value = line.split('=', 1) | ||
| # Strip quotes from values (handles "value" or 'value') | ||
| value = value.strip().strip('"').strip("'") | ||
| env_vars[key.strip()] = value | ||
| return env_vars |
There was a problem hiding this comment.
The function load_env() on lines 20-32 manually parses the .env file. While this works, it doesn't handle all edge cases that standard .env parsers handle (e.g., escaped characters, multiline values, comments after values). Consider using a standard library like python-dotenv which is more robust and handles these edge cases correctly. Add it to requirements.txt if adopted.
_includes/publications.html
Outdated
| // Fallback: legacy behavior based on specific filter values | ||
| document.querySelectorAll('.sidebar-section').forEach(section => { | ||
| const buttons = section.querySelectorAll('.filter-pill[data-filter]'); | ||
| if (buttons.length > 0) { | ||
| const hasType = section.querySelector('[data-filter="Journal Paper"]'); | ||
| const hasCategory = section.querySelector('[data-filter="Computational"]'); | ||
|
|
||
| if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) { | ||
| buttons.forEach(btn => { | ||
| btn.classList.toggle('active', btn.dataset.filter === value); | ||
| }); | ||
| } | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
The JavaScript function filterByTag has a fallback mechanism (lines 317-331) in case the data-filter-group attributes are not found. However, this fallback uses brittle selectors that look for specific filter values like 'Journal Paper' and 'Computational'. If these values change or new filter types are added, this fallback code will break silently. Consider removing the fallback entirely and ensuring all filter sections have the required data-filter-group attribute, or document why this fallback is necessary.
| <div class="pub-tags"> | ||
| {% assign type_display = project.type | replace: "Poster_Conference", "Poster/Conference" | | ||
| replace: "_", " " %} | ||
| <button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')">{{ type_display |
There was a problem hiding this comment.
The aria-label attributes on filter buttons (lines 35-48, 76-92) provide good accessibility support. However, the pub-tag buttons created in lines 119-120 that are clickable don't have aria-labels or roles, which could make them less accessible to screen reader users. Consider adding appropriate ARIA attributes to these interactive elements.
| <button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')">{{ type_display | |
| <button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')" aria-label="Filter by type: {{ type_display }}">{{ type_display |
| .pub-sidebar { | ||
| background: #f8f9fa; | ||
| border-radius: 8px; | ||
| padding: 20px 0px; | ||
| position: sticky; | ||
| top: 0; | ||
| } | ||
|
|
||
| @media (min-width: 992px) { | ||
| .pub-sidebar { | ||
| top: 100px; |
There was a problem hiding this comment.
The sticky sidebar CSS uses top: 0 for small screens and top: 100px for large screens (lines 152-160). The 100px offset assumes a fixed header height, but if the header size changes or is responsive, this could cause layout issues. Consider using CSS variables or calculating the header height dynamically to make this more maintainable.
| .pub-sidebar { | |
| background: #f8f9fa; | |
| border-radius: 8px; | |
| padding: 20px 0px; | |
| position: sticky; | |
| top: 0; | |
| } | |
| @media (min-width: 992px) { | |
| .pub-sidebar { | |
| top: 100px; | |
| :root { | |
| /* Default sticky offset for publication sidebar (can be overridden globally) */ | |
| --pub-sidebar-offset: 0; | |
| } | |
| .pub-sidebar { | |
| background: #f8f9fa; | |
| border-radius: 8px; | |
| padding: 20px 0px; | |
| position: sticky; | |
| top: var(--pub-sidebar-offset, 0); | |
| } | |
| @media (min-width: 992px) { | |
| :root { | |
| --pub-sidebar-offset: 100px; |
| # Try to extract from citation info | ||
| citation = article.get('citation', '') | ||
| # Use consistent year regex pattern (1900-2029) | ||
| year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', citation) |
There was a problem hiding this comment.
The regex pattern for year extraction is inconsistent across files. In sync_publications.py line 68, the pattern is \b(19\d{2}|20[0-2]\d)\b (1900-2029), while in clean_publications.py line 37, it's the same. However, this pattern will match years up to 2029, which may be too permissive. Consider using a more conservative pattern that only matches up to the current year plus one to avoid false positives from text that happens to contain numbers in this range.
| SERPAPI_KEY = env.get('SERPAPI_KEY', os.getenv('SERPAPI_KEY', '')) | ||
| AUTHOR_ID = env.get('AUTHOR_ID', 'TakXk9MAAAAJ') # Guillaume Dumas's Google Scholar ID |
There was a problem hiding this comment.
The SERPAPI_KEY is loaded from environment variables but there's no validation of its format or length. While there's a basic check for empty string on line 143, consider adding validation to ensure the API key meets minimum requirements (e.g., minimum length check) before making API requests, which could help catch configuration errors earlier.
| # Filter out empty titles and publications with unreasonable years (0 or far future) | ||
| current_year = get_current_year() | ||
| fetched_pubs = [p for p in fetched_pubs if p['title'] and 0 < p.get('year', 0) <= current_year + 1] |
There was a problem hiding this comment.
The year filtering logic on line 193 uses current_year + 1 as the upper bound. This means publications dated one year in the future will be accepted. While there's a comment on lines 157-158 mentioning this is expected behavior, accepting future-dated publications could lead to incorrect data if the API returns publications with incorrect years. Consider whether this tolerance is necessary or if it should be limited to the current year only.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 141 out of 143 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/sync_publications.py
Outdated
| # Use current year as fallback instead of hardcoded value | ||
| fallback_year = get_current_year() | ||
| parsed_year = int(year) if year and str(year).isdigit() else fallback_year |
There was a problem hiding this comment.
The fallback year logic uses get_current_year() when year data is missing. This means publications with missing year information will always appear as if they were published in the current year, which could be misleading. Consider using a special sentinel value (like 0 or None) for unknown years, or at minimum add a flag to indicate the year is estimated.
scripts/migrate_to_yaml.py
Outdated
| # Extract frontmatter between --- markers | ||
| match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL) |
There was a problem hiding this comment.
The frontmatter parsing uses a regex that may not handle edge cases well. For example, if the content contains --- within a code block in the frontmatter, the regex could match incorrectly. Consider using a YAML parser with proper frontmatter support or adding more robust boundary detection.
_includes/publications.html
Outdated
| } else { | ||
| // Fallback: legacy behavior based on specific filter values | ||
| document.querySelectorAll('.sidebar-section').forEach(section => { | ||
| const buttons = section.querySelectorAll('.filter-pill[data-filter]'); | ||
| if (buttons.length > 0) { | ||
| const hasType = section.querySelector('[data-filter="Journal Paper"]'); | ||
| const hasCategory = section.querySelector('[data-filter="Computational"]'); | ||
|
|
||
| if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) { | ||
| buttons.forEach(btn => { | ||
| btn.classList.toggle('active', btn.dataset.filter === value); | ||
| }); | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
The filter toggle logic has a fallback mechanism (lines 316-332) for when data-filter-group attributes are not present, but the HTML always includes these attributes. This fallback code is likely dead code and should be removed to improve maintainability.
| } else { | |
| // Fallback: legacy behavior based on specific filter values | |
| document.querySelectorAll('.sidebar-section').forEach(section => { | |
| const buttons = section.querySelectorAll('.filter-pill[data-filter]'); | |
| if (buttons.length > 0) { | |
| const hasType = section.querySelector('[data-filter="Journal Paper"]'); | |
| const hasCategory = section.querySelector('[data-filter="Computational"]'); | |
| if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) { | |
| buttons.forEach(btn => { | |
| btn.classList.toggle('active', btn.dataset.filter === value); | |
| }); | |
| } | |
| } | |
| }); |
| const typeMatch = currentFilters.type === 'all' || | ||
| type === currentFilters.type; | ||
|
|
||
| // Check category filter (case sensitive, controlled values) | ||
| const catMatch = currentFilters.category === 'all' || | ||
| categories.includes(currentFilters.category); |
There was a problem hiding this comment.
The filtering logic performs case-sensitive exact matching for type and category filters (lines 374-379), which is correct for controlled values. However, the code should validate that the data attributes contain only expected values to prevent filter mismatches. Consider adding data validation or normalization when the page loads.
scripts/normalize_publications.py
Outdated
| if re.match(r'^([A-Z]\.\s*)+[A-Za-z-]+$', author): | ||
| normalized.append(author) | ||
| continue |
There was a problem hiding this comment.
The regex pattern for author names ^([A-Z]\.\s*)+[A-Za-z-]+$ on line 34 doesn't handle compound last names with prefixes (like "van der Berg"). While compound prefixes are handled later in the code, names already in the format "J. van der Berg" won't be recognized by this regex and will be unnecessarily reprocessed. Consider updating the pattern to: ^([A-Z]\.\s*)+([a-z]+\s+)*[A-Za-z-]+$
| for i in range(len(parts) - 1, 0, -1): | ||
| if parts[i - 1].lower() in COMPOUND_PREFIXES: | ||
| lastname_start = i - 1 | ||
| else: | ||
| break |
There was a problem hiding this comment.
The compound last name detection logic (lines 44-48) iterates backwards but breaks on the first non-prefix word. This means it won't correctly handle names like "Jean de la Fontaine" where multiple compound prefixes appear consecutively. The current logic would only capture "de la" but miss earlier prefixes if they're separated by a non-prefix word.
scripts/clean_publications.py
Outdated
| def remove_duplicates(pubs: list) -> list: | ||
| """Remove duplicate publications based on title similarity.""" | ||
| seen_titles = set() # Use set instead of dict for efficiency | ||
| unique_pubs = [] | ||
| duplicates_removed = 0 | ||
|
|
||
| for pub in pubs: | ||
| title = pub.get('title', '').lower().strip() | ||
| # Normalize title for comparison | ||
| normalized = re.sub(r'[^a-z0-9]', '', title) | ||
|
|
||
| if normalized and normalized not in seen_titles: | ||
| seen_titles.add(normalized) | ||
| unique_pubs.append(pub) | ||
| else: | ||
| duplicates_removed += 1 | ||
| print(f" 🔄 Removed duplicate: {pub.get('title', '')[:60]}...") | ||
|
|
||
| return unique_pubs, duplicates_removed |
There was a problem hiding this comment.
The duplicate removal function returns a tuple (unique_pubs, dup_count) but the function signature on line 48 doesn't indicate this. Consider adding type hints to make the return type explicit: def remove_duplicates(pubs: list) -> tuple[list, int]:
_includes/publications.html
Outdated
| <div class="sidebar-section"> | ||
| <input type="search" class="form-control rounded" id="search-input" onkeyup="filterPublications()" | ||
| placeholder="Search..." autocapitalize="off" autocomplete="off" autocorrect="off" role="textbox" | ||
| spellcheck="false"> |
There was a problem hiding this comment.
Missing ARIA labels for the search input field. Add appropriate labels for screen reader accessibility. Example: aria-label="Search publications" on the input element.
| spellcheck="false"> | |
| spellcheck="false" aria-label="Search publications"> |
| } | ||
|
|
||
| @media (min-width: 992px) { | ||
| .pub-sidebar { |
There was a problem hiding this comment.
The sticky sidebar positioning uses top: 100px on large screens (line 158) but this value appears to be a magic number. This should either be documented or derived from the actual header height to ensure it works correctly with different header sizes.
| .pub-sidebar { | |
| .pub-sidebar { | |
| /* Offset matches desktop header height to prevent overlap; update if header height changes */ |
…nd code quality - Replace fallback year with None to avoid misleading publication dates - Improve YAML frontmatter parsing with robust boundary detection - Remove dead legacy code in publications filter - Add aria-label for search input accessibility - Add documentation for sticky sidebar top offset - Fix author name regex to handle compound names (van der Berg, de la) - Improve compound last name detection logic - Add proper type hints for remove_duplicates function - Add data validation note for publication filters
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 141 out of 143 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def save_publications(publications: list): | ||
| """Save publications to YAML file.""" | ||
| # Sort by year (newest first) | ||
| publications.sort(key=lambda x: x.get('year', 0) or 0, reverse=True) |
There was a problem hiding this comment.
The sorting key uses x.get('year', 0) or 0 which treats both missing years and years set to 0 the same way. However, line 193 filters out publications with year <= 0, so any publications with year=0 or None should have already been removed before this function is called. The or 0 part is redundant defensive coding. Consider simplifying to just x.get('year', 0) or x['year'] if you're confident the key exists.
| changes += 1 | ||
|
|
||
| # Sort by year (newest first) | ||
| pubs.sort(key=lambda x: x.get('year', 0) or 0, reverse=True) |
There was a problem hiding this comment.
The sorting key uses x.get('year', 0) or 0 which is redundant. If the key doesn't exist, get returns 0, and 0 or 0 is still 0. Additionally, this pattern treats publications with missing years the same as publications from year 0. Consider using a more explicit approach like x.get('year') or 0 if None should be treated as 0, or raise an error if year is expected to always be present.
| def remove_duplicates(pubs: list) -> tuple[list, int]: | ||
| """Remove duplicate publications based on title similarity. | ||
|
|
||
| Returns: | ||
| tuple[list, int]: Tuple of (unique_publications, duplicates_removed_count) |
There was a problem hiding this comment.
The function return type annotation is tuple[list, int] which uses the newer Python 3.9+ syntax for type hints. However, the scripts don't specify a minimum Python version in requirements.txt or in the shebang. For better compatibility with Python 3.7-3.8, consider using Tuple[list, int] from the typing module, or add a minimum Python version requirement (e.g., in a setup.py or README).
| errors.append(md_file.name) | ||
|
|
||
| # Sort by year (newest first) | ||
| publications.sort(key=lambda x: x.get('year', 0) or 0, reverse=True) |
There was a problem hiding this comment.
The same redundant sorting pattern x.get('year', 0) or 0 appears here as well. This is the fourth occurrence across the scripts.
| document.getElementById('noResult').style.display = visibleCount === 0 ? 'block' : 'none'; | ||
| } | ||
|
|
||
| // Initialize search function for backwards compatibility |
There was a problem hiding this comment.
The backward compatibility function searchItem() on lines 394-396 is defined but it's unclear if it's still being called anywhere. If this is for legacy code that's being removed, consider adding a deprecation comment or removing it if no longer needed.
| // Initialize search function for backwards compatibility | |
| // Initialize search function for backwards compatibility | |
| /** | |
| * @deprecated Use filterPublications instead. Kept for backwards compatibility with legacy callers. | |
| */ |
| # Try to extract from citation info | ||
| citation = article.get('citation', '') | ||
| # Use consistent year regex pattern (1900-2029) | ||
| year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', citation) |
There was a problem hiding this comment.
The year range validation regex pattern (19\d{2}|20[0-2]\d) allows years up to 2029, but the validation on line 193 uses current_year + 1 as the upper bound. This creates an inconsistency where the regex could extract a year like 2029 from text, but then it would be filtered out by the validation check. Consider aligning these two approaches or documenting why they differ.
| lastname_start = len(parts) - 1 | ||
| for i in range(len(parts) - 1, 0, -1): | ||
| if parts[i - 1].lower() in COMPOUND_PREFIXES: | ||
| lastname_start = i - 1 | ||
| elif i == len(parts) - 1: | ||
| # If current part is not a prefix, stop looking | ||
| break | ||
| else: | ||
| # If we found a prefix but then hit a non-prefix, keep the first prefix found | ||
| continue |
There was a problem hiding this comment.
The compound last name detection logic has a flaw. In the loop on lines 45-53, when a prefix is found at position i-1, the code sets lastname_start = i - 1 but then continues the loop. This means if you have multiple consecutive prefixes like "de la Fontaine", the logic will correctly set lastname_start to the first prefix. However, the break condition on line 49-50 only breaks if the current part is not a prefix and we're at the last position. This could cause issues with names like "Jean de la Fontaine" where "de la" should both be part of the last name. Consider simplifying the logic to scan backwards until you find a non-prefix word.
| pubs, dup_count = remove_duplicates(pubs) | ||
|
|
||
| # Step 4: Sort by year (newest first) | ||
| pubs.sort(key=lambda x: x.get('year', 0) or 0, reverse=True) |
There was a problem hiding this comment.
The sorting key uses the same redundant pattern x.get('year', 0) or 0. This is the third occurrence of this pattern across the scripts. Consider creating a shared utility function to extract the year value consistently, which would also make it easier to handle edge cases uniformly.
| @media (min-width: 992px) { | ||
| .pub-sidebar { | ||
| /* Offset matches desktop header height to prevent overlap; update if header height changes */ | ||
| top: 100px; | ||
| } |
There was a problem hiding this comment.
The inline comment mentions that the top offset "matches desktop header height to prevent overlap" and suggests updating if header height changes. However, this creates a maintenance burden as the value is hardcoded. Consider using CSS custom properties or calculating the offset dynamically with JavaScript based on the actual header height, which would make the code more maintainable.
| // Check type filter (case sensitive, controlled values) | ||
| const typeMatch = currentFilters.type === 'all' || | ||
| type === currentFilters.type; | ||
|
|
||
| // Check category filter (case sensitive, controlled values) | ||
| // Note: Data should be validated on page load to ensure only expected values are present | ||
| const catMatch = currentFilters.category === 'all' || | ||
| categories.includes(currentFilters.category); |
There was a problem hiding this comment.
The filter comparison logic on lines 360-361 and 365-366 uses strict equality for type and category checks, which is correct. However, the comment on line 364 mentions "Data should be validated on page load to ensure only expected values are present" but there's no actual validation code implemented. Consider adding a validation function that runs on page load to check for unexpected values and log warnings or errors, which would help catch data quality issues early.