Skip to content

feat: Publications sync system with SerpAPI integration#64

Open
Mhafsia wants to merge 3 commits intomasterfrom
updates
Open

feat: Publications sync system with SerpAPI integration#64
Mhafsia wants to merge 3 commits intomasterfrom
updates

Conversation

@Mhafsia
Copy link
Collaborator

@Mhafsia Mhafsia commented Jan 6, 2026

  • Added SerpAPI sync script to fetch publications from Google Scholar
  • Migrated 135 publications to single YAML file (_data/publications.yml)
  • Now 285 publications synced from Google Scholar
  • New sidebar UI with filters: Type, Year (Category temporarily deactivated)
  • Pill-shaped filter buttons with color coding (blue, teal)
  • Clickable tags on publications to filter
  • Normalized author names (Initial. Lastname format)
  • Added clean_publications.py and normalize_publications.py scripts
  • Removed old individual .md publication files

- Added SerpAPI sync script to fetch publications from Google Scholar
- Migrated 135 publications to single YAML file (_data/publications.yml)
- Now 285 publications synced from Google Scholar
- New sidebar UI with filters: Type, Year (Category provisoirement désactivé)
- Pill-shaped filter buttons with color coding (blue, teal)
- Clickable tags on publications to filter
- Normalized author names (Initial. Lastname format)
- Added clean_publications.py and normalize_publications.py scripts
- Removed old individual .md publication files
Copilot AI review requested due to automatic review settings January 6, 2026 22:33
@netlify
Copy link

netlify bot commented Jan 6, 2026

Deploy Preview for ppsp ready!

Name Link
🔨 Latest commit 910eac0
🔍 Latest deploy log https://app.netlify.com/projects/ppsp/deploys/695effef673911000832f8ad
😎 Deploy Preview https://deploy-preview-64--ppsp.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive publications management system that migrates from individual markdown files to a centralized YAML-based approach with automated syncing from Google Scholar via SerpAPI. The changes include four Python utility scripts for synchronization, normalization, and data cleaning, plus a redesigned UI with interactive filtering capabilities.

Key Changes:

  • Automated publication syncing from Google Scholar using SerpAPI (285 total publications)
  • Migration of 135 existing publications from individual .md files to single _data/publications.yml
  • New sidebar-based filtering UI with Type, Year filters, and clickable tags
  • Author name normalization to "Initial. Lastname" format

Reviewed changes

Copilot reviewed 141 out of 143 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
scripts/sync_publications.py Main sync script that fetches publications from Google Scholar API and adds new entries to YAML
scripts/normalize_publications.py Normalizes author names, journal titles, and publication types for consistency
scripts/clean_publications.py Cleans publication data by removing duplicates, fixing doubled years, and extracting correct years
scripts/migrate_to_yaml.py One-time migration script to convert .md frontmatter to YAML format
scripts/requirements.txt Python dependencies (requests, pyyaml)
_includes/publications.html Redesigned publications page with sidebar filters, pill-style buttons, and JavaScript filtering logic
.gitignore Added .env files to prevent API key exposure
_papers/papers/*.md Removed 135+ individual publication markdown files (migrated to YAML)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 36 to 42
if len(parts) >= 2:
# Last part is lastname, first parts are first names
lastname = parts[-1]
firstnames = parts[:-1]
# Convert first names to initials
initials = ' '.join([f"{n[0]}." for n in firstnames if n])
normalized.append(f"{initials} {lastname}")
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The author name normalization logic assumes "last part is lastname, first parts are first names", but this doesn't handle compound last names (e.g., "Van Der Berg", "De La Cruz") or names with suffixes (e.g., "John Smith Jr."). This could result in incorrect name formatting for authors with these name patterns.

Copilot uses AI. Check for mistakes.
# If year is 2025 (our fallback), try to find real year in publication field
if current_year == 2025 and publication:
# Look for year pattern in publication
year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', publication)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern for extracting years in clean_publications.py is '\b(19\d{2}|20[0-2]\d)\b' which will match years from 1900-2029. However, the pattern in sync_publications.py uses '\b(19|20)\d{2}\b' which matches 1900-2099. This inconsistency could lead to different behavior when extracting years. Consider using the same pattern in both scripts for consistency.

Copilot uses AI. Check for mistakes.
Comment on lines 86 to 87
<button class="filter-pill filter-pill-year" data-filter="2019-"
onclick="toggleFilter(this, 'year')">Before 2020</button>
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The year filter "Before 2020" uses the filter value "2019-" which is checked with a special case in the applyFilters function. However, this naming is misleading - "2019-" suggests a range starting from 2019, when it actually means "before 2020". Consider using a more explicit value like "pre-2020" or "before-2020" to make the intent clearer.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
def normalize_title(title: str) -> str:
"""Normalize title for comparison."""
return re.sub(r'[^a-z0-9]', '', title.lower())
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_title function removes all non-alphanumeric characters for comparison, which could cause false positives. For example, "COVID-19 Study" and "COVID 19 Study" would be considered the same (correctly), but "Re-examination" and "Reexamination" would also match (potentially incorrectly if they're different papers). Consider a more nuanced normalization that preserves some punctuation context or use a fuzzy matching algorithm for better duplicate detection.

Copilot uses AI. Check for mistakes.
Comment on lines 154 to 156
except requests.exceptions.RequestException as e:
print(f"❌ API Error: {e}")
return
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling for RequestException catches all request-related errors generically and simply returns, but doesn't distinguish between different error types (network errors, timeout, 401 unauthorized, 429 rate limit, etc.). Consider adding more specific error handling to provide better diagnostics, especially for common issues like invalid API keys (401/403) or rate limiting (429).

Copilot uses AI. Check for mistakes.
Comment on lines 347 to 353
// Check type filter (case insensitive)
const typeMatch = currentFilters.type === 'all' ||
type.toLowerCase() === currentFilters.type.toLowerCase();

// Check category filter
const catMatch = currentFilters.category === 'all' ||
categories.toLowerCase().includes(currentFilters.category.toLowerCase());
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case-insensitive type matching converts both sides to lowercase for comparison, but the category filter uses case-sensitive includes(). This inconsistency could lead to unexpected behavior. Either make both case-insensitive or both case-sensitive for consistency. Since publication types and categories are controlled values, case-sensitive comparison would be more predictable.

Suggested change
// Check type filter (case insensitive)
const typeMatch = currentFilters.type === 'all' ||
type.toLowerCase() === currentFilters.type.toLowerCase();
// Check category filter
const catMatch = currentFilters.category === 'all' ||
categories.toLowerCase().includes(currentFilters.category.toLowerCase());
// Check type filter (case sensitive, controlled values)
const typeMatch = currentFilters.type === 'all' ||
type === currentFilters.type;
// Check category filter (case sensitive, controlled values)
const catMatch = currentFilters.category === 'all' ||
categories.includes(currentFilters.category);

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 56
seen_titles = {}
unique_pubs = []
duplicates_removed = 0

for pub in pubs:
title = pub.get('title', '').lower().strip()
# Normalize title for comparison
normalized = re.sub(r'[^a-z0-9]', '', title)

if normalized and normalized not in seen_titles:
seen_titles[normalized] = True
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate removal uses a dictionary where keys are normalized titles but the value is just 'True'. Using a set would be more appropriate and memory-efficient for this use case. Consider changing 'seen_titles = {}' to 'seen_titles = set()' and updating the logic accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines 149 to 151
top: 100px;
}

Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sticky sidebar has 'top: 100px' which may not work well on mobile devices or if the site header height changes. Consider using a more flexible approach with proper responsive breakpoints and calculating the offset based on the actual header height, or ensure this is tested across different screen sizes and header configurations.

Suggested change
top: 100px;
}
top: 0;
}
@media (min-width: 992px) {
.pub-sidebar {
top: 100px;
}
}

Copilot uses AI. Check for mistakes.
Run once to migrate, then delete this script.
"""

import os
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Copilot uses AI. Check for mistakes.
import re
import yaml
import requests
from datetime import datetime
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'datetime' is not used.

Copilot uses AI. Check for mistakes.
- Use dynamic year fallback instead of hardcoded value
- Better error handling for API requests
- Handle compound last names and multiple initials
- Add ARIA labels for accessibility
- Use data-filter-group for robust filter handling
- Add responsive sticky sidebar
- Remove unused imports
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 141 out of 143 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 34 to 36
if re.match(r'^([A-Z]\.\s*)+[A-Za-z-]+$', author):
normalized.append(author)
continue
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern for detecting already-normalized author names on line 34 has a flaw. The pattern ^([A-Z]\.\s*)+[A-Za-z-]+$ will match names like "J. Smith-Jones" correctly, but it won't match compound last names with multiple parts separated by spaces (e.g., "J. van der Berg"). The pattern assumes the last name is a single continuous token, which doesn't handle compound last names with prefixes like "van", "von", "de", etc. that are already handled in the normalization logic starting at line 44.

Copilot uses AI. Check for mistakes.
Comment on lines 48 to 66
def remove_duplicates(pubs: list) -> list:
"""Remove duplicate publications based on title similarity."""
seen_titles = set() # Use set instead of dict for efficiency
unique_pubs = []
duplicates_removed = 0

for pub in pubs:
title = pub.get('title', '').lower().strip()
# Normalize title for comparison
normalized = re.sub(r'[^a-z0-9]', '', title)

if normalized and normalized not in seen_titles:
seen_titles.add(normalized)
unique_pubs.append(pub)
else:
duplicates_removed += 1
print(f" 🔄 Removed duplicate: {pub.get('title', '')[:60]}...")

return unique_pubs, duplicates_removed
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate removal logic uses a set to track normalized titles but returns a tuple on line 66 when the function signature on line 48 indicates it should return a list. While Python can handle this in most cases, the inconsistency between the declared return type (list) and actual return type (tuple) could cause issues. The function should consistently return a list as indicated by its signature.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +48
<button class="filter-pill active" data-filter="all" onclick="toggleFilter(this, 'type')"
aria-label="Show all publication types">All</button>
<button class="filter-pill" data-filter="Journal Paper" onclick="toggleFilter(this, 'type')"
aria-label="Filter by type: Journal Paper">Journal
Paper</button>
<button class="filter-pill" data-filter="Preprint" onclick="toggleFilter(this, 'type')"
aria-label="Filter by type: Preprint">Preprint</button>
<button class="filter-pill" data-filter="Book_Chapter" onclick="toggleFilter(this, 'type')"
aria-label="Filter by type: Book Chapter">Book
Chapter</button>
<button class="filter-pill" data-filter="Comment" onclick="toggleFilter(this, 'type')"
aria-label="Filter by type: Comment">Comment</button>
<button class="filter-pill" data-filter="Poster_Conference" onclick="toggleFilter(this, 'type')"
aria-label="Filter by type: Poster/Conference">Poster/Conference</button>
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HTML template, the filter pills use inline onclick handlers which mix JavaScript with HTML. This is generally considered a maintainability anti-pattern. Consider moving to event listeners in the JavaScript section instead of inline event handlers. This would make the code more maintainable and follow modern web development best practices.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +32
def load_env():
env_path = Path(__file__).parent / '.env'
env_vars = {}
if env_path.exists():
with open(env_path) as f:
for line in f:
line = line.strip()
if line and not line.startswith('#') and '=' in line:
key, value = line.split('=', 1)
# Strip quotes from values (handles "value" or 'value')
value = value.strip().strip('"').strip("'")
env_vars[key.strip()] = value
return env_vars
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function load_env() on lines 20-32 manually parses the .env file. While this works, it doesn't handle all edge cases that standard .env parsers handle (e.g., escaped characters, multiline values, comments after values). Consider using a standard library like python-dotenv which is more robust and handles these edge cases correctly. Add it to requirements.txt if adopted.

Copilot uses AI. Check for mistakes.
Comment on lines 317 to 331
// Fallback: legacy behavior based on specific filter values
document.querySelectorAll('.sidebar-section').forEach(section => {
const buttons = section.querySelectorAll('.filter-pill[data-filter]');
if (buttons.length > 0) {
const hasType = section.querySelector('[data-filter="Journal Paper"]');
const hasCategory = section.querySelector('[data-filter="Computational"]');

if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) {
buttons.forEach(btn => {
btn.classList.toggle('active', btn.dataset.filter === value);
});
}
}
});
}
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JavaScript function filterByTag has a fallback mechanism (lines 317-331) in case the data-filter-group attributes are not found. However, this fallback uses brittle selectors that look for specific filter values like 'Journal Paper' and 'Computational'. If these values change or new filter types are added, this fallback code will break silently. Consider removing the fallback entirely and ensuring all filter sections have the required data-filter-group attribute, or document why this fallback is necessary.

Copilot uses AI. Check for mistakes.
<div class="pub-tags">
{% assign type_display = project.type | replace: "Poster_Conference", "Poster/Conference" |
replace: "_", " " %}
<button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')">{{ type_display
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aria-label attributes on filter buttons (lines 35-48, 76-92) provide good accessibility support. However, the pub-tag buttons created in lines 119-120 that are clickable don't have aria-labels or roles, which could make them less accessible to screen reader users. Consider adding appropriate ARIA attributes to these interactive elements.

Suggested change
<button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')">{{ type_display
<button class="pub-tag" onclick="filterByTag('{{ project.type }}', 'type')" aria-label="Filter by type: {{ type_display }}">{{ type_display

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 158
.pub-sidebar {
background: #f8f9fa;
border-radius: 8px;
padding: 20px 0px;
position: sticky;
top: 0;
}

@media (min-width: 992px) {
.pub-sidebar {
top: 100px;
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sticky sidebar CSS uses top: 0 for small screens and top: 100px for large screens (lines 152-160). The 100px offset assumes a fixed header height, but if the header size changes or is responsive, this could cause layout issues. Consider using CSS variables or calculating the header height dynamically to make this more maintainable.

Suggested change
.pub-sidebar {
background: #f8f9fa;
border-radius: 8px;
padding: 20px 0px;
position: sticky;
top: 0;
}
@media (min-width: 992px) {
.pub-sidebar {
top: 100px;
:root {
/* Default sticky offset for publication sidebar (can be overridden globally) */
--pub-sidebar-offset: 0;
}
.pub-sidebar {
background: #f8f9fa;
border-radius: 8px;
padding: 20px 0px;
position: sticky;
top: var(--pub-sidebar-offset, 0);
}
@media (min-width: 992px) {
:root {
--pub-sidebar-offset: 100px;

Copilot uses AI. Check for mistakes.
# Try to extract from citation info
citation = article.get('citation', '')
# Use consistent year regex pattern (1900-2029)
year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', citation)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern for year extraction is inconsistent across files. In sync_publications.py line 68, the pattern is \b(19\d{2}|20[0-2]\d)\b (1900-2029), while in clean_publications.py line 37, it's the same. However, this pattern will match years up to 2029, which may be too permissive. Consider using a more conservative pattern that only matches up to the current year plus one to avoid false positives from text that happens to contain numbers in this range.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +36
SERPAPI_KEY = env.get('SERPAPI_KEY', os.getenv('SERPAPI_KEY', ''))
AUTHOR_ID = env.get('AUTHOR_ID', 'TakXk9MAAAAJ') # Guillaume Dumas's Google Scholar ID
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SERPAPI_KEY is loaded from environment variables but there's no validation of its format or length. While there's a basic check for empty string on line 143, consider adding validation to ensure the API key meets minimum requirements (e.g., minimum length check) before making API requests, which could help catch configuration errors earlier.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +193
# Filter out empty titles and publications with unreasonable years (0 or far future)
current_year = get_current_year()
fetched_pubs = [p for p in fetched_pubs if p['title'] and 0 < p.get('year', 0) <= current_year + 1]
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The year filtering logic on line 193 uses current_year + 1 as the upper bound. This means publications dated one year in the future will be accepted. While there's a comment on lines 157-158 mentioning this is expected behavior, accepting future-dated publications could lead to incorrect data if the API returns publications with incorrect years. Consider whether this tolerance is necessary or if it should be limited to the current year only.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 141 out of 143 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 90 to 92
# Use current year as fallback instead of hardcoded value
fallback_year = get_current_year()
parsed_year = int(year) if year and str(year).isdigit() else fallback_year
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback year logic uses get_current_year() when year data is missing. This means publications with missing year information will always appear as if they were published in the current year, which could be misleading. Consider using a special sentinel value (like 0 or None) for unknown years, or at minimum add a flag to indicate the year is estimated.

Copilot uses AI. Check for mistakes.
Comment on lines 20 to 21
# Extract frontmatter between --- markers
match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontmatter parsing uses a regex that may not handle edge cases well. For example, if the content contains --- within a code block in the frontmatter, the regex could match incorrectly. Consider using a YAML parser with proper frontmatter support or adding more robust boundary detection.

Copilot uses AI. Check for mistakes.
Comment on lines 316 to 330
} else {
// Fallback: legacy behavior based on specific filter values
document.querySelectorAll('.sidebar-section').forEach(section => {
const buttons = section.querySelectorAll('.filter-pill[data-filter]');
if (buttons.length > 0) {
const hasType = section.querySelector('[data-filter="Journal Paper"]');
const hasCategory = section.querySelector('[data-filter="Computational"]');

if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) {
buttons.forEach(btn => {
btn.classList.toggle('active', btn.dataset.filter === value);
});
}
}
});
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter toggle logic has a fallback mechanism (lines 316-332) for when data-filter-group attributes are not present, but the HTML always includes these attributes. This fallback code is likely dead code and should be removed to improve maintainability.

Suggested change
} else {
// Fallback: legacy behavior based on specific filter values
document.querySelectorAll('.sidebar-section').forEach(section => {
const buttons = section.querySelectorAll('.filter-pill[data-filter]');
if (buttons.length > 0) {
const hasType = section.querySelector('[data-filter="Journal Paper"]');
const hasCategory = section.querySelector('[data-filter="Computational"]');
if ((filterGroup === 'type' && hasType) || (filterGroup === 'category' && hasCategory)) {
buttons.forEach(btn => {
btn.classList.toggle('active', btn.dataset.filter === value);
});
}
}
});

Copilot uses AI. Check for mistakes.
Comment on lines 374 to 379
const typeMatch = currentFilters.type === 'all' ||
type === currentFilters.type;

// Check category filter (case sensitive, controlled values)
const catMatch = currentFilters.category === 'all' ||
categories.includes(currentFilters.category);
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering logic performs case-sensitive exact matching for type and category filters (lines 374-379), which is correct for controlled values. However, the code should validate that the data attributes contain only expected values to prevent filter mismatches. Consider adding data validation or normalization when the page loads.

Copilot uses AI. Check for mistakes.
Comment on lines 34 to 36
if re.match(r'^([A-Z]\.\s*)+[A-Za-z-]+$', author):
normalized.append(author)
continue
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern for author names ^([A-Z]\.\s*)+[A-Za-z-]+$ on line 34 doesn't handle compound last names with prefixes (like "van der Berg"). While compound prefixes are handled later in the code, names already in the format "J. van der Berg" won't be recognized by this regex and will be unnecessarily reprocessed. Consider updating the pattern to: ^([A-Z]\.\s*)+([a-z]+\s+)*[A-Za-z-]+$

Copilot uses AI. Check for mistakes.
Comment on lines 44 to 48
for i in range(len(parts) - 1, 0, -1):
if parts[i - 1].lower() in COMPOUND_PREFIXES:
lastname_start = i - 1
else:
break
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compound last name detection logic (lines 44-48) iterates backwards but breaks on the first non-prefix word. This means it won't correctly handle names like "Jean de la Fontaine" where multiple compound prefixes appear consecutively. The current logic would only capture "de la" but miss earlier prefixes if they're separated by a non-prefix word.

Copilot uses AI. Check for mistakes.
Comment on lines 48 to 66
def remove_duplicates(pubs: list) -> list:
"""Remove duplicate publications based on title similarity."""
seen_titles = set() # Use set instead of dict for efficiency
unique_pubs = []
duplicates_removed = 0

for pub in pubs:
title = pub.get('title', '').lower().strip()
# Normalize title for comparison
normalized = re.sub(r'[^a-z0-9]', '', title)

if normalized and normalized not in seen_titles:
seen_titles.add(normalized)
unique_pubs.append(pub)
else:
duplicates_removed += 1
print(f" 🔄 Removed duplicate: {pub.get('title', '')[:60]}...")

return unique_pubs, duplicates_removed
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate removal function returns a tuple (unique_pubs, dup_count) but the function signature on line 48 doesn't indicate this. Consider adding type hints to make the return type explicit: def remove_duplicates(pubs: list) -> tuple[list, int]:

Copilot uses AI. Check for mistakes.
<div class="sidebar-section">
<input type="search" class="form-control rounded" id="search-input" onkeyup="filterPublications()"
placeholder="Search..." autocapitalize="off" autocomplete="off" autocorrect="off" role="textbox"
spellcheck="false">
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ARIA labels for the search input field. Add appropriate labels for screen reader accessibility. Example: aria-label="Search publications" on the input element.

Suggested change
spellcheck="false">
spellcheck="false" aria-label="Search publications">

Copilot uses AI. Check for mistakes.
}

@media (min-width: 992px) {
.pub-sidebar {
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sticky sidebar positioning uses top: 100px on large screens (line 158) but this value appears to be a magic number. This should either be documented or derived from the actual header height to ensure it works correctly with different header sizes.

Suggested change
.pub-sidebar {
.pub-sidebar {
/* Offset matches desktop header height to prevent overlap; update if header height changes */

Copilot uses AI. Check for mistakes.
…nd code quality

- Replace fallback year with None to avoid misleading publication dates
- Improve YAML frontmatter parsing with robust boundary detection
- Remove dead legacy code in publications filter
- Add aria-label for search input accessibility
- Add documentation for sticky sidebar top offset
- Fix author name regex to handle compound names (van der Berg, de la)
- Improve compound last name detection logic
- Add proper type hints for remove_duplicates function
- Add data validation note for publication filters
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 141 out of 143 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def save_publications(publications: list):
"""Save publications to YAML file."""
# Sort by year (newest first)
publications.sort(key=lambda x: x.get('year', 0) or 0, reverse=True)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting key uses x.get('year', 0) or 0 which treats both missing years and years set to 0 the same way. However, line 193 filters out publications with year <= 0, so any publications with year=0 or None should have already been removed before this function is called. The or 0 part is redundant defensive coding. Consider simplifying to just x.get('year', 0) or x['year'] if you're confident the key exists.

Copilot uses AI. Check for mistakes.
changes += 1

# Sort by year (newest first)
pubs.sort(key=lambda x: x.get('year', 0) or 0, reverse=True)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting key uses x.get('year', 0) or 0 which is redundant. If the key doesn't exist, get returns 0, and 0 or 0 is still 0. Additionally, this pattern treats publications with missing years the same as publications from year 0. Consider using a more explicit approach like x.get('year') or 0 if None should be treated as 0, or raise an error if year is expected to always be present.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +52
def remove_duplicates(pubs: list) -> tuple[list, int]:
"""Remove duplicate publications based on title similarity.

Returns:
tuple[list, int]: Tuple of (unique_publications, duplicates_removed_count)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function return type annotation is tuple[list, int] which uses the newer Python 3.9+ syntax for type hints. However, the scripts don't specify a minimum Python version in requirements.txt or in the shebang. For better compatibility with Python 3.7-3.8, consider using Tuple[list, int] from the typing module, or add a minimum Python version requirement (e.g., in a setup.py or README).

Copilot uses AI. Check for mistakes.
errors.append(md_file.name)

# Sort by year (newest first)
publications.sort(key=lambda x: x.get('year', 0) or 0, reverse=True)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same redundant sorting pattern x.get('year', 0) or 0 appears here as well. This is the fourth occurrence across the scripts.

Copilot uses AI. Check for mistakes.
document.getElementById('noResult').style.display = visibleCount === 0 ? 'block' : 'none';
}

// Initialize search function for backwards compatibility
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backward compatibility function searchItem() on lines 394-396 is defined but it's unclear if it's still being called anywhere. If this is for legacy code that's being removed, consider adding a deprecation comment or removing it if no longer needed.

Suggested change
// Initialize search function for backwards compatibility
// Initialize search function for backwards compatibility
/**
* @deprecated Use filterPublications instead. Kept for backwards compatibility with legacy callers.
*/

Copilot uses AI. Check for mistakes.
# Try to extract from citation info
citation = article.get('citation', '')
# Use consistent year regex pattern (1900-2029)
year_match = re.search(r'\b(19\d{2}|20[0-2]\d)\b', citation)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The year range validation regex pattern (19\d{2}|20[0-2]\d) allows years up to 2029, but the validation on line 193 uses current_year + 1 as the upper bound. This creates an inconsistency where the regex could extract a year like 2029 from text, but then it would be filtered out by the validation check. Consider aligning these two approaches or documenting why they differ.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +53
lastname_start = len(parts) - 1
for i in range(len(parts) - 1, 0, -1):
if parts[i - 1].lower() in COMPOUND_PREFIXES:
lastname_start = i - 1
elif i == len(parts) - 1:
# If current part is not a prefix, stop looking
break
else:
# If we found a prefix but then hit a non-prefix, keep the first prefix found
continue
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compound last name detection logic has a flaw. In the loop on lines 45-53, when a prefix is found at position i-1, the code sets lastname_start = i - 1 but then continues the loop. This means if you have multiple consecutive prefixes like "de la Fontaine", the logic will correctly set lastname_start to the first prefix. However, the break condition on line 49-50 only breaks if the current part is not a prefix and we're at the last position. This could cause issues with names like "Jean de la Fontaine" where "de la" should both be part of the last name. Consider simplifying the logic to scan backwards until you find a non-prefix word.

Copilot uses AI. Check for mistakes.
pubs, dup_count = remove_duplicates(pubs)

# Step 4: Sort by year (newest first)
pubs.sort(key=lambda x: x.get('year', 0) or 0, reverse=True)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting key uses the same redundant pattern x.get('year', 0) or 0. This is the third occurrence of this pattern across the scripts. Consider creating a shared utility function to extract the year value consistently, which would also make it easier to handle edge cases uniformly.

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +160
@media (min-width: 992px) {
.pub-sidebar {
/* Offset matches desktop header height to prevent overlap; update if header height changes */
top: 100px;
}
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment mentions that the top offset "matches desktop header height to prevent overlap" and suggests updating if header height changes. However, this creates a maintenance burden as the value is hardcoded. Consider using CSS custom properties or calculating the offset dynamically with JavaScript based on the actual header height, which would make the code more maintainable.

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +366
// Check type filter (case sensitive, controlled values)
const typeMatch = currentFilters.type === 'all' ||
type === currentFilters.type;

// Check category filter (case sensitive, controlled values)
// Note: Data should be validated on page load to ensure only expected values are present
const catMatch = currentFilters.category === 'all' ||
categories.includes(currentFilters.category);
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter comparison logic on lines 360-361 and 365-366 uses strict equality for type and category checks, which is correct. However, the comment on line 364 mentions "Data should be validated on page load to ensure only expected values are present" but there's no actual validation code implemented. Consider adding a validation function that runs on page load to check for unexpected values and log warnings or errors, which would help catch data quality issues early.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant