Alex Kalyvas kalyvask

AI Product Manager. Stanford MBA (focusing on AI and Stanford CS classes). 7 years shipping technical products at Snowflake, Amazon, and IBM, translating AI capabilities into measurable business outcomes.

I write about AI agents, evals, and applied product strategy at alexkalyvas.substack.com. Currently RDI Research Fellow at Stanford. Sharing learnings from Stanford in the "Tools and learning resources" section.

🚀 What I'm building

inside-the-agent: AI Research. An interpretability tool for LLM agents: see which internal features (from a sparse autoencoder trained on the model's activations) fire at each decision step, and edit them in real time to fix failure modes (promotional banners, invented buttons, wandering off-task) inside the model rather than just logging them after the fact. Why it matters: a small, cheap model becomes competitive on the specific failure classes where it normally falls short, and a mechanism read can replace some pass/fail evals because it tells you which feature fired at the wrong moment, not just that the trial failed. On a 60-trial held-out browser-shopping benchmark, one system-prompt line plus two feature edits at the first step lift Llama-3.1-8B from 10% to 75%, closing 72% of the gap to Llama-3.3-70B at roughly one-eighth the cost (87.5% on promotional-trap tasks). Sign-flipped, random, and noise controls stay near baseline, so the +47-point targeted-vs-random gap is the causal claim. A Next.js HUD shows live feature activations and a counterfactual ("without your intervention, the model would have..."). Built for Stanford CS153. Python · Next.js · Playwright · Modal · Llama 3.1 / 3.3 · Goodfire SAE
world-model-eval: AI Research. A look at what today's open, small world models can and can't do. A world model learns to imagine a game one frame at a time, so in theory you could train and test agents inside its "dream" instead of the real game. I tested two of them (DIAMOND and IRIS) on Atari Breakout with one question: how long does the dream stay accurate? They read the present moment very well (you can recover the ball's position from inside the model at R² 0.78, and the very next frame is near-perfect), but the dream drifts away from reality fast once it runs on its own: on the moves it normally plays it keeps the ball roughly right for about 60 steps, but on unusual moves it loses the ball within 16 to 20 steps, and both models behave about the same. Why it's interesting: it shows where these models are genuinely useful (reading the current state, looking a few steps ahead) and where they quietly fail (they can't tell which way of playing is better, and you can't reliably nudge the dream toward an outcome) for the same underlying reason: the accuracy runs out. Built for Stanford CS153. Python · Modal · PyTorch · DIAMOND · IRIS · Atari / ALE
ai-oncall: Applied AI. An LLM agent that diagnoses production incidents in under 30 seconds. It builds the service graph from live telemetry (not a static config), so it rules out services that can't have caused this incident before spending any of its 8 tool calls, and every claim cites the tool call or deploy diff it came from, so a diagnosis is checkable rather than taken on faith. Fixes are staged into recommend / propose / auto tiers behind an allowlist with dry-run preview and one-click Slack rollback, so autonomy is granted per action and a bad call reverses in seconds. Deterministic rules make it abstain when the evidence is thin (cold start, low confidence, budget spent) rather than fabricate, and a CI eval harness scored against real public-postmortem incidents catches regressions across prompts and models. Built for Stanford CS224G. Python · FastAPI · Next.js · Anthropic API
chief-of-staff: A personal chief-of-staff agent that drafts my emails, preps and debriefs my meetings, shapes next week's calendar, and routes each decision into the right project log. Slack is the main surface: @-mention or DM it for a reply grounded in the live work queue and the people involved, because it reads my context, memory, and per-project state before responding rather than asking for it. A tier-based permission engine gates every action that touches the outside world (a read-only command can't quietly promote into sending an email or moving a meeting) and logs every decision, so autonomy is granted on purpose and stays auditable. It coaches each meeting afterward from the Granola transcript, with feedback grounded in the specific relationship history rather than generic advice, and finds new attendees from past email before the first prep. Borrows from Karpathy's LLM-OS idea (persistent memory, read the world before acting) and Garry Tan's gbrain (typed-link entity graph with an audit trail on every action). Node.js · Claude Agent SDK · MCP · Gmail · Google Calendar · Slack · Granola

🧰 Tools and learning resources

enterpreneurship-lessons: A library and Claude Code agent partner for the journey from a raw idea to product-market fit. Built around the PMF framework (find the real value hypothesis, the leap of faith, the early signs you're getting warm) and the classic startup playbooks (customer development, Lean Startup, the Mom Test, Crossing the Chasm, Jobs-to-be-Done). Eight stage guides walk a founder from prepared mind through discovery, problem-solution fit, MVP, and PMF measurement, backed by operational playbooks (customer interviews, cold email, MVP scoping, pivot decisions), fillable templates, and 14 Claude skills that route you to the right framework and push back when you're kidding yourself about having PMF. Synthesized from the published books it cites and Unusual Ventures' public Field Guide, and it closes with a curated reading list so you can go to the source. Built so a co-founder or early hire can read one self-contained file and ramp up. Markdown · Claude Skills · Anthropic SDK
winning-writing: Agentic writing coach that catches the AI tells, em-dashes, jargon, flattery, and warmth-and-competence misses that make outreach sound off, before it goes out. Rules grounded in Stanford GSB's Winning Writing (Glenn Kramon) and Rachel Konrad's cold-outreach lectures: 31 Claude skills plus a rule library covering recipient research, surgical edits (em-dash, jargon, adverbs, humanize, warmth-and-competence), pitch artifacts, and voice maintenance learned from sent mail. Browser Coach offers three pipeline modes (single-shot, planner-routed polish, full per-step pipeline), trading depth for speed by stakes; span-level inline critic (hover for the rule, Accept/Reject inline) keeps edits in-draft. Chrome MV3 extension runs the same critic in the Gmail side panel: one-click compose import, opt-in Send-button interception against a cross-model gate (a second Opus call independently approves every send), live rule sync from raw.githubusercontent.com so rules update without reinstalling. Every annotation cites the points/ or skills/ file the rule came from, so a flag can be argued with on the rule, not the assistant. Node eval harness with a golden corpus catches regressions when rules change. JavaScript; Chrome MV3; Anthropic SDK (prompt caching, web_search, multi-model gating); Claude Skills
pm-evaluation-framework: PM library that pushes back on the defaults that get a PM eaten in an exec review: vague problem statements, feature-laundry scope, vanity metrics, self-graded launch gates, hope-as-strategy, one-way-door blindness, undefended moats, and AI critiques accepted on autopilot. 9 Claude skills span the lifecycle (frame, discover, build, launch, measure, review, adversarial second-pass), including a Mom-Test customer-interview coach so problem statements survive contact with users, a value-hypothesis stress-tester so hypotheses can be falsified before code, and a pm-red-team that re-reviews any prior critique under a different lens so the first answer isn't taken as final. Plus 6 lifecycle frameworks, 11 decision and cross-functional reference docs, 3 evaluation rubrics, and 5 artifact templates. Built for the hard calls a working PM has to make under exec-review pressure (when to kill a feature, what MVP really means, what readiness actually looks like). Markdown · Claude Skills · Anthropic SDK
fde-simulation: A hands-on way to practice the Forward Deployed Engineer job: read the customer email, scope the work, ship a working agent. Two fictional 4-week engagements (an insurance claims-automation case and an equity-research case) come with synthetic data and a worked reference solution for each phase, so the difficulty matches real work without exposing any client. The reference agents run end-to-end at a production-grade reliability bar (pass^k, k=5–7) and can be forked, and every step leaves an audit trace so a reviewer can see why it acted. Ships with portable FDE frameworks, roleplay stakeholder prompts (a skeptical CCO, a compliance officer) for discovery practice, and per-phase Claude-graded scoring. Doubles as interview prep for FDE-style roles. Markdown · Python · single-file HTML · Anthropic / OpenAI SDK · MIT
deployment-monitor: Tracks AI deployment trends across Reddit, Hacker News, and 40+ RSS sources. Claude summarizes and categorizes; Streamlit dashboard surfaces what's actually shipping. LLM consolidates and sends you via email most relevant news for you weekly. Plus an agent layer (python main.py agent) that gives the reports memory across runs: a ledger of evolving narratives and tracked-lead validation for opportunity signals. Python · Streamlit · Anthropic API · SQLite
role-radar: PM job matcher for AI companies (you can adjust this based on your target career/roles). Pulls live roles from ~120 AI companies and VC-backed startups via Greenhouse / Lever / Ashby / SmartRecruiters / Workday, scores each 0–100 against your CV, serves a Flask UI that learns from like/dislike feedback. Two one-click LLM workflows per job, both Claude Opus 4.7: an interview prep generator that maps your stories to the role and runs a second-pass critic scoring 1–10 with severity-tagged findings; and a company review using the web_search tool (15–25 queries) that emits an investor-grade analysis with valuation timeline, ARR growth, competitor quadrant, funding-rounds table, press / Glassdoor / Reddit sentiment, and a verdict (Strong apply / Apply with caveats / Pass / Inconclusive), every numeric claim cited inline. Plus a stateful agent layer (role-radar agent) that tracks pipeline state across runs and drafts cold outreach per job. Python · Flask · Typer · Anthropic SDK · Mermaid · python-docx · SQLite

🔬 Research & writing

Enterprise AI Observability (Stanford GSB · IR 390 · Dec 2025). Primary research from 100+ AI engineers and founders. Headline: 60% of AI-natives already run agents in production; the next bottlenecks are privacy/security, observability, and continual learning, not hallucinations. Most teams ship without trace coverage and learn their failure modes from customer complaints. Eval-driven development is becoming the new TDD: the teams shipping fastest are the ones who wrote their evals before their agents. Three-part Substack series, generated intros to AI SRE companies and AI experts.
Enterprise AI Time-to-Value (Stanford GSB · IR 390 · Mar 2026). Only 14% of practitioners reach measurable impact in under a month, yet 50% expect that to be standard by 2027 (a 4× expectation gap). The teams hitting fast TTV share three habits: forward-deployed engineering (PMs and engineers in the customer's loop, not behind a CSM), trust transfer (the implementer's reputation matters more than the model's), and measurement discipline (a baseline metric agreed on before kickoff, not after). Argues implementation quality, not model quality, is the next moat.
Rosetta Sycophant (Stanford GSB · AI and Power · Winter 2026). AI Interpretability. With Barry Thrasher and Humzah Khan. Live tool detecting identity-based bias in AI translation. Same source text, two user profiles → "invasion" vs "intrusion." Demoed on January 6th reporting from Russian-language sources. Finding: frontier models change word choice based on inferred user politics even when the user metadata is irrelevant to the translation task, which makes "neutral translation" a harder claim than vendors imply.
Haptica: Tactile Intelligence for Manufacturing (Stanford GSB · MKTG 321 · Mar 2026). AI Robotics. With Devanshi Mehta, Grace Stayner, Paola Peraza Calderon, and Facundo Tosi. Wearable tactile-sensor ML for assembly-line connector seating. Trained per-connector classifiers (Random Forest / Gradient Boosting / SVM) on 16 hand-crafted pressure features; F1 0.67–0.83. Hand-crafted features beat a CNN baseline on the small dataset, reinforcing that sparse-data physical-sensor problems still favor classical ML over deep learning. Non-electromechanical assembly failures drive $14B/yr in U.S. recall costs, and catching 10% of escapes at a single OEM avoids $50M+ annually.
Strike GTM Repositioning (Stanford GSB · GTM · Dec 2025). AI Cybersecurity. With Sachin Khurana and Christian Gallo. Series-A AI pentesting platform (~$3M ARR), 8 proprietary interviews including XBow and Horizon3. Core finding: buyers won't accept fully-autonomous pentest output as compliance evidence, which makes hybrid AI + human structurally more defensible than autonomy claims. Recommends per-asset PTaaS pricing (unlocks 5–10× larger contracts than per-hour), LATAM financial services focus (underserved, regulatory tailwind), and hybrid positioning against fully-autonomous entrants.
Knowledge Gardener: A Product Pitch for Glean (Stanford GSB · PM · Spring 2026). Enterprise AI Applications. With Yedu Pushpendran, Viraj Singh, and Shivani Bajaj. Continuous agent that finds stale and contradicted docs in Glean's Enterprise Graph and routes fixes to the right steward. Thesis: as MCP commoditizes connectors, the retrieval signal (what decays, who owns it, what's asked) is the one asset MCP can't standardize, so Glean's defensibility shifts from integration breadth to retrieval-signal depth. Packaged as a free Health Score plus a paid "Cultivate" tier with tiered autonomy.

🛠️ Stack

AI/ML: Multi-agent orchestration · LLM evals · Agentic workflows · AI observability · Supervised ML Engineering: Python · SQL · JavaScript · React · Node.js · TypeScript · Next.js · FastAPI · Streamlit · SQLite Tools: Anthropic SDK · OpenRouter · Tableau · Power BI

🎓 Background

Stanford GSB: MBA
LSE: MSc Management & Strategy · Karelia Merit Scholar · Dissertation: predicting corporate performance from employee-satisfaction data using supervised ML
Athens University of Economics and Business: BSc Business Administration & Computer Science (Top 5%)
GMAT 750 (Top 2%), IR 8/8

📫 Reach me

LinkedIn · Email · Substack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly