Skip to content
View Kevin-Li-2025's full-sized avatar

Block or report Kevin-Li-2025

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Kevin-Li-2025/README.md

Yin Li

LLM systems engineer building reproducible evaluation, post-training, retrieval, and traceable agent infrastructure.

Portfolio | Selected repositories

I work on the engineering layer around model behavior: data pipelines, benchmark harnesses, retrieval diagnostics, verifier-guided inference, trace capture, and regression tests that make LLM systems measurable instead of anecdotal.

Upstream Open Source

  • Triton PR #10411: merged runtime cache-group integrity fix that treats incomplete cache groups as misses; passed Triton's full integration CI across NVIDIA, AMD, and macOS runners.

Operating Thesis

  • Treat every model claim as an artifact-backed systems claim: data version, command, hardware, metric, and failure boundary.
  • Build evaluation loops that survive refactors: golden sets, deterministic runners, CI checks, and report provenance.
  • Keep agent behavior inspectable: tool calls, retrieved sources, validators, retries, and escalation paths should be first-class data.
  • Optimize for reproducible learning velocity: small models, single-GPU runs, tight ablations, and clear error analysis before scale.

System Map

Surface What I build Representative repos
Post-training and verifier-guided inference SFT/DPO-style pipelines, executable checks, reward-labeled traces, benchmark exports L20-CodeForge, repro-llm-stack
Retrieval and ranking evaluation Query planning, citation checks, recall/MRR regression tests, reranker snapshots signal-rag, retrieval-eval, finmteb-zh-reranker-sota, coreb-retrieval-sota
Structured generation benchmarks NL2SQL, ordering reliability, multi-path inference, cost and robustness reporting nl2sql-benchmark, order-delta-bench
Agent trace infrastructure Scientific workflows, semantic judges, deterministic validators, graph memory scitrace-rl, multi-agent-memory-graphs
Efficient model systems Quantization experiments, serving benchmarks, GPU instrumentation, compiler/runtime fixes Triton PR #10411, bitnet-1p58b-experiments, llm-quant-bench, l20-edu-135m-pretrain

Selected Systems

Project Signal Evidence surface
L20-CodeForge Single-L20 post-training and verifier-guided inference for executable code benchmarks Reproduction scripts, artifact hashes, result boundaries
nl2sql-benchmark Text-to-SQL fine-tuning and multi-path inference with Qwen2.5-Coder-7B Spider/BIRD-style evaluation, cost curves, export paths
finmteb-zh-reranker-sota FinanceMTEB Chinese reranking snapshot with Qwen3-Reranker-8B Public report, CI checks, leaderboard snapshot context
signal-rag Retrieval workbench with query planning, citation checks, and extractive fallback Recall evaluation, source-trust tiers, benchmark examples
scitrace-rl Trace, validation, and reward infrastructure for scientific agents Adversarial cases, semantic judge, deterministic validators
coreb-retrieval-sota Reproducible CoREB retrieval benchmark snapshot CI-backed artifacts and result provenance
Triton PR #10411 Merged upstream runtime cache-group integrity fix Full upstream CI across NVIDIA, AMD, and macOS runners

Engineering Standard

I try to make serious repositories answer five questions quickly:

Question Expected answer
What is the exact task? Dataset, benchmark, workflow, or user problem is named up front.
How do I run it? Setup and reproduction commands are visible from the README.
What should happen? Expected outputs, metrics, report paths, or screenshots are documented.
What is proven? Claims are tied to artifacts rather than vague demos.
Where does it fail? Known limitations and next experiments are explicit.

Technical Vector

Core languages: Python, TypeScript, SQL, C++, CUDA, Swift, C#
Model systems: PyTorch, Transformers, LoRA/QLoRA, vLLM, lm-eval, Triton
LLM applications: RAG, retrieval evaluation, tool use, citation verification, structured generation
Infrastructure: FastAPI, SQLite, Docker, GitHub Actions, Make, CLI tooling, PostGIS, Redis, Kafka
Research direction: post-training, process supervision, agent evaluation, scientific reproducibility, AI4S infrastructure

Contact

Portfolio | GitHub

Pinned Loading

  1. bitnet-1p58b-experiments bitnet-1p58b-experiments Public

    1.58-bit LLM pretraining experiments with quantization-aware layers, Triton kernels, stability tracking, and GPU instrumentation.

    Python

  2. nl2sql-benchmark nl2sql-benchmark Public

    Reproducible NL2SQL fine-tuning and multi-path inference benchmark on Spider/BIRD using Qwen2.5-Coder-7B and one NVIDIA L20 GPU.

    Python

  3. signal-rag signal-rag Public

    Search and retrieval workbench with query planning, multi-source retrieval, citation checking, source-trust tiers, and extractive fallback.

    Python