An agent skill for GPU kernel/operator profiling with Nsight Compute and portable profiler backends
Cursor · Claude Code · Codex · Gemini CLI · CUDA · Triton · Nsight Compute
Kernel Profiler Skill is an agent skill for GPU kernel/operator profiling in Cursor, Claude Code, Codex, and Gemini CLI. It provides a structured workflow for investigating a single GPU kernel or a tightly related kernel family, including launch configuration, occupancy, memory behavior, compute pipeline use, warp stalls, instruction mix, source-line attribution, and before/after regression.
The default backend is NVIDIA Nsight Compute (ncu). The backend layer is intentionally isolated so the same profiling plan can be adapted to other GPU vendors by replacing the profiler adapter, metric aliases, and architecture database.
Supported target styles include:
- native CUDA executables,
- benchmark commands built by CMake/Make/custom build systems,
- Python processes that launch CUDA kernels,
- Triton JIT kernels launched from Python.
Out of scope by default: CPU timeline analysis, dataloaders, distributed communication, launch-gap analysis, MPI/NCCL overlap, end-to-end application tracing, and system-level scheduling. Those belong in a system profiler workflow, not this kernel-only flow.
The files in scripts/ are the executable implementation of this skill. Agents should call them directly instead of regenerating equivalent shell or Python logic:
scripts/generate_profile_target.pyfor./profile/<profile_id>/profile-target.yamlscripts/ncu_collect_kernel_profile.shfor staged profilingscripts/discover_kernels.shonly as a kernel-selection fallbackscripts/extract_ncu_metrics.py,scripts/generate_source_hotspots.py,scripts/compare_profiles.py, andscripts/visualize_profile_report.pyfor post-processing
If a script cannot express a required capability, patch the existing script first, then call it. Direct ncu commands are backend reference details, not the normal agent execution path.
Install it as a local Codex skill by cloning this repository, or run the copy command from this directory:
# Linux/macOS
mkdir -p ~/.codex/skills/kernel-profiler-skill
cp -R . ~/.codex/skills/kernel-profiler-skill/
# Windows PowerShell
New-Item -ItemType Directory -Force "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"
Copy-Item -Recurse -Force . "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"Restart or reload Codex. To use the skill, open the CUDA/Triton project you want to profile, then invoke /skill kernel-profiler-skill; the skill itself is loaded from ~/.codex/skills/kernel-profiler-skill.
The main workflow starts from a short request in the coding agent, for example:
/skill kernel-profiler-skill, profile kernel hgemm_byzj_v0 and generate a visual report
The agent should convert that request into a reproducible profiling run:
- Parse the request: kernel hint
hgemm_byzj_v0, visualization enabled, default profile levelbasic, source mapping enabled unless disabled, and default backendnvidia-ncu. - Resolve the target command from the repository when it is not provided. Typical search locations are
README,CMakeLists.txt,Makefile,build/,bin/,examples/, benchmark scripts, and project run scripts. - Generate
./profile/<profile_id>/profile-target.yamlby callingscripts/generate_profile_target.py. - Use the kernel name directly as the initial filter. The default generated filter for
hgemm_byzj_v0is.*hgemm_byzj_v0.*. - Run
scripts/ncu_collect_kernel_profile.sh --stages auto: the collector profilesbasic, inspectsdetails/metrics_summary.json, then runs one evidence-justified follow-up stage such asmemory,compute,occupancy, orspeed-of-light. Use--stages allonly when explicitly requested. - Generate source hotspots, comparison, and visual reports with the existing scripts when requested.
- Write the final artifacts under
./profile/<kernel_name>_<profile_id>/.
Manual usage is still supported: profile-target.yaml can be generated by hand from assets/templates/profile-target.yaml or edited after generation. Place it under the matching profile directory, for example ./profile/hgemm_byzj_v0_20260515_120000/profile-target.yaml.
Example generator call produced from the prompt above, once the benchmark command has been resolved:
python3 scripts/generate_profile_target.py \
--target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel hgemm_byzj_v0 \
--requirement "visual report"Expected kernel section:
kernel:
name: hgemm_byzj_v0
filter_mode: regex
filter: .*hgemm_byzj_v0.*
allow_discovery_fallback: trueDiscovery is a fallback for missing, failed, or ambiguous filters. It is not a required extra step for a named kernel.
A profile run is described by ./profile/<profile_id>/profile-target.yaml. The target can be written by hand from assets/templates/profile-target.yaml or generated from a short request.
The generator supports:
- native executable targets,
- Python/Triton targets,
- kernel-name hints,
- automatic regex filter generation from the kernel name,
- visual report requests,
- source/SASS/PTX requests,
- Roofline requests,
- regression comparison requests,
- unsupported requirement recording.
For a named kernel such as hgemm_byzj_v0, the normal path is to use the name directly:
kernel:
name: hgemm_byzj_v0
filter_mode: regex
filter: .*hgemm_byzj_v0.*
allow_discovery_fallback: trueKernel discovery is reserved for missing, failed, or ambiguous filters. It is not used as an extra step when a valid kernel name is already available.
The standard collector can run profiling in evidence-oriented stages with --stages. The normal autonomous-agent path is a small first pass, followed by targeted follow-up stages:
basicpass for launch configuration, high-level utilization, and obvious bottlenecks.SpeedOfLightpass for compute-vs-memory direction.MemoryWorkloadAnalysispass for memory hierarchy, memory throughput, transactions, sectors, cache behavior, shared-memory conflicts, and local memory.ComputeWorkloadAnalysisplusInstructionStatsfor pipeline utilization, IPC, instruction mix, Tensor Core use, FP/INT/SFU/LDST behavior, and instruction categories.LaunchStats,Occupancy, scheduler, and warp-state analysis for active warps, theoretical/achieved occupancy, resident blocks, register/shared-memory limits, and stall reasons.- Roofline analysis when supported by the backend.
- Source/SASS/PTX attribution when source mapping is available.
Raw artifacts are kept under details/; the final report is normalized and concise.
Triton kernels are profiled through the Python process that launches them. The selected object remains the generated GPU kernel, not the Python function runtime.
The Triton flow handles:
- Python command targets,
- Triton JIT warmup,
- autotune contamination control,
- launch-skip defaults for JIT/autotune/warmup,
- best-effort Python/Triton source mapping,
- fallback discovery when the Python function name differs from the generated GPU kernel name.
Recommended final-evidence runs should use a fixed Triton config where possible, then skip warmup launches before collecting the profiled launch.
Some profiler counters require elevated privileges. Two execution modes are supported:
| Mode | Purpose |
|---|---|
none |
Run without sudo. |
full_sudo |
Run sudo -n "$(cat ./profile/ncu_path)"; requires root or narrow NOPASSWD access for the path stored in ./profile/ncu_path. |
Plaintext password storage is not implemented. Passwords must not be written into YAML, scripts, logs, environment variables, shell history, generated commands, reports, or files such as profile/sudokey. At skill start, ./profile/ncu_path is created if missing with /usr/local/cuda/bin/ncu. If ncu requires sudo, the agent should stop the profile, show the NOPASSWD guide, ask the user to run command -v ncu and readlink -f "$(command -v ncu)", and require the user to configure NOPASSWD for one selected path and write that same path into ./profile/ncu_path.
On multi-CUDA servers, keep sudo profiling with the same configured ./profile/ncu_path:
mkdir -p ./profile
printf '%s\n' '/usr/local/cuda-12.9/bin/ncu' > ./profile/ncu_path
scripts/ncu_collect_kernel_profile.sh \
--sudo \
...Large raw CSV files are reduced into compact machine-readable summaries:
details/metrics_summary.json
details/metrics_extracted.jsonl
These summaries are the preferred inputs for report generation, regression comparison, and optional rule-based triage.
The source hotspot stage writes:
details/source_hotspots.csv
Standard columns:
source_file,line,function,sass_opcode,ptx_opcode,metric,value,stall_reason,bottleneck_class,confidence,evidence
Each major conclusion should point to source lines, SASS/PTX opcode families, or a documented reason why source correlation is unavailable.
The rule engine is disabled by default. It can classify common bottleneck patterns from extracted metrics using rules in:
references/portable/rules/bottleneck_rules.yaml
The output is advisory. The final report still needs explicit metric evidence.
Regression mode is disabled by default. When enabled, it compares the current profile to a selected or automatically discovered previous profile for the same kernel prefix.
The comparison report includes:
- baseline value,
- current value,
- absolute delta,
- percentage delta,
- metric-level judgment,
- overall verdict,
- missing-data and inconclusive states,
- lower-is-better and higher-is-better semantics,
- a default
±2%noise band for normal random variation.
A visual summary can be generated from the final report and extracted details. The script checks for required Python packages and should return installation guidance instead of producing partial images when dependencies are missing.
The NVIDIA implementation is described by:
references/nvidia/vendor-adapter-nvidia-ncu.yaml
Portable mappings and taxonomy are placed under:
references/portable/
A backend port should provide equivalent support for discovery, filtering, staged collection, raw metric export, source/ISA correlation, metric aliases, and conformance checks.
kernel-profiler-skill/
├── SKILL.md
├── README.md
├── README.zh-CN.md
├── docs/
├── references/
├── scripts/
└── assets/
Important paths:
assets/templates/profile-target.yaml target schema template
assets/templates/final_report_template.md normalized report template
assets/templates/run_manifest_template.yaml run metadata template
assets/examples/profile-target.example.yaml native CUDA example
assets/examples/profile-target.triton.example.yaml Python/Triton example
docs/workload-stabilization-guide.md warmup, repeatability, launch selection
docs/triton-kernel-profiling.md Triton-specific profiling notes
docs/vendor-conformance.md backend capability checks
docs/ncu-nopasswd-guide.zh-CN.md exact-path NOPASSWD setup guide
docs/legal/NOTICE.md attribution and redistribution notes
references/nvidia/architectures/gpu_specs.yaml machine-readable NVIDIA GPU limits
references/nvidia/ncu-guide.md link to upstream Nsight Compute workflow notes
references/nvidia/ProfilingGuide.md link to upstream profiling guide reference
references/nvidia/ncu-metric-map.md metric-to-bottleneck map
references/nvidia/vendor-adapter-nvidia-ncu.yaml default backend adapter
references/portable/metric-aliases.yaml vendor-neutral metric aliases
references/portable/portable-bottleneck-taxonomy.md bottleneck classes
references/portable/vendor-porting-guide.md backend migration guide
references/portable/rules/bottleneck_rules.yaml optional rule engine rules
scripts/generate_profile_target.py target generator
scripts/ncu_collect_kernel_profile.sh staged Nsight Compute collector
scripts/discover_kernels.sh fallback kernel discovery
scripts/extract_ncu_metrics.py raw metric extractor
scripts/generate_source_hotspots.py source hotspot table generator
scripts/bottleneck_decision_engine.py optional rule-based triage
scripts/compare_profiles.py optional before/after comparison
scripts/visualize_profile_report.py optional visual report generator
scripts/vendor_conformance_check.py backend adapter check
./profile/<profile_id>/profile-target.yaml is the normalized contract between the request parser, the collector scripts, and the final report generator. It is generated automatically in the normal prompt-driven workflow, but the fields are intentionally readable and safe to edit.
Schema identifier for compatibility checks. Current value: 3.0.
Describes the command that launches the kernel.
| Field | Meaning | Typical value |
|---|---|---|
runtime |
Runtime family. | native, python, or python-triton |
executable |
Program to execute. | ./build/bench_gemm, python3 |
working_directory |
Directory used when running the command. | . |
args |
Argument vector passed to executable. |
['--m', '4096'] |
env |
Environment variables for the run. | {CUDA_VISIBLE_DEVICES: '0'} |
stdin |
Optional stdin payload or file reference. | null |
timeout_seconds |
Optional command timeout. | null or integer |
python.interpreter |
Python executable when runtime is Python-based. | python3 |
python.entry_kind |
Python entry type. | script, module, command, or null |
python.script |
Script path for Python script targets. | bench_triton_hgemm.py |
python.module |
Module name for python -m. |
package.bench |
python.jit_framework |
JIT framework marker. | triton for Triton |
Runtime-specific controls, mainly for Python/Triton.
| Field | Meaning |
|---|---|
triton_enabled |
Enables Triton-specific stabilization and reporting notes. |
triton_kernel_hint |
Original Triton function/kernel hint from the request. |
triton_autotune_policy |
fixed_config, warmup_or_fixed_config, or allow_autotune. |
jit_warmup_required |
Marks JIT warmup as required before evidence collection. |
recommended_warmup_skip |
Suggested skip count for JIT/autotune/warmup launches. |
require_cuda_synchronize_around_profile_window |
Requests explicit synchronization around benchmark windows when practical. |
source_mapping_expectation |
Source attribution expectation; Triton is usually best_effort. |
notes |
Runtime-specific caveats to carry into the report. |
Selects the profiled kernel.
| Field | Meaning |
|---|---|
name |
Human-readable kernel label from the request. |
filter_mode |
exact, regex, kernel-id, nvtx-range, vendor-specific, or auto. |
filter |
Backend filter. Named kernels default to .*<kernel_name>.*. |
profile_id |
Stable output identifier; auto generates one from kernel name and timestamp. |
nvtx_range |
Optional NVTX range filter. |
allow_discovery_fallback |
Allows discovery only when the initial filter is missing, fails, or is ambiguous. |
Controls profile depth and output location.
| Field | Meaning |
|---|---|
backend |
Profiler backend adapter, default nvidia-ncu. |
initial_level |
basic or full. Default is basic. |
warmup_policy |
auto, fixed, or none. |
warmup_skip |
Number of matching launches to skip before collection. |
launch_count |
Number of matching launches to collect. |
enable_source_mapping |
Enables source/SASS/PTX attribution when available. |
enable_visual_report |
Enables visual report generation. |
extra_profiler_options |
Backend-specific extra options. |
ncu_bin |
Nsight Compute CLI command for non-sudo mode, default ncu; full_sudo mode reads ./profile/ncu_path. |
output_root |
Root directory for generated profile folders. |
Defines how privileged profiler access is handled.
| Field | Meaning |
|---|---|
mode |
none or full_sudo. |
full_sudo_policy |
Restricts sudo to root or narrow NOPASSWD for the exact path in ./profile/ncu_path. |
password_storage |
Must remain forbidden. |
forbidden |
Explicit safety rules: never store, read, print, pipe, or auto-type sudo passwords. |
Fallback kernel discovery settings. Discovery is not the default path for a named kernel.
| Field | Meaning |
|---|---|
enabled_when_filter_missing |
Allows discovery if no filter is available. |
fallback_when_kernel_filter_misses |
Allows discovery if the generated filter matches nothing. |
max_candidates |
Maximum candidates to retain. |
prefer_user_named_kernel |
Ranks name matches before generic cost ranking. |
rank_by |
Candidate ranking keys. |
command_extra_options |
Extra options used only during discovery. |
Controls optional and targeted analysis stages.
| Field | Meaning |
|---|---|
collect_memory |
auto, true, or false for memory analysis. |
collect_compute |
auto, true, or false for compute analysis. |
collect_occupancy |
auto, true, or false for occupancy/scheduler analysis. |
collect_roofline |
auto, true, or false for Roofline. |
collect_source |
auto, true, or false for source/SASS/PTX collection. |
enable_bottleneck_decision_engine |
Optional rule engine; default false. |
enable_regression_compare |
Optional before/after comparison; default false. |
compare_baseline |
auto, explicit path, or null. |
random_variation_tolerance_pct |
Default noise band for regression decisions; default 2.0. |
generate_source_hotspots_table |
Generates details/source_hotspots.csv. |
Controls the optional visual report.
| Field | Meaning |
|---|---|
enabled |
Enables visual report generation. |
format |
Output image format, usually png. |
require_python_packages |
Required Python packages checked before rendering. |
Carries request text and unsupported requirements into the report.
| Field | Meaning |
|---|---|
user_requirements |
Original extra requirements parsed from the request. |
unsupported_or_deferred_requirements |
Requirements outside the kernel-only or supported profiler scope. |
Generate a target file:
python3 scripts/generate_profile_target.py \
--target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel hgemm_byzj_v0 \
--requirement "visual report, source, roofline"Run the first evidence pass:
scripts/ncu_collect_kernel_profile.sh \
--target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel-name hgemm_byzj_v0 \
--kernel-regex ".*hgemm_byzj_v0.*" \
--launch-skip 10 \
--launch-count 1 \
--output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
--stages autoThen run targeted stages only when justified by the first pass, for example:
scripts/ncu_collect_kernel_profile.sh \
--target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel-name hgemm_byzj_v0 \
--kernel-regex ".*hgemm_byzj_v0.*" \
--launch-skip 10 \
--launch-count 1 \
--output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
--stages memory,sourceThe collector writes raw CSV files directly, does not generate .ncu-rep, and runs compact metric extraction automatically when raw metrics are available. Manual retry:
python3 scripts/extract_ncu_metrics.py \
--input ./profile/hgemm_byzj_v0_20260515_120000/details/metrics_raw.csv \
--output-dir ./profile/hgemm_byzj_v0_20260515_120000/detailsGenerate a visual summary when enabled:
python3 scripts/visualize_profile_report.py \
./profile/hgemm_byzj_v0_20260515_120000/final_report.md \
./profile/hgemm_byzj_v0_20260515_120000/details \
./profile/hgemm_byzj_v0_20260515_120000/visual/profile_summary.pngGenerate a Triton target:
python3 scripts/generate_profile_target.py \
--runtime python-triton \
--target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel hgemm_byzj_v0 \
--requirement "triton, visual report, source, roofline"Collect the profile:
scripts/ncu_collect_kernel_profile.sh \
--runtime python-triton \
--target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
--kernel-name hgemm_byzj_v0 \
--kernel-regex ".*hgemm_byzj_v0.*" \
--launch-skip 20 \
--launch-count 1 \
--output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
--stages autoFor final Triton evidence, keep the profiled launch window stable: warm up JIT, avoid collecting autotune candidates, synchronize around measured loops, and fix kernel configs when possible.
A completed profile should use this layout:
profile/<kernel_name>_<profile_id>/
├── profile-target.yaml
├── final_report.md
├── run_manifest.yaml
├── details/
│ ├── 01_basic_raw.csv
│ ├── 02_speed_of_light_raw.csv
│ ├── 03_memory_raw.csv
│ ├── 04_compute_raw.csv
│ ├── 05_occupancy_launch_raw.csv
│ ├── 06_roofline_raw.csv
│ ├── 07_source_raw.csv
│ ├── metrics_summary.json
│ ├── metrics_extracted.jsonl
│ └── source_hotspots.csv
├── visual/
│ └── profile_summary.png
└── comparison/
└── regression_report.md
Not every optional directory is required. visual/ appears only when visualization is enabled. comparison/ appears only when regression mode is enabled.
The normalized report should include:
- target command and runtime,
- GPU model, driver, CUDA/profiler version when available,
- kernel filter and resolved kernel name,
- launch skip/count and stabilization notes,
- basic profile summary,
- SpeedOfLight direction,
- memory analysis,
- compute analysis,
- occupancy and scheduler analysis,
- Roofline result when available,
- source/SASS/PTX hotspot table or limitations,
- bottleneck classification with metric evidence,
- optimization recommendations ranked by confidence,
- optional visual report path,
- optional before/after comparison,
- unresolved questions and data-quality notes.
Run the adapter check:
python3 scripts/vendor_conformance_check.py \
--adapter references/nvidia/vendor-adapter-nvidia-ncu.yamlA conforming backend should declare support or explicit non-support for:
- kernel discovery,
- exact/regex/kernel-id/range filtering,
- basic profile collection,
- memory analysis,
- compute analysis,
- occupancy analysis,
- Roofline analysis,
- raw metric export,
- source/ISA correlation,
- permission handling.
This skill is built around a portable kernel-profiling workflow. The default backend is NVIDIA Nsight Compute, while the profiling model is intentionally separated from the vendor-specific command layer so that another GPU vendor can replace the backend adapter and metric mapping.
-
NVIDIA Nsight Compute documentation
Used as the primary reference for Nsight Compute CLI behavior, report concepts, metric collection, sections, sets, source/SASS correlation, replay modes, and profiling workflow. -
NVIDIA CUDA C++ Programming Guide and CUDA Compute Capability documentation
Used for architecture-level limits such as warp size, maximum resident blocks, maximum resident warps, registers per SM, shared memory limits, and occupancy-related interpretation. -
NVIDIA CUDA C++ Best Practices Guide
Used for CUDA kernel optimization concepts such as memory coalescing, occupancy, arithmetic intensity, memory hierarchy behavior, and performance measurement practices.
The following files are local pointer files to cuda-skill references in the open-source repository slowlyC/agent-gpu-skills:
references/nvidia/ncu-guide.mdreferences/nvidia/ProfilingGuide.md
Original source repository:
https://github.com/slowlyC/agent-gpu-skills
Attribution:
- Original repository:
slowlyC/agent-gpu-skills - Original author / maintainer:
slowlyC - Source skill:
cuda-skill - Relevant upstream reference paths include:
references/ncu-guide.mdreferences/ncu-docs/ProfilingGuide.mdreferences/best-practices-guide/
These files intentionally do not embed the upstream documentation text. If redistributing this skill, keep this attribution section and follow any upstream license and notice requirements from slowlyC/agent-gpu-skills.
-
references/nvidia/ncu-guide.md
Pointer to the upstream Nsight Compute command and workflow reference inslowlyC/agent-gpu-skills. -
references/nvidia/ProfilingGuide.md
Pointer to the upstream Nsight Compute profiling guide reference inslowlyC/agent-gpu-skills. -
references/nvidia/vendor-adapter-nvidia-ncu.yaml
Backend adapter that maps the portable profiling workflow to Nsight Compute commands, sections, sets, and report artifacts. -
references/nvidia/architectures/gpu_specs.yaml
Machine-readable NVIDIA GPU architecture and device-spec database for architecture-aware profiling interpretation. -
references/nvidia/architectures/architecture-notes.md
Human-readable architecture notes for interpreting kernel bottlenecks across NVIDIA GPU generations. -
references/portable/metric-aliases.yaml
Portable metric alias table used to decouple analysis logic from vendor-specific metric names. -
references/portable/portable-bottleneck-taxonomy.md
Vendor-neutral bottleneck taxonomy used by reports and optional rule-based analysis. -
references/portable/vendor-porting-guide.md
Guide for replacing the Nsight Compute backend with another vendor profiler. -
references/portable/rules/bottleneck_rules.yaml
Optional rule set for the bottleneck decision engine. Disabled by default.
Stable kernel-only profiling package with native CUDA and Python/Triton support. The package includes target generation, staged Nsight Compute collection, optional visual reports, optional before/after regression comparison, optional bottleneck decision rules, source hotspot extraction, exact-path ncu NOPASSWD guidance, NVIDIA hardware metadata, vendor-portable metric aliases, and backend conformance checks.
Consolidated the package layout into SKILL.md, README.md, README.zh-CN.md, docs/, references/, scripts/, and assets/. Moved templates and examples under assets/, moved conformance documentation under docs/, and kept executable utilities under scripts/.
Defined the privilege model: no-sudo execution and exact-path full_sudo execution through sudo -n. Kernel-name requests now generate a direct regex filter by default, with discovery only as a fallback.
Added privileged collection guidance for environments where Nsight Compute requires performance-counter permissions. Refined README tone and clarified how privileged collection should be run.
Replaced prose-only NVIDIA architecture notes with references/nvidia/architectures/gpu_specs.yaml and architecture notes. Added target generation, metric extraction, optional bottleneck rules, source hotspot schema, regression comparison, workload stabilization guidance, and backend conformance checks.
Added Chinese README, expanded NVIDIA architecture and profiling references, made kernel filtering optional in handwritten targets, and documented future industrialization improvements.
Initial kernel-only profiling skill with Nsight Compute as the default backend, staged profile collection, target schema, normalized final report template, visual report script, and vendor-portability guidance.