Kernel Profiler Skill

An agent skill for GPU kernel/operator profiling with Nsight Compute and portable profiler backends

Cursor · Claude Code · Codex · Gemini CLI · CUDA · Triton · Nsight Compute

Overview

Kernel Profiler Skill is an agent skill for GPU kernel/operator profiling in Cursor, Claude Code, Codex, and Gemini CLI. It provides a structured workflow for investigating a single GPU kernel or a tightly related kernel family, including launch configuration, occupancy, memory behavior, compute pipeline use, warp stalls, instruction mix, source-line attribution, and before/after regression.

The default backend is NVIDIA Nsight Compute (ncu). The backend layer is intentionally isolated so the same profiling plan can be adapted to other GPU vendors by replacing the profiler adapter, metric aliases, and architecture database.

Supported target styles include:

native CUDA executables,
benchmark commands built by CMake/Make/custom build systems,
Python processes that launch CUDA kernels,
Triton JIT kernels launched from Python.

Out of scope by default: CPU timeline analysis, dataloaders, distributed communication, launch-gap analysis, MPI/NCCL overlap, end-to-end application tracing, and system-level scheduling. Those belong in a system profiler workflow, not this kernel-only flow.

Script-first agent behavior

The files in scripts/ are the executable implementation of this skill. Agents should call them directly instead of regenerating equivalent shell or Python logic:

scripts/generate_profile_target.py for ./profile/<profile_id>/profile-target.yaml
scripts/ncu_collect_kernel_profile.sh for staged profiling
scripts/discover_kernels.sh only as a kernel-selection fallback
scripts/extract_ncu_metrics.py, scripts/generate_source_hotspots.py, scripts/compare_profiles.py, and scripts/visualize_profile_report.py for post-processing

If a script cannot express a required capability, patch the existing script first, then call it. Direct ncu commands are backend reference details, not the normal agent execution path.

Installation

Install it as a local Codex skill by cloning this repository, or run the copy command from this directory:

# Linux/macOS
mkdir -p ~/.codex/skills/kernel-profiler-skill
cp -R . ~/.codex/skills/kernel-profiler-skill/

# Windows PowerShell
New-Item -ItemType Directory -Force "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"
Copy-Item -Recurse -Force . "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"

Restart or reload Codex. To use the skill, open the CUDA/Triton project you want to profile, then invoke /skill kernel-profiler-skill; the skill itself is loaded from ~/.codex/skills/kernel-profiler-skill.

Primary usage: prompt-driven profiling

The main workflow starts from a short request in the coding agent, for example:

/skill kernel-profiler-skill, profile kernel hgemm_byzj_v0 and generate a visual report

The agent should convert that request into a reproducible profiling run:

Parse the request: kernel hint hgemm_byzj_v0, visualization enabled, default profile level basic, source mapping enabled unless disabled, and default backend nvidia-ncu.
Resolve the target command from the repository when it is not provided. Typical search locations are README, CMakeLists.txt, Makefile, build/, bin/, examples/, benchmark scripts, and project run scripts.
Generate ./profile/<profile_id>/profile-target.yaml by calling scripts/generate_profile_target.py.
Use the kernel name directly as the initial filter. The default generated filter for hgemm_byzj_v0 is .*hgemm_byzj_v0.*.
Run scripts/ncu_collect_kernel_profile.sh --stages auto: the collector profiles basic, inspects details/metrics_summary.json, then runs one evidence-justified follow-up stage such as memory, compute, occupancy, or speed-of-light. Use --stages all only when explicitly requested.
Generate source hotspots, comparison, and visual reports with the existing scripts when requested.
Write the final artifacts under ./profile/<kernel_name>_<profile_id>/.

Manual usage is still supported: profile-target.yaml can be generated by hand from assets/templates/profile-target.yaml or edited after generation. Place it under the matching profile directory, for example ./profile/hgemm_byzj_v0_20260515_120000/profile-target.yaml.

Example generator call produced from the prompt above, once the benchmark command has been resolved:

python3 scripts/generate_profile_target.py \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "visual report"

Expected kernel section:

kernel:
  name: hgemm_byzj_v0
  filter_mode: regex
  filter: .*hgemm_byzj_v0.*
  allow_discovery_fallback: true

Discovery is a fallback for missing, failed, or ambiguous filters. It is not a required extra step for a named kernel.

What the skill can do

Kernel target generation

A profile run is described by ./profile/<profile_id>/profile-target.yaml. The target can be written by hand from assets/templates/profile-target.yaml or generated from a short request.

The generator supports:

native executable targets,
Python/Triton targets,
kernel-name hints,
automatic regex filter generation from the kernel name,
visual report requests,
source/SASS/PTX requests,
Roofline requests,
regression comparison requests,
unsupported requirement recording.

For a named kernel such as hgemm_byzj_v0, the normal path is to use the name directly:

kernel:
  name: hgemm_byzj_v0
  filter_mode: regex
  filter: .*hgemm_byzj_v0.*
  allow_discovery_fallback: true

Kernel discovery is reserved for missing, failed, or ambiguous filters. It is not used as an extra step when a valid kernel name is already available.

Staged Nsight Compute collection

The standard collector can run profiling in evidence-oriented stages with --stages. The normal autonomous-agent path is a small first pass, followed by targeted follow-up stages:

basic pass for launch configuration, high-level utilization, and obvious bottlenecks.
SpeedOfLight pass for compute-vs-memory direction.
MemoryWorkloadAnalysis pass for memory hierarchy, memory throughput, transactions, sectors, cache behavior, shared-memory conflicts, and local memory.
ComputeWorkloadAnalysis plus InstructionStats for pipeline utilization, IPC, instruction mix, Tensor Core use, FP/INT/SFU/LDST behavior, and instruction categories.
LaunchStats, Occupancy, scheduler, and warp-state analysis for active warps, theoretical/achieved occupancy, resident blocks, register/shared-memory limits, and stall reasons.
Roofline analysis when supported by the backend.
Source/SASS/PTX attribution when source mapping is available.

Raw artifacts are kept under details/; the final report is normalized and concise.

Triton kernel profiling

Triton kernels are profiled through the Python process that launches them. The selected object remains the generated GPU kernel, not the Python function runtime.

The Triton flow handles:

Python command targets,
Triton JIT warmup,
autotune contamination control,
launch-skip defaults for JIT/autotune/warmup,
best-effort Python/Triton source mapping,
fallback discovery when the Python function name differs from the generated GPU kernel name.

Recommended final-evidence runs should use a fixed Triton config where possible, then skip warmup launches before collecting the profiled launch.

Permission handling

Some profiler counters require elevated privileges. Two execution modes are supported:

Mode	Purpose
`none`	Run without sudo.
`full_sudo`	Run `sudo -n "$(cat ./profile/ncu_path)"`; requires root or narrow `NOPASSWD` access for the path stored in `./profile/ncu_path`.

Plaintext password storage is not implemented. Passwords must not be written into YAML, scripts, logs, environment variables, shell history, generated commands, reports, or files such as profile/sudokey. At skill start, ./profile/ncu_path is created if missing with /usr/local/cuda/bin/ncu. If ncu requires sudo, the agent should stop the profile, show the NOPASSWD guide, ask the user to run command -v ncu and readlink -f "$(command -v ncu)", and require the user to configure NOPASSWD for one selected path and write that same path into ./profile/ncu_path.

On multi-CUDA servers, keep sudo profiling with the same configured ./profile/ncu_path:

mkdir -p ./profile
printf '%s\n' '/usr/local/cuda-12.9/bin/ncu' > ./profile/ncu_path

scripts/ncu_collect_kernel_profile.sh \
  --sudo \
  ...

Metric extraction

Large raw CSV files are reduced into compact machine-readable summaries:

details/metrics_summary.json
details/metrics_extracted.jsonl

These summaries are the preferred inputs for report generation, regression comparison, and optional rule-based triage.

Source hotspot table

The source hotspot stage writes:

details/source_hotspots.csv

Standard columns:

source_file,line,function,sass_opcode,ptx_opcode,metric,value,stall_reason,bottleneck_class,confidence,evidence

Each major conclusion should point to source lines, SASS/PTX opcode families, or a documented reason why source correlation is unavailable.

Optional bottleneck decision engine

The rule engine is disabled by default. It can classify common bottleneck patterns from extracted metrics using rules in:

references/portable/rules/bottleneck_rules.yaml

The output is advisory. The final report still needs explicit metric evidence.

Optional before/after regression analysis

Regression mode is disabled by default. When enabled, it compares the current profile to a selected or automatically discovered previous profile for the same kernel prefix.

The comparison report includes:

baseline value,
current value,
absolute delta,
percentage delta,
metric-level judgment,
overall verdict,
missing-data and inconclusive states,
lower-is-better and higher-is-better semantics,
a default ±2% noise band for normal random variation.

Optional visual report

A visual summary can be generated from the final report and extracted details. The script checks for required Python packages and should return installation guidance instead of producing partial images when dependencies are missing.

Vendor portability

The NVIDIA implementation is described by:

references/nvidia/vendor-adapter-nvidia-ncu.yaml

Portable mappings and taxonomy are placed under:

references/portable/

A backend port should provide equivalent support for discovery, filtering, staged collection, raw metric export, source/ISA correlation, metric aliases, and conformance checks.

Repository layout

kernel-profiler-skill/
├── SKILL.md
├── README.md
├── README.zh-CN.md
├── docs/
├── references/
├── scripts/
└── assets/

Important paths:

assets/templates/profile-target.yaml          target schema template
assets/templates/final_report_template.md     normalized report template
assets/templates/run_manifest_template.yaml   run metadata template
assets/examples/profile-target.example.yaml   native CUDA example
assets/examples/profile-target.triton.example.yaml  Python/Triton example

docs/workload-stabilization-guide.md          warmup, repeatability, launch selection
docs/triton-kernel-profiling.md               Triton-specific profiling notes
docs/vendor-conformance.md                    backend capability checks
docs/ncu-nopasswd-guide.zh-CN.md              exact-path NOPASSWD setup guide
docs/legal/NOTICE.md                          attribution and redistribution notes

references/nvidia/architectures/gpu_specs.yaml      machine-readable NVIDIA GPU limits
references/nvidia/ncu-guide.md                      link to upstream Nsight Compute workflow notes
references/nvidia/ProfilingGuide.md                 link to upstream profiling guide reference
references/nvidia/ncu-metric-map.md                 metric-to-bottleneck map
references/nvidia/vendor-adapter-nvidia-ncu.yaml    default backend adapter
references/portable/metric-aliases.yaml             vendor-neutral metric aliases
references/portable/portable-bottleneck-taxonomy.md bottleneck classes
references/portable/vendor-porting-guide.md         backend migration guide
references/portable/rules/bottleneck_rules.yaml     optional rule engine rules

scripts/generate_profile_target.py            target generator
scripts/ncu_collect_kernel_profile.sh         staged Nsight Compute collector
scripts/discover_kernels.sh                   fallback kernel discovery
scripts/extract_ncu_metrics.py                raw metric extractor
scripts/generate_source_hotspots.py           source hotspot table generator
scripts/bottleneck_decision_engine.py         optional rule-based triage
scripts/compare_profiles.py                   optional before/after comparison
scripts/visualize_profile_report.py           optional visual report generator
scripts/vendor_conformance_check.py           backend adapter check

`profile-target.yaml` parameter reference

./profile/<profile_id>/profile-target.yaml is the normalized contract between the request parser, the collector scripts, and the final report generator. It is generated automatically in the normal prompt-driven workflow, but the fields are intentionally readable and safe to edit.

`schema_version`

Schema identifier for compatibility checks. Current value: 3.0.

`target`

Describes the command that launches the kernel.

Field	Meaning	Typical value
`runtime`	Runtime family.	`native`, `python`, or `python-triton`
`executable`	Program to execute.	`./build/bench_gemm`, `python3`
`working_directory`	Directory used when running the command.	`.`
`args`	Argument vector passed to `executable`.	`['--m', '4096']`
`env`	Environment variables for the run.	`{CUDA_VISIBLE_DEVICES: '0'}`
`stdin`	Optional stdin payload or file reference.	`null`
`timeout_seconds`	Optional command timeout.	`null` or integer
`python.interpreter`	Python executable when runtime is Python-based.	`python3`
`python.entry_kind`	Python entry type.	`script`, `module`, `command`, or `null`
`python.script`	Script path for Python script targets.	`bench_triton_hgemm.py`
`python.module`	Module name for `python -m`.	`package.bench`
`python.jit_framework`	JIT framework marker.	`triton` for Triton

`runtime_options`

Runtime-specific controls, mainly for Python/Triton.

Field	Meaning
`triton_enabled`	Enables Triton-specific stabilization and reporting notes.
`triton_kernel_hint`	Original Triton function/kernel hint from the request.
`triton_autotune_policy`	`fixed_config`, `warmup_or_fixed_config`, or `allow_autotune`.
`jit_warmup_required`	Marks JIT warmup as required before evidence collection.
`recommended_warmup_skip`	Suggested skip count for JIT/autotune/warmup launches.
`require_cuda_synchronize_around_profile_window`	Requests explicit synchronization around benchmark windows when practical.
`source_mapping_expectation`	Source attribution expectation; Triton is usually `best_effort`.
`notes`	Runtime-specific caveats to carry into the report.

`kernel`

Selects the profiled kernel.

Field	Meaning
`name`	Human-readable kernel label from the request.
`filter_mode`	`exact`, `regex`, `kernel-id`, `nvtx-range`, `vendor-specific`, or `auto`.
`filter`	Backend filter. Named kernels default to `.<kernel_name>.`.
`profile_id`	Stable output identifier; `auto` generates one from kernel name and timestamp.
`nvtx_range`	Optional NVTX range filter.
`allow_discovery_fallback`	Allows discovery only when the initial filter is missing, fails, or is ambiguous.

`profiling`

Controls profile depth and output location.

Field	Meaning
`backend`	Profiler backend adapter, default `nvidia-ncu`.
`initial_level`	`basic` or `full`. Default is `basic`.
`warmup_policy`	`auto`, `fixed`, or `none`.
`warmup_skip`	Number of matching launches to skip before collection.
`launch_count`	Number of matching launches to collect.
`enable_source_mapping`	Enables source/SASS/PTX attribution when available.
`enable_visual_report`	Enables visual report generation.
`extra_profiler_options`	Backend-specific extra options.
`ncu_bin`	Nsight Compute CLI command for non-sudo mode, default `ncu`; `full_sudo` mode reads `./profile/ncu_path`.
`output_root`	Root directory for generated profile folders.

`privilege`

Defines how privileged profiler access is handled.

Field	Meaning
`mode`	`none` or `full_sudo`.
`full_sudo_policy`	Restricts sudo to root or narrow `NOPASSWD` for the exact path in `./profile/ncu_path`.
`password_storage`	Must remain `forbidden`.
`forbidden`	Explicit safety rules: never store, read, print, pipe, or auto-type sudo passwords.

`discovery`

Fallback kernel discovery settings. Discovery is not the default path for a named kernel.

Field	Meaning
`enabled_when_filter_missing`	Allows discovery if no filter is available.
`fallback_when_kernel_filter_misses`	Allows discovery if the generated filter matches nothing.
`max_candidates`	Maximum candidates to retain.
`prefer_user_named_kernel`	Ranks name matches before generic cost ranking.
`rank_by`	Candidate ranking keys.
`command_extra_options`	Extra options used only during discovery.

`analysis`

Controls optional and targeted analysis stages.

Field	Meaning
`collect_memory`	`auto`, `true`, or `false` for memory analysis.
`collect_compute`	`auto`, `true`, or `false` for compute analysis.
`collect_occupancy`	`auto`, `true`, or `false` for occupancy/scheduler analysis.
`collect_roofline`	`auto`, `true`, or `false` for Roofline.
`collect_source`	`auto`, `true`, or `false` for source/SASS/PTX collection.
`enable_bottleneck_decision_engine`	Optional rule engine; default `false`.
`enable_regression_compare`	Optional before/after comparison; default `false`.
`compare_baseline`	`auto`, explicit path, or `null`.
`random_variation_tolerance_pct`	Default noise band for regression decisions; default `2.0`.
`generate_source_hotspots_table`	Generates `details/source_hotspots.csv`.

`visualization`

Controls the optional visual report.

Field	Meaning
`enabled`	Enables visual report generation.
`format`	Output image format, usually `png`.
`require_python_packages`	Required Python packages checked before rendering.

`notes`

Carries request text and unsupported requirements into the report.

Field	Meaning
`user_requirements`	Original extra requirements parsed from the request.
`unsupported_or_deferred_requirements`	Requirements outside the kernel-only or supported profiler scope.

Quick start: native CUDA executable

Generate a target file:

python3 scripts/generate_profile_target.py \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "visual report, source, roofline"

Run the first evidence pass:

scripts/ncu_collect_kernel_profile.sh \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 10 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages auto

Then run targeted stages only when justified by the first pass, for example:

scripts/ncu_collect_kernel_profile.sh \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 10 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages memory,source

The collector writes raw CSV files directly, does not generate .ncu-rep, and runs compact metric extraction automatically when raw metrics are available. Manual retry:

python3 scripts/extract_ncu_metrics.py \
  --input ./profile/hgemm_byzj_v0_20260515_120000/details/metrics_raw.csv \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000/details

Generate a visual summary when enabled:

python3 scripts/visualize_profile_report.py \
  ./profile/hgemm_byzj_v0_20260515_120000/final_report.md \
  ./profile/hgemm_byzj_v0_20260515_120000/details \
  ./profile/hgemm_byzj_v0_20260515_120000/visual/profile_summary.png

Quick start: Triton kernel

Generate a Triton target:

python3 scripts/generate_profile_target.py \
  --runtime python-triton \
  --target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "triton, visual report, source, roofline"

Collect the profile:

scripts/ncu_collect_kernel_profile.sh \
  --runtime python-triton \
  --target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 20 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages auto

For final Triton evidence, keep the profiled launch window stable: warm up JIT, avoid collecting autotune candidates, synchronize around measured loops, and fix kernel configs when possible.

Output structure

A completed profile should use this layout:

profile/<kernel_name>_<profile_id>/
├── profile-target.yaml
├── final_report.md
├── run_manifest.yaml
├── details/
│   ├── 01_basic_raw.csv
│   ├── 02_speed_of_light_raw.csv
│   ├── 03_memory_raw.csv
│   ├── 04_compute_raw.csv
│   ├── 05_occupancy_launch_raw.csv
│   ├── 06_roofline_raw.csv
│   ├── 07_source_raw.csv
│   ├── metrics_summary.json
│   ├── metrics_extracted.jsonl
│   └── source_hotspots.csv
├── visual/
│   └── profile_summary.png
└── comparison/
    └── regression_report.md

Not every optional directory is required. visual/ appears only when visualization is enabled. comparison/ appears only when regression mode is enabled.

Final report contents

The normalized report should include:

target command and runtime,
GPU model, driver, CUDA/profiler version when available,
kernel filter and resolved kernel name,
launch skip/count and stabilization notes,
basic profile summary,
SpeedOfLight direction,
memory analysis,
compute analysis,
occupancy and scheduler analysis,
Roofline result when available,
source/SASS/PTX hotspot table or limitations,
bottleneck classification with metric evidence,
optimization recommendations ranked by confidence,
optional visual report path,
optional before/after comparison,
unresolved questions and data-quality notes.

Backend conformance

Run the adapter check:

python3 scripts/vendor_conformance_check.py \
  --adapter references/nvidia/vendor-adapter-nvidia-ncu.yaml

A conforming backend should declare support or explicit non-support for:

kernel discovery,
exact/regex/kernel-id/range filtering,
basic profile collection,
memory analysis,
compute analysis,
occupancy analysis,
Roofline analysis,
raw metric export,
source/ISA correlation,
permission handling.

References and Attribution

This skill is built around a portable kernel-profiling workflow. The default backend is NVIDIA Nsight Compute, while the profiling model is intentionally separated from the vendor-specific command layer so that another GPU vendor can replace the backend adapter and metric mapping.

Upstream documentation

NVIDIA Nsight Compute documentation
Used as the primary reference for Nsight Compute CLI behavior, report concepts, metric collection, sections, sets, source/SASS correlation, replay modes, and profiling workflow.
NVIDIA CUDA C++ Programming Guide and CUDA Compute Capability documentation
Used for architecture-level limits such as warp size, maximum resident blocks, maximum resident warps, registers per SM, shared memory limits, and occupancy-related interpretation.
NVIDIA CUDA C++ Best Practices Guide
Used for CUDA kernel optimization concepts such as memory coalescing, occupancy, arithmetic intensity, memory hierarchy behavior, and performance measurement practices.

Third-party reference material

The following files are local pointer files to cuda-skill references in the open-source repository slowlyC/agent-gpu-skills:

references/nvidia/ncu-guide.md
references/nvidia/ProfilingGuide.md

Original source repository:

https://github.com/slowlyC/agent-gpu-skills

Attribution:

Original repository: slowlyC/agent-gpu-skills
Original author / maintainer: slowlyC
Source skill: cuda-skill
Relevant upstream reference paths include:
- references/ncu-guide.md
- references/ncu-docs/ProfilingGuide.md
- references/best-practices-guide/

These files intentionally do not embed the upstream documentation text. If redistributing this skill, keep this attribution section and follow any upstream license and notice requirements from slowlyC/agent-gpu-skills.

Local reference layout

references/nvidia/ncu-guide.md
Pointer to the upstream Nsight Compute command and workflow reference in slowlyC/agent-gpu-skills.
references/nvidia/ProfilingGuide.md
Pointer to the upstream Nsight Compute profiling guide reference in slowlyC/agent-gpu-skills.
references/nvidia/vendor-adapter-nvidia-ncu.yaml
Backend adapter that maps the portable profiling workflow to Nsight Compute commands, sections, sets, and report artifacts.
references/nvidia/architectures/gpu_specs.yaml
Machine-readable NVIDIA GPU architecture and device-spec database for architecture-aware profiling interpretation.
references/nvidia/architectures/architecture-notes.md
Human-readable architecture notes for interpreting kernel bottlenecks across NVIDIA GPU generations.
references/portable/metric-aliases.yaml
Portable metric alias table used to decouple analysis logic from vendor-specific metric names.
references/portable/portable-bottleneck-taxonomy.md
Vendor-neutral bottleneck taxonomy used by reports and optional rule-based analysis.
references/portable/vendor-porting-guide.md
Guide for replacing the Nsight Compute backend with another vendor profiler.
references/portable/rules/bottleneck_rules.yaml
Optional rule set for the bottleneck decision engine. Disabled by default.

Changelog

v1.0

Stable kernel-only profiling package with native CUDA and Python/Triton support. The package includes target generation, staged Nsight Compute collection, optional visual reports, optional before/after regression comparison, optional bottleneck decision rules, source hotspot extraction, exact-path ncu NOPASSWD guidance, NVIDIA hardware metadata, vendor-portable metric aliases, and backend conformance checks.

v0.6

Consolidated the package layout into SKILL.md, README.md, README.zh-CN.md, docs/, references/, scripts/, and assets/. Moved templates and examples under assets/, moved conformance documentation under docs/, and kept executable utilities under scripts/.

v0.5

Defined the privilege model: no-sudo execution and exact-path full_sudo execution through sudo -n. Kernel-name requests now generate a direct regex filter by default, with discovery only as a fallback.

v0.4

Added privileged collection guidance for environments where Nsight Compute requires performance-counter permissions. Refined README tone and clarified how privileged collection should be run.

v0.3

Replaced prose-only NVIDIA architecture notes with references/nvidia/architectures/gpu_specs.yaml and architecture notes. Added target generation, metric extraction, optional bottleneck rules, source hotspot schema, regression comparison, workload stabilization guidance, and backend conformance checks.

v0.2

Added Chinese README, expanded NVIDIA architecture and profiling references, made kernel filtering optional in handwritten targets, and documented future industrialization improvements.

v0.1

Initial kernel-only profiling skill with Nsight Compute as the default backend, staged profile collection, target schema, normalized final report template, visual report script, and vendor-portability guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
docs		docs
references		references
scripts		scripts
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

Kernel Profiler Skill

Overview

Script-first agent behavior

Installation

Primary usage: prompt-driven profiling

What the skill can do

Kernel target generation

Staged Nsight Compute collection

Triton kernel profiling

Permission handling

Metric extraction

Source hotspot table

Optional bottleneck decision engine

Optional before/after regression analysis

Optional visual report

Vendor portability

Repository layout

profile-target.yaml parameter reference

schema_version

target

runtime_options

kernel

profiling

privilege

discovery

analysis

visualization

notes

Quick start: native CUDA executable

Quick start: Triton kernel

Output structure

Final report contents

Backend conformance

References and Attribution

Upstream documentation

Third-party reference material

Local reference layout

Changelog

v1.0

v0.6

v0.5

v0.4

v0.3

v0.2

v0.1

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`profile-target.yaml` parameter reference

`schema_version`

`target`

`runtime_options`

`kernel`

`profiling`

`privilege`

`discovery`

`analysis`

`visualization`

`notes`

Packages