Skip to content

ZJtoast/kernel-profiler-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kernel Profiler Skill

An agent skill for GPU kernel/operator profiling with Nsight Compute and portable profiler backends

Cursor · Claude Code · Codex · Gemini CLI · CUDA · Triton · Nsight Compute


Overview

Kernel Profiler Skill is an agent skill for GPU kernel/operator profiling in Cursor, Claude Code, Codex, and Gemini CLI. It provides a structured workflow for investigating a single GPU kernel or a tightly related kernel family, including launch configuration, occupancy, memory behavior, compute pipeline use, warp stalls, instruction mix, source-line attribution, and before/after regression.

The default backend is NVIDIA Nsight Compute (ncu). The backend layer is intentionally isolated so the same profiling plan can be adapted to other GPU vendors by replacing the profiler adapter, metric aliases, and architecture database.

Supported target styles include:

  • native CUDA executables,
  • benchmark commands built by CMake/Make/custom build systems,
  • Python processes that launch CUDA kernels,
  • Triton JIT kernels launched from Python.

Out of scope by default: CPU timeline analysis, dataloaders, distributed communication, launch-gap analysis, MPI/NCCL overlap, end-to-end application tracing, and system-level scheduling. Those belong in a system profiler workflow, not this kernel-only flow.

Script-first agent behavior

The files in scripts/ are the executable implementation of this skill. Agents should call them directly instead of regenerating equivalent shell or Python logic:

  • scripts/generate_profile_target.py for ./profile/<profile_id>/profile-target.yaml
  • scripts/ncu_collect_kernel_profile.sh for staged profiling
  • scripts/discover_kernels.sh only as a kernel-selection fallback
  • scripts/extract_ncu_metrics.py, scripts/generate_source_hotspots.py, scripts/compare_profiles.py, and scripts/visualize_profile_report.py for post-processing

If a script cannot express a required capability, patch the existing script first, then call it. Direct ncu commands are backend reference details, not the normal agent execution path.

Installation

Install it as a local Codex skill by cloning this repository, or run the copy command from this directory:

# Linux/macOS
mkdir -p ~/.codex/skills/kernel-profiler-skill
cp -R . ~/.codex/skills/kernel-profiler-skill/

# Windows PowerShell
New-Item -ItemType Directory -Force "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"
Copy-Item -Recurse -Force . "$env:USERPROFILE\.codex\skills\kernel-profiler-skill"

Restart or reload Codex. To use the skill, open the CUDA/Triton project you want to profile, then invoke /skill kernel-profiler-skill; the skill itself is loaded from ~/.codex/skills/kernel-profiler-skill.

Primary usage: prompt-driven profiling

The main workflow starts from a short request in the coding agent, for example:

/skill kernel-profiler-skill, profile kernel hgemm_byzj_v0 and generate a visual report

The agent should convert that request into a reproducible profiling run:

  1. Parse the request: kernel hint hgemm_byzj_v0, visualization enabled, default profile level basic, source mapping enabled unless disabled, and default backend nvidia-ncu.
  2. Resolve the target command from the repository when it is not provided. Typical search locations are README, CMakeLists.txt, Makefile, build/, bin/, examples/, benchmark scripts, and project run scripts.
  3. Generate ./profile/<profile_id>/profile-target.yaml by calling scripts/generate_profile_target.py.
  4. Use the kernel name directly as the initial filter. The default generated filter for hgemm_byzj_v0 is .*hgemm_byzj_v0.*.
  5. Run scripts/ncu_collect_kernel_profile.sh --stages auto: the collector profiles basic, inspects details/metrics_summary.json, then runs one evidence-justified follow-up stage such as memory, compute, occupancy, or speed-of-light. Use --stages all only when explicitly requested.
  6. Generate source hotspots, comparison, and visual reports with the existing scripts when requested.
  7. Write the final artifacts under ./profile/<kernel_name>_<profile_id>/.

Manual usage is still supported: profile-target.yaml can be generated by hand from assets/templates/profile-target.yaml or edited after generation. Place it under the matching profile directory, for example ./profile/hgemm_byzj_v0_20260515_120000/profile-target.yaml.

Example generator call produced from the prompt above, once the benchmark command has been resolved:

python3 scripts/generate_profile_target.py \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "visual report"

Expected kernel section:

kernel:
  name: hgemm_byzj_v0
  filter_mode: regex
  filter: .*hgemm_byzj_v0.*
  allow_discovery_fallback: true

Discovery is a fallback for missing, failed, or ambiguous filters. It is not a required extra step for a named kernel.

What the skill can do

Kernel target generation

A profile run is described by ./profile/<profile_id>/profile-target.yaml. The target can be written by hand from assets/templates/profile-target.yaml or generated from a short request.

The generator supports:

  • native executable targets,
  • Python/Triton targets,
  • kernel-name hints,
  • automatic regex filter generation from the kernel name,
  • visual report requests,
  • source/SASS/PTX requests,
  • Roofline requests,
  • regression comparison requests,
  • unsupported requirement recording.

For a named kernel such as hgemm_byzj_v0, the normal path is to use the name directly:

kernel:
  name: hgemm_byzj_v0
  filter_mode: regex
  filter: .*hgemm_byzj_v0.*
  allow_discovery_fallback: true

Kernel discovery is reserved for missing, failed, or ambiguous filters. It is not used as an extra step when a valid kernel name is already available.

Staged Nsight Compute collection

The standard collector can run profiling in evidence-oriented stages with --stages. The normal autonomous-agent path is a small first pass, followed by targeted follow-up stages:

  1. basic pass for launch configuration, high-level utilization, and obvious bottlenecks.
  2. SpeedOfLight pass for compute-vs-memory direction.
  3. MemoryWorkloadAnalysis pass for memory hierarchy, memory throughput, transactions, sectors, cache behavior, shared-memory conflicts, and local memory.
  4. ComputeWorkloadAnalysis plus InstructionStats for pipeline utilization, IPC, instruction mix, Tensor Core use, FP/INT/SFU/LDST behavior, and instruction categories.
  5. LaunchStats, Occupancy, scheduler, and warp-state analysis for active warps, theoretical/achieved occupancy, resident blocks, register/shared-memory limits, and stall reasons.
  6. Roofline analysis when supported by the backend.
  7. Source/SASS/PTX attribution when source mapping is available.

Raw artifacts are kept under details/; the final report is normalized and concise.

Triton kernel profiling

Triton kernels are profiled through the Python process that launches them. The selected object remains the generated GPU kernel, not the Python function runtime.

The Triton flow handles:

  • Python command targets,
  • Triton JIT warmup,
  • autotune contamination control,
  • launch-skip defaults for JIT/autotune/warmup,
  • best-effort Python/Triton source mapping,
  • fallback discovery when the Python function name differs from the generated GPU kernel name.

Recommended final-evidence runs should use a fixed Triton config where possible, then skip warmup launches before collecting the profiled launch.

Permission handling

Some profiler counters require elevated privileges. Two execution modes are supported:

Mode Purpose
none Run without sudo.
full_sudo Run sudo -n "$(cat ./profile/ncu_path)"; requires root or narrow NOPASSWD access for the path stored in ./profile/ncu_path.

Plaintext password storage is not implemented. Passwords must not be written into YAML, scripts, logs, environment variables, shell history, generated commands, reports, or files such as profile/sudokey. At skill start, ./profile/ncu_path is created if missing with /usr/local/cuda/bin/ncu. If ncu requires sudo, the agent should stop the profile, show the NOPASSWD guide, ask the user to run command -v ncu and readlink -f "$(command -v ncu)", and require the user to configure NOPASSWD for one selected path and write that same path into ./profile/ncu_path.

On multi-CUDA servers, keep sudo profiling with the same configured ./profile/ncu_path:

mkdir -p ./profile
printf '%s\n' '/usr/local/cuda-12.9/bin/ncu' > ./profile/ncu_path

scripts/ncu_collect_kernel_profile.sh \
  --sudo \
  ...

Metric extraction

Large raw CSV files are reduced into compact machine-readable summaries:

details/metrics_summary.json
details/metrics_extracted.jsonl

These summaries are the preferred inputs for report generation, regression comparison, and optional rule-based triage.

Source hotspot table

The source hotspot stage writes:

details/source_hotspots.csv

Standard columns:

source_file,line,function,sass_opcode,ptx_opcode,metric,value,stall_reason,bottleneck_class,confidence,evidence

Each major conclusion should point to source lines, SASS/PTX opcode families, or a documented reason why source correlation is unavailable.

Optional bottleneck decision engine

The rule engine is disabled by default. It can classify common bottleneck patterns from extracted metrics using rules in:

references/portable/rules/bottleneck_rules.yaml

The output is advisory. The final report still needs explicit metric evidence.

Optional before/after regression analysis

Regression mode is disabled by default. When enabled, it compares the current profile to a selected or automatically discovered previous profile for the same kernel prefix.

The comparison report includes:

  • baseline value,
  • current value,
  • absolute delta,
  • percentage delta,
  • metric-level judgment,
  • overall verdict,
  • missing-data and inconclusive states,
  • lower-is-better and higher-is-better semantics,
  • a default ±2% noise band for normal random variation.

Optional visual report

A visual summary can be generated from the final report and extracted details. The script checks for required Python packages and should return installation guidance instead of producing partial images when dependencies are missing.

Vendor portability

The NVIDIA implementation is described by:

references/nvidia/vendor-adapter-nvidia-ncu.yaml

Portable mappings and taxonomy are placed under:

references/portable/

A backend port should provide equivalent support for discovery, filtering, staged collection, raw metric export, source/ISA correlation, metric aliases, and conformance checks.

Repository layout

kernel-profiler-skill/
├── SKILL.md
├── README.md
├── README.zh-CN.md
├── docs/
├── references/
├── scripts/
└── assets/

Important paths:

assets/templates/profile-target.yaml          target schema template
assets/templates/final_report_template.md     normalized report template
assets/templates/run_manifest_template.yaml   run metadata template
assets/examples/profile-target.example.yaml   native CUDA example
assets/examples/profile-target.triton.example.yaml  Python/Triton example

docs/workload-stabilization-guide.md          warmup, repeatability, launch selection
docs/triton-kernel-profiling.md               Triton-specific profiling notes
docs/vendor-conformance.md                    backend capability checks
docs/ncu-nopasswd-guide.zh-CN.md              exact-path NOPASSWD setup guide
docs/legal/NOTICE.md                          attribution and redistribution notes

references/nvidia/architectures/gpu_specs.yaml      machine-readable NVIDIA GPU limits
references/nvidia/ncu-guide.md                      link to upstream Nsight Compute workflow notes
references/nvidia/ProfilingGuide.md                 link to upstream profiling guide reference
references/nvidia/ncu-metric-map.md                 metric-to-bottleneck map
references/nvidia/vendor-adapter-nvidia-ncu.yaml    default backend adapter
references/portable/metric-aliases.yaml             vendor-neutral metric aliases
references/portable/portable-bottleneck-taxonomy.md bottleneck classes
references/portable/vendor-porting-guide.md         backend migration guide
references/portable/rules/bottleneck_rules.yaml     optional rule engine rules

scripts/generate_profile_target.py            target generator
scripts/ncu_collect_kernel_profile.sh         staged Nsight Compute collector
scripts/discover_kernels.sh                   fallback kernel discovery
scripts/extract_ncu_metrics.py                raw metric extractor
scripts/generate_source_hotspots.py           source hotspot table generator
scripts/bottleneck_decision_engine.py         optional rule-based triage
scripts/compare_profiles.py                   optional before/after comparison
scripts/visualize_profile_report.py           optional visual report generator
scripts/vendor_conformance_check.py           backend adapter check

profile-target.yaml parameter reference

./profile/<profile_id>/profile-target.yaml is the normalized contract between the request parser, the collector scripts, and the final report generator. It is generated automatically in the normal prompt-driven workflow, but the fields are intentionally readable and safe to edit.

schema_version

Schema identifier for compatibility checks. Current value: 3.0.

target

Describes the command that launches the kernel.

Field Meaning Typical value
runtime Runtime family. native, python, or python-triton
executable Program to execute. ./build/bench_gemm, python3
working_directory Directory used when running the command. .
args Argument vector passed to executable. ['--m', '4096']
env Environment variables for the run. {CUDA_VISIBLE_DEVICES: '0'}
stdin Optional stdin payload or file reference. null
timeout_seconds Optional command timeout. null or integer
python.interpreter Python executable when runtime is Python-based. python3
python.entry_kind Python entry type. script, module, command, or null
python.script Script path for Python script targets. bench_triton_hgemm.py
python.module Module name for python -m. package.bench
python.jit_framework JIT framework marker. triton for Triton

runtime_options

Runtime-specific controls, mainly for Python/Triton.

Field Meaning
triton_enabled Enables Triton-specific stabilization and reporting notes.
triton_kernel_hint Original Triton function/kernel hint from the request.
triton_autotune_policy fixed_config, warmup_or_fixed_config, or allow_autotune.
jit_warmup_required Marks JIT warmup as required before evidence collection.
recommended_warmup_skip Suggested skip count for JIT/autotune/warmup launches.
require_cuda_synchronize_around_profile_window Requests explicit synchronization around benchmark windows when practical.
source_mapping_expectation Source attribution expectation; Triton is usually best_effort.
notes Runtime-specific caveats to carry into the report.

kernel

Selects the profiled kernel.

Field Meaning
name Human-readable kernel label from the request.
filter_mode exact, regex, kernel-id, nvtx-range, vendor-specific, or auto.
filter Backend filter. Named kernels default to .*<kernel_name>.*.
profile_id Stable output identifier; auto generates one from kernel name and timestamp.
nvtx_range Optional NVTX range filter.
allow_discovery_fallback Allows discovery only when the initial filter is missing, fails, or is ambiguous.

profiling

Controls profile depth and output location.

Field Meaning
backend Profiler backend adapter, default nvidia-ncu.
initial_level basic or full. Default is basic.
warmup_policy auto, fixed, or none.
warmup_skip Number of matching launches to skip before collection.
launch_count Number of matching launches to collect.
enable_source_mapping Enables source/SASS/PTX attribution when available.
enable_visual_report Enables visual report generation.
extra_profiler_options Backend-specific extra options.
ncu_bin Nsight Compute CLI command for non-sudo mode, default ncu; full_sudo mode reads ./profile/ncu_path.
output_root Root directory for generated profile folders.

privilege

Defines how privileged profiler access is handled.

Field Meaning
mode none or full_sudo.
full_sudo_policy Restricts sudo to root or narrow NOPASSWD for the exact path in ./profile/ncu_path.
password_storage Must remain forbidden.
forbidden Explicit safety rules: never store, read, print, pipe, or auto-type sudo passwords.

discovery

Fallback kernel discovery settings. Discovery is not the default path for a named kernel.

Field Meaning
enabled_when_filter_missing Allows discovery if no filter is available.
fallback_when_kernel_filter_misses Allows discovery if the generated filter matches nothing.
max_candidates Maximum candidates to retain.
prefer_user_named_kernel Ranks name matches before generic cost ranking.
rank_by Candidate ranking keys.
command_extra_options Extra options used only during discovery.

analysis

Controls optional and targeted analysis stages.

Field Meaning
collect_memory auto, true, or false for memory analysis.
collect_compute auto, true, or false for compute analysis.
collect_occupancy auto, true, or false for occupancy/scheduler analysis.
collect_roofline auto, true, or false for Roofline.
collect_source auto, true, or false for source/SASS/PTX collection.
enable_bottleneck_decision_engine Optional rule engine; default false.
enable_regression_compare Optional before/after comparison; default false.
compare_baseline auto, explicit path, or null.
random_variation_tolerance_pct Default noise band for regression decisions; default 2.0.
generate_source_hotspots_table Generates details/source_hotspots.csv.

visualization

Controls the optional visual report.

Field Meaning
enabled Enables visual report generation.
format Output image format, usually png.
require_python_packages Required Python packages checked before rendering.

notes

Carries request text and unsupported requirements into the report.

Field Meaning
user_requirements Original extra requirements parsed from the request.
unsupported_or_deferred_requirements Requirements outside the kernel-only or supported profiler scope.

Quick start: native CUDA executable

Generate a target file:

python3 scripts/generate_profile_target.py \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "visual report, source, roofline"

Run the first evidence pass:

scripts/ncu_collect_kernel_profile.sh \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 10 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages auto

Then run targeted stages only when justified by the first pass, for example:

scripts/ncu_collect_kernel_profile.sh \
  --target-cmd "./build/bench_gemm --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 10 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages memory,source

The collector writes raw CSV files directly, does not generate .ncu-rep, and runs compact metric extraction automatically when raw metrics are available. Manual retry:

python3 scripts/extract_ncu_metrics.py \
  --input ./profile/hgemm_byzj_v0_20260515_120000/details/metrics_raw.csv \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000/details

Generate a visual summary when enabled:

python3 scripts/visualize_profile_report.py \
  ./profile/hgemm_byzj_v0_20260515_120000/final_report.md \
  ./profile/hgemm_byzj_v0_20260515_120000/details \
  ./profile/hgemm_byzj_v0_20260515_120000/visual/profile_summary.png

Quick start: Triton kernel

Generate a Triton target:

python3 scripts/generate_profile_target.py \
  --runtime python-triton \
  --target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel hgemm_byzj_v0 \
  --requirement "triton, visual report, source, roofline"

Collect the profile:

scripts/ncu_collect_kernel_profile.sh \
  --runtime python-triton \
  --target-cmd "python3 bench_triton_hgemm.py --m 4096 --n 4096 --k 4096 --iters 100" \
  --kernel-name hgemm_byzj_v0 \
  --kernel-regex ".*hgemm_byzj_v0.*" \
  --launch-skip 20 \
  --launch-count 1 \
  --output-dir ./profile/hgemm_byzj_v0_20260515_120000 \
  --stages auto

For final Triton evidence, keep the profiled launch window stable: warm up JIT, avoid collecting autotune candidates, synchronize around measured loops, and fix kernel configs when possible.

Output structure

A completed profile should use this layout:

profile/<kernel_name>_<profile_id>/
├── profile-target.yaml
├── final_report.md
├── run_manifest.yaml
├── details/
│   ├── 01_basic_raw.csv
│   ├── 02_speed_of_light_raw.csv
│   ├── 03_memory_raw.csv
│   ├── 04_compute_raw.csv
│   ├── 05_occupancy_launch_raw.csv
│   ├── 06_roofline_raw.csv
│   ├── 07_source_raw.csv
│   ├── metrics_summary.json
│   ├── metrics_extracted.jsonl
│   └── source_hotspots.csv
├── visual/
│   └── profile_summary.png
└── comparison/
    └── regression_report.md

Not every optional directory is required. visual/ appears only when visualization is enabled. comparison/ appears only when regression mode is enabled.

Final report contents

The normalized report should include:

  • target command and runtime,
  • GPU model, driver, CUDA/profiler version when available,
  • kernel filter and resolved kernel name,
  • launch skip/count and stabilization notes,
  • basic profile summary,
  • SpeedOfLight direction,
  • memory analysis,
  • compute analysis,
  • occupancy and scheduler analysis,
  • Roofline result when available,
  • source/SASS/PTX hotspot table or limitations,
  • bottleneck classification with metric evidence,
  • optimization recommendations ranked by confidence,
  • optional visual report path,
  • optional before/after comparison,
  • unresolved questions and data-quality notes.

Backend conformance

Run the adapter check:

python3 scripts/vendor_conformance_check.py \
  --adapter references/nvidia/vendor-adapter-nvidia-ncu.yaml

A conforming backend should declare support or explicit non-support for:

  • kernel discovery,
  • exact/regex/kernel-id/range filtering,
  • basic profile collection,
  • memory analysis,
  • compute analysis,
  • occupancy analysis,
  • Roofline analysis,
  • raw metric export,
  • source/ISA correlation,
  • permission handling.

References and Attribution

This skill is built around a portable kernel-profiling workflow. The default backend is NVIDIA Nsight Compute, while the profiling model is intentionally separated from the vendor-specific command layer so that another GPU vendor can replace the backend adapter and metric mapping.

Upstream documentation

  • NVIDIA Nsight Compute documentation
    Used as the primary reference for Nsight Compute CLI behavior, report concepts, metric collection, sections, sets, source/SASS correlation, replay modes, and profiling workflow.

  • NVIDIA CUDA C++ Programming Guide and CUDA Compute Capability documentation
    Used for architecture-level limits such as warp size, maximum resident blocks, maximum resident warps, registers per SM, shared memory limits, and occupancy-related interpretation.

  • NVIDIA CUDA C++ Best Practices Guide
    Used for CUDA kernel optimization concepts such as memory coalescing, occupancy, arithmetic intensity, memory hierarchy behavior, and performance measurement practices.

Third-party reference material

The following files are local pointer files to cuda-skill references in the open-source repository slowlyC/agent-gpu-skills:

  • references/nvidia/ncu-guide.md
  • references/nvidia/ProfilingGuide.md

Original source repository:

  • https://github.com/slowlyC/agent-gpu-skills

Attribution:

  • Original repository: slowlyC/agent-gpu-skills
  • Original author / maintainer: slowlyC
  • Source skill: cuda-skill
  • Relevant upstream reference paths include:
    • references/ncu-guide.md
    • references/ncu-docs/ProfilingGuide.md
    • references/best-practices-guide/

These files intentionally do not embed the upstream documentation text. If redistributing this skill, keep this attribution section and follow any upstream license and notice requirements from slowlyC/agent-gpu-skills.

Local reference layout

  • references/nvidia/ncu-guide.md
    Pointer to the upstream Nsight Compute command and workflow reference in slowlyC/agent-gpu-skills.

  • references/nvidia/ProfilingGuide.md
    Pointer to the upstream Nsight Compute profiling guide reference in slowlyC/agent-gpu-skills.

  • references/nvidia/vendor-adapter-nvidia-ncu.yaml
    Backend adapter that maps the portable profiling workflow to Nsight Compute commands, sections, sets, and report artifacts.

  • references/nvidia/architectures/gpu_specs.yaml
    Machine-readable NVIDIA GPU architecture and device-spec database for architecture-aware profiling interpretation.

  • references/nvidia/architectures/architecture-notes.md
    Human-readable architecture notes for interpreting kernel bottlenecks across NVIDIA GPU generations.

  • references/portable/metric-aliases.yaml
    Portable metric alias table used to decouple analysis logic from vendor-specific metric names.

  • references/portable/portable-bottleneck-taxonomy.md
    Vendor-neutral bottleneck taxonomy used by reports and optional rule-based analysis.

  • references/portable/vendor-porting-guide.md
    Guide for replacing the Nsight Compute backend with another vendor profiler.

  • references/portable/rules/bottleneck_rules.yaml
    Optional rule set for the bottleneck decision engine. Disabled by default.

Changelog

v1.0

Stable kernel-only profiling package with native CUDA and Python/Triton support. The package includes target generation, staged Nsight Compute collection, optional visual reports, optional before/after regression comparison, optional bottleneck decision rules, source hotspot extraction, exact-path ncu NOPASSWD guidance, NVIDIA hardware metadata, vendor-portable metric aliases, and backend conformance checks.

v0.6

Consolidated the package layout into SKILL.md, README.md, README.zh-CN.md, docs/, references/, scripts/, and assets/. Moved templates and examples under assets/, moved conformance documentation under docs/, and kept executable utilities under scripts/.

v0.5

Defined the privilege model: no-sudo execution and exact-path full_sudo execution through sudo -n. Kernel-name requests now generate a direct regex filter by default, with discovery only as a fallback.

v0.4

Added privileged collection guidance for environments where Nsight Compute requires performance-counter permissions. Refined README tone and clarified how privileged collection should be run.

v0.3

Replaced prose-only NVIDIA architecture notes with references/nvidia/architectures/gpu_specs.yaml and architecture notes. Added target generation, metric extraction, optional bottleneck rules, source hotspot schema, regression comparison, workload stabilization guidance, and backend conformance checks.

v0.2

Added Chinese README, expanded NVIDIA architecture and profiling references, made kernel filtering optional in handwritten targets, and documented future industrialization improvements.

v0.1

Initial kernel-only profiling skill with Nsight Compute as the default backend, staged profile collection, target schema, normalized final report template, visual report script, and vendor-portability guidance.

About

Kernel-only profiling workflow for CUDA and Triton kernels with Nsight Compute, standardized reports, visual analysis, and vendor-portable adapters.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors