Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/s_tir/meta_schedule/schedule_rule/schedule_rule.cc
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ ffi::Array<ScheduleRule> ScheduleRule::DefaultCUDA() {
ScheduleRule::ParallelizeVectorizeUnroll(
/*max_jobs_per_core=*/-1,
/*max_vectorize_extent=*/-1,
/*unroll_max_steps=*/ffi::Array<Integer>{0, 16, 64, 512, 1024},
/*unroll_max_steps=*/ffi::Array<Integer>{0, 16, 32, 64, 128, 256, 512, 1024},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change expands the unroll search space for all CUDA targets, which can increase auto-tuning time for architectures where these new unroll steps are not beneficial.

To make this optimization more targeted to SM70+ GPUs as intended, I suggest making the choice of unroll steps conditional on the target's compute capability. This avoids slowing down tuning on older architectures. This can be done cleanly using an immediately-invoked lambda expression.

The condition can be sm_version >= 70 if the optimization is beneficial for SM70 and newer, or sm_version == 70 if it's specific to V100.

          /*unroll_max_steps=*/[]() -> ffi::Array<Integer> {
            auto target = tvm::Target::Current(true);
            if (target.defined() && target->kind->name == "cuda") {
              if (const auto* sm_ptr = target->GetAttr<Integer>("sm")) {
                if (sm_ptr->value() >= 70) {
                  return {0, 16, 32, 64, 128, 256, 512, 1024};
                }
              }
            }
            return {0, 16, 64, 512, 1024};
          }(),

/*unroll_explicit=*/true),
ScheduleRule::AutoBind(
/*max_threadblocks=*/256,
Expand Down