[BUG] Cascaded algorithm correctness depends on compiler flag

**Describe the bug**
On branch-2.2, compiling the Cascaded algorithm with the -G flag changes the static shared memory alignment behavior, causing misalignment errors.

**Steps/Code to reproduce bug**

1. Modify CMakeLists.txt:

- Comment out line [CMakeLists.txt:33](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/CMakeLists.txt#L33)
```cmake
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG};-g")
```
- Uncomment line [CMakeLists.txt:32](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/CMakeLists.txt#L32)
```cmake
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG};-G").
```

2. Run [tests/test_cascaded.cpp ](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/tests/test_cascaded.cpp)with the following test case: 
```cpp
TEST_CASE("comp/decomp cascaded-small-uint64", "[nvcomp][small]").
```

To compile successfully, I reduced the default_chunk_size in [default_chunk_size](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/src/CascadedKernels.cuh#L67) from 4096 to 2048.

**Expected behavior**
The test should pass without errors.

**Environment details (please complete the following information):**
 - Environment location:

-   Ubuntu-22.04 
-   Driver Version: 555.99
-   CUDA Version: 12.5 
-   NVIDIA GeForce RTX 3080

 - Method of nvCOMP install: branch-2.2 source code

**Additional context**

After debugging, I explicitly declared alignment for shared memory allocation in the following files:

[CascadedHlifKernels.cu:122](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/src/highlevel/CascadedHlifKernels.cu#L122)
```cpp
__shared__ __align__(sizeof(data_type)) uint8_t shmem[shmem_size];
```
[CascadedKernels.cuh:801](https://github.com/NVIDIA/nvcomp/blob/a6e4e64a177e07cd2e5c8c5e07bb66ffefceae84/src/CascadedKernels.cuh#L801)
```cpp
__shared__ __align__(sizeof(data_type)) uint32_t chunk_metadata[max_chunk_metadata_size / sizeof(uint32_t)];
```
After making these changes, all tests in test_cascaded.cpp passed. I believe this dependency on compiler optimization for correctness is a bug.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cascaded algorithm correctness depends on compiler flag #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Cascaded algorithm correctness depends on compiler flag #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions