Skip to content

HCD-241: DSE 6.8/9 upgradeability#2240

Open
szymon-miezal wants to merge 5 commits intomainfrom
hcd-241
Open

HCD-241: DSE 6.8/9 upgradeability#2240
szymon-miezal wants to merge 5 commits intomainfrom
hcd-241

Conversation

@szymon-miezal
Copy link
Copy Markdown

@szymon-miezal szymon-miezal commented Feb 19, 2026

This PR addresses critical compatibility issues discovered during DSE to HCD (Hyper-Converged Database) upgrade scenarios. The changes focus on ensuring smooth interoperability between DSE 6.8+ nodes and HCD nodes during mixed-version cluster operations.

What Problems Do These Changes Fix?

1. Enhanced Debugging Capabilities for Upgrade Scenarios

Problem: When upgrading from DSE to HCD, troubleshooting connection and gossip issues was difficult due to insufficient logging.

Solution: Added comprehensive logging throughout the handshake and gossip processes, including:

  • Port selection logic for DSE → HCD upgrades
  • Messaging version negotiation during handshake
  • Protocol proposals from DSE peers
  • Gossip state values (both loaded and saved)
  • Connection initialization events

Why it matters: This gives operators visibility into what's happening during the upgrade process, making it much easier to diagnose and resolve issues in production environments.


2. Gossip Deserialization Crash Prevention

Problem: Nodes would crash with ArrayIndexOutOfBoundsException when processing gossip messages from pre-4.0 nodes. This happened because the code assumed certain delimiters would always be present in address and status values, but older nodes sent data in different formats. Both INTERNAL_ADDRESS_AND_PORT and NATIVE_ADDRESS_AND_PORT could arrive without port delimiters (e.g., "10.0.0.1" instead of "10.0.0.1:7000"), and STATUS_WITH_PORT could arrive without additional port information.

Solution: Added defensive length checks after splitting gossip values to gracefully handle cases where expected delimiters are missing:

  • Internal and native IP addresses without ports (e.g., "10.0.0.1" vs "10.0.0.1:7000")
  • Status values without additional port information

Why it matters: Prevents cluster instability and crashes during mixed-version operations.


3. DSE 6.8+ PING Request Compatibility

Problem: During startup connectivity checks, HCD nodes were attempting to send PING requests to DSE 6.8+ peers, but these versions don't support PING_REQ messages (similar to DSE 6.x and Cassandra 3.x). This resulted in noisy error logs during cluster startup.

Solution: Extended the existing version detection logic to recognize and skip PING requests for DSE 6.8+ peers during the startup connectivity check phase.

Why it matters: Eliminates unnecessary error logs when HCD nodes join or restart in a mixed DSE/HCD environment.


4. User-Defined Type Preservation During Schema Migration

Problem: When keyspace schemas were updated during migration, User-Defined Types (UDTs) from the previous schema could be lost if they weren't explicitly present in the new schema. This caused failures when inherited tables depended on these types.

Solution: Modified the schema transformation logic to preserve UDTs from the previous schema when they don't exist in the new schema, ensuring dependent tables continue to function correctly.

Why it matters: Prevents data model corruption and application failures during schema migrations, particularly important when dealing with complex schemas that use inheritance and UDTs.


Overall Impact

These changes collectively improve the robustness and reliability of DSE to HCD upgrades by:

  • Making issues easier to diagnose through better logging
  • Preventing crashes from malformed gossip data
  • Ensuring protocol compatibility across versions
  • Preserving schema integrity during migrations

Latest test run: http://10.169.74.112:8081/job/ds-cassandra-build/2117/
Baseline: http://10.169.74.112:8081/job/ds-cassandra-build/10/

Review corpus

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 19, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR and ticket in the CNDB project updating the Converged Cassandra version 1704 17405
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Comment thread src/java/org/apache/cassandra/net/StartupClusterConnectivityChecker.java Outdated
@szymon-miezal
Copy link
Copy Markdown
Author

TODO before merging: alter the commit message of cdc2d1a

@szymon-miezal szymon-miezal changed the title HCD-241: DSE 6.8/9 upgradeability (WIP) HCD-241: DSE 6.8/9 upgradeability Feb 20, 2026
@szymon-miezal szymon-miezal self-assigned this Feb 20, 2026
Comment thread src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java Outdated
@szymon-miezal szymon-miezal force-pushed the hcd-241 branch 2 times, most recently from be40074 to 4cab315 Compare February 20, 2026 18:00
return Map.of(
ApplicationState.values()[7], VersionedValue.unsafeMakeVersionedValue(vv.value.split(":")[0], vv.version),
ApplicationState.values()[17], VersionedValue.unsafeMakeVersionedValue(vv.value.split(":")[1], vv.version));
values = vv.value.split(":");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a comment imo. Explaining the relationship between DSE, underlying protocol versions, states, the change from IP to IP:Port, etc. I would also extract it to a isPortPresent or sthg like that function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree it could be documented better.

@sonarqubecloud
Copy link
Copy Markdown

@bereng
Copy link
Copy Markdown
Collaborator

bereng commented Apr 8, 2026

Testing upgrades here: http://10.169.74.112:8081/job/ds-cassandra-build/2125/

boolean known = MessagingService.instance().versions.knows(peer);
logger.debug("Peer {} is known with version {}", peer, known ? MessagingService.instance().versions.getRaw(peer) : "null");
// DSE 6.8/6.9 advertises itself with value higher than VERSION_40, thus we need to compare it with VERSION_DSE_68
boolean detectedDse = known && MessagingService.instance().versions.getRaw(peer) >= MessagingService.VERSION_DSE_68;
Copy link
Copy Markdown
Member

@michaelsembwever michaelsembwever Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question)
why do we need this (added check on detectedDse) ?

when was any not DSE 6.x entering the condition ?

@michaelsembwever
Copy link
Copy Markdown
Member

this type of commit is painful for rebases. i would normally ask how could the patch be improved to make rebases less likely to conflict, and to what git commit in main history does this commit have lineage to ?

BUT, IIUC this will not be committed to main-5.0, and there will not be more rebases of main, so the rebase concern doesn't apply. (just sharing my thoughts around this concern in preparation for next time :) )

- Add logging to debug the port selecton for DSE -> HCD upgrade
- Add logging to debug the messaging version selection during handshake
- Logging in outbound connection at handshake success
- Add the node address to logs
- Log the protocol proposed by DSE peer
- Log the value gossip fails on
- Log the loaded/saved gossip values
- Logs for Initiate, Accept and ConfirmOutboundPre40
- Logs for InboundConnectionInitiator
Adds tests validating the fix for ArrayIndexOutOfBoundsException that
occurred when deserializing gossip state from pre-4.0 nodes containing
address/port values without expected delimiters.

The bug manifested when filterOutgoingState() attempted to split values
like "10.0.0.1" or "NORMAL" and blindly access array indices [1] or [0]
that didn't exist, causing crashes during gossip message processing in
mixed-version clusters.

The fix adds length checks after splitting to gracefully handle:
- IP addresses without ports (e.g., "10.0.0.1" vs "10.0.0.1:7000")
- Status values without tokens (e.g., "NORMAL" vs "NORMAL,10.0.0.1:7000")
…ivity check

DSE 6.8 and later versions don't support PING_REQ messages, similar to
DSE 6.x and Cassandra 3.x. This change extends the existing logic to
detect and skip PING requests for DSE 6.8+ peers during the startup
cluster connectivity check phase.
When updating keyspace schemas, UDTs from the previous schema are now
preserved if they don't exist in the new schema.
This prevents issues where inherited tables depend on types that would
otherwise be lost during schema transformations.
@bereng
Copy link
Copy Markdown
Collaborator

bereng commented Apr 21, 2026

Rebase preparing for the merge

@sonarqubecloud
Copy link
Copy Markdown

@cassci-bot
Copy link
Copy Markdown

❌ Build ds-cassandra-pr-gate/PR-2240 rejected by Butler


3 regressions found
See build details here


Found 3 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.VectorCompaction100dTest.testOneToOneCompaction[version=ec enableNVQ=false] REGRESSION 🔴 0 / 30
o.a.c.index.sai.cql.VectorSiftSmallTest.testMultiSegmentBuild[version=fa enableNVQ=false enableFused=false] REGRESSION 🔴 0 / 30
o.a.c.index.sai.cql.VectorTypeTest.testIndexingMultipleVectorColumns[version=ca enableNVQ=false enableFused=false] REGRESSION 🔴 0 / 30

Found 7 known test failures

@bereng
Copy link
Copy Markdown
Collaborator

bereng commented Apr 22, 2026

  • CI 8 LGTM in terms of failures on Butler
  • Waiting on the run with the new ccm
  • CI CNDB LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants