Skip to content

feat(alluxio): add lifecycle management and in-place upgrade support#5588

Open
Prachi194agrawal wants to merge 3 commits intofluid-cloudnative:masterfrom
Prachi194agrawal:feature/alluxio-lifecycle-upgrade
Open

feat(alluxio): add lifecycle management and in-place upgrade support#5588
Prachi194agrawal wants to merge 3 commits intofluid-cloudnative:masterfrom
Prachi194agrawal:feature/alluxio-lifecycle-upgrade

Conversation

@Prachi194agrawal
Copy link

Closes #5412

Overview
This Pull Request addresses the requirements outlined in Issue #5412 by extending the Generic Cache Runtime Interface. It introduces a comprehensive framework for managing the full data lifecycle—including data loading, processing, and mutations—and implements support for in-place cache upgrades specifically for the Alluxio runtime.

The goal is to move beyond basic create/delete operations, enabling smoother maintenance (like version upgrades) without requiring dataset re-provisioning.

Key Changes

  1. Base Runtime Extensions (pkg/ddc/base)

Extended Lifecycle Interfaces: Introduced ExtendedLifecycleManager, DataLifecycleManager, and StateMachineManager interfaces to standardize data operations and state transitions.

Default Implementation: Added DefaultExtendedLifecycleManager to provide default/stub behavior for engines that do not yet implement the full feature set.

State Machine: Implemented a DefaultStateMachine to track operation states (e.g., Idle, Initializing, Executing, Completed).

  1. Alluxio Runtime Implementation (pkg/ddc/alluxio)

Engine Integration: Updated AlluxioEngine in engine.go to initialize the new lifecycle and state machine managers.

In-Place Upgrade: Added upgrade.go to handle the upgrade logic, allowing the runtime to transition through upgrade states without disruption.

Data Lifecycle: Added lifecycle.go to implement LoadData and operation status retrieval.

  1. Testing & Validation

Unit Tests: Added lifecycle_test.go and upgrade_test.go to verify the new Alluxio logic.

Test Coverage: Updated transform_resources_test.go to align with the new changes.

Cleanup: Removed obsolete end-to-end test scripts (test/gha-e2e/alluxio/test.sh).

Verification
I have verified the changes locally. The attached screenshots demonstrate:

Unit Tests Passing: Successful execution of TestOperations, TestPortallocator, and TestAlluxioFileUtils suites.

Build Status: The core logic for Alluxio operations is functioning as expected, with relevant tests passing in 0.03s
Screenshot from 2026-01-28 19-44-32
Screenshot from 2026-01-28 19-43-37

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jan 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jan 28, 2026

Hi @Prachi194agrawal. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Prachi194agrawal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data management capabilities of the DDC (Distributed Data Cache) framework, particularly for the Alluxio runtime. It introduces a comprehensive set of interfaces and a default implementation for managing the full data lifecycle, including loading, processing, and mutations, along with a state machine for tracking operation progress. A key feature is the implementation of in-place cache upgrades for Alluxio, allowing for smoother maintenance and version updates without requiring data re-provisioning. These changes lay the groundwork for more advanced and resilient data operations within the system.

Highlights

  • Extended Lifecycle Interfaces: Introduced ExtendedLifecycleManager, DataLifecycleManager, and StateMachineManager interfaces in pkg/ddc/base to standardize data operations and state transitions across DDC engines.
  • Default Lifecycle Implementation: Added DefaultExtendedLifecycleManager to provide default/stub behavior for engines that do not yet implement the full extended lifecycle feature set, ensuring backward compatibility.
  • Operation State Machine: Implemented a DefaultStateMachine to track the states of data operations (e.g., Idle, Initializing, Executing, Completed), enabling more robust management of long-running tasks.
  • Alluxio Engine Integration: The AlluxioEngine in pkg/ddc/alluxio has been updated to embed and initialize the new DefaultExtendedLifecycleManager and StateMachineManager, integrating Alluxio with the new lifecycle framework.
  • Alluxio In-Place Upgrade Support: Added upgrade.go to the Alluxio runtime, providing logic for in-place cache upgrades that can transition through states without disrupting data access.
  • Alluxio Data Loading and Status: A new lifecycle.go file in Alluxio implements the LoadData method and provides functionality to retrieve the status of data operations, leveraging the new state machine.
  • Testing and Cleanup: New unit tests (lifecycle_test.go, upgrade_test.go) were added for the Alluxio-specific lifecycle and upgrade logic. Obsolete end-to-end test scripts for Alluxio and JuiceFS (test/gha-e2e/alluxio/test.sh, test/gha-e2e/juicefs/test.sh) were removed.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Signed-off-by: Prachi194agrawal <agrawalprachi7718@gmail.com>
Signed-off-by: Prachi194agrawal <agrawalprachi7718@gmail.com>
Signed-off-by: Prachi194agrawal <agrawalprachi7718@gmail.com>
@Prachi194agrawal Prachi194agrawal force-pushed the feature/alluxio-lifecycle-upgrade branch from 8d0b2d7 to be973fc Compare January 28, 2026 14:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new framework for data lifecycle management and in-place upgrades for the Alluxio runtime. However, the current state machine implementation has significant logic flaws, including a non-persistent, in-memory design that bypasses concurrency controls and a Time-of-Check to Time-of-Use (TOCTOU) race condition in the upgrade state transition logic. Critical issues also include compilation errors due to incorrect struct field access, ambiguous method embedding in AlluxioEngine, references to non-existent fields (e.g., e.Helper.Context), non-unique ID generation, and unhandled errors in new stub functions. These issues must be addressed to ensure the robustness, reliability, and security of the new lifecycle features.

func (e *AlluxioEngine) Upgrade() (modified bool, err error) {
opID := e.name + "-upgrade"
// Use a fake context for internal op
ctx := e.Helper.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line will cause a compilation error because e.Helper is of type *ctrl.Helper, which does not have a Context field. Based on your comment // Use a fake context for internal op, you should construct a cruntime.ReconcileRequestContext manually using the fields available in the AlluxioEngine struct.

Suggested change
ctx := e.Helper.Context
ctx := cruntime.ReconcileRequestContext{
NamespacedName: types.NamespacedName{
Name: e.name,
Namespace: e.namespace,
},
Client: e.Client,
Log: e.Log,
Recorder: e.Recorder,
Runtime: e.runtime,
RuntimeType: e.runtimeType,
}

Comment on lines +12 to +19
engine := &AlluxioEngine{
TemplateEngine: &base.TemplateEngine{
Id: "test-runtime",
Log: fake.NullLogger(),
DefaultExtendedLifecycleManager: base.DefaultExtendedLifecycleManager{},
},
StateMachineManager: base.NewDefaultStateMachine(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test setup will cause a compilation error. The AlluxioEngine struct does not have a TemplateEngine field. You need to initialize AlluxioEngine with the fields it actually has, such as name, Log, Client, etc. Consequently, engine.Context used later in the test will not be available, and you may need to construct a cruntime.ReconcileRequestContext for the test.

)

func (e *AlluxioEngine) LoadData(ctx cruntime.ReconcileRequestContext, spec *base.DataLoadSpec) (*base.DataLoadResult, error) {
opID := fmt.Sprintf("Load-%s-%d", ctx.Name, spec.Priority)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current operation ID generation might not guarantee uniqueness if multiple load operations are initiated with the same priority for the same runtime. This could lead to state conflicts in the state machine. Consider including a more unique element, like a timestamp or a random component, to ensure the opID is always unique.

Comment on lines +14 to +19
state, _ := e.GetCurrentState(opID)
if state != base.StateIdle && state != base.StateCompleted {
return false, fmt.Errorf("busy")
}

_ = e.TransitionState(ctx, opID, base.StateExecuting, "Upgrading")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The Upgrade function has a Time-of-Check to Time-of-Use (TOCTOU) race condition. It performs a state check via GetCurrentState (lines 14-17) and then a state transition via TransitionState (line 19) in two non-atomic steps. This allows multiple concurrent calls to Upgrade to proceed simultaneously, potentially leading to inconsistent state or resource corruption. Additionally, the error from e.GetCurrentState is ignored, which should be handled for robustness. To remediate the TOCTOU race, the state machine should provide an atomic 'Compare-and-Swap' or 'Check-and-Set' method that performs the check and transition while holding an internal lock. The error from e.GetCurrentState should also be properly handled, for example, by logging and returning.

Comment on lines +39 to +42
type DefaultStateMachine struct {
mu sync.RWMutex
states map[string]OperationState
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The DefaultStateMachine implementation uses an in-memory map (states) to track operation phases. However, in the current architecture, the AlluxioEngine instance (which holds this state machine) is typically recreated during each reconciliation loop of the controller (see pkg/ddc/alluxio/engine.go:74). Consequently, the state is not persisted across reconciles, and any 'busy' checks (such as the one in Upgrade) will always see the state as Idle for a new engine instance. This renders the state tracking ineffective and allows an attacker to bypass concurrency controls by triggering multiple reconciles, potentially leading to a Denial of Service (DoS) through redundant heavy operations. The operation state should be persisted in a durable store, such as the Status field of the Kubernetes resource.

Comment on lines +12 to +14
_ = e.TransitionState(ctx, opID, base.StateInitializing, "Starting")
e.Log.Info("Alluxio Load Data", "Paths", spec.Paths)
_ = e.TransitionState(ctx, opID, base.StateCompleted, "Done")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The errors returned by e.TransitionState are ignored. While the current default implementation of the state machine might not return errors, it's a good practice to handle potential errors for future-proofing and to make the code more robust. Consider logging the error if it's not nil.

}

func (e *AlluxioEngine) GetDataOperationStatus(operationID string) (*base.DataOperationStatus, error) {
state, _ := e.GetCurrentState(operationID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned by e.GetCurrentState is ignored. While the current default implementation might not return an error, it's safer to handle it, for example by logging it or returning it to the caller, to make the code more robust.

Comment on lines +19 to +21
_ = e.TransitionState(ctx, opID, base.StateExecuting, "Upgrading")
time.Sleep(10 * time.Millisecond)
_ = e.TransitionState(ctx, opID, base.StateCompleted, "Success")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The errors from e.TransitionState are ignored. It's good practice to handle them for robustness, for example by logging them.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURES] Extend Cache Runtime Interface to Support Full Data Lifecycle and In-Place Operations

1 participant