Skip to content

Conversation

@aldbr
Copy link
Contributor

@aldbr aldbr commented Jun 27, 2024

Replace the Dirac-specific SSH class by fabric.

BEGINRELEASENOTES
*Resources
CHANGE: Replace SSH by fabric in SSHComputingElement
ENDRELEASENOTES

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch 2 times, most recently from 206c55e to 6fbeaed Compare June 27, 2024 08:07
@fstagni
Copy link
Contributor

fstagni commented Jun 28, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR.
It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

@aldbr aldbr linked an issue Jul 3, 2024 that may be closed by this pull request
@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 6fbeaed to 1c60e47 Compare July 25, 2024 15:24
@aldbr
Copy link
Contributor Author

aldbr commented Jul 25, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR. It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

I agree it would great to add integration tests for CEs, at least to test basic features. But it will likely become complex because:

  • if we want to test things properly, we need to set up a CE and a Batch System.
  • we will have to choose one configuration, but it might not reflect the configuration of the sites in production.

I will give it a try with the SSHCE, let's see.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I wonder if it really makes sense to add CEs (and Batch Systems) in the integration tests: while it would be great to have a "grid in a box" in a controller environment, it would be cumbersome to maintain on the long term and would not be representative of all the instances we can find out there (e.g. Arc v6, v6 with a hack, v7, transferring jobs to Slurm, HTCondor, SSH, SSH tunnel, HTCondor with local scheduler, with remote scheduler...).

It would probably make more sense to add some scripts to run during the hackathons. For each type of CE supported it would:

  • get all the instances related to the given type of CE and for each of them:
    • submit a "hello world" job
    • get the CE status
    • get the job status until it reaches a final state
    • get the job output and logging info (if available)

Basically, it would be very similar to (i) submitting pilots with the Site Director and (ii) checking their results manually. But it would be more focused on the CE interfaces and would be more automated (though a human would need to check whether errors come from the CE instance itself or the Dirac CE interface).

Any opinion @fstagni ?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

I think the only one that would make sense to set up here is the SSHCE. The others, "proper Grid ones", can not be tested here.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I don't even know if testing SSHCE in an integration test makes sense. The only easy test we can set up would be SSHCE + Host, which is not representative of what we can have in production.

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

OK OK, give up on the idea...

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I will add a certification test focused on the CE interfaces as I explained (+ a card in the kanban board to explain how to execute it). I will execute it in the lhcb environment to make sure the changes in this PR are correct.

And I can also try to add a container that would act as a "Site" and use SSH + Host so that we can at least test the Site Director "in a box". Would it be okay?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

Sure, thanks

@aldbr
Copy link
Contributor Author

aldbr commented Jan 12, 2026

Tested with #8420 in LHCb production + now used by the LHCb Site Directors for a few hours without any issue so far.

This PR does not modify the interfaces so the transition is expected to be transparent.
The only risky point I identified is the SSHTunnel, which can take any command as value: I need to extract the hostname as well as the port from it.
My function is supposed to be flexible and support various cases, but I can't guarantee that this is going to work fine if the value is tricky.

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 6b383ed to 4a20778 Compare January 12, 2026 14:07
@aldbr
Copy link
Contributor Author

aldbr commented Jan 12, 2026

Another risky point is the SSHBatchCE because I don't know how to test that properly, we don't have any instance in LHCb as far as I know.

@fstagni
Copy link
Contributor

fstagni commented Jan 12, 2026

Does this need a new DIRACOS release?

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 4a20778 to 2bc6401 Compare January 13, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace the SSH class by a Python library?

3 participants