-
Notifications
You must be signed in to change notification settings - Fork 3
docs(guides): Crypt4GH_proTES and SPE deployment Tutorial #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds two new administrator-facing documentation guides describing how to set up and use secure processing environments for sensitive data: one focused on Crypt4GH-based encryption workflows orchestrated via proTES/Funnel, and another outlining an SPE architecture in de.NBI Cloud using WESkit, Slurm, MinIO, and LS Login. Sequence diagram for Crypt4GH encryption workflow via proTES and FunnelsequenceDiagram
actor Admin
participant proTES as proTES_gateway
participant TES as TES_Funnel_server
participant Worker as Funnel_worker_node
participant Storage as Local_storage
participant DB as BoltDB_database
Admin->>proTES: POST task1_keygen
proTES->>TES: create_task(task1_keygen.json)
TES->>DB: store_task_definition
TES->>Worker: schedule_task(task1_keygen)
Worker->>Worker: start_container(crypt4gh_tutorial)
Worker->>Storage: write_keys(sender_sk, sender_pk, recipient_sk, recipient_pk, recipient_pk_copy)
Worker-->>TES: report_task_status(COMPLETED)
TES-->>proTES: task_status(COMPLETED)
proTES-->>Admin: task1 result locations
Admin->>proTES: POST task2_encrypt_file
proTES->>TES: create_task(task2_encrypt_file.json)
TES->>Worker: schedule_task(task2_encrypt_file)
Worker->>Storage: read_keys(sender_sk, recipient_pk)
Worker->>Worker: download_file_and_encrypt
Worker->>Storage: write_encrypted_file_and_size
Worker-->>TES: report_task_status(COMPLETED)
Admin->>proTES: POST task3_decrypt_and_write_size
proTES->>TES: create_task(task3_decrypt_and_write_size.json)
TES->>Worker: schedule_task(task3_decrypt_and_write_size)
Worker->>Storage: read_encrypted_file_and_recipient_sk
Worker->>Worker: decrypt_and_compute_md5sum
Worker->>Storage: write_decrypted_md5sum
Worker-->>TES: report_task_status(COMPLETED)
TES-->>proTES: task_status(COMPLETED)
proTES-->>Admin: final_result_locations
Flow diagram for Crypt4GH key generation, encryption, and decryption pipelineflowchart LR
A[Start_tutorial] --> B[task1_keygen_generate_crypt4gh_keypairs]
B --> B1[Sender_sk_pk_and_recipient_sk_pk_written_to_storage]
B1 --> C[task2_encrypt_file_download_logo_and_record_size]
C --> C1[Write_plain_size_file_to_storage]
C1 --> C2[Encrypt_size_file_with_sender_sk_and_recipient_pk]
C2 --> C3[Store_encrypted_c4gh_file_in_encrypted_directory]
C3 --> D[Transfer_encrypted_file_to_secure_environment]
D --> E[task3_decrypt_and_write_size_read_encrypted_file_and_recipient_sk]
E --> F[Decrypt_and_compute_md5sum]
F --> G[Write_md5sum_file_to_decrypted_directory]
G --> H[End_pipeline]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 13 issues, and left some high level feedback:
- In the
crypt4gh_to_protestutorial, the third task (task3_decrypt_and_write_size.json) never actually decrypts the.c4ghfile and instead runsmd5sumon a non-existentunited_kingdom_logo_size.txt; please revise the executor command and output paths so they match the described decryption workflow and previous task outputs. - The final
curlexample incrypt4gh_to_protes.mdpoststask4_decrypt_and_write_size.json, but the document definestask3_decrypt_and_write_size.json; align the file name and task number to avoid confusion when users follow the tutorial. - Both new guides contain several typos and placeholders (e.g.
enrypted,recomend,impemented,Authentification,???for redirect URLs, andaccessable); a focused pass to correct spelling and replace placeholders with concrete values or explicit instructions will make the tutorials significantly clearer and easier to follow.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the `crypt4gh_to_protes` tutorial, the third task (`task3_decrypt_and_write_size.json`) never actually decrypts the `.c4gh` file and instead runs `md5sum` on a non-existent `united_kingdom_logo_size.txt`; please revise the executor command and output paths so they match the described decryption workflow and previous task outputs.
- The final `curl` example in `crypt4gh_to_protes.md` posts `task4_decrypt_and_write_size.json`, but the document defines `task3_decrypt_and_write_size.json`; align the file name and task number to avoid confusion when users follow the tutorial.
- Both new guides contain several typos and placeholders (e.g. `enrypted`, `recomend`, `impemented`, `Authentification`, `???` for redirect URLs, and `accessable`); a focused pass to correct spelling and replace placeholders with concrete values or explicit instructions will make the tutorials significantly clearer and easier to follow.
## Individual Comments
### Comment 1
<location> `docs/guides/guide-admin/crypt4gh_to_protes.md:4` </location>
<code_context>
+# Sensitive data encryption and processing using Crypt4GH and proTES
+## Description
+This tutorial presents processing of encrypted sensitive data using TES/[Funnel](https://github.com/ohsu-comp-bio/funnel). Each job is triggered via [proTES](https://github.com/elixir-cloud-aai/proTES), a stable and scalable GA4GH [TES](https://github.com/ga4gh/task-execution-schemas) gateway.
+A [Crypt4GH](https://crypt4gh.readthedocs.io/) enrypted dataset is decrypted using the private key of the recipient. We recomend to store the key in a secure location and use it only for decryption. Transfer of the private key to the TES instance could be done using signed URL.
+
+
</code_context>
<issue_to_address>
**issue (typo):** Fix spelling and article usage in the Crypt4GH description sentence.
Please fix the typos (`enrypted` → `encrypted`, `recomend` → `recommend`) and update the last clause to read: “using a signed URL.”
```suggestion
A [Crypt4GH](https://crypt4gh.readthedocs.io/) encrypted dataset is decrypted using the private key of the recipient. We recommend to store the key in a secure location and use it only for decryption. Transfer of the private key to the TES instance could be done using a signed URL.
```
</issue_to_address>
### Comment 2
<location> `docs/guides/guide-admin/crypt4gh_to_protes.md:8` </location>
<code_context>
+
+
+We assume two parties that need to exchange information: a sender and a recipient. For each party, we generate a `Crypt4GH` key pair and send the recipient’s public key to the sender. All keys are generated inside a container (Task 1) and exported to the configured storage via TES outputs.
+For simplicity, we download a file that was presented [elsewhere](https://github.com/vschnei/elixir-cloud-demos) and assume it contains sensitive information. We encrypt this data (Task 2) using the sender’s private key and the recipient’s public key with `Crypt4GH`, producing a `*.c4gh` encrypted file. Once the encrypted file is transfered to to a secure processing environment, it can be then decrypted (Task 3) using the recipient’s secret key.
+
+
</code_context>
<issue_to_address>
**issue (typo):** Correct spelling and duplicated word in the sentence about transferring the encrypted file.
Change "transfered" to "transferred" and remove the duplicate "to" in that sentence.
```suggestion
For simplicity, we download a file that was presented [elsewhere](https://github.com/vschnei/elixir-cloud-demos) and assume it contains sensitive information. We encrypt this data (Task 2) using the sender’s private key and the recipient’s public key with `Crypt4GH`, producing a `*.c4gh` encrypted file. Once the encrypted file is transferred to a secure processing environment, it can be then decrypted (Task 3) using the recipient’s secret key.
```
</issue_to_address>
### Comment 3
<location> `docs/guides/guide-admin/crypt4gh_to_protes.md:266` </location>
<code_context>
+
+Finally, all task and scheduler data is stored in the embedded database. For further configuration options, see the [Funnel documentation](https://ohsu-comp-bio.github.io/funnel/).
+
+Our pipeleine allows to use proTES, a robust and scalable [Global Alliance for Genomics and Health(GA4GH)](https://www.ga4gh.org/) [Task Execution Service (TES) API](https://github.com/ga4gh/task-execution-schemas) gateway. It enables the federation of tasks across on-premise, hybrid, and multi-cloud environments composed of TES nodes. See official [proTES](https://github.com/elixir-cloud-aai/proTES) documentation for deployment and configuration.
+
+
</code_context>
<issue_to_address>
**suggestion (typo):** Fix typo and improve grammar in the proTES pipeline description.
Change "pipeleine" to "pipeline" and rephrase "allows to use proTES" to "allows the use of proTES" or "allows you to use proTES". Also add a space in "Health(GA4GH)" → "Health (GA4GH)" for readability.
```suggestion
Our pipeline allows the use of proTES, a robust and scalable [Global Alliance for Genomics and Health (GA4GH)](https://www.ga4gh.org/) [Task Execution Service (TES) API](https://github.com/ga4gh/task-execution-schemas) gateway. It enables the federation of tasks across on-premise, hybrid, and multi-cloud environments composed of TES nodes. See official [proTES](https://github.com/elixir-cloud-aai/proTES) documentation for deployment and configuration.
```
</issue_to_address>
### Comment 4
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:5` </location>
<code_context>
+
+This tutorial presents the implementation of a SPE in the de.NBI Cloud (ELIXIR-DE) using ELIXIR and open-source services.
+
+The aim of this tutorial is the need for a Secure Processing Environment (SPE) to analyse large volumes of sensitive data generated by biomedical and clinical research. Easy and secure access to such environments accelerates research and enables participation by researchers with limited resources.
+
+Users of an SPE can run workflows on sensitive data, without ever gaining access to the actual data. The data is processed securely and the user can only access the results of the workflows.
</code_context>
<issue_to_address>
**suggestion (typo):** Improve grammar in the sentence describing the aim of the tutorial.
The sentence “The aim of this tutorial is the need for a Secure Processing Environment (SPE)” reads awkwardly. Please rephrase, e.g. “The aim of this tutorial is to address the need for a Secure Processing Environment (SPE)…” or “…is to describe a Secure Processing Environment (SPE)…”.
```suggestion
The aim of this tutorial is to address the need for a Secure Processing Environment (SPE) to analyse large volumes of sensitive data generated by biomedical and clinical research. Easy and secure access to such environments accelerates research and enables participation by researchers with limited resources.
```
</issue_to_address>
### Comment 5
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:32` </location>
<code_context>
+### Execution
+WESkit allows execution of [Snakemake](https://snakemake.readthedocs.io/en/v7.32.3/) and [Nextflow](https://www.nextflow.io/docs/latest/) workflows by sending a request to the compute infrastructure (Cloud/Cluster). Find details in the [WESkit docs](https://gitlab.com/one-touch-pipeline/weskit/documentation).
+
+A Slurm cluster can be deployed with little effort using [BiBiGrid](https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/), a framework for creating and managing cloud clusters. BiBiGrid uses Ansible to configure cloud images and setup an on-demand SLURM cluster. Alternatively use any other Slurm deployment.
+
+Access to the SPE must be restricted due to national restrictions and laws. Collaborators and foreign researchers need to obtain permission from the Identity Provider to use the SPE. A permission allows them to authenticate at the Identity Provider site and request workflow execution via WESkit on the SLURM cluster.
</code_context>
<issue_to_address>
**nitpick (typo):** Use the correct verb form for "set up".
Here "setup" is used as a verb; please change it to "set up" ("...and set up an on-demand SLURM cluster").
```suggestion
A Slurm cluster can be deployed with little effort using [BiBiGrid](https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/), a framework for creating and managing cloud clusters. BiBiGrid uses Ansible to configure cloud images and set up an on-demand SLURM cluster. Alternatively use any other Slurm deployment.
```
</issue_to_address>
### Comment 6
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:37` </location>
<code_context>
+Access to the SPE must be restricted due to national restrictions and laws. Collaborators and foreign researchers need to obtain permission from the Identity Provider to use the SPE. A permission allows them to authenticate at the Identity Provider site and request workflow execution via WESkit on the SLURM cluster.
+
+### Results
+Finally, results are stored in a storage that is mounted into the cluster and an interface that is only accessable via LS-Login. Sensitive data is not managed by WESkit or accessible in the result storage.
+
+## Step 1: WESkit
</code_context>
<issue_to_address>
**issue (typo):** Correct the spelling of "accessible".
```suggestion
Finally, results are stored in a storage that is mounted into the cluster and an interface that is only accessible via LS-Login. Sensitive data is not managed by WESkit or accessible in the result storage.
```
</issue_to_address>
### Comment 7
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:56` </location>
<code_context>
+
+The SPE uses MinIO/S3 to provide researchers access to non-sensitive results data. Depending on the environment, there are several options available on how to [deploy MinIO](https://github.com/minio/minio?tab=readme-ov-file). To configure OpenID please refer to the [MinIO OIDC Documentation](https://min.io/docs/minio/linux/operations/external-iam/configure-openid-external-identity-management.html).
+
+In this scenario we create a bucket "results" in MinIO and allow all authorized user to access MinIO with read-access on the results data.
+
+Note: Minio as storage provider removes it's open source license, therefore it might be advisable to switch to a different storage solution. Refer to [legacy binary releases](https://github.com/minio/minio?tab=readme-ov-file#legacy-binary-releases) for the last open source release.
</code_context>
<issue_to_address>
**nitpick (typo):** Use plural for "users" when referring to all authorized users.
Here this should read "all authorized users" to match the plural subject.
```suggestion
In this scenario we create a bucket "results" in MinIO and allow all authorized users to access MinIO with read-access on the results data.
```
</issue_to_address>
### Comment 8
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:60-62` </location>
<code_context>
+
+### Results crawler
+
+To make the non-sensitive results available in, a crawler continuously checks for new results and copies them to MinIO. This can be impemented as a shell script running as a cron job.
+
+A simple example script is given below:
</code_context>
<issue_to_address>
**issue (typo):** Remove an extra word and fix a typo in the results crawler description.
Please remove the extra "in" after "available" and correct "impemented" to "implemented".
```suggestion
### Results crawler
To make the non-sensitive results available, a crawler continuously checks for new results and copies them to MinIO. This can be implemented as a shell script running as a cron job.
```
</issue_to_address>
### Comment 9
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:94` </location>
<code_context>
+done
+```
+
+This scripts regulary checks the WESkit results folder. WESkit logs information about a workflow execution in the file `log.json`, once the workflow execution finished. The scripts checks if the `log.json` file exists and in case uploads then the result files `results.csv` into the S3 bucket. Uploaded run-directories are tagged with a `upload_token` file to prevent redundant uploads.
+
+## Step 3: User Interface
</code_context>
<issue_to_address>
**suggestion (typo):** Fix subject-verb agreement, spelling, and word order in the description of the crawler behavior.
Suggested wording: "This script regularly checks the WESkit results folder... The script checks if the `log.json` file exists and, if so, uploads the result file `results.csv` to the S3 bucket." This fixes “scripts” → “script”, “regulary” → “regularly”, and “scripts checks” → “script checks”, and smooths the phrasing around uploading the result files.
```suggestion
This script regularly checks the WESkit results folder. WESkit logs information about a workflow execution in the file `log.json` once the workflow execution has finished. The script checks if the `log.json` file exists and, if so, uploads the result file `results.csv` to the S3 bucket. Uploaded run-directories are tagged with an `upload_token` file to prevent redundant uploads.
```
</issue_to_address>
### Comment 10
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:58` </location>
<code_context>
+
+In this scenario we create a bucket "results" in MinIO and allow all authorized user to access MinIO with read-access on the results data.
+
+Note: Minio as storage provider removes it's open source license, therefore it might be advisable to switch to a different storage solution. Refer to [legacy binary releases](https://github.com/minio/minio?tab=readme-ov-file#legacy-binary-releases) for the last open source release.
+
+### Results crawler
</code_context>
<issue_to_address>
**issue (typo):** Correct the possessive "its" and consider adjusting phrasing about MinIO.
Change "it's" to the possessive "its" ("its open source license"). You could also rephrase to something like: "MinIO as a storage provider has removed its open source license" for clarity.
```suggestion
Note: MinIO as a storage provider has removed its open source license, therefore it might be advisable to switch to a different storage solution. Refer to [legacy binary releases](https://github.com/minio/minio?tab=readme-ov-file#legacy-binary-releases) for the last open source release.
```
</issue_to_address>
### Comment 11
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:98` </location>
<code_context>
+
+## Step 3: User Interface
+
+To offer a user interface for the SPE, the simplest way is to use a [customized version](https://gitlab.com/one-touch-pipeline/weskit/gui/-/tree/spe4hd_demo) of the WESkit GUI. It offers a light weight web application to allow researchers to run and monitor workflows. The WESkit GUI repository can be used as a blueprint to create a customized website.
+
+## Step 4: Authentification and Authorization
</code_context>
<issue_to_address>
**nitpick (typo):** Use "lightweight" as a single word.
In the sentence "It offers a light weight web application...", update "light weight" to "lightweight".
```suggestion
To offer a user interface for the SPE, the simplest way is to use a [customized version](https://gitlab.com/one-touch-pipeline/weskit/gui/-/tree/spe4hd_demo) of the WESkit GUI. It offers a lightweight web application to allow researchers to run and monitor workflows. The WESkit GUI repository can be used as a blueprint to create a customized website.
```
</issue_to_address>
### Comment 12
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:100` </location>
<code_context>
+
+To offer a user interface for the SPE, the simplest way is to use a [customized version](https://gitlab.com/one-touch-pipeline/weskit/gui/-/tree/spe4hd_demo) of the WESkit GUI. It offers a light weight web application to allow researchers to run and monitor workflows. The WESkit GUI repository can be used as a blueprint to create a customized website.
+
+## Step 4: Authentification and Authorization
+
+Authentication and authorization is implemented using OIDC. This setup uses the [LS-Login infrastructure](https://lifescience-ri.eu/ls-login/) to for OIDC integration. The LS-Login documentation contains a [guide](https://lifescience-ri.eu/ls-login/documentation/service-provider-documentation/service-provider-documentation.html) on how to register a new service.
</code_context>
<issue_to_address>
**issue (typo):** Correct the spelling of "Authentication" in the section title.
Use “Authentication” rather than “Authentification” in the title.
```suggestion
## Step 4: Authentication and Authorization
```
</issue_to_address>
### Comment 13
<location> `docs/guides/guide-admin/sensitive_data_analysis.md:102` </location>
<code_context>
+
+## Step 4: Authentification and Authorization
+
+Authentication and authorization is implemented using OIDC. This setup uses the [LS-Login infrastructure](https://lifescience-ri.eu/ls-login/) to for OIDC integration. The LS-Login documentation contains a [guide](https://lifescience-ri.eu/ls-login/documentation/service-provider-documentation/service-provider-documentation.html) on how to register a new service.
+
+In this tutorial, we assume a single LS-Login service for all the deployed tools (WESkit, MinIO, WebApp). This requires that the following three addresses are valid as OIDC redirect URLs:
</code_context>
<issue_to_address>
**issue (typo):** Remove the extra "to" and fix pluralization in the LS-Login description.
In the phrase "to for OIDC integration," one "to" is redundant; "for OIDC integration" is sufficient. Also, change "detailed instruction" to "detailed instructions" to match the intended plural meaning.
Suggested implementation:
```
Authentication and authorization is implemented using OIDC. This setup uses the [LS-Login infrastructure](https://lifescience-ri.eu/ls-login/) for OIDC integration. The LS-Login documentation contains a [guide](https://lifescience-ri.eu/ls-login/documentation/service-provider-documentation/service-provider-documentation.html) on how to register a new service.
```
```
LS-Login can be activated in MinIO either by using the MinIO console using the OIDC configuration or by setting environmental variables, as described in the MinIO [OIDC Documentation](https://min.io/docs/minio/linux/operations/external-iam/configure-openid-external-identity-management.html). There are detailed instructions in the [ELIXIR-Cloud-AAI documentation](https://elixir-cloud-aai.github.io/guides/guide-admin/services_to_ls_aai/) for using MinIO with LS-Login.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: schneiva <valentin.schneider-lunitz@charite.de>
f094a17 to
7186cc2
Compare
Add Administrator Guides for Sensitive Data Processing
Summary
This PR adds two administrator guides for setting up secure processing environments (SPE) to handle sensitive data within the ELIXIR Cloud.
Changes:
crypt4gh_to_protes.md - Tutorial on encryption and processing of sensitive data using Crypt4GH and proTES/Funnel
sensitive_data_analysis.md - Guide for implementing a complete Secure Processing Environment (SPE) in the de.NBI
Summary by Sourcery
Add administrator tutorials for encrypting and processing sensitive data using Crypt4GH with proTES/Funnel and for implementing a Secure Processing Environment (SPE) for sensitive data analysis in de.NBI Cloud.
Documentation: