From 0bdf56e4a65631f8ca4a06b4e55cbecc870f13dd Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Fri, 13 Mar 2026 13:45:57 +0100 Subject: [PATCH 1/7] update readme, reflect needs of build in the gitignore --- .gitignore | 4 ++++ README.md | 60 ++++++------------------------------------------------ 2 files changed, 10 insertions(+), 54 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..8776d852 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +components/ai/ +components/button.tsx +components/card3.tsx +components/toc.tsx diff --git a/README.md b/README.md index 10688ca3..cc11975c 100644 --- a/README.md +++ b/README.md @@ -14,59 +14,11 @@ The easiest way to start contributing is to find the `.mdx` file that is used to In case you want to generate the static pages locally (could be useful for large changes) see below. -1. Clone CERIT repo with some objects common to all eInfra docs: `git clone https://github.com/CERIT-SC/fumadocs` (you can do this only once, you just need to have the repo content somewhere.) There are several files that are needed to compile the docs, however they should not be copied to the repo and should be only used temporarily (ask Lukas Hejtmanek if in doubt). - -2. Clone the Metacentrum docs repo, checkout to the main branch: - -``` -git clone https://github.com/CESNET/metacentrum-user-docs - -git checkout remotes/origin/main - -# or you can make the remote branch local as: -git checkout -b main origin/main - -``` - -3. Make a small script similar to the following: - +1. Clone the CESNET/metacentrum-user-docs repo `git clone https://github.com/CESNET/metacentrum-user-docs` +2. Clone CERIT-SC/fumadocs repo with some objects common to all eInfra docs: `git clone https://github.com/CERIT-SC/fumadocs` +3. Copy the required files `cp -r fumadocs/components/* metacentrum-user-docs/components/` +4. Run the build ```bash -#!/bin/bash - -# path to where the Metacentrum docs repo -repodir="/home/melounova/meta/metacentrum-user-docs" - -# path to the CERIT fumadocs repo -fumadir="/home/melounova/meta/fumadocs" - -# Copy some stuff from CERIT repo to Metacentrum repo -cd ${repodir}/components -cp -r ${fumadir}/components/* . -cd ${repodir} - -# run the build -docker run -it --rm -p 3000:3000 -e STARTPAGE=/en/docs -v ${repodir}/public:/opt/fumadocs/public -v ${repodir}/components:/opt/fumadocs/components -v ${repodir}/content/docs:/opt/fumadocs/content/docs cerit.io/docs/fuma:v15.0.12 pnpm dev - -# remove again the stuff borrowed from CERIT repo -cd ${repodir}/components ; rm -r ai ; rm button.tsx card3.tsx sidebar.tsx toc.tsx +docker run -it --rm -p 3000:3000 -e STARTPAGE=/en/docs -v metacentrum-user-docs/public:/opt/fumadocs/public -v metacentrum-user-docs/components:/opt/fumadocs/components -v metacentrum-user-docs/content/docs:/opt/fumadocs/content/docs cerit.io/docs/fuma:v16.4.6 pnpm dev ``` - -4. run the script (as sudo if needed); in a browser, see the docs at `http://localhost:3000/en/docs/welcome` - - -**Notes** - -- 8 GB of mem is just barely enough to run the build on an older ntb - - - - - - - - - - - - - +5. Documentation will be available at `http://localhost:3000/en/docs/welcome` and automatically rebuilt on source change From 3d7d5b3fb97d8a96011cf4277858cbde3d916b31 Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Fri, 13 Mar 2026 13:47:47 +0100 Subject: [PATCH 2/7] refactor the compelte basics of computing@metacentrum to two guides: - the first very simple with the bare minimum and low assumptions - second as a list of advanced topics (this one could use more love, e.g. reprioritization) --- content/docs/access/account.mdx | 2 +- content/docs/computing/advanced.mdx | 687 ++++++++++++++++++++++ content/docs/computing/basic-tutorial.mdx | 38 -- content/docs/computing/concepts.mdx | 192 ------ content/docs/computing/meta.json | 3 +- content/docs/computing/run-basic-job.mdx | 369 ++---------- content/docs/sandbox/index.mdx | 2 +- content/docs/welcome.mdx | 2 +- 8 files changed, 756 insertions(+), 539 deletions(-) create mode 100644 content/docs/computing/advanced.mdx delete mode 100644 content/docs/computing/basic-tutorial.mdx delete mode 100644 content/docs/computing/concepts.mdx diff --git a/content/docs/access/account.mdx b/content/docs/access/account.mdx index 3c17f1f5..d424db91 100644 --- a/content/docs/access/account.mdx +++ b/content/docs/access/account.mdx @@ -27,7 +27,7 @@ Expired accounts can be renewed at any time during the year [here](https://metav ## How to start with MetaCentrum -A comprehensive tutorial for new users is [here](https://docs.metacentrum.cz/en/docs/computing/basic-tutorial). +A comprehensive tutorial for new users is [here](https://docs.metacentrum.cz/en/docs/computing/run-basic-job). ## Group data access diff --git a/content/docs/computing/advanced.mdx b/content/docs/computing/advanced.mdx new file mode 100644 index 00000000..caa4975f --- /dev/null +++ b/content/docs/computing/advanced.mdx @@ -0,0 +1,687 @@ +--- +title: Running jobs (advanced) +--- + +This guide covers advanced topics for running jobs on MetaCentrum. If you're new to MetaCentrum, start with the [Getting started guide](./run-basic-job). + +## Kerberos authentication + +MetaCentrum uses the **Kerberos protocol** for internal authentication. When you log in, you automatically receive a Kerberos ticket that allows you to move between machines, run jobs, and copy files without repeatedly entering your password. + +### Kerberos ticket expiration + + + Kerberos tickets are valid for 10 hours. If you stay logged in longer than this, you must regenerate your ticket. + + +You'll know your ticket has expired when you see errors like: +``` +Key has expired @ dir_s_mkdir - /storage/brno2/home/user_name +``` + +### Basic Kerberos commands + +```bash +klist # List all current tickets and their expiration times +kdestroy # Delete all tickets +kinit # Create a new Kerberos ticket (you'll be prompted for password) +``` + +### OnDemand and Kerberos + +If you're using the [OnDemand web interface](../graphical/ondemand) and your Kerberos ticket expires during a long session, you cannot simply reload the page. You must click the **Help** button in the menu (upper right corner) and select **Restart Web Server** to renew your ticket. + +For more detailed Kerberos information, see the [Kerberos security page](../access/security/kerberos). + +## Detailed resource configuration + +### Resource specification methods + +Resources can be specified in two ways: +1. On the command line with `qsub` +2. Inside the batch script on lines beginning with `#PBS` + +```bash +# On command line +qsub -l select=1:ncpus=4:mem=4gb:scratch_local=10gb -l walltime=1:00:00 myJob.sh +``` + + + If both resource specifications are present (CLI and script), the values on CLI have priority. + + +### Chunk-wide vs job-wide resources + +According to PBS terminology, a **chunk** is a subset of computational nodes on which the job runs. Resources can be: + +- **Chunk-wide**: Applied to each chunk separately (e.g., `ncpus`, `mem`, `scratch_local`) +- **Job-wide**: Applied to the job as a whole (e.g., `walltime`, software licenses) + + + For most "normal" jobs, the number of chunks is 1 (default value). See [PBS resources guide](./resources/resources) for complex parallel computing scenarios. + + +### Scratch directory types + +We offer four types of scratch storage for temporary files. For detailed information about each type, see the [Scratch storage guide](./infrastructure/scratch-storages). + +| Type | Available everywhere? | Location | Environment variable | Key characteristic | +|------|----------------------|----------|---------------------|-------------------| +| `local` | Yes | `/scratch/USERNAME/job_JOBID` | `scratch_local` | Universal, large capacity | +| `ssd` | No | `/scratch.ssd/USERNAME/job_JOBID` | `scratch_ssd` | Fast I/O operations | +| `shared` | No | `/scratch.shared/USERNAME/job_JOBID` | `scratch_shared` | Can be shared by multiple jobs | +| `shm` | No | `/dev/shm/scratch.shm/USERNAME/job_JOBID` | `scratch_shm` | In RAM, ultra-fast | + + + There is no default scratch directory. You must always specify its type and volume. + + +As a default choice, we recommend **local scratch**: + +```bash +qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00 +``` + +### Accessing the scratch directory + +Use the `$SCRATCHDIR` environment variable to access your scratch directory: + +```bash +user123@glados12:~$ echo $SCRATCHDIR +/scratch.ssd/user123/job_14429322.pbs-m1.metacentrum.cz +user123@glados12:~$ cd $SCRATCHDIR +``` + + + If your job crashes or fails to copy data back, your files remain in scratch. Use `go_to_scratch ` to access them. + + +## Interactive jobs in depth + +### Starting an interactive job + +Interactive jobs are requested via `qsub -I` (uppercase "i"): + +```bash +qsub -I -l select=1:ncpus=4 -l walltime=2:00:00 +qsub: waiting for job 13010171.pbs-m1.metacentrum.cz to start +qsub: job 13010171.pbs-m1.metacentrum.cz ready +(BULLSEYE)user123@elmo3-1:~$ # Now on computational node +``` + +### When interactive jobs are necessary + +- Testing what works (software versions, input data format, bash constructions) +- Getting initial resource estimates +- Compiling your own software +- Processing, moving, or archiving large data volumes +- Running [GUI applications](../software/graphical-access) + +### Time quota enforcement + +If you don't log out within the time quota, the job is automatically terminated: + +```bash +user123@elmo3-1:~$ =>> PBS: job killed: walltime 7230 exceeded limit 7200 +logout +qsub: job 13010171.pbs-m1.metacentrum.cz completed +``` + +### Interactive job example (Python environment setup) + +```bash +qsub -I -l select=1:ncpus=4 -l walltime=2:00:00 +qsub: waiting for job 13010171.pbs-m1.metacentrum.cz to start +qsub: job 13010171.pbs-m1.metacentrum.cz ready + +# Load mamba and create environment +(BULLSEYE)user123@elmo3-1:~$ module add mambaforge +(BULLSEYE)user123@elmo3-1:~$ mamba list | grep scipy +(BULLSEYE)user123@elmo3-1:~$ mamba create -n my_scipy +... +(BULLSEYE)user123@elmo3-1:~$ mamba activate my_scipy +(my_scipy) (BULLSEYE)user123@elmo3-1:~$ mamba install scipy +... +(my_scipy) (BULLSEYE)user123@elmo3-1:~$ python +>>> import scipy as sp +>>> +``` + +## Job ID details + +The job ID is a unique identifier crucial for tracking, manipulating, or deleting jobs, and for reporting issues to support. + +### Job ID formats + +- Short form (sometimes sufficient): `13010171.` +- Full form (always required): `13010171.pbs-m1.metacentrum.cz` + +### Getting your job ID + +- After running `qsub` command +- Inside interactive jobs or batch scripts: `echo $PBS_JOBID` +- From qstat: `qstat -u your_username @pbs-m1.metacentrum.cz` + +```bash +# Within an interactive job +(BULLSEYE)user123@elmo3-1:~$ echo $PBS_JOBID +13010171.pbs-m1.metacentrum.cz +``` + +## Job monitoring and management + +### Job states + +PBS Pro uses different codes to mark job state within the PBS ecosystem: + +| State | Description | +|-------|-------------| +| Q | Queued | +| H | Held. Job is suspended by the server, user, or administrator. Job stays in held state until released by user or administrator. | +| R | Running | +| S | Suspended (substate of R) | +| E | Exiting after having run | +| F | Finished | +| X | Finished (subjobs only) | +| W | Waiting. Job is waiting for its requested execution time or delayed due to stagein failure. | + +### Advanced qstat commands + +```bash +qstat -u user123 # list all jobs (running or queued) +qstat -xu user123 # list finished jobs +qstat -f # full details of running/queued job +qstat -xf # full details of finished job +``` + +For more detailed job monitoring and history, see [Job tracking](./jobs/job-tracking). + +### qstat output interpretation + +``` +Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time +-------------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- +11733550.pbs-m1 user123 q_2h myJob.sh -- 1 1 1gb 00:05 Q -- +``` + +| Header | Meaning | +|--------|---------| +| S | Status: Q (queued), R (running), F (finished) | +| NDS | Number of distinct compute nodes | +| TSK | Number of tasks (usually equals ncpus) | +| Memory | Requested memory | +| Time | Elapsed time | + +### Job deletion + +**Delete a submitted/running job:** + +```bash +qdel 21732596.pbs-m1.metacentrum.cz +``` + +**Force deletion (if plain qdel doesn't work):** + +```bash +qdel -W force 21732596.pbs-m1.metacentrum.cz +``` + +## PBS server + +The scheduling system that plans job execution is the **PBS server** (currently running OpenPBS). + +### Essential PBS commands + +- `qsub` – submit a computational job +- `qstat` – query status of a job +- `qdel` – delete a job + +The server on which the scheduler runs is `pbs-m1.metacentrum.cz`. + +### Queues + +Jobs are submitted to queues managed by the scheduler. Queues are typically defined by walltime duration or other criteria (GPUs, memory, user groups). + +Unless you have a specific reason, don't specify a queue. Jobs are submitted to a default **routing** queue and then automatically routed to appropriate **execution** queues (e.g., `q_1h`, `q_1d`). + + + View all queues at [PBSmon queues list](https://metavo.metacentrum.cz/pbsmon2/queues/list). The routing queue is marked with the routing icon (not to submit directly to execution queues). + + +For more on queues, see [Queues guide](./resources/queues). + +## Output files and error handling + +### Standard output and error + +When a job completes, two files are created in the submission directory: + +1. `.o` – standard output (STDOUT) +2. `.e` – standard error (STDERR) + +The STDERR file contains all error messages and is the first place to look if a job fails. + +### Examining failed jobs + +1. Check the `.e` file for error messages +2. Verify the exit status (see below) +3. Check if input files exist and are accessible +4. Verify module loading and software availability + +## Exit status interpretation + +The exit status (a number) indicates how a job finished. This is meaningful **only for batch jobs** - interactive jobs always have exit status 0. + +### Getting exit status + +```bash +qstat -xf job_ID | grep Exit_status +Exit_status = 271 +``` + + + For jobs older than 24 hours, qstat may not show exit status. Use `pbs-get-job-history` utility or check [PBSmon](https://metavo.metacentrum.cz/pbsmon2/jobs/detail). + + +### Exit status ranges + +| Range | Meaning | +|-------|---------| +| X < 0 | Job killed by PBS (resource exceeded or other problem) | +| 0 <= X < 256 | Exit value of shell or top process | +| X >= 256 | Job killed with OS signal | + +### Translating exit status >= 256 to OS signals + +If exit status >= 256, subtract 256 to get the OS signal code: + +``` +PBS exit status - 256 = OS signal code +``` + +For example, exit status 271 means OS signal number 15 (`SIGTERM`). + +List all OS signals: + +```bash +kill -l +``` + +### Common exit statuses + +| Job ending type | Exit status | +|-----------------|-------------| +| Missing Kerberos credentials | -23 | +| Job exceeded number of CPUs | -25 | +| Job exceeded memory | -27 | +| Job exceeded walltime | -29 | +| Normal termination | **0** | +| Job killed by `SIGTERM` (qdel) | 271 | + +### Job termination by PBS server + +The PBS server monitors resource usage. If a job exceeds its reserved resources, PBS sends a **SIGKILL** signal to terminate it: + +```bash +qstat -x -f 13030457.pbs-m1.metacentrum.cz | grep Exit_status + Exit_status = -29 # walltime exceeded +``` + +## Manual scratch cleanup + +When a job ends with an error, data may remain in the scratch directory. You should clean it up after retrieving any useful data. + +### Cleanup procedure + +You need: +- The hostname where the job ran +- The path to the scratch directory + +```bash +# Log in to the compute node +user123@skirit:~$ ssh user123@luna13.fzu.cz + +# Navigate to scratch directory +user123@luna13:~$ cd /scratch/user123/job_14053410.pbs-m1.metacentrum.cz + +# Remove all contents (not the directory itself) +user123@luna13:/scratch/user123/job_14053410.pbs-m1.metacentrum.cz$ rm -r * +``` + + + Users have permissions for `rm -rf $SCRATCHDIR/*` but not `rm -rf $SCRATCHDIR`. The scratch directory itself is deleted automatically after some time. + + +### Helpful scratch management + +**Record scratch info in your job:** + +```bash +DATADIR=/storage/brno12-cerit/home/user123/test_directory +echo "$PBS_JOBID is running on node `hostname -f` in $SCRATCHDIR" >> $DATADIR/jobs_info.txt +``` + +This creates a record that helps you find scratch directories when jobs fail. + +**Conditional cleanup:** + +```bash +# Copy output, don't fail if cleanup has issues +cp h2o.out $DATADIR/ || export CLEAN_SCRATCH=false + +# SCRATCH is auto-cleaned only if previous command succeeded +``` + +## Advanced scratch management + +### Trap command for automatic cleanup + +Use the `trap` command to ensure scratch is cleaned up even when jobs fail unexpectedly. + +#### Clean on normal exit + +```bash +trap 'clean_scratch' EXIT +``` + +This runs when the script finishes normally (or via `exit` command). + +#### Clean on job termination + +```bash +trap 'clean_scratch' TERM +``` + +This runs when the job receives a SIGTERM signal (either from PBS killing it due to resource limits, or from you using `qdel`). + + + When SIGTERM is received, you have approximately 10 seconds before SIGKILL terminates the job unconditionally. Cleanup operations must complete within this time. + + +#### Combined approach + +```bash +trap 'clean_scratch' EXIT TERM +``` + +This cleans scratch for both normal exits and terminations. + +#### Recording failure location for manual cleanup + +If you need to retrieve data from scratch after failure: + +```bash +trap 'echo "$PBS_JOBID job failed. Retrieve from $SCRATCHDIR on `hostname -f`" >> /storage/.../jobs_info.txt' TERM +``` + +This logs the scratch location instead of attempting large file operations that might not complete in time. + +For more on trap commands, see [Trap command usage](./jobs/trap-command). + +## Custom output paths + +By default, job output files go to the submission directory (`$PBS_O_WORKDIR`). You can change this: + +```bash +qsub -o /custom-path/myOutputFile -e /custom-path/myErrorFile script.sh +``` + +Or in the batch script: + +```bash +#PBS -o /custom-path/myOutputFile +#PBS -e /custom-path/myErrorFile +``` + +For more on output file customization, see [PBS resources guide](./resources/resources). + +## Job arrays + +Job arrays allow you to run many similar jobs with a single submission instead of submitting each one individually. + +### Submitting a job array + +```bash +qsub -J X-Y[:Z] script.sh +``` + +- `X` – first index of the job +- `Y` – last index of the job +- `Z` – optional index step + +**Example:** `qsub -J 2-7:2 script.sh` creates subjobs with indexes 2, 4, 6. + +### Array job format + +The main job is displayed with `[]` (e.g., `969390[]`). Each subjob has an ID like `969390[1].pbs-m1.metacentrum.cz`. + +### Array job variables + +Inside your script, use: + +```bash +$PBS_ARRAY_INDEX # Index of the current subjob +$PBS_ARRAY_ID # Job ID of the main job +``` + +### Monitoring array jobs + +```bash +qstat -t # List all subjobs +qstat -f 969390'[]' -x | grep array_state_count # See overall status +``` + +For more on job arrays, see [Job arrays guide](./jobs/job-arrays). + +## Job dependencies + +Make a job wait until another job completes successfully. + +### Submit with dependencies + +```bash +qsub -W depend=afterok:job1_ID.pbs-m1.metacentrum.cz job2_script.sh +``` + +This submits `job2_script.sh` to run only after `job1_ID` completes with exit code 0. + +### Modify existing job dependencies + +```bash +qalter -W depend=afterok:job1_ID.pbs-m1.metacentrum.cz job2_ID.pbs-m1.metacentrum.cz +``` + +## Modifying job attributes + +You can modify attributes of **queued** jobs (status Q) using the `qalter` command. + + + Only queuing jobs can be modified with `qalter`. For running jobs that need changes, see the "Extend walltime" section below. + + +### Common modifications + +**Change resource requirements:** + +```bash +# Original submission +qsub -l select=1:ncpus=150:mem=10gb -l walltime=1:00:00 job.sh + +# This will never start (150 CPUs on one machine is impossible) +qalter -l select=1:ncpus=32:mem=10gb job_ID.pbs-m1.metacentrum.cz +``` + +**Add or modify walltime:** + +```bash +qalter -l walltime=02:00:00 job_ID.pbs-m1.metacentrum.cz +``` + + + Walltime can only be modified within the limits of the job's assigned queue. Increasing beyond the queue's maximum will fail. + + +**Remove obsolete parameters:** + +```bash +# Remove obsolete GPU capability parameter +qalter -l select=1:ncpus=1:ngpus=1:mem=10gb job_ID.pbs-m1.metacentrum.cz +``` + + + When using `qalter`, you must specify the entire `-l` attribute including unchanged parts. + + +For more on modifying job attributes, see [Modify job attributes guide](./jobs/modify-job-attributes). + +## Extend walltime for running jobs + +The `qextend` command allows you to extend the walltime of **running** jobs. + +### Extending a job + +```bash +qextend job_ID.pbs-m1.metacentrum.cz 01:00:00 +``` + +Time can be specified as: +- A single number (seconds) +- `hh:mm:ss` format + +```bash +(BUSTER)user123@skirit:~$ qextend 8152779.pbs-m1.metacentrum.cz 01:00:00 +The walltime of the job 8152779.pbs-m1.metacentrum.cz has been extended. +Additional walltime: 01:00:00 +New walltime: 02:00:00 +``` + +### Usage limits + +To prevent abuse, `qextend` is limited by: +- Maximum 20 times within the last 30 days, **AND** +- Maximum 1440 CPU-hours of extension within the last 30 days + + + CPU-hours are not walltime hours. For a job running on 8 CPUs, extending by 1 hour uses 8 hours from your fund. + + +### Checking your quota + +```bash +qextend info +``` + +This shows your: +- Counter limit (usage count) +- CPU-time fund (usage in CPU-hours) +- Available remaining quota +- When the oldest extension "expires" from the 30-day window + + + If you hit the monthly limit and still need to extend a job, contact user support at meta@cesnet.cz. + + + + `qextend` works only on simple jobs. To extend an array job, contact user support at meta@cesnet.cz. + + +For more on extending walltime, see [Extend walltime guide](./jobs/extend-walltime). + +## Module span management + +### Conflicting modules + +Loading multiple modules with conflicting dependencies can cause errors. Use subshells to isolate module usage: + +```bash +# First computation with its own module environment +(module add python/3.8.0-gcc; python my_script1.py output1) + +# Second computation with different module environment +(module add python/3.8.0-gcc-rab6t; python my_script2.py output2) +``` + +Each subshell loads its own modules and automatically unloads them when the subshell exits. + +### Displaying module information + +```bash +module display module_name # Show module details including environment variables +``` + +Important variables set by modules: +- `PATH` – path to executables +- `LD_LIBRARY_PATH` – path to libraries for the linker +- `LIBRARY_PATH` – path to libraries + +For more on modules, see [Software modules guide](../software/modules). + +## Research group annual report + +Research groups are asked to submit annual reports by the end of January. The report should include: + +1. Group name +2. Contact address +3. List of group members +4. Summary of research interests +5. Hardware contributed to MetaCentrum (if applicable) +6. Most frequently used MetaCentrum software +7. New software developed (if applicable) +8. List of research projects using MetaCentrum resources, with brief descriptions +9. List of publications with MetaCentrum/CERIT-SC acknowledgements + +Reports can be in English or Czech, in any file format. Send to [annual-report@metacentrum.cz](mailto:annual-report@metacentrum.cz). + +## Additional resources + +- [Parallel computing](./parallel-comput) – for running MPI/OpenMP jobs +- [GPU computing](./gpu-comput) – for GPU-accelerated workloads +- [PBS resources](./resources/resources) – detailed resource specification guide +- [Job tracking](./jobs/job-tracking) – detailed job monitoring and history +- [Email notifications](./jobs/email-notif) – configure job status emails +- [Software modules](../software/modules) – advanced module management +- [Frontend and storage details](./infrastructure/frontend-storage) – understanding the architecture +- [Finished jobs](./jobs/finished-jobs) – retrieving information about completed jobs +- [Containers](../software/containers) – using Apptainer/Singularity images + +## Web-based job running with usegalaxy.cz + +As an alternative to command-line job submission, you can run computational jobs through **usegalaxy.cz**, a web-based platform provided by e-INFRA CZ / MetaCentrum together with ELIXIR CZ. + +As an alternative to command-line job submission, you can run computational jobs through **usegalaxy.cz**, a web-based platform provided by e-INFRA CZ / MetaCentrum together with ELIXIR CZ. + +### Features + +- **Thousands of tools** from various scientific domains (bioinformatics, ecology, chemistry, NLP, climate science, social sciences, and more) +- **Web interface** – no need to write batch scripts or use command line +- **Large data quotas** – 250 GB for e‑INFRA CZ login, 50 GB for Life Science login +- **Workflow support** – create, share, and publish computational pipelines +- **Intuitive interface** – tools panel, submission forms, and result history +- **Training resources** – access to Galaxy Training Network tutorials + +### Access and quotas + +**Website:** https://usegalaxy.cz – log in with e-INFRA CZ or Life Science credentials + +| Resource | Limit (standard) | +|---------------------------|-------------------------------------------| +| **Storage** | 250 GB (e‑INFRA CZ Login)
50 GB (Life Science Login) | +| **Concurrent jobs** | 10 jobs | +| **Maximum single dataset**| 50 GB | + +### When to use usegalaxy.cz + +- You prefer a web interface over command-line tools +- Your field has Galaxy tools available +- You need to build and share computational workflows +- You're working with common bioinformatics or data science tools +- You want to avoid writing and debugging batch scripts + +### Getting more resources + +If you need additional storage, compute, or concurrency for your research, contact galaxy@cesnet.cz. The team can also: +- Install additional tools +- Help wrap new tools for Galaxy +- Collaborate on designing workflows + +For detailed information about usegalaxy.cz, see the [usegalaxy.cz guide](../graphical/usegalaxy). \ No newline at end of file diff --git a/content/docs/computing/basic-tutorial.mdx b/content/docs/computing/basic-tutorial.mdx deleted file mode 100644 index 582aabab..00000000 --- a/content/docs/computing/basic-tutorial.mdx +++ /dev/null @@ -1,38 +0,0 @@ ---- -title: A comprehensive manual for beginners ---- - -*This page is currently under construction and will be completed soon.* - - -Please [let us know](https://docs.metacentrum.cz/en/docs/support) if you think any part of the manual could be expanded, or if you feel that any information is missing. We would appreciate your opinion for further improvement. - - -This manual is intended for new users who are unfamiliar with MetaCentrum NGI (National Grid Infrastructure) or any similar infrastructure. However, it isn't easy to summarise the detailed use of the entire infrastructure in a reasonably short guide. The following tutorial aims to provide the minimum necessary knowledge to enable new users to start calculating without any crucial problems as soon as possible. - -## If you are confused, please let us know - -Sometimes, problems can be too complicated, and error messages can be too cryptic. Please do not hesitate to contact [user support](https://docs.metacentrum.cz/en/docs/support) as soon as possible and ask for help. We are here to help you. But this is also important for us. It is sometimes necessary to catch inconsistent system behaviour in the act, because it is not possible to detect the cause of an error from the system logs afterwards. - -## Available software tools - -We provide access to several hundred application tools in thousands of individual modules. For this reason, we are unable to periodically monitor and update all of them, and also, our [list of available tools](https://docs.metacentrum.cz/en/docs/software/alphabet) only includes those that require some further description. We rely on our users to [inform us](https://docs.metacentrum.cz/en/docs/support) when a tool needs updating, and we recommend simpler [installations in the home directory](https://docs.metacentrum.cz/en/docs/software/install-software) (with our assistance if required). - -## Where is my data located, and why are there so many frontend servers? - -The main principle of the grid infrastructure is that it connects various compute clusters that are distributed across the Czech Republic. The same scheme applies to [storage servers](https://docs.metacentrum.cz/en/docs/computing/infrastructure/mount-storages), where data is located, and [frontend servers](https://docs.metacentrum.cz/en/docs/computing/infrastructure/frontends), which are the main access points. This is to distribute the user load across multiple localities and prevent all users from losing access to all data in the event of a technical problem. - -Users can, however, use any frontend server to access data on any storage server. Each frontend server is mounted on an individual storage server, and all storage servers are accessible across MetaCentrum. You can use the standard Linux command `cd` to move yourself (or the current working directory) between storage servers. For example, the frontend server `nympha` (`nympha.metacentrum.cz`) is mounted on the storage server `storage-plzen1.metacentrum.cz`, which is accessible from compute nodes and other frontend servers via the path `/storage/plzen1/home/user_name`. However, switching to a different storage server is simple if needed. - -```bash -local_user_name@local_pc_name:~$ ssh user_name@nympha.metacentrum.cz -... -(BOOKWORM)user_name@nympha:~$ pwd -/storage/plzen1/home/user_name -(BOOKWORM)user_name@nympha:~$ cd /storage/brno2/home/user_name -(BOOKWORM)user_name@nympha:/storage/brno2/home/user_name$ pwd -/storage/brno2/home/user_name -(BOOKWORM)user_name@nympha:/storage/brno2/home/user_name$ cd /storage/brno12-cerit/home/user_name -(BOOKWORM)user_name@nympha:/storage/brno12-cerit/home/user_name$ pwd -/storage/brno12-cerit/home/user_name -``` diff --git a/content/docs/computing/concepts.mdx b/content/docs/computing/concepts.mdx deleted file mode 100644 index a32a0faf..00000000 --- a/content/docs/computing/concepts.mdx +++ /dev/null @@ -1,192 +0,0 @@ ---- -title: Basic terms ---- - -import FrontendTable from '@/components/frontends'; - -## Frontends, storages, homes - -There are several **frontends** (login nodes) to access the grid. Each frontend has a native **home directory** on one of the **storages**. - -There are several storages (large-capacity harddisc arrays). They are named according to their physical location (a city). - -```bash -user123@user123-XPS-13-9370:~$ ssh skirit.metacentrum.cz -user123@skirit.ics.muni.cz's password: -... -(BUSTER)user123@skirit:~$ pwd # print current directory -/storage/brno2/home/user123 # "brno2" is native storage for "skirit" frontend -``` - -**List of frontends together with their native /home directories** - - - -**Frontend do's and dont's** - -Frontend usage policy is different from the one on computational nodes. The frontend nodes are shared by all users, the command typed by any user is performed immediately and there is no resource planning. Frontend node are not intended for heavy computing. - -Frontends should be used only for: - -- preparing inputs, data pre- and postprocessing -- managing batch jobs -- light compiling and testing - - - The resource load on frontend is monitored continuously. Processes not adhering to usage rules will be terminated without warning. For large compilations, running benchmark calculations or moving massive data volumes (> 10 GB, > 10 000 files), use interative job. - - -## PBS server - -A set of instructions performed on computational nodes is **computational job**. Jobs require a set of **resources** such as CPUs, memory or time. A **scheduling system** plans execution of the jobs so as optimize the load and usage of computational nodes. - -The server on which the scheduling system is called **PBS server** or **PBS scheduler**. - -On the current scheduler `pbs-m1.metacentrum.cz` the **[OpenPBS](https://www.openpbs.org/)** is used. - -The most important PBS Pro commands are: - -- `qsub` - submit a computational job -- `qstat` - query status of a job -- `qdel` - delete a job - -## Resources - -Every jobs need to have defined set of computational resources at the point of submission. The resources can be specified - -- on CLI as `qsub` command options, or -- inside the batch script on lines beginning with `#PBS` header. - -In the PBS terminology, a **chunk** is a subset of computational nodes on which the job runs. In most cases the concept of chunks is useful for parallelized computing only and "normal" jobs run on one chunk. We cannot avoid the concept of chunks, though, as the specification of resources differ according to whether they can be applied on a job as a whole or on a chunk. - -According to PBS internal logic, the resources are either **chunk-wide** or **job-wide**. - -**Job-wide** resources are defined for the job as a whole, e.g. maximal duration of the job or a license to run a commercial software. These cannot be divided in parts and distributed among computational nodes on which the job runs. Every job-wide resource is defined in the form of `-l =`, e.g. `-l walltime=1:00:00`. - -**Chunk-wide** resources can be ascribed to every chunk separately and differently. - - - For the purpose of this intro, we assume that the number of chunks is always 1, which is also a default value. To see more complicated examples about per-chunk resource distribution, see [advanced chapter on PBS resources](../computing/resources/resources). - - -Chunk-wide resources are defined as options of `select` statement in pairs `=` divided by `:`. - -The essential resources are: - -| Resource name | Keyword | Chunk-wide or job-wide? | -|---------------|---------|-------------------------| -| no. of CPUs | ncpus | chunk | -| Memory | mem | chunk | -| Maximal duration of the job | walltime | job | -| Type and volume of space for temporary data | scratch\_local | chunk | - -There are a deal more resources than the ones shown here; for example, it is possible to specify a type of computational nodes' OS or their physical placement, software licences, speed of CPU, number pf GPU cards and more. For detailed information see [PBS options detailed page](). - -Examples: - -```bash - qsub -l select=1:ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00 myJob.sh -``` - -where - - ncpus is number of processors (2 in this example) - mem is the size of memory that will be reserved for the job (4 GB in this example, default 400 MB), - scratch_local specifies the size and type of scratch directory (1 GB in this example, no default) - walltime is the maximum time the job will run, set in the format hh:mm:ss (2 hours in this example, default 24 hours) - -## Queues - -When the job is submitted, it is added to one of the **queues** managed by the scheduler. Queues can be defined arbitrarily by the admins based on various criteria - usually on walltime, but also on number of GPU cards, size of memory etc. Some queues are reserved for defined groups of users ("private" queues). - -Unless you [have a reason to send job to a specific queue](../computing/resources/queues), do not specify any. The job will be submitted into a default queue and from there routed to one of execution queues. - -The default queue is only **routing** one: it serves to sort jobs into another queues according to the job's walltime - e.g. `q_1h` (1-hour jobs), `q_1d` (1-day jobs), etc. - -The latter queues are **execution** ones, i.e. they serve to actually run the jobs. - -In PBSmon, the [list of queues for all planners can be found](https://metavo.metacentrum.cz/pbsmon2/queues/list). - -![Queues list (top)](/img/meta/computing/queues_top.png) - -. . . - -![Queues list (bottom)](/img/meta/computing/queues_bottom.png) - - -with respective meaning of icons: - -| Icon | meaning | -|----|----| -| ![Queues list (top)](/img/meta/computing/routing-logo.png) | routing queue
(to send jobs into) | -| ![Queues list (top)](/img/meta/computing/exec-logo.png) | execution queue
(not to send jobs into) | -| ![Queues list (top)](/img/meta/computing/private-logo.png) | private queue
(limited for a group of users) | - -## Modules - -The software istalled in Metacentrum is packed (together with dependencies, libraries and environment variables) in so-called **modules**. - -To be able to use a particular software, you must **load a module**. - -Key command to work with software is `module`, see `module --help` on any frontend. - -**Basic commands** - -```bash -module avail orca/ # list versions of installed Orca - -module add orca # load Orca module (default version) -module load orca # dtto - -module list # list currently loaded modules - -module unload orca # unload module orca -module purge # unload all currently loaded modules -``` - -For more complicated examples of module usage, see [advanced chapter on modules](../software/modules). - -## Scratch directory - -Most application produce some large temporary files during the calculation. - -To store these files, as well as all the input data, on the computational node, a disc space must be reserved for them. - - - If your HPC job crashes or fails to copy data back from scratch to your home directory, don't worry! Your output files remain stored in the scratch. To access these files, simply use the command `go_to_scratch `, replacing `` with your actual job ID. Please retrieve your data promptly since scratch storage is temporary and may be purged after a certain period to free up space for other users. - - -This is a purpose of **scratch directory** on computational node. - - - There is no default scratch directory and the user must always specify its type and volume. - - -Currently we offer four types of scratch storage: - -| Type | Available on every node? | Location on machine | `$SCRATCHDIR` value | Key characteristic | -|------| -------------------------|---------------------|-------------------|----------------------| -| local | yes | `/scratch/USERNAME/job_JOBID` | `scratch_local`| universal, large capacity, available everywhere | -| ssd | no | `/scratch.ssd/USERNAME/job_JOBID` | `scratch_ssd`| fast I/O operations | -| shared | no | `/scratch.shared/USERNAME/job_JOBID` | `scratch_shared`| can be shared by more jobs | -| shm | no | `/dev/shm/scratch.shm/USERNAME/job_JOBID` | `scratch_shm`| exists in RAM, ultra fast | - -As a default choice, we recommend users to use **local scratch**: - -```bash -qsub -I -l select=1=ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00 -``` - -To access the scratch directory, use the system variable `SCRATCHDIR`: - -```bash -(BULLSEYE)user123@skirit:~$ qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00 -qsub: waiting for job 14429322.pbs-m1.metacentrum.cz to start -qsub: job 14429322.pbs-m1.metacentrum.cz ready - -user123@glados12:~$ echo $SCRATCHDIR -/scratch.ssd/user123/job_14429322.pbs-m1.metacentrum.cz -user123@glados12:~$ cd $SCRATCHDIR -user123@glados12:/scratch.ssd/user123/job_14429322.pbs-m1.metacentrum.cz$ -``` - diff --git a/content/docs/computing/meta.json b/content/docs/computing/meta.json index 1037d570..d71ca36f 100644 --- a/content/docs/computing/meta.json +++ b/content/docs/computing/meta.json @@ -1,9 +1,8 @@ { "title": "Grid computing", "pages": [ - "concepts", - "basic-tutorial", "run-basic-job", + "advanced", "jobs", "infrastructure", "resources", diff --git a/content/docs/computing/run-basic-job.mdx b/content/docs/computing/run-basic-job.mdx index cf6ff2aa..7ff587d2 100644 --- a/content/docs/computing/run-basic-job.mdx +++ b/content/docs/computing/run-basic-job.mdx @@ -1,358 +1,119 @@ --- -title: Run simple job +title: Getting started --- -Welcome to the basic guide on how to run calculations in the Metacentrum grid service. You will learn how to -- navigate between **frontends**, **home directories** and **storages**, -- make use of **batch** and **interactive** job, -- **submit a job** to a **PBS server**, -- set up **resources** for a job, -- retrieve job **output**. +## Welcome to MetaCentrum - - - 1. have a Metacentrum account - 2. be able to login to a frontend node - 3. have elementary knowledge of the Linux command line - -*If anything is missing, see [Access](../access/terms) section.* - - - -## Lifecycle of a job - -### Batch job - -A typical use case for grid computing is a non-interactive batch job, when the user only prepares input and set of instructions at the beginning. The calculation itself then runs independently on the user. - -Batch jobs consist of the following steps: - -1. **User prepares data** to be used in the calculation **and instructions** what is to be done with them (input files + batch script). -2. The batch script is submitted to the job planner (**PBS server**), which stages the job until the required resources are available. -3. After the PBS server has released the job to be run, **the job runs** on one of the computational nodes. -4. At this time, the applications (software) are loaded. -5. When the job is finished, results are copied back to the user's directory according to instructions in the batch script. - -![pic](/img/meta/computing/batch-job-scheme_border.jpg) - -### Interactive job - -Interactive job works in different way. The user does not need to specify in advance what will be done, neither does not need to prepare any input data. Instead, they first reserve computational resources and, after the job starts to run, work interactively on the CLI. - -The interactive job consists of the following steps: - -1. User **submits request for specified resources** to the PBS server -2. **PBS server stages** this request until the resources are available. -3. When the job starts running, **user is redirected** to a computational node's CLI. -4. **User does whatever they need** on the node. -5. When the user logs out of the computational node or when the time reserved for the job runs out, the job is done. - -![pic](/img/meta/computing/interact-job-scheme_border.jpg) - -### Batch vs interactive - -A primary choice for grid computing is the batch job. Batch jobs allow users to run massive sets of calculations without the need to overview them, manipulate data, etc. They also optimize the usage of computational resources better, as there is no need to wait for user's input. - -Interactive jobs are good for: - -- testing what works and what does not (software versions, input data format, bash constructions to be used in batch script later, etc) -- getting first guess about resources -- compiling your own software -- processing, moving or archiving large amounts of data - -Interactive jobs are **necessary** for [running GUI application](../software/graphical-access). - -## Batch job example - -The batch script in the following example is called myJob.sh. - -```bash - (BUSTER)user123@skirit:~$ cat myJob.sh - #!/bin/bash - #PBS -N batch_job_example - #PBS -l select=1:ncpus=4:mem=4gb:scratch_local=10gb - #PBS -l walltime=1:00:00 - # The 3 lines above are options for the scheduling system: the job will run 1 hour at maximum, 1 machine with 4 processors + 4gb RAM memory + 10gb scratch memory are requested - - # define a DATADIR variable: directory where the input files are taken from and where the output will be copied to - DATADIR=/storage/brno12-cerit/home/user123/test_directory # substitute username and path to your real username and path - - # append a line to a file "jobs_info.txt" containing the ID of the job, the hostname of the node it is run on, and the path to a scratch directory - # this information helps to find a scratch directory in case the job fails, and you need to remove the scratch directory manually - echo "$PBS_JOBID is running on node `hostname -f` in a scratch directory $SCRATCHDIR" >> $DATADIR/jobs_info.txt - - #loads the Gaussian's application modules, version 03 - module add g03 - - # test if the scratch directory is set - # if scratch directory is not set, issue error message and exit - test -n "$SCRATCHDIR" || { echo >&2 "Variable SCRATCHDIR is not set!"; exit 1; } - - # copy input file "h2o.com" to scratch directory - # if the copy operation fails, issue an error message and exit - cp $DATADIR/h2o.com $SCRATCHDIR || { echo >&2 "Error while copying input file(s)!"; exit 2; } - - # move into scratch directory - cd $SCRATCHDIR - - # run Gaussian 03 with h2o.com as input and save the results into h2o.out file - # if the calculation ends with an error, issue error message an exit - g03 h2o.out || { echo >&2 "Calculation ended up erroneously (with a code $?) !!"; exit 3; } - - # move the output to user's DATADIR or exit in case of failure - cp h2o.out $DATADIR/ || { echo >&2 "Result file(s) copying failed (with a code $?) !!"; exit 4; } - - # clean the SCRATCH directory - clean_scratch -``` - -The last two lines can be piped together. - -```bash - cp h2o.out $DATADIR/ || export CLEAN_SCRATCH=false -``` - -SCRATCH will be automatically cleaned (by the `clean_scratch` utility) only if the copy command finishes without error. - - -The job is then submitted as - -```bash - (BUSTER)user123@skirit:~$ qsub myJob.sh - 11733571.pbs-m1.metacentrum.cz # job ID is 11733571.pbs-m1.metacentrum.cz -``` - -Alternatively, you can specify resources on the command line. In this case, the lines starting by `#PBS` need not to be in the batch script. - -```bash - (BUSTER)user123@skirit:~$ qsub -l select=1:ncpus=4:mem=4gb:scratch_local=10gb -l walltime=1:00:00 myJob.sh -``` +MetaCentrum provides free computing resources to Czech academic institutions through distributed compute clusters. - If both resource specifications are present (on CLI as well as inside the script), the values on CLI have priority. + New users unfamiliar with grid computing environments often have questions. Please [contact user support](https://docs.metacentrum.cz/en/docs/support) if you need help - we're here for you, and your feedback helps us improve the documentation. -## Interactive job example - -An interactive job is requested via `qsub -I` command (uppercase "i"). - -```bash - (BUSTER)user123@skirit:~$ qsub -I -l select=1:ncpus=4 -l walltime=2:00:00 # submit interactive job - qsub: waiting for job 13010171.pbs-m1.metacentrum.cz to start - qsub: job 13010171.pbs-m1.metacentrum.cz ready # 13010171.pbs-m1.metacentrum.cz is the job ID - (BULLSEYE)user123@elmo3-1:~$ # elmo3-1 is computational node - (BULLSEYE)user123@elmo3-1:~$ module add mambaforge # make available mamba - (BULLSEYE)user123@elmo3-1:~$ mamba list | grep scipy # make sure there is no scipy package already installed - (BULLSEYE)user123@elmo3-1:~$ mamba search scipy - ... # mamba returns list of scipy packages available in repositories - (BULLSEYE)user123@elmo3-1:~$ mamba create -n my_scipy # create my environment to install scipy into - ... - environment location: /storage/praha1/home/user123/.conda/envs/my_scipy - ... - Proceed ([y]/n)? y - ... - (BULLSEYE)user123@elmo3-1:~$ mamba activate my_scipy # enter the environment - (my_scipy) (BULLSEYE)user123@elmo3-1:~$ - (my_scipy) (BULLSEYE)user123@elmo3-1:~$ mamba install scipy - ... - Proceed ([y]/n)? y - ... - Downloading and Extracting Packages - ... - (my_scipy) (BULLSEYE)user123@elmo3-1:~$ python - ... - >>> import scipy as sp - >>> -``` - -Unless you log out within the time quota (in this example 2 hours), you will get the following message: - -```bash - user123@elmo3-1:~$ =>> PBS: job killed: walltime 7230 exceeded limit 7200 - logout - qsub: job 13010171.pbs-m1.metacentrum.cz completed -``` - -## job ID - -Job ID is a unique identifier in a job. Job ID is crucial to track, manipulate or delete job, as well as to identify your problem to user support. - -Under some circumstances, the job can be identified by the number only (e.g. `13010171.`). In general, however, the PBS server suffix is needed, too, to fully identify the job (e.g. `13010171.meta-pbs.metacentrum.cz`). - -You can get the job ID: - -- after running the `qsub` command -- by `echo $PBS_JOBID` in the interactive job or in the batch script -- by `qstat -u your_username @pbs-m1.metacentrum.cz - -Within interactive job: - -```bash - (BULLSEYE)user123@elmo3-1:~$ echo $PBS_JOBID - 13010171.pbs-m1.metacentrum.cz -``` - -By `qstat` command: +Before getting started you need an active MetaCentrum account (see [Account guide](../access/account)). -```bash - (BULLSEYE)user123@perian :~$ qstat -u user123 @pbs-m1.metacentrum.cz - - Job id Name User Time Use S Queue - --------------------- ---------------- ---------------- -------- - ----- - 1578105.pbs-m1 Boom-fr-bulk_12* fiserp 0 Q q_1w -``` +## Getting started: Logging in -## Job status - -The basic command for getting the status of your jobs is the `qstat` command. +Once you have an activated account connect to MetaCentrum using SSH with your username. Here we are using the `tarkil.metacentrum.cz` login server (frontend). You can use any frontend, but we recommend choosing one closest to your physical location to minimize network latency. ```bash - qstat -u user123 # list all jobs of user "user123" running or queuing on the PBS server - qstat -xu user123 # list finished jobs for user "user123" - qstat -f # list details of the running or queueing job with a given jobID - qstat -xf # list details of the finished job with a given jobID +ssh your_username@tarkil.metacentrum.cz ``` -You will see something like the following table: +For the full list of frontends, their locations, and detailed login instructions (including Windows/PuTTY), see the [Log in guide](../access/log-in). -```bash - Req'd Req'd Elap - Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time - -------------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- - 11733550.meta-pbs.* user123 q_2h myJob.sh -- 1 1 1gb 00:05 Q -- -``` +## Understanding the architecture -The letter under the header 'S' (status) gives the status of the job. The most common states are: +MetaCentrum connects distributed compute clusters. Each frontend has a native home directory on a storage server. You can access any storage from any frontend. -- Q – queued -- R – running -- F – finished +Frontends are shared by all users and are **not** intended for heavy computing. Use them only for data preparation, job management, or light compiling. For computing, use batch or interactive jobs. -To learn more about how to track running job and how to retrieve job history, see [Job tracking page](../computing/jobs/job-tracking). +For detailed infrastructure information see [Frontend & Storage guide](../infrastructure/frontend-storage). -## Output files +## Basic job concepts -When a job is completed (no matter how), two files are created in the directory from which you have submitted the job: +### Batch job vs interactive job -1. `.o` - job's standard output (STDOUT) -2. `.e` - job's standard error output (STDERR) +**Batch job**: Non-interactive, submit script and it runs independently (primary choice for grid computing) -STDERR file contains all the error messages which occurred during the calculation. It is a first place where to look if the job has failed. +**Interactive job**: Reserve resources, then work interactively (useful for testing, compiling, running [GUI apps](../software/graphical-access)) -## Job termination - -### Done by user - -Sometimes, you need to delete the submitted/running job. This can be done by `qdel` command: +## Your first batch job ```bash - (BULLSEYE)user123@skirit~: qdel 21732596.pbs-m1.metacentrum.cz +#!/bin/bash +#PBS -N my_job +#PBS -l select=1:ncpus=4:mem=4gb:scratch_local=10gb -l walltime=1:00:00 + +DATADIR=/storage/cityXY/home/user/data +cp $DATADIR/input.txt $SCRATCHDIR +cd $SCRATCHDIR +module add software_name +run_calculation +cp results.txt $DATADIR/ +clean_scratch ``` -If plain `qdel` does not work, add `-W` (force del) option: +Essential resources: `select` (number of chunks), `ncpus`, `mem`, `scratch_local`, `walltime` ```bash - (BULLSEYE)user123@skirit~: qdel -W force 21732596.pbs-m1.metacentrum.cz +qsub myJob.sh # submit job, returns job ID ``` -### Done by PBS server - -The PBS server keeps track of resources used by the job. In case the job uses more resources than it has reserved, PBS server sends a **SIGKILL** signal to the execution host. - -You can see the signal as `Exit_status` on CLI: +## Software modules ```bash - (BULLSEYE)user123@tarkil:~$ qstat -x -f 13030457.pbs-m1.metacentrum.cz | grep Exit_status - Exit_status = -29 +module avail tool/ # list versions +module add tool # load default version +module list # show loaded modules ``` -## Exit status - -When the job is finished (no matter how), it exits with a certain **exit status** (a number). - - - Interactive jobs have always exit status equal to 0. + + Use wildcards to search: `module avail *python*`. Add `/` to see versions within a module directory. -A normal termination is denominated by 0. - -Any non-zero exit status means the job failed for some reason. - -You can get the exit status by typing +## Monitoring jobs ```bash - (BULLSEYE)user123@skirit:~$ qstat -xf job_ID | grep Exit_status - Exit_status = 271 +qstat -u username # list your jobs ``` - - The `qstat -x -f` works only for recently finished jobs (last 24 hours). For For older jobs, use the `pbs-get-job-history` utility - see [advanced chapter on getting info about older jobs](../computing/jobs/finished-jobs#older). - - -Alternatively, you can navigate to [your list of jobs in PBSmon](https://metavo.metacentrum.cz/pbsmon2/jobs/detail), go to tab "Jobs" and choose a particular finished job from the list. - -A gray table at the bottom of the page contains many variables connected to the job. Search for "Exit status" like shown in the picture below: +Status codes: `Q`=queued, `R`=running, `F`=finished -![pic](/img/meta/computing/exit_status.png) +Output files (in submission directory): +- `jobname.o` – standard output +- `jobname.e` – errors (check here first if job fails) -### Exit status ranges +## Account maintenance -Exit status can fall into one of three categories, or ranges of numbers. +**Renewal**: Accounts expire February 2nd. You'll be notified by email. -| Exit status range | Meaning | -|--------------------|---------------------| -| X < 0 | job killed by PBS; either some resource was exceeded
or another problem occured | -| 0 <= X < 256 | exit value of shell or top process of the job | -| X >= 256 | job was killed with an OS signal | +**Security**: Use a strong password and never share credentials. For password changes and complete security rules, see [Account page](../access/account) and [Terms and conditions](../access/terms). -### Exit status to `SIG*` type +## Next steps -If the exit status exceeds 256, it means an signal from operation system has terminated the job. +Now that you understand the basics, you can: +- Learn [advanced job configuration and troubleshooting](./advanced) +- Explore [available software](../software/alphabet) +- Read about [parallel computing](./parallel-comput) +- Check [GPU resources](./gpu-comput) +- Try [usegalaxy.cz web interface](../graphical/usegalaxy) – an alternative way to run jobs with a web-based platform providing thousands of tools, large data quotas (250 GB for e‑INFRA CZ login), and workflow support, accessible at https://usegalaxy.cz -Usually this means the used has deleted the job by `qdel`, upon which a `SIGKILL` and/or `SIGTERM` signal is sent. +## Troubleshooting basics -The OS signal have an OS code of their own. +If your job fails: +1. Check the error file (`*.e`) +2. Verify your input files exist and have correct permissions +3. Check if software modules are loaded correctly +4. Ensure you requested adequate resources (memory, walltime, scratch) -Type `kill -l` on any frontend to get list of OS signals together with their values. +For more advanced troubleshooting, see the [Advanced guide](./advanced). -To translate PBS exit code >= 256 to OS signal type, just subtract 256 from exit code. +## Acknowledgements -For example, exit status of 271 means the OS signal no. 15 (a `SIGTERM`). +Publications created with MetaCentrum support must include the e-INFRA CZ acknowledgement (ID:90254) and be submitted to the [publications system](https://publications.e-infra.cz/all-publications). For ELIXIR CZ resources, use ID:90255. -![pic](/img/meta/computing/sigterm.png) - - - `PBS exit status` - `256` = `OS signal code`. + + Computational resources were provided by the e-INFRA CZ project (ID:90254), + supported by the Ministry of Education, Youth and Sports of the Czech Republic. - -### Common exit statuses - -Most often you will meet some of the following signals: - -| Type of job ending | Exit status | -|--------------------|---------------| -| missing Kerberos credenials | -23 | -| job exceeded number of CPUs | -25 | -| job exceeded memory | -27 | -| job exceeded walltime | -29 | -| **normal termination** | **0** | -| Job killed by `SIGTERM`
(result of `qdel`) | 271 | - -## Manual scratch clean - -In case of erroneous job ending, the data are left in the scratch directory. You should always clean the scratch after all potentially useful data has been retrieved. To do so, you need to know the hostname of machine where the job was run, and path to the scratch directory. - - - Users' rights allow only `rm -rf $SCRATCHDIR/*`, not `rm -rf $SCRATCHDIR`. - - -For example: - -```bash - user123@skirit:~$ ssh user123@luna13.fzu.cz # login to luna13.fzu.cz - user123@luna13:~$ cd /scratch/user123/job_14053410.pbs-m1.metacentrum.cz # enter scratch directory - user123@luna13:/scratch/user123/job_14053410.pbs-m1.metacentrum.cz$ rm -r * # remove all content -``` - -The scratch directory itself will be **deleted automatically** after some time. - diff --git a/content/docs/sandbox/index.mdx b/content/docs/sandbox/index.mdx index 175ccaf8..733269d0 100644 --- a/content/docs/sandbox/index.mdx +++ b/content/docs/sandbox/index.mdx @@ -31,7 +31,7 @@ import IconGrid from '@/public/img/meta/welcome/icon-pbs.svg'; }> Distributed HPC computing built on OpenPBS scheduler and NFS filesystem.
MetaCentrum Grid
- MetaCentrum Grid docs + MetaCentrum Grid docs
diff --git a/content/docs/welcome.mdx b/content/docs/welcome.mdx index 1f00c8b0..462af0e1 100644 --- a/content/docs/welcome.mdx +++ b/content/docs/welcome.mdx @@ -39,7 +39,7 @@ Welcome to the MetaCentrum documentation, the home of all MetaCentrum services. }> Distributed HPC computing built on OpenPBS scheduler and NFS filesystem.
MetaCentrum Grid
- MetaCentrum Grid docs + MetaCentrum Grid docs
From 282161adc5766fbb4506c48afac0bbc0d7725214 Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Fri, 13 Mar 2026 13:51:18 +0100 Subject: [PATCH 3/7] change callout to quote --- content/docs/computing/run-basic-job.mdx | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/content/docs/computing/run-basic-job.mdx b/content/docs/computing/run-basic-job.mdx index 7ff587d2..62465338 100644 --- a/content/docs/computing/run-basic-job.mdx +++ b/content/docs/computing/run-basic-job.mdx @@ -111,9 +111,7 @@ For more advanced troubleshooting, see the [Advanced guide](./advanced). ## Acknowledgements -Publications created with MetaCentrum support must include the e-INFRA CZ acknowledgement (ID:90254) and be submitted to the [publications system](https://publications.e-infra.cz/all-publications). For ELIXIR CZ resources, use ID:90255. +Publications created with MetaCentrum support must include the e-INFRA CZ acknowledgement (ID:90254) and be submitted to the [publications system](https://publications.e-infra.cz/all-publications). For ELIXIR CZ resources, please use ID:90255. - - Computational resources were provided by the e-INFRA CZ project (ID:90254), - supported by the Ministry of Education, Youth and Sports of the Czech Republic. - +>Computational resources were provided by the e-INFRA CZ project (ID:90254), +>supported by the Ministry of Education, Youth and Sports of the Czech Republic. From 347b3d647ea54d3164472dbf2d16d65c402c09b8 Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Fri, 13 Mar 2026 14:08:38 +0100 Subject: [PATCH 4/7] remove duplication, provide more links to other content --- content/docs/computing/advanced.mdx | 483 +++++----------------------- 1 file changed, 78 insertions(+), 405 deletions(-) diff --git a/content/docs/computing/advanced.mdx b/content/docs/computing/advanced.mdx index caa4975f..29b1cee7 100644 --- a/content/docs/computing/advanced.mdx +++ b/content/docs/computing/advanced.mdx @@ -6,32 +6,17 @@ This guide covers advanced topics for running jobs on MetaCentrum. If you're new ## Kerberos authentication -MetaCentrum uses the **Kerberos protocol** for internal authentication. When you log in, you automatically receive a Kerberos ticket that allows you to move between machines, run jobs, and copy files without repeatedly entering your password. - -### Kerberos ticket expiration - - - Kerberos tickets are valid for 10 hours. If you stay logged in longer than this, you must regenerate your ticket. - - -You'll know your ticket has expired when you see errors like: -``` -Key has expired @ dir_s_mkdir - /storage/brno2/home/user_name -``` - -### Basic Kerberos commands +MetaCentrum uses Kerberos for internal authentication. Tickets expire after 10 hours. ```bash -klist # List all current tickets and their expiration times -kdestroy # Delete all tickets -kinit # Create a new Kerberos ticket (you'll be prompted for password) +klist # List tickets +kdestroy # Delete tickets +kinit # Create new ticket ``` -### OnDemand and Kerberos +On ticket expiration, use `kinit` to regenerate. For OnDemand users, restart the web server via **Help → Restart Web Server**. -If you're using the [OnDemand web interface](../graphical/ondemand) and your Kerberos ticket expires during a long session, you cannot simply reload the page. You must click the **Help** button in the menu (upper right corner) and select **Restart Web Server** to renew your ticket. - -For more detailed Kerberos information, see the [Kerberos security page](../access/security/kerberos). +For detailed Kerberos information, see [Kerberos security page](../access/security/kerberos). ## Detailed resource configuration @@ -61,112 +46,54 @@ According to PBS terminology, a **chunk** is a subset of computational nodes on For most "normal" jobs, the number of chunks is 1 (default value). See [PBS resources guide](./resources/resources) for complex parallel computing scenarios.
-### Scratch directory types - -We offer four types of scratch storage for temporary files. For detailed information about each type, see the [Scratch storage guide](./infrastructure/scratch-storages). +### Scratch directories -| Type | Available everywhere? | Location | Environment variable | Key characteristic | -|------|----------------------|----------|---------------------|-------------------| -| `local` | Yes | `/scratch/USERNAME/job_JOBID` | `scratch_local` | Universal, large capacity | -| `ssd` | No | `/scratch.ssd/USERNAME/job_JOBID` | `scratch_ssd` | Fast I/O operations | -| `shared` | No | `/scratch.shared/USERNAME/job_JOBID` | `scratch_shared` | Can be shared by multiple jobs | -| `shm` | No | `/dev/shm/scratch.shm/USERNAME/job_JOBID` | `scratch_shm` | In RAM, ultra-fast | - - - There is no default scratch directory. You must always specify its type and volume. - - -As a default choice, we recommend **local scratch**: +Four scratch types are available. Default: `scratch_local`. +**Recommended:** ```bash qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00 ``` -### Accessing the scratch directory +Access scratch via `$SCRATCHDIR`. Use `go_to_scratch ` to access scratch after job failure. -Use the `$SCRATCHDIR` environment variable to access your scratch directory: +For detailed scratch type information, see [Scratch storage guide](./infrastructure/scratch-storages). -```bash -user123@glados12:~$ echo $SCRATCHDIR -/scratch.ssd/user123/job_14429322.pbs-m1.metacentrum.cz -user123@glados12:~$ cd $SCRATCHDIR -``` +## Interactive jobs - - If your job crashes or fails to copy data back, your files remain in scratch. Use `go_to_scratch ` to access them. - +### Starting interactive jobs -## Interactive jobs in depth +Request interactive session: `qsub -I -l select=1:ncpus=4 -l walltime=2:00:00` -### Starting an interactive job +Jobs are auto-terminated when walltime expires. -Interactive jobs are requested via `qsub -I` (uppercase "i"): +### When useful -```bash -qsub -I -l select=1:ncpus=4 -l walltime=2:00:00 -qsub: waiting for job 13010171.pbs-m1.metacentrum.cz to start -qsub: job 13010171.pbs-m1.metacentrum.cz ready -(BULLSEYE)user123@elmo3-1:~$ # Now on computational node -``` - -### When interactive jobs are necessary - -- Testing what works (software versions, input data format, bash constructions) -- Getting initial resource estimates -- Compiling your own software -- Processing, moving, or archiving large data volumes +- Testing software, input formats, resource estimates +- Compiling, processing/moving large data - Running [GUI applications](../software/graphical-access) -### Time quota enforcement +### Example -If you don't log out within the time quota, the job is automatically terminated: - -```bash -user123@elmo3-1:~$ =>> PBS: job killed: walltime 7230 exceeded limit 7200 -logout -qsub: job 13010171.pbs-m1.metacentrum.cz completed -``` - -### Interactive job example (Python environment setup) +Interactive jobs are useful for software testing, compiling, and data processing: ```bash qsub -I -l select=1:ncpus=4 -l walltime=2:00:00 -qsub: waiting for job 13010171.pbs-m1.metacentrum.cz to start -qsub: job 13010171.pbs-m1.metacentrum.cz ready - -# Load mamba and create environment -(BULLSEYE)user123@elmo3-1:~$ module add mambaforge -(BULLSEYE)user123@elmo3-1:~$ mamba list | grep scipy -(BULLSEYE)user123@elmo3-1:~$ mamba create -n my_scipy -... -(BULLSEYE)user123@elmo3-1:~$ mamba activate my_scipy -(my_scipy) (BULLSEYE)user123@elmo3-1:~$ mamba install scipy -... -(my_scipy) (BULLSEYE)user123@elmo3-1:~$ python ->>> import scipy as sp ->>> +# Once on compute node: +module add mambaforge +mamba create -n my_env +mamba activate my_env +python my_script.py ``` ## Job ID details -The job ID is a unique identifier crucial for tracking, manipulating, or deleting jobs, and for reporting issues to support. - -### Job ID formats - -- Short form (sometimes sufficient): `13010171.` -- Full form (always required): `13010171.pbs-m1.metacentrum.cz` - -### Getting your job ID +Job IDs identify jobs for tracking and management: `13010171.pbs-m1.metacentrum.cz` (full form required). -- After running `qsub` command -- Inside interactive jobs or batch scripts: `echo $PBS_JOBID` -- From qstat: `qstat -u your_username @pbs-m1.metacentrum.cz` - -```bash -# Within an interactive job -(BULLSEYE)user123@elmo3-1:~$ echo $PBS_JOBID -13010171.pbs-m1.metacentrum.cz -``` +Get your job ID: +- After `qsub` command +- Inside jobs: `echo $PBS_JOBID` +- From qstat: `qstat -u username` ## Job monitoring and management @@ -204,13 +131,7 @@ Job ID Username Queue Jobname SessID NDS TSK Memory Time S 11733550.pbs-m1 user123 q_2h myJob.sh -- 1 1 1gb 00:05 Q -- ``` -| Header | Meaning | -|--------|---------| -| S | Status: Q (queued), R (running), F (finished) | -| NDS | Number of distinct compute nodes | -| TSK | Number of tasks (usually equals ncpus) | -| Memory | Requested memory | -| Time | Elapsed time | +Key headers: `S`=status, `NDS`=nodes, `TSK`=tasks, `Memory`=requested memory, `Time`=elapsed. ### Job deletion @@ -226,197 +147,67 @@ qdel 21732596.pbs-m1.metacentrum.cz qdel -W force 21732596.pbs-m1.metacentrum.cz ``` -## PBS server - -The scheduling system that plans job execution is the **PBS server** (currently running OpenPBS). - -### Essential PBS commands - -- `qsub` – submit a computational job -- `qstat` – query status of a job -- `qdel` – delete a job - -The server on which the scheduler runs is `pbs-m1.metacentrum.cz`. - -### Queues +## PBS server and queues -Jobs are submitted to queues managed by the scheduler. Queues are typically defined by walltime duration or other criteria (GPUs, memory, user groups). +**Essential commands**: `qsub` (submit), `qstat` (query), `qdel` (delete) -Unless you have a specific reason, don't specify a queue. Jobs are submitted to a default **routing** queue and then automatically routed to appropriate **execution** queues (e.g., `q_1h`, `q_1d`). +**Queues**: Jobs route automatically from routing queue to execution queues (`q_1h`, `q_1d`, etc.). Don't specify a queue unless necessary. - View all queues at [PBSmon queues list](https://metavo.metacentrum.cz/pbsmon2/queues/list). The routing queue is marked with the routing icon (not to submit directly to execution queues). + View all queues at [PBSmon](https://metavo.metacentrum.cz/pbsmon2/queues/list). For more on queues, see [Queues guide](./resources/queues). -For more on queues, see [Queues guide](./resources/queues). - ## Output files and error handling -### Standard output and error - -When a job completes, two files are created in the submission directory: - -1. `.o` – standard output (STDOUT) -2. `.e` – standard error (STDERR) - -The STDERR file contains all error messages and is the first place to look if a job fails. - -### Examining failed jobs +When a job completes, two files are created in the submission directory: `jobname.o` (STDOUT) and `jobname.e` (STDERR). The `.e` file is the first place to look if a job fails. -1. Check the `.e` file for error messages -2. Verify the exit status (see below) -3. Check if input files exist and are accessible -4. Verify module loading and software availability +For detailed output file handling, see [Job tracking guide](./jobs/job-tracking). ## Exit status interpretation -The exit status (a number) indicates how a job finished. This is meaningful **only for batch jobs** - interactive jobs always have exit status 0. - -### Getting exit status +Exit status indicates how a batch job finished (interactive jobs always return 0). ```bash -qstat -xf job_ID | grep Exit_status -Exit_status = 271 +qstat -xf job_ID | grep Exit_status # Get exit status ``` - For jobs older than 24 hours, qstat may not show exit status. Use `pbs-get-job-history` utility or check [PBSmon](https://metavo.metacentrum.cz/pbsmon2/jobs/detail). + For jobs >24h old, use `pbs-get-job-history` or [PBSmon](https://metavo.metacentrum.cz/pbsmon2/jobs/detail). -### Exit status ranges - -| Range | Meaning | -|-------|---------| -| X < 0 | Job killed by PBS (resource exceeded or other problem) | -| 0 <= X < 256 | Exit value of shell or top process | -| X >= 256 | Job killed with OS signal | - -### Translating exit status >= 256 to OS signals - -If exit status >= 256, subtract 256 to get the OS signal code: - -``` -PBS exit status - 256 = OS signal code -``` - -For example, exit status 271 means OS signal number 15 (`SIGTERM`). - -List all OS signals: - -```bash -kill -l -``` - -### Common exit statuses - -| Job ending type | Exit status | -|-----------------|-------------| -| Missing Kerberos credentials | -23 | -| Job exceeded number of CPUs | -25 | -| Job exceeded memory | -27 | -| Job exceeded walltime | -29 | -| Normal termination | **0** | -| Job killed by `SIGTERM` (qdel) | 271 | - -### Job termination by PBS server +**Ranges**: +- `X < 0`: PBS killed job (resource exceeded) +- `0 <= X < 256`: Shell/top process exit +- `X >= 256`: OS signal (subtract 256 for signal code; use `kill -l` to list signals) -The PBS server monitors resource usage. If a job exceeds its reserved resources, PBS sends a **SIGKILL** signal to terminate it: +**Common statuses**: `-23`=missing Kerberos, `-25`=exceeded CPUs, `-27`=exceeded memory, `-29`=exceeded walltime, `0`=normal, `271`=SIGTERM (qdel) -```bash -qstat -x -f 13030457.pbs-m1.metacentrum.cz | grep Exit_status - Exit_status = -29 # walltime exceeded -``` +## Scratch cleanup -## Manual scratch cleanup +When a job ends with an error, data may remain in scratch. Clean up after retrieving useful data. -When a job ends with an error, data may remain in the scratch directory. You should clean it up after retrieving any useful data. +### Manual cleanup -### Cleanup procedure - -You need: -- The hostname where the job ran -- The path to the scratch directory +Log in to the compute node and remove scratch contents: ```bash -# Log in to the compute node -user123@skirit:~$ ssh user123@luna13.fzu.cz - -# Navigate to scratch directory -user123@luna13:~$ cd /scratch/user123/job_14053410.pbs-m1.metacentrum.cz - -# Remove all contents (not the directory itself) -user123@luna13:/scratch/user123/job_14053410.pbs-m1.metacentrum.cz$ rm -r * +ssh user123@node.fzu.cz +cd /scratch/user123/job_JOBID +rm -r * ``` - Users have permissions for `rm -rf $SCRATCHDIR/*` but not `rm -rf $SCRATCHDIR`. The scratch directory itself is deleted automatically after some time. + Use `go_to_scratch ` to access scratch after job failure. The scratch directory itself is deleted automatically. -### Helpful scratch management - -**Record scratch info in your job:** +### Automatic cleanup with trap ```bash -DATADIR=/storage/brno12-cerit/home/user123/test_directory -echo "$PBS_JOBID is running on node `hostname -f` in $SCRATCHDIR" >> $DATADIR/jobs_info.txt +trap 'clean_scratch' EXIT TERM # Clean on normal exit or termination +trap 'echo "$PBS_JOBID failed at $SCRATCHDIR" >> log.txt' TERM # Log for manual cleanup ``` -This creates a record that helps you find scratch directories when jobs fail. - -**Conditional cleanup:** - -```bash -# Copy output, don't fail if cleanup has issues -cp h2o.out $DATADIR/ || export CLEAN_SCRATCH=false - -# SCRATCH is auto-cleaned only if previous command succeeded -``` - -## Advanced scratch management - -### Trap command for automatic cleanup - -Use the `trap` command to ensure scratch is cleaned up even when jobs fail unexpectedly. - -#### Clean on normal exit - -```bash -trap 'clean_scratch' EXIT -``` - -This runs when the script finishes normally (or via `exit` command). - -#### Clean on job termination - -```bash -trap 'clean_scratch' TERM -``` - -This runs when the job receives a SIGTERM signal (either from PBS killing it due to resource limits, or from you using `qdel`). - - - When SIGTERM is received, you have approximately 10 seconds before SIGKILL terminates the job unconditionally. Cleanup operations must complete within this time. - - -#### Combined approach - -```bash -trap 'clean_scratch' EXIT TERM -``` - -This cleans scratch for both normal exits and terminations. - -#### Recording failure location for manual cleanup - -If you need to retrieve data from scratch after failure: - -```bash -trap 'echo "$PBS_JOBID job failed. Retrieve from $SCRATCHDIR on `hostname -f`" >> /storage/.../jobs_info.txt' TERM -``` - -This logs the scratch location instead of attempting large file operations that might not complete in time. - -For more on trap commands, see [Trap command usage](./jobs/trap-command). +The `trap` command ensures scratch cleanup even when jobs fail. See [Trap command guide](./jobs/trap-command) for details. ## Custom output paths @@ -493,144 +284,58 @@ qalter -W depend=afterok:job1_ID.pbs-m1.metacentrum.cz job2_ID.pbs-m1.metacentru ## Modifying job attributes -You can modify attributes of **queued** jobs (status Q) using the `qalter` command. - - - Only queuing jobs can be modified with `qalter`. For running jobs that need changes, see the "Extend walltime" section below. - - -### Common modifications - -**Change resource requirements:** +Modify **queued** jobs (status Q) with `qalter`: ```bash -# Original submission -qsub -l select=1:ncpus=150:mem=10gb -l walltime=1:00:00 job.sh - -# This will never start (150 CPUs on one machine is impossible) qalter -l select=1:ncpus=32:mem=10gb job_ID.pbs-m1.metacentrum.cz -``` - -**Add or modify walltime:** - -```bash qalter -l walltime=02:00:00 job_ID.pbs-m1.metacentrum.cz ``` - - Walltime can only be modified within the limits of the job's assigned queue. Increasing beyond the queue's maximum will fail. + + Walltime can only be modified within the queue's maximum. You must specify the entire `-l` attribute with `qalter`. -**Remove obsolete parameters:** - -```bash -# Remove obsolete GPU capability parameter -qalter -l select=1:ncpus=1:ngpus=1:mem=10gb job_ID.pbs-m1.metacentrum.cz -``` - - - When using `qalter`, you must specify the entire `-l` attribute including unchanged parts. - - -For more on modifying job attributes, see [Modify job attributes guide](./jobs/modify-job-attributes). +For running jobs, see "Extend walltime" below. For more, see [Modify job attributes guide](./jobs/modify-job-attributes). ## Extend walltime for running jobs -The `qextend` command allows you to extend the walltime of **running** jobs. - -### Extending a job +Extend walltime of **running** jobs with `qextend`: ```bash -qextend job_ID.pbs-m1.metacentrum.cz 01:00:00 +qextend job_ID.pbs-m1.metacentrum.cz 01:00:00 # hh:mm:ss or seconds ``` -Time can be specified as: -- A single number (seconds) -- `hh:mm:ss` format - -```bash -(BUSTER)user123@skirit:~$ qextend 8152779.pbs-m1.metacentrum.cz 01:00:00 -The walltime of the job 8152779.pbs-m1.metacentrum.cz has been extended. -Additional walltime: 01:00:00 -New walltime: 02:00:00 -``` - -### Usage limits - -To prevent abuse, `qextend` is limited by: -- Maximum 20 times within the last 30 days, **AND** -- Maximum 1440 CPU-hours of extension within the last 30 days - - - CPU-hours are not walltime hours. For a job running on 8 CPUs, extending by 1 hour uses 8 hours from your fund. - - -### Checking your quota +**Limits**: Max 20 times/month AND 1440 CPU-hours/month (CPU-hours = walltime × ncpus) ```bash -qextend info +qextend info # Check your quota ``` -This shows your: -- Counter limit (usage count) -- CPU-time fund (usage in CPU-hours) -- Available remaining quota -- When the oldest extension "expires" from the 30-day window +Array jobs require support contact: meta@cesnet.cz - - If you hit the monthly limit and still need to extend a job, contact user support at meta@cesnet.cz. - - - - `qextend` works only on simple jobs. To extend an array job, contact user support at meta@cesnet.cz. - - -For more on extending walltime, see [Extend walltime guide](./jobs/extend-walltime). +For more, see [Extend walltime guide](./jobs/extend-walltime). ## Module span management -### Conflicting modules - -Loading multiple modules with conflicting dependencies can cause errors. Use subshells to isolate module usage: +For conflicting modules, use subshells to isolate environments: ```bash -# First computation with its own module environment -(module add python/3.8.0-gcc; python my_script1.py output1) - -# Second computation with different module environment -(module add python/3.8.0-gcc-rab6t; python my_script2.py output2) +(module add python/3.8.0-gcc; python script.py) # Independent module environment ``` -Each subshell loads its own modules and automatically unloads them when the subshell exits. - -### Displaying module information - ```bash -module display module_name # Show module details including environment variables +module display module_name # Show module details ``` -Important variables set by modules: -- `PATH` – path to executables -- `LD_LIBRARY_PATH` – path to libraries for the linker -- `LIBRARY_PATH` – path to libraries +`module display` shows key variables: `PATH`, `LD_LIBRARY_PATH`, `LIBRARY_PATH`. -For more on modules, see [Software modules guide](../software/modules). +For more, see [Software modules guide](../software/modules). ## Research group annual report -Research groups are asked to submit annual reports by the end of January. The report should include: - -1. Group name -2. Contact address -3. List of group members -4. Summary of research interests -5. Hardware contributed to MetaCentrum (if applicable) -6. Most frequently used MetaCentrum software -7. New software developed (if applicable) -8. List of research projects using MetaCentrum resources, with brief descriptions -9. List of publications with MetaCentrum/CERIT-SC acknowledgements +Submit annual reports by end of January: group name/members/contact, research interests, contributions (hardware, software), projects, publications. -Reports can be in English or Czech, in any file format. Send to [annual-report@metacentrum.cz](mailto:annual-report@metacentrum.cz). +Send to [annual-report@metacentrum.cz](mailto:annual-report@metacentrum.cz). ## Additional resources @@ -646,42 +351,10 @@ Reports can be in English or Czech, in any file format. Send to [annual-report@m ## Web-based job running with usegalaxy.cz -As an alternative to command-line job submission, you can run computational jobs through **usegalaxy.cz**, a web-based platform provided by e-INFRA CZ / MetaCentrum together with ELIXIR CZ. - -As an alternative to command-line job submission, you can run computational jobs through **usegalaxy.cz**, a web-based platform provided by e-INFRA CZ / MetaCentrum together with ELIXIR CZ. - -### Features - -- **Thousands of tools** from various scientific domains (bioinformatics, ecology, chemistry, NLP, climate science, social sciences, and more) -- **Web interface** – no need to write batch scripts or use command line -- **Large data quotas** – 250 GB for e‑INFRA CZ login, 50 GB for Life Science login -- **Workflow support** – create, share, and publish computational pipelines -- **Intuitive interface** – tools panel, submission forms, and result history -- **Training resources** – access to Galaxy Training Network tutorials - -### Access and quotas - -**Website:** https://usegalaxy.cz – log in with e-INFRA CZ or Life Science credentials - -| Resource | Limit (standard) | -|---------------------------|-------------------------------------------| -| **Storage** | 250 GB (e‑INFRA CZ Login)
50 GB (Life Science Login) | -| **Concurrent jobs** | 10 jobs | -| **Maximum single dataset**| 50 GB | - -### When to use usegalaxy.cz - -- You prefer a web interface over command-line tools -- Your field has Galaxy tools available -- You need to build and share computational workflows -- You're working with common bioinformatics or data science tools -- You want to avoid writing and debugging batch scripts +As an alternative to command-line job submission, use **usegalaxy.cz** – a web-based platform providing thousands of tools, large data quotas (250 GB for e‑INFRA CZ login), and workflow support. -### Getting more resources +**Access**: https://usegalaxy.cz – log in with e-INFRA CZ or Life Science credentials -If you need additional storage, compute, or concurrency for your research, contact galaxy@cesnet.cz. The team can also: -- Install additional tools -- Help wrap new tools for Galaxy -- Collaborate on designing workflows +**When useful**: Web interface preference, available Galaxy tools, workflow building, avoiding script writing -For detailed information about usegalaxy.cz, see the [usegalaxy.cz guide](../graphical/usegalaxy). \ No newline at end of file +**More resources**: For detailed features and quotas, see [usegalaxy.cz guide](../graphical/usegalaxy). From cce3c78a81929f13d07fa39930e53053368a175e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ji=C5=99=C3=AD=20Vorel?= Date: Mon, 16 Mar 2026 15:39:58 +0100 Subject: [PATCH 5/7] resolve conflict, no change needed From 26ef27c1f92c97dd02f05ecd909f30488b261343 Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Wed, 18 Mar 2026 10:39:19 +0100 Subject: [PATCH 6/7] shorten Galaxy mention --- content/docs/computing/run-basic-job.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/computing/run-basic-job.mdx b/content/docs/computing/run-basic-job.mdx index 62465338..7542a580 100644 --- a/content/docs/computing/run-basic-job.mdx +++ b/content/docs/computing/run-basic-job.mdx @@ -97,7 +97,7 @@ Now that you understand the basics, you can: - Explore [available software](../software/alphabet) - Read about [parallel computing](./parallel-comput) - Check [GPU resources](./gpu-comput) -- Try [usegalaxy.cz web interface](../graphical/usegalaxy) – an alternative way to run jobs with a web-based platform providing thousands of tools, large data quotas (250 GB for e‑INFRA CZ login), and workflow support, accessible at https://usegalaxy.cz +- Try usegalaxy.cz [web interface](../graphical/usegalaxy) – an alternative way to run jobs in a web-based platform with workflow support. ## Troubleshooting basics From 05f7eff58a495e68e341a51f1462b98052ff51de Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Fri, 20 Mar 2026 11:26:18 +0100 Subject: [PATCH 7/7] update readme to clarify build --- README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index cc11975c..d326151f 100644 --- a/README.md +++ b/README.md @@ -17,8 +17,9 @@ In case you want to generate the static pages locally (could be useful for large 1. Clone the CESNET/metacentrum-user-docs repo `git clone https://github.com/CESNET/metacentrum-user-docs` 2. Clone CERIT-SC/fumadocs repo with some objects common to all eInfra docs: `git clone https://github.com/CERIT-SC/fumadocs` 3. Copy the required files `cp -r fumadocs/components/* metacentrum-user-docs/components/` -4. Run the build +4. Enter the directory `cd metacentrum-user-docs` +5. Run the build ```bash -docker run -it --rm -p 3000:3000 -e STARTPAGE=/en/docs -v metacentrum-user-docs/public:/opt/fumadocs/public -v metacentrum-user-docs/components:/opt/fumadocs/components -v metacentrum-user-docs/content/docs:/opt/fumadocs/content/docs cerit.io/docs/fuma:v16.4.6 pnpm dev +docker run -it --rm -p 3000:3000 -e STARTPAGE=/en/docs -v ./public:/opt/fumadocs/public -v ./components:/opt/fumadocs/components -v ./content/docs:/opt/fumadocs/content/docs cerit.io/docs/fuma:v16.4.6 pnpm dev ``` -5. Documentation will be available at `http://localhost:3000/en/docs/welcome` and automatically rebuilt on source change +6. Documentation will be available at `http://localhost:3000/en/docs/welcome` and automatically rebuilt on source change.