-
Notifications
You must be signed in to change notification settings - Fork 121
Add ESM2 PEFT recipe #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ESM2 PEFT recipe #1446
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@bionemo-recipes/models/esm2/pyproject.toml`:
- Line 17: The git-based peft dependency string in pyproject.toml currently pins
to a branch ("peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora"); replace the
branch ref with an immutable identifier (a specific commit SHA or an official
release tag) so the requirement becomes pinned to that commit/tag, update the
dependency entry accordingly, and verify the chosen SHA/tag exists in the peft
repo and builds correctly (look for the dependency line containing "peft @
git+https://github.com/balvisio/peft.git@...").
In `@bionemo-recipes/recipes/esm2_peft_te/checkpoint.py`:
- Line 1: The current checkpoint.py is a cross-recipe symlink to another
recipe's checkpoint logic; replace it by copying the checkpoint implementation
into this recipe (or extract the shared logic into a new common module and
import from there), add the proper per-file license header to the copied file,
remove any imports that reference other recipes, and update all references to
use the local checkpoint implementation (i.e., the functions/classes originally
provided by the external checkpoint module) so the recipe is fully
self-contained.
In `@bionemo-recipes/recipes/esm2_peft_te/dataset.py`:
- Around line 20-28: The file imports DataCollatorWithFlattening from
transformers but this recipe expects the local, recipe-specific implementation
in collator.py; change the import to pull DataCollatorWithFlattening from the
local collator module (the same place TokenPackingDataset is imported from) so
that the DataCollatorWithFlattening used by Dataset code matches the
recipe-specific signature and Flash Attention / THD-format parameters; update
the import list to reference collator.DataCollatorWithFlattening instead of
transformers.DataCollatorWithFlattening.
In `@bionemo-recipes/recipes/esm2_peft_te/distributed_config.py`:
- Line 1: The file distributed_config.py currently points to
../esm2_native_te/distributed_config.py (cross-recipe symlink); replace this
pointer by inlining the distributed_config implementation into this recipe (copy
the code from esm2_native_te/distributed_config.py into this recipe's
distributed_config.py) and add the required per-file license header at the top,
or alternatively extract any truly shared utilities into a new common module
outside recipes (e.g., a shared package) and import that instead; ensure the new
distributed_config.py in this recipe contains no imports from other recipes and
includes the proper license header.
In `@bionemo-recipes/recipes/esm2_peft_te/Dockerfile`:
- Around line 1-12: The Dockerfile is copying files from esm2_native_te
(checkpoint.py, collator.py, distributed_config.py, scheduler.py) which violates
the self-contained recipe rule; remove those COPY lines and either vendor those
helper modules into esm2_peft_te (add them under esm2_peft_te/ and update any
imports) or move them into a shared non-recipe package, then update the
Dockerfile to only COPY esm2_peft_te/ and its requirements; ensure any import
paths in the code reference the vendored modules (e.g., esm2_peft_te.checkpoint,
esm2_peft_te.collator, etc.) so the image no longer depends on esm2_native_te.
In `@bionemo-recipes/recipes/esm2_peft_te/infer.py`:
- Around line 27-36: Add a Google-style docstring to the _batched_inference
function: document the function purpose in one line, then an Args section
listing model, tokenizer, records, batch_size (int), max_seq_length (int),
stride (int), infer_overflowing_aas (bool), and device (str) with short
descriptions and types, and a Returns section describing the tuple of list[str]
(predicted sequences) and list[int] (corresponding lengths/ids) following
pydocstyle conventions; place the docstring immediately below the def
_batched_inference(...) signature and ensure proper triple-quote formatting and
punctuation.
- Around line 46-76: The code assumes inputs contains
"overflow_to_sample_mapping" but when tokenizer was called with
return_overflowing_tokens/infer_overflowing_aas=False that key is missing;
before using overflow_map (variable overflow_map) in the inner loop (inside the
block that constructs sub_inputs and iterates preds) guard access by checking if
"overflow_to_sample_mapping" is in inputs and set overflow_map =
inputs.pop("overflow_to_sample_mapping", None); when overflow_map is None,
compute original_idx using i (outer sample index) directly (or j+k) without
indexing overflow_map, and keep appending to sequences_to_sample_mapping and
predictions as before to avoid KeyError; update references in the loop where
original_idx is assigned (original_idx = i + overflow_map[j + k].item()) to
handle both cases.
- Line 108: Replace the hardcoded tokenizer checkpoint with the model-derived
tokenizer: modify the AutoTokenizer.from_pretrained call (where tokenizer =
AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D")) to load from the
runtime/config value (e.g., args.model_tag or a config field tokenizer_name) so
the tokenizer matches the model loaded elsewhere in infer.py; also add
tokenizer_name: ${model_tag} to hydra_config/defaults_infer.yaml so the
tokenizer checkpoint can be overridden via config. Ensure you reference the same
symbol used to load the model (args.model_tag or the config object) when calling
AutoTokenizer.from_pretrained.
In `@bionemo-recipes/recipes/esm2_peft_te/requirements.txt`:
- Around line 2-6: The git-based peft dependency is pinned to a mutable branch;
update the requirement line "peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora" to reference an
immutable identifier by replacing the branch name with a specific commit SHA or
released tag (e.g., @<commit-sha> or `@vX.Y.Z`) so builds are reproducible and
supply-chain safe; ensure the updated string remains in requirements.txt as
"peft @ git+https://github.com/balvisio/peft.git@<commit-or-tag>".
In `@bionemo-recipes/recipes/esm2_peft_te/scheduler.py`:
- Line 1: This file currently redirects to or imports from
esm2_native_te.scheduler which violates recipe isolation; replace that
cross-recipe dependency by copying the required scheduler logic into this recipe
or implementing a local equivalent. Identify the exported symbols you depend on
(e.g., Scheduler class, create_scheduler or get_scheduler factory function, and
any helper functions like schedule_task or init_scheduler) from the
esm2_native_te implementation, reproduce their behavior locally inside
bionemo-recipes/recipes/esm2_peft_te/scheduler.py, update any local imports to
use the new local implementations, and remove any import lines that reference
esm2_native_te.scheduler so the recipe is fully self-contained.
- Line 1: The file currently contains a bare path string which raises a
SyntaxError; replace it with a valid Python implementation or a re-export
import: remove the literal path and add an import that re-exports the native
scheduler (e.g., import or from import of esm2_native_te.scheduler) or implement
a minimal wrapper function/class matching this package's expected API (e.g.,
functions/classes used elsewhere that reference scheduler.py) so the module can
be imported without error; update symbols to match callers if you implement a
wrapper.
In `@bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py`:
- Line 25: The recipe imports NVEsmForConvTokenClassification from
modeling_esm_te in train_lora_convnet.py which pulls code from another recipe
and breaks self-containment; either vendor the required module into this recipe
(add modeling_esm_te.py with NVEsmForConvTokenClassification implementation
alongside train_lora_convnet.py and update the package/module path) or declare
and install the external esm package from models/esm2 (add it to
requirements.txt and update the Dockerfile to copy/install that package) so that
the import in train_lora_convnet.py resolves without referencing code outside
this recipe.
🟡 Minor comments (15)
bionemo-recipes/recipes/esm2_peft_te/scheduler.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd license header and Google-style module docstring.
The file is missing the required license header and module docstring. Pre-commit hooks will likely fail.
As per coding guidelines: “Ensure license headers are present in all files…” and “Use Google-style docstrings (pydocstyle).”
.github/workflows/unit-tests-recipes.yml-158-160 (1)
158-160:⚠️ Potential issue | 🟡 MinorAvoid hard-coded
safe.directorypath to prevent CI breakage in forks.Using a fixed
/__w/bionemo-framework/bionemo-frameworkpath can fail when the repo name changes (forks or mirrors), triggering “dubious ownership” errors. Prefer$GITHUB_WORKSPACEso the command is resilient.🛠️ Proposed fix
- run: git -c safe.directory=/__w/bionemo-framework/bionemo-framework sparse-checkout add bionemo-recipes/recipes/esm2_native_te + run: git -c safe.directory="$GITHUB_WORKSPACE" sparse-checkout add bionemo-recipes/recipes/esm2_native_tebionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/config.json-31-42 (1)
31-42:⚠️ Potential issue | 🟡 MinorAdd a comment explaining the "L" label in
label2id.The
label2idmapping includes"L": 2, which is not a standard DSSP secondary structure code. While this label is used consistently throughout the recipe (mapped to the coil class), it's undocumented what "L" represents. Add an inline comment clarifying whether "L" represents "Loop" or another designation, and explain why it's included in the label scheme alongside standard DSSP codes.bionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/README.md-1-200 (1)
1-200:⚠️ Potential issue | 🟡 MinorReplace placeholder fields before shipping the model card.
The model card is entirely "[More Information Needed]" placeholders. If this accompanies a published checkpoint, please fill in at least core metadata (license, training data, intended use, evaluation) or clearly mark it as a template to avoid shipping incomplete documentation.bionemo-recipes/recipes/esm2_peft_te/README.md-101-107 (1)
101-107:⚠️ Potential issue | 🟡 MinorUse descriptive link text for the esm2_native_te README link.
Line 107 uses "here", which is not descriptive and triggers MD059. Suggest updating the anchor text.Proposed fix
-For more information see [here](../esm2_native_te/README.md). +For more information see the [esm2_native_te README](../esm2_native_te/README.md).bionemo-recipes/models/esm2/src/esm/modeling_esm_te.py-684-707 (1)
684-707:⚠️ Potential issue | 🟡 MinorAdd Google-style Args/Returns to the new conv head docstrings.
NVConvNetHeadandNVEsmForConvTokenClassificationdocstrings are minimal and missing Args/Returns sections, and the__init__docstring name is off. Please update them to Google-style to satisfy pydocstyle.Example docstring updates
class NVConvNetHead(nn.Module): - """Convolution based head for token classification.""" + """Convolution-based head for token classification. + + Args: + config (NVEsmConfig): Model configuration. + """ @@ - def forward(self, features, **kwargs): - """Forward pass for the convolutional token classification head.""" + def forward(self, features, **kwargs): + """Forward pass for the convolutional token classification head. + + Args: + features (torch.Tensor): Input features of shape (batch, hidden, seq_len). + **kwargs: Unused keyword arguments. + + Returns: + torch.Tensor: Logits of shape (batch, seq_len, num_labels). + """ @@ class NVEsmForConvTokenClassification(NVEsmPreTrainedModel): @@ - def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + def __init__(self, config): + """Initialize NVEsmForConvTokenClassification. + + Args: + config (NVEsmConfig): Model configuration. + """As per coding guidelines: Use Google-style docstrings following pydocstyle conventions.
bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Two issues in the constructor:
- Line 706: Docstring says "Initialize NVEsmForTokenClassification" but this is
NVEsmForConvTokenClassification.- Lines 713-714: Calling both
init_weights()andpost_init()is redundant—post_init()already invokesinit_weights()internally (see HuggingFacePreTrainedModel.post_init()). Other classes in this file (e.g.,NVEsmForTokenClassificationat line 640) only callpost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in other esm_nv.py files:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in
esm2_peft_te/example_8m_checkpoint/esm_nv.py:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in other esm_nv.py files:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_peft_te/infer.py-104-106 (1)
104-106:⚠️ Potential issue | 🟡 MinorHandle CPU-only environments or allow a device override.
The script hardcodes CUDA at line 106, which crashes on CPU-only systems. Additionally, the
_batched_inference()call (lines 112-117) doesn't pass the device parameter despite the function supporting it.🛠️ Suggested device handling
# Load PEFT adapters on top peft_model = PeftModel.from_pretrained(base_model, args.peft_model_config_dir) - peft_model = peft_model.to("cuda").eval() + device = "cuda" if torch.cuda.is_available() else "cpu" + peft_model = peft_model.to(device).eval() @@ - predictions, sequences_to_sample_mapping = _batched_inference( - peft_model, - tokenizer, - records, - **args.inference, - ) + inference_kwargs = dict(args.inference) + inference_kwargs.setdefault("device", device) + predictions, sequences_to_sample_mapping = _batched_inference( + peft_model, + tokenizer, + records, + **inference_kwargs, + )bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py-170-196 (1)
170-196:⚠️ Potential issue | 🟡 MinorGuard against empty validation dataloader.
If validation yields zero batches,
val_stepsstays 0 and the averaging will raise. Add a guard to handle empty validation splits.💡 Suggested fix
- avg_val_loss = val_loss_total / val_steps - avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0 + if val_steps == 0: + avg_val_loss = 0.0 + avg_val_acc = 0.0 + else: + avg_val_loss = val_loss_total / val_steps + avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0bionemo-recipes/recipes/esm2_peft_te/train_lora_ddp.py-171-197 (1)
171-197:⚠️ Potential issue | 🟡 MinorGuard against empty validation dataloader.
If the validation dataloader yields zero batches,
val_stepsremains 0 and this division will raise. Add a zero-batch guard.💡 Suggested fix
- avg_val_loss = val_loss_total / val_steps - avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0 + if val_steps == 0: + avg_val_loss = 0.0 + avg_val_acc = 0.0 + else: + avg_val_loss = val_loss_total / val_steps + avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0bionemo-recipes/recipes/esm2_peft_te/utils.py-106-123 (1)
106-123:⚠️ Potential issue | 🟡 MinorValidate CSV headers and handle empty files.
The current code fails on empty CSV files (line 115:
"pdb_id" in reader.fieldnamesraisesTypeErrorwhenfieldnamesisNone), and missing thesequencecolumn causes an unhandledKeyErroron line 120. Add upfront validation to provide clear error messages.Suggested fix
def load_csv(path: Path) -> list[dict]: """Read input CSV file for inference. @@ - with open(path) as f: - reader = csv.DictReader(f) - has_pdb_id = "pdb_id" in reader.fieldnames + with open(path, newline="") as f: + reader = csv.DictReader(f) + if reader.fieldnames is None: + raise ValueError("CSV must include a header with a 'sequence' column.") + if "sequence" not in reader.fieldnames: + raise ValueError("CSV header must include a 'sequence' column.") + has_pdb_id = "pdb_id" in reader.fieldnamesbionemo-recipes/recipes/esm2_peft_te/tests/test_train_lora.py-16-18 (1)
16-18:⚠️ Potential issue | 🟡 MinorGuard
test_sanity_ddp_thdwith CUDA availability check to prevent failures in CPU-only environments.
torch.cuda.get_device_capability()raises if CUDA is unavailable. Add the skip guard at the start of the function to avoid test failures in CPU-only runs.💡 Suggested fix
import torch +import pytest from hydra import compose, initialize_config_dir from train_lora_ddp import main as main_ddp def test_sanity_ddp_thd(tmp_path, monkeypatch, recipe_path): + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for DDP THD sanity test") if torch.cuda.get_device_capability() == (12, 0): # TODO(BIONEMO-2840): On sm120, we need to set NVTE_FUSED_ATTN to 0 since TE will choose fused attn by default,
🧹 Nitpick comments (7)
bionemo-recipes/recipes/esm2_peft_te/perf_logger.py (2)
139-142: Avoid mutating the caller'soutputsobject.Directly modifying
outputs.logitswithunsqueeze(0)mutates the object passed by the caller, which could cause unexpected side effects in the training loop if the outputs are used elsewhere after this call.Proposed fix: use a local variable
# Handle sequence packing for torchmetrics calculation. + logits_for_perplexity = outputs.logits if outputs.logits.dim() < 3: - outputs.logits = outputs.logits.unsqueeze(0) + logits_for_perplexity = outputs.logits.unsqueeze(0) - self.metrics["train/perplexity"].update(outputs.logits, batch["labels"]) + self.metrics["train/perplexity"].update(logits_for_perplexity, batch["labels"])
153-166: Inconsistent rank checks for logging.Line 153 uses
is_main_process()for wandb logging, but line 165 useslocal_rank == 0for logger output. In multi-node distributed setups, these may not be equivalent (e.g.,local_rank == 0is true on every node, whileis_main_process()is typically true only on global rank 0).Consider using consistent rank checks throughout.
Proposed fix
- if self._dist_config.local_rank == 0: + if self._dist_config.is_main_process(): logger.info(", ".join([f"{k.split('/')[1]}: {v:.3g}" for k, v in metrics.items()]))bionemo-recipes/recipes/esm2_peft_te/tests/test_train_lora_two_gpus.py (1)
49-52: Add a Google-style docstring to the test.Keeps docstring linting consistent even under relaxed test rules.
♻️ Suggested update
`@requires_multi_gpu` def test_multi_gpu_train_te_ddp(tmp_path, recipe_path): + """Smoke-test multi-GPU DDP training for the recipe.""" # Run 'accelerate launch train.py' as a subprocessAs per coding guidelines: Use Google-style docstrings following pydocstyle conventions.
bionemo-recipes/recipes/esm2_peft_te/train_lora_ddp.py (1)
111-112: Gate verbose prints to the main process.These prints will fire on every rank and can spam logs; consider restricting them to the main process.
💡 Suggested fix
- print("----- PEFT Model --------") - peft_model.print_trainable_parameters() + if dist_config.is_main_process(): + print("----- PEFT Model --------") + peft_model.print_trainable_parameters() @@ - print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n") + if dist_config.is_main_process(): + print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n")Also applies to: 198-198
bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py (1)
110-111: Gate verbose prints to the main process.These prints execute on every rank; restricting to the main process avoids noisy multi-rank output.
💡 Suggested fix
- print("----- PEFT Model --------") - peft_model.print_trainable_parameters() + if dist_config.is_main_process(): + print("----- PEFT Model --------") + peft_model.print_trainable_parameters() @@ - print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n") + if dist_config.is_main_process(): + print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n")Also applies to: 197-197
bionemo-recipes/recipes/esm2_peft_te/dataset.py (1)
31-44: Use Google-style docstring forcreate_dataloader.The current one-line docstring doesn’t meet the Google-style requirement. Add
Args:/Returns:sections.As per coding guidelines, Ensure all Python files follow Google-style docstrings (pydocstyle convention).💡 Suggested fix
- """Create a dataloader for the secondary structure dataset.""" + """Create dataloaders for the secondary structure dataset. + + Args: + distributed_config: Distributed training configuration. + use_sequence_packing: Whether to enable sequence packing. + tokenizer_name: Tokenizer identifier. + micro_batch_size: Training micro-batch size. + val_micro_batch_size: Validation micro-batch size. + num_workers: DataLoader worker count. + max_seq_length: Maximum sequence length. + stride: Tokenizer stride for overflow. + seed: RNG seed for shuffling. + ss3_classification: Whether to use SS3 labels (else SS8). + load_dataset_kwargs: Keyword arguments for datasets.load_dataset. + + Returns: + Tuple of (train_dataloader, val_dataloader, train_dataset_or_sampler). + """bionemo-recipes/recipes/esm2_peft_te/utils.py (1)
55-61: Use Google-style docstrings for public helpers.These public functions should include
Args:/Returns:sections to meet the docstring standard. Consider updating all helper docstrings similarly.As per coding guidelines, Ensure all Python files follow Google-style docstrings (pydocstyle convention).💡 Example for one function
def compute_accuracy(preds, labels, ignore_index=-100) -> tuple[int, int]: - """Calculate the accuracy.""" + """Calculate the accuracy. + + Args: + preds: Model logits or scores per token. + labels: Ground-truth label tensor. + ignore_index: Label value to exclude from accuracy. + + Returns: + Tuple of (correct_count, total_count). + """Also applies to: 64-82, 84-104, 106-135, 138-161, 163-170
bionemo-recipes/recipes/esm2_peft_te/data/porter6_train_dataset_55k.parquet
Outdated
Show resolved
Hide resolved
...s/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/adapter_model.safetensors
Outdated
Show resolved
Hide resolved
ceb3874 to
1308d55
Compare
bionemo-recipes/recipes/esm2_peft_te/tests/peft_test_dataset.csv
Outdated
Show resolved
Hide resolved
953370e to
e144a8a
Compare
Signed-off-by: Bruno Alvisio <balvisio@nvidia.com>
Signed-off-by: Bruno Alvisio <balvisio@nvidia.com>
e144a8a to
d1f6909
Compare
Signed-off-by: Bruno Alvisio <balvisio@nvidia.com>
d1f6909 to
5846611
Compare
Description
This PR adds a recipe to perform LoRA fine-tuning to the Evo2 model. It provides support for DDP and sequence packing. It also contains a file
infer.pythat shows how to do inference from a fine-tuned checkpoint. The PR contains the datasets that were used to train the models. Eventually we can convert them into HF datasets.It does not add support for FSDP or FP8 yet.
Usage
cd bionemo-recipes/recipes/evo2_peft_te/ python train_lora_ddp.pyFor more information on usage see the README.
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores