Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,4 @@ Scikit-Mol has been developed as a community effort with contributions from peop
- [@enricogandini](https://github.com/enricogandini)
- [@mikemhenry](https://github.com/mikemhenry)
- [@c-feldmann](https://github.com/c-feldmann)
- Mieczyslaw Torchala [@mieczyslaw](https://github.com/mieczyslaw)
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,4 @@ Scikit-Mol has been developed as a community effort with contributions from peop
- [@enricogandini](https://github.com/enricogandini)
- [@mikemhenry](https://github.com/mikemhenry)
- [@c-feldmann](https://github.com/c-feldmann)
- Mieczyslaw Torchala [@mieczyslaw](https://github.com/mieczyslaw)
315 changes: 315 additions & 0 deletions docs/notebooks/05_smiles_sanitization.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9b787560",
"metadata": {},
"source": [
"# SMILES sanitation\n",
"Sometimes we are faced with datasets which has SMILES that rdkit doesn't want to sanitize. This can be human entry errors, or differences between RDKits more strict sanitazion and other toolkits implementations of the parser. e.g. RDKit will not handle a tetravalent nitrogen when it has no charge, where other toolkits may simply build the graph anyway, disregarding the issues with the valence rules or guessing that the nitrogen should have a charge, where it could also by accident instead have a methyl group too many."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "612aa974",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:27.545695Z",
"iopub.status.busy": "2024-11-24T09:27:27.545293Z",
"iopub.status.idle": "2024-11-24T09:27:28.079174Z",
"shell.execute_reply": "2024-11-24T09:27:28.078490Z"
},
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from rdkit.Chem import PandasTools\n",
"\n",
"csv_file = \"../tests/data/SLC6A4_active_excapedb_subset.csv\" # Hmm, maybe better to download directly\n",
"data = pd.read_csv(csv_file)"
]
},
{
"cell_type": "markdown",
"id": "0f957a69",
"metadata": {},
"source": [
"Now, this example dataset contain all sanitizable SMILES, so for demonstration purposes, we will corrupt one of them"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b09cfd6b",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:28.082222Z",
"iopub.status.busy": "2024-11-24T09:27:28.081921Z",
"iopub.status.idle": "2024-11-24T09:27:28.086003Z",
"shell.execute_reply": "2024-11-24T09:27:28.085450Z"
}
},
"outputs": [],
"source": [
"data.loc[1, \"SMILES\"] = \"CN(C)(C)(C)\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e20fb5cc",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:28.088449Z",
"iopub.status.busy": "2024-11-24T09:27:28.088211Z",
"iopub.status.idle": "2024-11-24T09:27:28.130818Z",
"shell.execute_reply": "2024-11-24T09:27:28.130102Z"
},
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset contains 1 unparsable mols\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[10:27:28] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
}
],
"source": [
"\n",
"PandasTools.AddMoleculeColumnToFrame(data, smilesCol=\"SMILES\")\n",
"print(f\"Dataset contains {data.ROMol.isna().sum()} unparsable mols\")"
]
},
{
"cell_type": "markdown",
"id": "f8dccd93",
"metadata": {},
"source": [
"If we use these SMILES for the scikit-learn pipeline, we would face an error, so we need to check and clean the dataset first. The CheckSmilesSanitation can help us with that."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3dbd50b3",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:28.133745Z",
"iopub.status.busy": "2024-11-24T09:27:28.133507Z",
"iopub.status.idle": "2024-11-24T09:27:28.508377Z",
"shell.execute_reply": "2024-11-24T09:27:28.507130Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[10:27:28] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
}
],
"source": [
"from scikit_mol.utilities import CheckSmilesSanitazion\n",
"\n",
"smileschecker = CheckSmilesSanitazion()\n",
"\n",
"smiles_list_valid, y_valid, smiles_errors, y_errors = smileschecker.sanitize(\n",
" list(data.SMILES), list(data.pXC50)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c888d7da",
"metadata": {},
"source": [
"Now the smiles_list_valid should be all valid and the y_values filtered as well. Errors are returned, but also accesible after the call to .sanitize() in the .errors property"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5af5ea3d",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:28.511261Z",
"iopub.status.busy": "2024-11-24T09:27:28.510945Z",
"iopub.status.idle": "2024-11-24T09:27:28.522024Z",
"shell.execute_reply": "2024-11-24T09:27:28.521232Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SMILES</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CN(C)(C)(C)</td>\n",
" <td>7.18046</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SMILES y\n",
"0 CN(C)(C)(C) 7.18046"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smileschecker.errors"
]
},
{
"cell_type": "markdown",
"id": "c2ce2677",
"metadata": {},
"source": [
"The checker can also be used only on X"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "84db07cc",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:28.524982Z",
"iopub.status.busy": "2024-11-24T09:27:28.524717Z",
"iopub.status.idle": "2024-11-24T09:27:28.569119Z",
"shell.execute_reply": "2024-11-24T09:27:28.568473Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[10:27:28] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SMILES</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CN(C)(C)(C)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SMILES\n",
"0 CN(C)(C)(C)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smiles_list_valid, X_errors = smileschecker.sanitize(list(data.SMILES))\n",
"smileschecker.errors"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.4 ('rdkit')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}