This repository contains Jupyter notebooks of the “Data and Text Processing for Health and Life Sciences” book
Note: Includes fix for new ChEBI 2.0 web interface that currently lacks detailed cross-references on entry pages.
notebooks/– Jupyter notebooks.data/– Files with data created and used in the notebooks.scripts/– Scripts created in the notebooks.
| # | Notebook | Overview |
|---|---|---|
| 01 | unix shell | Unix basics: ls, pwd, head, cat, piping. Setup for ChEBI retrieval. |
| 02 | data retrieval | curl EBI APIs. Download UniProt xrefs (CSV/XML). getdata.sh. |
| 03 | data extraction | grep filter (HUMAN/RAT/MOUSE), cut columns. getproteins.sh. |
| 04 | task repetition | Loops, xargs, parallel for batch processing. |
| 05 | XML processing | xmllint XPath on UniProt XML. Extract PubMed IDs. |
| 06 | text retrieval | RDF publications (UniProt/NCBI). Extract titles/abstracts. |
| 07 | text processing | Tokenization, sentence splitting, normalization. |
| 08 | semantic processing | Ontology lexicons (DOID), NER + linking with MER tool. |
01: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_01_unix_shell.ipynb
02: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_02_data_retrieval.ipynb
03: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_03_data_extraction.ipynb
04: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_04_task_repetition.ipynb
05: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_05_xml_processing.ipynb
06: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_06_text_retrieval.ipynb
07: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_07_text_processing.ipynb
08: https://githubtocolab.com/lasigeBioTM/data-text-processing-notebooks/blob/main/notebooks/data_text_processing_notebooks_08_semantic_processing.ipynb
- Go to Google Colab
- File -> Open notebook -> GitHub tab
- Paste repo:
https://github.com/lasigeBioTM/data-text-processing-notebooks - Select notebook from
notebooks/folder -> Open
git clone https://github.com/lasigeBioTM/data-text-processing-notebooks
cd data-text-processing-notebooks
jupyter notebook notebooks/This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

