A Python pipeline to calculate sequence statistics for nucleotide or amino acid FASTA files.
Designed for bioinformaticians and students who need quick insights into sequences.
For each sequence in a FASTA file, this pipeline calculates:
-
Letter statistics
- Counts of each distinct letter (nucleotide or amino acid)
- Longest consecutive run of each letter
-
Triplet statistics
- Counts of all overlapping consecutive triplets (groups of three letters)
Output is written in a human-readable text format for downstream analysis.
- One or more sequences in FASTA format.
- Each sequence must start with a
>followed by the sequence name. - Sequences can be nucleotide or amino acid letters.
Example input:
>seq1
ATGCGATTTGCGC
>seq2
GGCATGCCATTAG
- Text file containing, for each sequence:
- Sequence name
- Each letter:
Letter Count Longest_Run - Each triplet:
Triplet Count
Example output:
seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...
This script processes FASTA files and outputs sequence statistics in a text file.
Run the script from your terminal:
python fasta_stats.py -in input.fasta -out output.txt
-
in → Path to the input FASTA file containing one or more sequences.
-
out → Path for the output file where statistics will be saved.
- Make sure each sequence starts with a > followed by the sequence name.
- Sequences can span multiple lines; empty lines are ignored.
- Only letters A–Z (uppercase or lowercase) are considered.
-
Python reads the FASTA file and analyses each sequence individually.
-
For each sequence, it calculates the counts of each letter, the longest consecutive run of each letter and the counts of all overlapping consecutive triplets
The output text file contains a clear report per sequence.
Example snippet:
seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...
-
Can be used for plotting base composition, identifying motifs, or other downstream analysis.
-
Works for nucleotide (DNA/RNA) or amino acid sequences.
-
Can handle multiple sequences in a single FASTA file.
-
Python 3.6+ (tested on Python 3.13.3)
-
No external libraries required
-
Triplets are overlapping (sliding window of 3 letters)
-
Only letters A–Z are considered; other characters are ignored
-
Works for both nucleotide and amino acid sequences
-
Output format is compatible with downstream parsing for plotting or further analysis
This project is licensed under the MIT License – see LICENSE for details.
Created by Martina Debnath GitHub Profile: marti-dotcom
Feel free to use, adapt, and contribute! <3