FASTA Sequence Statistics Pipeline

A Python pipeline to calculate sequence statistics for nucleotide or amino acid FASTA files.
Designed for bioinformaticians and students who need quick insights into sequences.

Features

For each sequence in a FASTA file, this pipeline calculates:

Letter statistics
- Counts of each distinct letter (nucleotide or amino acid)
- Longest consecutive run of each letter
Triplet statistics
- Counts of all overlapping consecutive triplets (groups of three letters)

Output is written in a human-readable text format for downstream analysis.

Input

One or more sequences in FASTA format.
Each sequence must start with a > followed by the sequence name.
Sequences can be nucleotide or amino acid letters.

Example input:

>seq1
ATGCGATTTGCGC
>seq2
GGCATGCCATTAG

Output

Text file containing, for each sequence:
- Sequence name
- Each letter: Letter Count Longest_Run
- Each triplet: Triplet Count

Example output:

seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...

Usage

This script processes FASTA files and outputs sequence statistics in a text file.

1. Command line execution

Run the script from your terminal:

python fasta_stats.py -in input.fasta -out output.txt

2. Arguments:

in → Path to the input FASTA file containing one or more sequences.
out → Path for the output file where statistics will be saved.

3. Step-by-step explanation

1. Prepare your FASTA file

Make sure each sequence starts with a > followed by the sequence name.
Sequences can span multiple lines; empty lines are ignored.
Only letters A–Z (uppercase or lowercase) are considered.

2. Run the script

Python reads the FASTA file and analyses each sequence individually.
For each sequence, it calculates the counts of each letter, the longest consecutive run of each letter and the counts of all overlapping consecutive triplets

3. Check the output

The output text file contains a clear report per sequence.

Example snippet:

seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...

4. Tips

Can be used for plotting base composition, identifying motifs, or other downstream analysis.
Works for nucleotide (DNA/RNA) or amino acid sequences.
Can handle multiple sequences in a single FASTA file.

Requirements

Python 3.6+ (tested on Python 3.13.3)
No external libraries required

Notes for Bioinformaticians

Triplets are overlapping (sliding window of 3 letters)
Only letters A–Z are considered; other characters are ignored
Works for both nucleotide and amino acid sequences
Output format is compatible with downstream parsing for plotting or further analysis

License

This project is licensed under the MIT License – see LICENSE for details.

Contact

Created by Martina Debnath GitHub Profile: marti-dotcom

Feel free to use, adapt, and contribute! <3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASTA Sequence Statistics Pipeline

Features

Input

Output

Usage

1. Command line execution

2. Arguments:

3. Step-by-step explanation

1. Prepare your FASTA file

2. Run the script

3. Check the output

4. Tips

Requirements

Notes for Bioinformaticians

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
fasta_stats.py		fasta_stats.py
input.fasta		input.fasta
output.txt		output.txt

Folders and files

Latest commit

History

Repository files navigation

FASTA Sequence Statistics Pipeline

Features

Input

Output

Usage

1. Command line execution

2. Arguments:

3. Step-by-step explanation

1. Prepare your FASTA file

2. Run the script

3. Check the output

4. Tips

Requirements

Notes for Bioinformaticians

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages