Skip to content

marti-dotcom/FASTA-Statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FASTA Sequence Statistics Pipeline

A Python pipeline to calculate sequence statistics for nucleotide or amino acid FASTA files.
Designed for bioinformaticians and students who need quick insights into sequences.


Features

For each sequence in a FASTA file, this pipeline calculates:

  1. Letter statistics

    • Counts of each distinct letter (nucleotide or amino acid)
    • Longest consecutive run of each letter
  2. Triplet statistics

    • Counts of all overlapping consecutive triplets (groups of three letters)

Output is written in a human-readable text format for downstream analysis.


Input

  • One or more sequences in FASTA format.
  • Each sequence must start with a > followed by the sequence name.
  • Sequences can be nucleotide or amino acid letters.

Example input:

>seq1
ATGCGATTTGCGC
>seq2
GGCATGCCATTAG

Output

  • Text file containing, for each sequence:
    • Sequence name
    • Each letter: Letter Count Longest_Run
    • Each triplet: Triplet Count

Example output:

seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...

Usage

This script processes FASTA files and outputs sequence statistics in a text file.

1. Command line execution

Run the script from your terminal:

python fasta_stats.py -in input.fasta -out output.txt

2. Arguments:

  • in → Path to the input FASTA file containing one or more sequences.

  • out → Path for the output file where statistics will be saved.

3. Step-by-step explanation

1. Prepare your FASTA file

  • Make sure each sequence starts with a > followed by the sequence name.
  • Sequences can span multiple lines; empty lines are ignored.
  • Only letters A–Z (uppercase or lowercase) are considered.

2. Run the script

  • Python reads the FASTA file and analyses each sequence individually.

  • For each sequence, it calculates the counts of each letter, the longest consecutive run of each letter and the counts of all overlapping consecutive triplets

3. Check the output

The output text file contains a clear report per sequence.

Example snippet:

seq1
A 3 1
T 4 2
G 3 1
C 3 1
ATG 1
TGC 1
GCG 1
...

4. Tips

  • Can be used for plotting base composition, identifying motifs, or other downstream analysis.

  • Works for nucleotide (DNA/RNA) or amino acid sequences.

  • Can handle multiple sequences in a single FASTA file.

Requirements

  • Python 3.6+ (tested on Python 3.13.3)

  • No external libraries required

Notes for Bioinformaticians

  • Triplets are overlapping (sliding window of 3 letters)

  • Only letters A–Z are considered; other characters are ignored

  • Works for both nucleotide and amino acid sequences

  • Output format is compatible with downstream parsing for plotting or further analysis

License

This project is licensed under the MIT License – see LICENSE for details.

Contact

Created by Martina Debnath GitHub Profile: marti-dotcom

Feel free to use, adapt, and contribute! <3

About

Python pipeline for calculating letter and triplet statistics of FASTA sequences.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages