Skip to content
Matt Pinder edited this page Nov 9, 2020 · 8 revisions

In this exercise you'll learn how to make a dynamic bash script that takes command line options. You will also learn how to add comments to your code so that you and others will understand what it does.

Create a file called fastq_stats.sh and make it executable. Open it in your favorite text editor and add a shebang. The next thing to add is a description of your program, what it does and how it works. It could look like this:

#!/bin/bash

# This script extracts some basic information 
# from fastq sequences and prints it to screen.
#
# Usage: ./fastq_stats.sh </path/to/fastq-file>

Every line that starts with the character "#" will be interpreted as a comment by your shell. Next, add some functionality to the script by including these lines.

# Take the first five characters from row one in 
# the inputfile and store it in the variable ID.
ID=`head -1 $1 | cut -c 1-5`

Remember from the comment # Usage: fastq_stats.sh <fastq file> that you use the script by typing its name followed by the path to a fastq file. Your shell will remember this syntax and stor the information in two variables named "0" and "1". Consequently, the values of these variables will be "$0" and "$1" as we discussed in the exercise on your environment. To do some computation on our input file we can hence refer to it as "$1". Sweet!

We are using another funky technique here, namely the pipe ("|") character. On the last row we use the command head -1 to extract the first line of the fastq file (we know that this line will contain a sequence header), but instead of printing this line to screen, which is the default behaviour of head, we "pipe" it to the next program cut -c 1-5, that here extracts the first five characters from every line it receives (only one in this case). The whole thing is enclosed in ` characters, which means we can store the output of this computation to the variable "ID".

Next we'll put this variable to use and count how many times it occurs in the input file:

# Count the number of sequences in the input file.
COUNT=`grep -c $ID $1` 

In this exercise we make the assumption that all sequences in the file are of the same length (but don't make this assumption in real life) and therefore get away with only measuring the length of one of them.

# Calculate sequence length.
LENGTH=`head -4 $1 | tail -1 | wc -m`

Can you see what this command is doing? You can find out how the two commands head and tail work by typing man head and man tail. Finally we print our findings to screen.

# Print some information to screen.
echo "Filename: " $1
echo "Number of sequences: " $COUNT
echo "Sequence length: " $LENGTH

You can use FR32_ATCACG_before.fastq or FR82_GGCTAC_after.fastq from More bash exercises with fastq files as an input file.

[mtop@albiorix ~]$ ./fastq_stats.sh FR32_ATCACG_before.fastq 
Filename:  FR32_ATCACG_before.fastq
Number of sequences:  200000
Sequence length:  52
[mtop@albiorix ~]$ ./fastq_stats.sh FR82_GGCTAC_after.fastq
Filename:  FR82_GGCTAC_after.fastq
Number of sequences:  200000
Sequence length:  52
[mtop@albiorix ~]$ 

Try to add some more functionality to the script by adding more commands and using more environmental variables.

Clone this wiki locally