Genome assembly 2 & a first evaluation

I continued with assembling the genome of F566K, this time using ABySS. The k-mer size was arbitrarily set to 30 just to test the sbatch script. Cha and McKBird (2006) describes strategies for optimizing the k-mer size prior to assemblies of paired-end Illumina reads, which would increase the robustness of the assembly [1]. The strategy included using the tools Velvetadvisor and KmerGenie. But more research is needed before I can try to optimize a k-mer value using those tools. Until then I will just assemble using a range of k-mer values and then evaluate the assemblies with the quality strategy setup by Hanna. The following script was used to run the assembly (k = 30) with ABySS on Uppmax:

#!/bin/bash

# Run the script like this:
# sbatch <OPTIONS> Assembly_sbatch.sh
# Where the OPTIONS are all of:
# -A g2017020 -t 2:00:00 -p node -n 8 -o run_evaluation.out

module add bioinfo-tools
module add abyss

cd /home/heoscar/Project

# k = size of the kmer
# name = output file prefix
# in = paths to the forward/reverse trimmed reads

abyss-pe k=30 name=abyss_trial_1 in='/home/heoscar/Project/F566K_GAGATTCC-ATAGAGGC_L001_R1_001.fastq F566K_GAGATTCC-ATAGAGGC_L001_R2_001.fastq'

The first assembly trial with ABySS yielded the following output files (using the same input files as for the previous A5-MiSeq run ). The output files were transferred to a new directory at Uppmax, /home/heoscar/Project/Abyss_k30.

Output files: abyss_trial_1-1.dot, abyss_trial_1-1.fa, abyss_trial_1-1.path, abyss_trial_1-2.dot, abyss_trial_1-2.dot1, abyss_trial_1-2.fa, abyss_trial_1-2.path, abyss_trial_1-3.dist, abyss_trial_1-3.dot, abyss_trial_1-3.fa, abyss_trial_1-3.fa.fai, abyss_trial_1-3.hist, abyss_trial_1-4.dot, abyss_trial_1-4.fa, abyss_trial_1-4.fa.fai, abyss_trial_1-4.path1, abyss_trial_1-4.path2, abyss_trial_1-4.path3, abyss_trial_1-5.dot, abyss_trial_1-5.fa, abyss_trial_1-5.path, abyss_trial_1-6.dist.dot, abyss_trial_1-6.dot, abyss_trial_1-6.fa, abyss_trial_1-6.hist, abyss_trial_1-6.path, abyss_trial_1-6.path.dot, abyss_trial_1-7.dot, abyss_trial_1-7.fa, abyss_trial_1-7.path, abyss_trial_1-8.dot, abyss_trial_1-8.fa, abyss_trial_1-bubbles.fa, abyss_trial_1-contigs.dot, abyss_trial_1-contigs.fa, abyss_trial_1-indel.fa, abyss_trial_1-scaffolds.dot, abyss_trial_1-scaffolds.fa, abyss_trial_1-stats, abyss_trial_1-stats.csv, abyss_trial_1-stats.html, abyss_trial_1-stats.md, abyss_trial_1-stats.tab, abyss_trial_1-unitigs.fa, coverage.hist.

Looking at the abyss_trial_1-stats file using the command cat gives some information regarding the quality of the assembly including N50-value and E-size. Hanna has done a more thorough analysis of the quality.

Skärmklipp 2017-12-01 11.35.24
Table 1. Content of the abyss_trial_1-stats file with quality values for the three files specified in the “name” column. These files correspond to the unitigs, contigs, and scaffolds of the assembly.

Note that this assembly using ABySS did not include a quality control of the raw sequencing data (input files received from Anders). However, since I now see that the assembly script works as intended, I will implement a quality control step prior to the assembly using Sickle [2]. The subsequent ABySS trials will thus most probably have a higher quality.  Initially, I will try with a minimum quality of 30 and a minimum read length of 45 bp. The Q30 will exclude 24.12 % of the reads (75.88 %Q30 according to the Illumina sequencing report), and the minimum read length should be reasonable considering the sequencing read length of 300 bp. The following script will be used for further assemblies with ABySS (with Sickle quality control implemented). Note that the directory ‘Abyss_k50’ will be changed according to the k-mer value chosen for each run.

#!/bin/bash

# Run the script like this:
# sbatch <OPTIONS> Assembly_sbatch.sh
# Where the OPTIONS are all of:
# -A g2017020 -t 2:00:00 -p node -n 8 -o run_evaluation.out

module add bioinfo-tools
module add abyss
module add sickle

cd /home/heoscar/Project/Abyss_k50

# Note: Input files (raw data) and the script should be in the new Abyss_kx directory when running. This to avoid having to move all the output files to the right directory.

# SICKLE QUALITY CONTROL
# pe = paired-end
# -f = input file containing the forward reads
# -r = input file containing the reverse reads
# -o = trimmed forward reads
# -p = trimmed reverse reads
# -s = trimmed singles
# -q = minimum quality
# -l = minimum read length
# -t = type of read

sickle pe -f /home/heoscar/Project/Abyss_k50/F566K_GAGATTCC-ATAGAGGC_L001_R1_001.fastq -r /home/heoscar/Project/Abyss_k50/F566K_GAGATTCC-ATAGAGGC_L001_R2_001.fastq -t sanger -o F566K_1.fastq -p F566K_2.fastq -s F566K_s.fastq -q 30 -l 45

# ABYSS ASSEMBLY
# k = size of the kmer
# name = output file prefix
# in = paths to the forward/reverse trimmed reads
# se = path to the singles file

abyss-pe k=30 name=abyss_trial_1 in='/home/heoscar/Project/Abyss_k50/F566K_1.fastq /home/heoscar/Project/Abyss_k50/F566K_2.fastq' se='/home/heoscar/Project/Abyss_k50/F566K_s.fastq'

During the quality assessment of the two current assemblies (A5-MiSeq trial 1 and ABySS trial 1) that Hanna made, she discovered that the assembly with A5-MiSeq was not complete when reading the run_evaluation.out file. The error message ‘slurmstepd: error: *** JOB 11597115 ON m143 CANCELLED AT 2017-11-30T10:50:41 DUE TO TIME LIMIT ***’ indicates that the run was cancelled prematurely. The output files received from that run also does not correspond to the expected output files [3].

I made a new assembly using A5-MiSeq (trial 2) with increased requested running time (from 2h to 10h) using the same script as the one described on the 29th of November with only the output file-name changed.

Liv and Hanna continued with the results from the ABySS assembly to find suitable tools for visualizing assemblies.

References

[1] Cha, S. and McKBird, D. (2016). Optimizing k-mer size using a variant grid search to enhance de novo genome assembly. Bioinformation, 12(2), pp.36-40.

[2] Joshi NA, Fass JN. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) [Software]. Available at https://github.com/najoshi/sickle.

[3] Sourceforge.net. (2017). ngopt / Wiki / A5PipelineREADME. [online] Available at: https://sourceforge.net/p/ngopt/wiki/A5PipelineREADME/ [Accessed 1 Dec. 2017].

Genome assembly 2 & a first evaluation

Leave a comment