Using Additional Paired End Reads to Improve an Assembly

Unicycler is an assembly pipeline for bacterial genomes. It tin get together Illumina-only read sets where it functions equally a SPAdes-optimiser. It tin can also associates long-read-merely sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and information technology will conduct a brusque-read-first hybrid associates.

2022 update

Unicycler was initially fabricated in 2016, back when long reads could exist sparse and very noisy. For example, our early Oxford Nanopore sequencing runs might generate only fifteen× read depth for a single bacterial isolate, and most of the reads had a lot of errors. And then Unicycler was designed to use low-depth and depression-accuracy long reads to scaffold a curt-read assembly graph to completion, an approach I telephone call brusque-read-first hybrid associates. Assuming the short-read associates graph is in good shape, Unicycler does this quite well!

However, things accept inverse in the last six years. Nanopore sequencing yield is now much higher, making >100× depth easy to obtain, even on multiplexed runs. Read accuracy has besides improved and continues to go meliorate each year. High-depth and high-accuracy long reads make long-read-first hybrid assembly (long-read assembly followed by short-read polishing) a viable arroyo that'southward often preferable to Unicycler. I have developed Trycycler and Polypolish in the pursuit of ideal long-read-offset assemblies.

Unicycler is not completely out-of-date, equally it is nonetheless (in my opinion) the best tool for short-read-kickoff hybrid associates of bacterial genomes. But I think it should only exist used for hybrid assembly when long-read-first is non an option – i.e. when long-read depth is low. I also call back that Unicycler is good for short-read-but bacterial genomes, as it produces cleaner assembly graphs than SPAdes alone. So while Unicycler doesn't get a lot of my time and attending these days, I don't yet consider information technology to be abandonware.

For some upwards-to-date bacterial genome associates tips, bank check out these parts of Trycycler's wiki:

Should I apply Unicycler or Trycycler to assemble my bacterial genome?
Guide to bacterial genome associates

Introduction

As input, Unicycler takes 1 of the post-obit:

Illumina reads from a bacterial isolate (ideally paired-stop, but unpaired works too)
A ready of long reads (either PacBio or Nanopore) from a bacterial isolate
Illumina reads and long reads from the same isolate (best case)

Reasons to use Unicycler:

It circularises replicons without the need for a dissever tool like Circlator.
Information technology handles plasmid-rich genomes.
It can use long reads of any depth and quality in hybrid assembly. 20× or more than may be required to consummate a genome, just Unicycler can make about-complete genomes with far fewer long reads.
It produces an associates graph in addition to a contigs FASTA file, viewable in Bandage.
It filters out low-depth contigs, giving make clean assemblies even when the read gear up has low-level contagion.
It has low misassembly rates.
It tin cope with highly repetitive genomes, such every bit Shigella.
It'south like shooting fish in a barrel to use: runs with just ane command and usually doesn't require tinkering with parameters.

Reasons to non utilize Unicycler:

You lot're assembling a eukaryotic genome or a metagenome (Unicycler is designed exclusively for bacterial isolates).
Your Illumina reads and long reads are from dissimilar isolates (Unicycler struggles with sample heterogeneity).
You're impatient (Unicycler is thorough only not especially fast).

Requirements

Linux or macOS
Python 3.iv or after
C++ compiler with C++14 support:
- GCC iv.9.1 or after
- Clang 3.5 or later
- ICC also works (though I don't know the minimum required version number)
setuptools (only required for installation of Unicycler)
For short-read or hybrid assembly:
- SPAdes v3.fourteen.0 or afterward (spades.py)
For long-read or hybrid associates:
- Racon (racon)
For rotating circular contigs:
- Smash+ (makeblastdb and tblastn)

Unicycler expects external tools to exist available in $PATH. If they aren't, you tin specify their location using Unicycler options (e.yard. --spades_path).

Bandage isn't required to run Unicycler, but it is very helpful for manually investigating assemblies (the graph images in this README were made with Bandage).

Installation

Install from source

These instructions install the most up-to-date version of Unicycler:

git clone https://github.com/rrwick/Unicycler.git              cd              Unicycler python3 setup.py install

Notes:

If the last command complains about permissions, y'all may demand to run information technology with sudo.
If you want a particular version of Unicycler, download the source from the releases page instead of cloning from GitHub.
Install just for your user: python3 setup.py install --user
- If you get a strange 'can't combine user with prefix' error, read this.
Install to a specific location: python3 setup.py install --prefix=$HOME/.local
Install with pip (local copy): pip3 install path/to/Unicycler
Install with pip (from GitHub): pip3 install git+https://github.com/rrwick/Unicycler.git
Install with specific Makefile options: python3 setup.py install --makeargs "CXX=icpc"

Build and run without installation

This approach compiles Unicycler code, but doesn't re-create executables anywhere:

git clone https://github.com/rrwick/Unicycler.git              cd              Unicycler make

Now instead of running unicycler, you instead utilise path/to/unicycler-runner.py.

Quick usage

Illumina-only assembly:
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir

Long-read-merely assembly:
unicycler -l long_reads.fastq.gz -o output_dir

Hybrid assembly:
unicycler -1 short_reads_1.fastq.gz -ii short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir

If you don't accept any reads of your own, accept a look in the sample_data directory for links to some small read sets.

Background

Associates graphs

To understand what Unicycler is doing, you need to know virtually assembly graphs. For a thorough introduction, I'd advise this tutorial or the Velvet paper. But in curt, an assembly graph is a data construction where contigs aren't disconnected sequences but can have connections to each other:

              Just contigs:               Associates graph:  TCGAAACTTGACGCGAGTCGC                             CTTGTTTA TGCTACTGCTTGATGATGCGG                            /        \ TGTCCATT                    TCGAAACTTGACGCGAGTCGC          TGCTACTGCTTGATGATGCGG CTTGTTTA                                         \        /                                                   TGTCCATT

Most assemblers apply graphs internally to produce their assemblies, but users ofttimes ignore the graph in favour of the conceptually simpler FASTA file of contigs. When a genome assembly is 100% complete, nosotros have 1 contig per chromosome/plasmid and there's no real demand for the graph. But most short-read assemblies are not complete, and a graph tin can describe an incomplete assembly much amend than contigs alone.

Limitations of short reads

The main reason nosotros tin't get a consummate assembly from short reads is that Deoxyribonucleic acid ordinarily contains repeats – the aforementioned sequence occurring two or more times in the genome. When a repeat is longer than the reads (or for paired-end sequencing, longer than the insert size), information technology forms a single contig in the assembly graph with multiple connections in and multiple connections out.

Hither is what happens to a simple bacterial associates graph every bit you add repeats to the genome:

Repeats in graph

Equally repeats are added, the graph becomes increasingly tangled (and real assembly graphs get a lot more tangled than that).

To complete a bacterial genome assembly (i.e. find the ane correct sequence for each chromosome/plasmid), we need to resolve the repeats. This means finding which manner into a repeat matches up with which style out. Short reads don't have enough information for this but long reads do.

SPAdes graphs

Assembly graphs come in many unlike varieties, but we are particularly interested in the kind produced by SPAdes, considering that is what Unicycler uses.

SPAdes graphs are fabricated by performing a de Bruijn graph assembly with a range of different one thousand-mer sizes, from small to large (see the SPAdes paper). Each assembly builds on the previous one, which allows SPAdes to go the advantages of both minor 1000-mer assemblies (a more connected graph) and large thousand-mer assemblies (meliorate power to resolve repeats). Two contigs in a SPAdes graph that connect volition overlap by their k-mer size (more info on the Cast wiki folio).

After producing the graph, SPAdes can perform further echo resolution past using paired-end information. Since ii reads in a pair are close to each other in the original Dna, SPAdes can use this to trace paths in the graph to form larger contigs (see their newspaper on ExSPAnder). Nevertheless, the SPAdes contigs with echo resolution do not come up in graph class – they are just bachelor in a FASTA file.

Method: Illumina-only assembly

When assembling just Illumina reads, Unicycler functions mainly equally a SPAdes optimiser. It offers a few benefits over using SPAdes alone:

Tries a wide range of k-mer sizes and automatically selects the best.
Filters out low-depth parts of the assembly to remove contamination.
Applies SPAdes echo resolution to the graph (as opposed to disconnected contigs in a FASTA file).
Rejects low-confidence repeat resolution to reduce the rate of misassembly.
Trims off graph overlaps so sequences aren't repeated where contigs join.

More than information on the Illumina-merely assembly process is described in the steps below.

SPAdes assembly

Unicycler uses SPAdes to assemble the Illumina reads into an assembly graph. It tries assemblies at a broad range of k-mer sizes, evaluating the graph at each 1. Information technology chooses the graph which best minimises both contig count and dead end count. If the Illumina reads are good, it produces an assembly graph with long contigs but few to no dead ends (more info hither). Since a typical bacterial genome has no dead ends (the sequences are circular) an ideal assembly graph won't either.

A raw SPAdes graph can also contain some 'junk' sequences due to sequencer artefacts or contamination, and then Unicycler performs some graph cleaning to remove these. Therefore, pocket-size amounts of contamination in the Illumina reads should not be a problem.

Multiplicity

To scaffold the graph, Unicycler must distinguish between single re-create contigs and repeats. It does this with a greedy algorithm that uses both read depth and graph connectivity:

Multiplicity assignment

This procedure does non assume that all unmarried copy contigs have the aforementioned read depth, which allows information technology to identify single re-create contigs from plasmids besides as the chromosome. After information technology has determined multiplicity, Unicycler chooses a set of 'anchor' contigs. These are sufficiently-long unmarried-copy contigs suitable for bridging in later on steps.

Overlap removal

To reduce back-up and allow for neatly circularised contigs, Unicycler removes all overlap in the graphs:

Before:                                       After:            GACGCGTTGACAAGGAAAT                           TGACAAGGAAAT                   /                                             / TTGACTACCCAGACGCGT            TTGACTACCCAGACGCGT                   \                                             \            GACGCGTCCTCTCATTCTA                           CCTCTCATTCTA

Bridging

At this point, the assembly graph does not comprise the SPAdes repeat resolution. To apply this to the graph, Unicycler builds bridges between single copy contigs using the path information in the SPAdes assembly.

Short read bridging

Bridges are given a quality score, almost importantly based on the length of the bridge compared to the length of the paired end insert size, so bridges which bridge a long repeat are given a depression score. Since paired-terminate sequencing cannot resolve repeats longer than the insert size, bridges which attempt to span long repeats cannot be trusted. This selectivity helps to reduce the number of misassemblies.

Method: long-read-only associates

When assembling just long reads, Unicycler uses a miniasm+Racon pipeline. It offers a couple advantages over using other long-read-only assemblers:

Multiple rounds of Racon polishing give a good last sequence accuracy.
Circular replicons (similar nearly bacterial chromosomes and plasmids) gather into round replicons with no start-end overlap.

More than information on the long-read-only associates process is described in the steps below.

miniasm assembly

Unicycler uses minimap and miniasm to assemble the long reads in substantially the same fashion as described in the miniasm README. This produces an uncorrected associates which is made direct of pieces of reads – the assembly error charge per unit volition be like to the read fault rate.

The version of miniasm that comes with Unicycler is slightly modified in a couple of means. The first modification is to help circular replicons assemble into circular cord graphs. The other modification only applies to hybrid assembly, so I'll come back to that!

Racon polishing

After miniasm associates, Unicycler carries out multiple rounds of polishing with Racon to improve the sequence accurateness. It will polish until the associates stops improving, as measured past the agreement between the reads and the associates. Round replicons are 'rotated' (accept their starting position shifted) between rounds of polishing to ensure that no office of the sequence is left unpolished.

Method: hybrid assembly

Hybrid assembly (using both Illumina read and long reads) is where Unicycler really shines. Like with the Illumina-merely pipeline described in a higher place, Unicycler will produce an Illumina assembly graph. Information technology and then uses long reads to build bridges, which often allows it to resolve all repeats in the genome, resulting in a complete genome assembly.

In hybrid assembly, Unicycler carries out all the steps in the Illumina-simply pipeline, plus the additional steps below:

Long-read plus contig assembly

This step uses miniasm and Racon, and is very much like the long-read-only assembly method described above. Here even so, the assembly is not simply on long reads but a mixture of long reads and ballast contigs from the Illumina-simply associates. Since these anchor contigs can oft exist much longer than long reads (sometimes hundreds of kbp), they can significantly help the assembly. This takes reward of the other modification to miniasm which was teased above. In Unicycler'southward miniasm, contigs and long reads are treated slightly differently in the string graph manipulations to better perform this step.

Later the assembly is finished, Unicycler finds anchor contigs in the assembled sequence and uses the intervening sequences to create bridges:

              assembled sequence:                 TATGGTCTCGCATGTTAATTCTACTCCCGAACTTGGCCCATCCCCGGCTAGGCTGGGCACTAGACGGTGGAT ballast contigs:                         GTCTCGCATGTTAA    ACTCCCGAACTTGGCCCATCCCCGGC       GGCACTAGACGGTGG intervening sequences for bridges:                    TTCT                          TAGGCTG

Direct long-read bridging

Unicycler as well attempts to make long-read bridges directly by semi-globally adjustment the long reads to the assembly graph. For each pair of single copy contigs which are linked past read alignments, Unicycler uses the read consensus sequence to notice a connecting path and creates a bridge.

Long-read bridging

This step and the previous stride are somewhat redundant, equally both use long reads to build bridges betwixt short-read contigs. They are both included because they have unlike strengths. The previous approach can tolerate low long-read depth just requires a practiced brusk-read assembly graph (i.e. few dead ends). This stride requires decent long-read depth but can tolerate poor curt-read assembly graphs. By using the two strategies together, Unicycler tin successfully handle many types of input.

Bridge application

At this point of the pipeline there can be many bridges, some of which may conflict. Unicycler therefore assigns a quality score to each based on all available show (e.m. read alignment quality, graph path match, read depth consistency). Bridges are so practical in order of decreasing quality and so whenever there is a conflict, the most supported span is used. A minimum quality threshold prevents the application of low bear witness bridges (see Bourgeois, normal and bold for more information).

Application of bridges

Finalisation

If the above steps take resulted in whatever simple, round sequences, then Unicycler will try to rotate/flip them to begin at a consistent starting factor. By default this is dnaA or repA, but users can specify their ain with the --start_genes choice.

Conservative, normal and bold

Unicycler tin be run in three modes: conservative, normal (the default) and bold, prepare with the --mode selection. Conservative style is least likely to produce a complete assembly but has a very low risk of misassembly. Bold mode is about probable to produce a complete assembly merely carries greater risk of misassembly. Normal mode is intermediate regarding both abyss and misassembly risk.

If the structural accuracy of your assembly is paramount to your enquiry, conservative mode is recommended. If you want a completed genome, fifty-fifty if it contains a mistake or two, then use bold mode.

The specific differences between the three modes are as follows:

Style	Invocation	Short read bridges	Bridge quality threshold	Contig merging
bourgeois	`‑‑way conservative`	not used	loftier (25)	contigs are only merged with bridges
normal	`‑‑mode normal` (or nothing)	used	medium (10)	contigs are merged with bridges and when their multiplicity is 1
bold	`‑‑mode assuming`	used	low (1)	contigs are merged wherever possible

Conservative, normal and bold

In the above instance, the conservative assembly is incomplete considering some bridges fell beneath the quality threshold and were non practical. Its contigs, however, are very reliable. Normal mode nearly gave a complete assembly, but a couple of unmerged contigs remain. Bold style completed the assembly, but since lower confidence regions were bridged and merged, at that place is a larger risk of fault.

Options and usage

Standard options

Run unicycler --assistance to view the program's most commonly used options:

              usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAIRED] [-l LONG] -o OUT                  [--verbosity VERBOSITY] [--min_fasta_length MIN_FASTA_LENGTH] [--keep Continue]                  [-t THREADS] [--mode {bourgeois,normal,bold}] [--linear_seqs LINEAR_SEQS]         __        \ \___         \ ___\         //    ____//      _    _         _                     _  //_  //\\    | |  | |       |_|                   | | //  \//  \\   | |  | | _ __   _   ___  _   _   ___ | |  ___  _ __ ||  (O)  ||   | |  | || '_ \ | | / __|| | | | / __|| | / _ \| '__| \\    \_ //   | |__| || | | || || (__ | |_| || (__ | ||  __/| |  \\_____//     \____/ |_| |_||_| \___| \__, | \___||_| \___||_|                                         __/ |                                        |___/  Unicycler: an assembly pipeline for bacterial genomes  Help:   -h, --assistance                      Show this help bulletin and exit   --help_all                      Show a help bulletin with all programme options   --version                       Show Unicycler's version number  Input:   -ane SHORT1, --short1 SHORT1      FASTQ file of commencement short reads in each pair   -2 SHORT2, --short2 SHORT2      FASTQ file of second short reads in each pair   -due south UNPAIRED, --unpaired UNPAIRED                                   FASTQ file of unpaired brusk reads   -fifty LONG, --long LONG            FASTQ or FASTA file of long reads  Output:   -o OUT, --out OUT               Output directory (required)   --verbosity VERBOSITY           Level of stdout and log file information (default: ane)                                     0 = no stdout, 1 = bones progress indicators, 2 = extra info,                                     three = debugging info   --min_fasta_length MIN_FASTA_LENGTH                                   Exclude contigs from the FASTA file which are shorter than this                                   length (default: 100)   --keep KEEP                     Level of file retention (default: 1)                                     0 = only keep final files: assembly (FASTA, GFA and log),                                     1 = too save graphs at main checkpoints,                                     2 = also keep SAM (enables fast rerun in different mode),                                     3 = keep all temp files and save all graphs (for debugging)  Other:   -t THREADS, --threads THREADS   Number of threads used (default: 8)   --mode {conservative,normal,bold}                                   Bridging mode (default: normal)                                     conservative = smaller contigs, lowest misassembly rate                                     normal = moderate contig size and misassembly charge per unit                                     assuming = longest contigs, higher misassembly rate   --linear_seqs LINEAR_SEQS       The expected number of linear (i.east. non-round) sequences in the                                   underlying sequence (default: 0)

Advanced options

Run unicycler --help_all to see a complete list of the program'south options. These allow you to turn off parts of the pipeline, specify the location of tools (only necessary if they are non in PATH) and conform various settings:

              usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAIRED] [-l LONG] -o OUT                  [--verbosity VERBOSITY] [--min_fasta_length MIN_FASTA_LENGTH] [--go along Proceed]                  [-t THREADS] [--mode {conservative,normal,bold}] [--min_bridge_qual MIN_BRIDGE_QUAL]                  [--linear_seqs LINEAR_SEQS] [--min_anchor_seg_len MIN_ANCHOR_SEG_LEN]                  [--spades_path SPADES_PATH] [--min_kmer_frac MIN_KMER_FRAC]                  [--max_kmer_frac MAX_KMER_FRAC] [--kmers KMERS] [--kmer_count KMER_COUNT]                  [--depth_filter DEPTH_FILTER] [--largest_component] [--spades_options SPADES_OPTIONS]                  [--no_miniasm] [--racon_path RACON_PATH]                  [--existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY] [--no_simple_bridges]                  [--no_long_read_alignment] [--contamination CONTAMINATION] [--scores SCORES]                  [--low_score LOW_SCORE] [--min_component_size MIN_COMPONENT_SIZE]                  [--min_dead_end_size MIN_DEAD_END_SIZE] [--no_rotate] [--start_genes START_GENES]                  [--start_gene_id START_GENE_ID] [--start_gene_cov START_GENE_COV]                  [--makeblastdb_path MAKEBLASTDB_PATH] [--tblastn_path TBLASTN_PATH]         __        \ \___         \ ___\         //    ____//      _    _         _                     _  //_  //\\    | |  | |       |_|                   | | //  \//  \\   | |  | | _ __   _   ___  _   _   ___ | |  ___  _ __ ||  (O)  ||   | |  | || '_ \ | | / __|| | | | / __|| | / _ \| '__| \\    \_ //   | |__| || | | || || (__ | |_| || (__ | ||  __/| |  \\_____//     \____/ |_| |_||_| \___| \__, | \___||_| \___||_|                                         __/ |                                        |___/  Unicycler: an assembly pipeline for bacterial genomes  Assist:   -h, --help                      Show this help message and exit   --help_all                      Bear witness a help bulletin with all programme options   --version                       Show Unicycler'southward version number  Input:   -one SHORT1, --short1 SHORT1      FASTQ file of first short reads in each pair   -2 SHORT2, --short2 SHORT2      FASTQ file of second brusque reads in each pair   -due south UNPAIRED, --unpaired UNPAIRED                                   FASTQ file of unpaired brusque reads   -l LONG, --long LONG            FASTQ or FASTA file of long reads  Output:   -o OUT, --out OUT               Output directory (required)   --verbosity VERBOSITY           Level of stdout and log file data (default: 1)                                     0 = no stdout, 1 = bones progress indicators, 2 = extra info,                                     3 = debugging info   --min_fasta_length MIN_FASTA_LENGTH                                   Exclude contigs from the FASTA file which are shorter than this                                   length (default: 100)   --keep KEEP                     Level of file retentivity (default: ane)                                     0 = but keep final files: assembly (FASTA, GFA and log),                                     1 = also relieve graphs at primary checkpoints,                                     ii = also keep SAM (enables fast rerun in different mode),                                     3 = keep all temp files and salve all graphs (for debugging)  Other:   -t THREADS, --threads THREADS   Number of threads used (default: 8)   --manner {conservative,normal,bold}                                   Bridging manner (default: normal)                                     conservative = smaller contigs, lowest misassembly rate                                     normal = moderate contig size and misassembly rate                                     bold = longest contigs, higher misassembly rate   --min_bridge_qual MIN_BRIDGE_QUAL                                   Do not utilize bridges with a quality below this value                                     bourgeois fashion default: 25.0                                     normal mode default: 10.0                                     bold way default: one.0   --linear_seqs LINEAR_SEQS       The expected number of linear (i.e. not-round) sequences in the                                   underlying sequence (default: 0)   --min_anchor_seg_len MIN_ANCHOR_SEG_LEN                                   If set, Unicycler volition non apply segments shorter than this as                                   scaffolding anchors (default: automatic threshold)  SPAdes assembly:   These options command the short-read SPAdes assembly at the commencement of the Unicycler pipeline.    --spades_path SPADES_PATH       Path to the SPAdes executable (default: spades.py)   --min_kmer_frac MIN_KMER_FRAC   Everyman thou-mer size for SPAdes assembly, expressed as a fraction of                                   the read length (default: 0.2)   --max_kmer_frac MAX_KMER_FRAC   Highest thousand-mer size for SPAdes assembly, expressed every bit a fraction of                                   the read length (default: 0.95)   --kmers KMERS                   Exact k-mers to use for SPAdes assembly, comma-separated (instance:                                   21,51,71, default: automated)   --kmer_count KMER_COUNT         Number of k-mer steps to use in SPAdes assembly (default: 8)   --depth_filter DEPTH_FILTER     Filter out contigs lower than this fraction of the chromosomal                                   depth, if doing so does non result in graph dead ends (default:                                   0.25)   --largest_component             Only keep the largest connected component of the assembly graph                                   (default: proceed all connected components)   --spades_options SPADES_OPTIONS                                   Additional options to exist given to SPAdes (example: "--phred-offset                                   33", default: no additional options)  miniasm+Racon assembly:   These options control the use of miniasm and Racon to produce long-read bridges.    --no_miniasm                    Skip miniasm+Racon bridging (default: utilize miniasm and Racon to                                   produce long-read bridges)   --racon_path RACON_PATH         Path to the Racon executable (default: racon)   --existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY                                   A pre-prepared long-read assembly for the sample in GFA or FASTA                                   format. If this option is used, Unicycler volition skip the                                   miniasm/Racon steps and instead use the given assembly (default:                                   perform long-read assembly using miniasm/Racon)  Long-read alignment and bridging:   These options control the use of long-read alignment to produce long-read bridges.    --no_simple_bridges             Skip elementary long-read bridging (default: use simple long-read                                   bridging)   --no_long_read_alignment        Skip long-read-alignment-based bridging (default: use long-read                                   alignments to produce bridges)   --contamination CONTAMINATION   FASTA file of known contamination in long reads   --scores SCORES                 Comma-delimited string of alignment scores: friction match, mismatch, gap                                   open, gap extend (default: 3,-6,-five,-2)   --low_score LOW_SCORE           Score threshold - alignments below this are considered poor                                   (default: ready threshold automatically)  Graph cleaning:   These options control the removal of pocket-size leftover sequences after bridging is consummate.    --min_component_size MIN_COMPONENT_SIZE                                   Graph components smaller than this size (bp) will exist removed from                                   the last graph (default: grand)   --min_dead_end_size MIN_DEAD_END_SIZE                                   Graph dead ends smaller than this size (bp) will be removed from the                                   final graph (default: m)  Assembly rotation:   These options control the rotation of completed round sequence near the end of the Unicycler   pipeline.    --no_rotate                     Do not rotate completed replicons to start at a standard factor                                   (default: completed replicons are rotated)   --start_genes START_GENES       FASTA file of genes for outset signal of rotated replicons (default:                                   start_genes.fasta)   --start_gene_id START_GENE_ID   The minimum required BLAST per centum identity for a showtime gene search                                   (default: ninety.0)   --start_gene_cov START_GENE_COV                                   The minimum required Boom percentage coverage for a start cistron search                                   (default: 95.0)   --makeblastdb_path MAKEBLASTDB_PATH                                   Path to the makeblastdb executable (default: makeblastdb)   --tblastn_path TBLASTN_PATH     Path to the tblastn executable (default: tblastn)

Output files

Unicycler's most important output files are assembly.gfa, assembly.fasta and unicycler.log. These are produced by every Unicycler run. Which other files are saved to its output directory depends on the value of --keep:

--go along 0 retains only the of import files. Use this setting to save drive space.
--keep 1 (the default) also saves some intermediate graphs which can be useful for investigating an associates more deeply.
--keep 2 likewise retains the SAM file of long-read alignments to the graph. This ensures that if yous rerun Unicycler with the same output directory (for instance changing the mode to conservative or bold) it will run faster because information technology does not accept to repeat the alignment step.
--go along 3 retains all files and saves many intermediate graphs. This is for debugging purposes and uses a lot of space, so most users should probably avoid this setting.

All files and directories are described in the table below. Intermediate output files (everything except for assembly.gfa, assembly.fasta and unicycler.log) will be prefixed with a number so they are in chronological social club. Whether or non a file is in the output depends on the --keep level and type of input reads (eastward.g. short-read-but or hybrid).

File/directory	Description	`--go along` level
`spades_assembly/`	directory containing SPAdes files log (can be useful for debugging if SPAdes crashes)	3
`_spades_graph_k.gfa`	unaltered SPAdes assembly graphs at each g-mer size	1
`*_depth_filter.gfa`	best SPAdes short-read assembly graph after depression-depth contigs have been removed and multiplicity determination	ane
`*_overlaps_removed.gfa`	overlap-free version of the best SPAdes graph, with some more than graph clean-upwardly	i
`miniasm_assembly/`	directory containing miniasm string graphs and unitig graphs	3
`simple_bridging/`	directory containing files for the simple long-read bridging step	iii
`*_long_read_assembly.gfa`	the long-read+contig miniasm+Racon associates	1
`read_alignment/`	directory containing `long_read_alignments.sam`	two
`*_bridges_applied.gfa`	bridges applied, before any cleaning or merging	i
`*_cleaned.gfa`	redundant contigs removed from the graph	3
`*_merged.gfa`	contigs merged together where possible	3
`*_final_clean.gfa`	more redundant contigs removed	1
`nail/`	directory containing files for the associates-rotation BLAST search	iii
`*_rotated.gfa`	round replicons rotated and/or flipped to a offset position	1
`associates.gfa`	last assembly in GFA v1 graph format	0
`associates.fasta`	concluding associates in FASTA format (same sequences as in assembly.gfa expect for very short contigs)	0
`unicycler.log`	Unicycler log file (same info as was printed to stdout)	0

Tips

Running time

Unicycler is thorough and accurate, but not particularly fast. For hybrid assemblies, the directly long-read bridging step of the pipeline can take a while to complete. Two primary factors influence the running fourth dimension: the number of long reads (more reads take longer to align) and the genome size/complexity (finding span paths is more difficult in complex graphs).

Unicycler may just take an hour or and so to assemble a small, unproblematic genome with low depth long reads. On the other manus, a circuitous genome with many long reads may have 12 hours to terminate or more. If you accept a very high depth of long reads (e.g. >100×), you lot tin make Unicycler run faster past subsampling for just the best/longest reads (cheque out Filtlong).

Using a lot of threads (with the --threads option) can make Unicycler run faster also. It will only use up to 8 threads by default, but if yous're running information technology on a large machine with lots of CPU and RAM, feel free to use more!

Unicycler also works with PyPy which can speed up parts of its pipeline. Yet, some of Unicycler's slowest steps are when information technology calls other tools (like SPAdes) or uses C++ code, so PyPy may not help much. I oasis't tested this thoroughly – if you lot attempt it, let me know how you go!

Necessary read length

The length of a long read is very important, typically more than than its accuracy, because longer reads are more likely to align to multiple single copy contigs, allowing Unicycler to build bridges.

Consider a sequence with a 2 kb repeat:

Long read length

In order to resolve the repeat, a read must bridge it past aligning to some sequence on either side. In this example, the ane kb reads are shorter than the repeat and are useless. The ii.5 kb reads can resolve the echo, only they take to be in just the correct place to practise so. Merely one out of the six in this instance is useful. The 5 kb reads, however, have a much easier time spanning the repeat and all 3 are useful.

And so how long must your reads be for Unicycler to complete an associates? Longer than the longest repeat in the genome. Depending on the genome, that might be a ane kb insertion sequence, a six kb rRNA operon or a l kb prophage. If your reads are but a bit longer than the longest repeat, yous'll probably need a lot of them. If they are much longer, and so fewer reads should suffice. But in any scenario, longer is better!

Bad Illumina reads

Unicycler prefers decent Illumina reads as input – ideally with uniform read depth and 100% genome coverage. Bad Illumina read sets can still work in Unicycler, only greater long-read depth will be required to compensate.

You can look at Unicycler graphs in Bandage to get a quick impression of the Illumina read quality:

Graphs of varying quality

A is an very good Illumina read graph – the contigs are long and there are no dead ends. This read set is ideally suited for use in Unicycler and shouldn't require too many long reads to complete (10–20× would probably be enough).

B is too a good graph. The genome is more circuitous, resulting in a more tangled structure, but there are still very few dead ends (you can see one in the lower left). This read prepare would also work well in Unicycler, though more long reads may be required to get a consummate genome (possibly 30× or so).

C is a disaster! It is broken into many pieces, probably because parts of the genome got no read depth at all. This genome may have lots of long reads to complete in Unicycler, peradventure 50× or more. The final assembly volition probably have more small errors (SNPs and indels), equally parts of the genome cannot be polished well with Illumina reads. If your graph looks similar this, I'd recommend trying a long-read-first assembly approach (see 2022 update).

Very short contigs

Confused past very small-scale (east.m. 2 bp) contigs in Unicycler assemblies? Dissimilar a SPAdes graph where neighbouring sequences overlap by their k-mer size, Unicycler's final graph has no overlaps and the sequences abut directly. This means that contigs in complex regions tin can be quite short. They may exist useless as stand-alone contigs but are notwithstanding important in the graph structure.

Short contigs in assembly graph

If short contigs are a problem for your downstream analysis, you lot can use the --min_fasta_length to exclude them from Unicycler's FASTA file (they will notwithstanding be included in the GFA file).

Chromosomes and plasmid depth

Unicycler normalises the depth of contigs in the graph to the median value. This typically means that the chromosome has a depth near one× and plasmids may accept dissimilar (typically higher) depths.

Plasmid depths

In the above graph, the chromosome is at the top (yous can merely encounter role of it) and there are ii plasmids. The plasmid on the left occurs in approximately 4 or v copies per cell. For the larger plasmid on the right, nigh cells probably had one copy just some had more than. Since sequencing biases can affect read depth, these per cell counts should exist interpreted loosely.

Known contagion

If your long reads have known contamination, you can utilise the --contamination option to give Unicycler a FASTA file of the contaminant sequences. Unicycler will then discard whatsoever reads for which the best alignment is to the contaminant.

For example, if you've sequenced two isolates in succession on the same Nanopore flow cell, there may be residual reads from the first sample in the second run. In this case, y'all tin can supply a reference/associates of the commencement sample to Unicycler when assembling the second sample.

Some Oxford Nanopore protocols include a lambda phage spike-in as a control. Since this is a common contaminant, you can simply use --contagion lambda to filter these out (no need to supply a FASTA file).

Transmission multiplicity

If Unicycler makes a serious error during its multiplicity conclusion, this tin accept detrimental furnishings on the rest of the associates. I've seen this happen when:

the Illumina graph is badly fragmented (multiplcity determination has few graph connections to piece of work with).
there are multiple very similar plasmids in the genome (shared sequences between plasmids tin can be huge, 10s of kbp).
there is genomic heterogeneity.

If you believe this has happened in your associates, you can manually assign multiplicities and try the assembly once again. Hither's the procedure:

View the curt read assembly (002_depth_filter.gfa) in Bandage and view the region in question. Annotation that Unicycler's graph colour scheme uses green for unmarried-re-create segments and yellow/orangish/red for multi-copy segments.
For any segments where you disagree with Unicycler'due south multiplicity, add a ML tag to the GFA segment line in 002_depth_filter.gfa. Examples:
- If Unicycler called segment 50 single-copy just you lot think it's really a 2-copy echo, add ML:i:2 to the end of the GFA line starting with South 50.
- If Unicycler chosen segment 107 multi-copy merely yous think it's actually single-copy, add ML:i:i to the end of the GFA line starting with Southward 107.
Run Unicycler again, pointing to the same output directory (with your modified 002_depth_filter.gfa file). Information technology will take your manually assigned multiplicities into account and hopefully do better!

Manual completion

If Unicycler doesn't consummate your bacterial genome assembly on its ain, you may be able to complete it manually with a chip of bioinformatics detective work. At that place'southward no single, directly-forward procedure for doing so, but I've put together a few examples on the Unicycler wiki which may be helpful.

Using an external long-read assembly

If you accept a long-read assembly that you've prepared outside Unicycler and trust (eastward.thou. with Canu), you lot can requite it to Unicycler with --existing_long_read_assembly. Unicycler will and so skip its miniasm/Racon step and use this associates instead.

Assemblies with contig overlaps

Unicycler removes overlaps between contigs, resulting in cleaner associates graphs. Still, in some contexts, y'all might desire these overlaps. In particular, if yous are analysing your assemblies with a grand-mer-based algorithm, overlaps might be a practiced thing so k-mers at contig boundaries aren't lost.

If this applies to you, I'd recommend using Unicycler's 002_depth_filter.gfa file (the last of the intermediate files before overlaps are removed) instead of the final assembly.fasta file. If you need this in FASTA format, Torsten's any2fasta tool can do the conversion.

Acknowledgements

Unicycler would non have been possible without Kat Holt, my swain researchers in her lab and the many other people I work with at the University of Melbourne's Bio21 Molecular Science & Biotechnology Institute. In particular, Margaret Lam, Kelly Wyres, David Edwards and Claire Gorrie worked with me on many challenging genomes during Unicycler'south development. Louise Judd is peachy with the MinION and produced many of the long reads I have used when developing Unicycler.

Unicycler uses SeqAn to perform alignments and other sequence manipulations. The authors of this library take been very helpful during Unicycler'south evolution and I owe them a great bargain of cheers! It also uses minimap for alignment and miniasm for long-read associates, and then I'd like to give thanks Heng Li for these tools. Finally, Unicycler uses nanoflann, a delightfully fast and lightweight nearest neighbour library, to perform its line-finding in semi-global alignment.

License

GNU General Public License, version 3

Using Additional Paired End Reads to Improve an Assembly

Table of contents

2022 update

Introduction

Requirements

Installation

Install from source

Build and run without installation

Quick usage

Background

Associates graphs

Limitations of short reads

SPAdes graphs

Method: Illumina-only assembly

SPAdes assembly

Multiplicity

Overlap removal

Bridging

Method: long-read-only associates

miniasm assembly

Racon polishing

Method: hybrid assembly

Long-read plus contig assembly

Direct long-read bridging

Bridge application

Finalisation

Conservative, normal and bold

Options and usage

Standard options

Advanced options

Output files

Tips

Running time

Necessary read length

Bad Illumina reads

Very short contigs

Chromosomes and plasmid depth

Known contagion

Transmission multiplicity

Manual completion

Using an external long-read assembly

Assemblies with contig overlaps

Acknowledgements

License

Post a Comment for "Using Additional Paired End Reads to Improve an Assembly"