Sequencing Technologies: A High-Level Overview

How do researchers read and store the genetic code of organisms? As you know, DNA holds information as a sequence of nucleotides: A, C, G, or T. Over the last half-century, researchers have developed numerous methods to sequence genomes:

Sanger Sequencing

This technique involves copying the DNA and incorporating special molecules called dideoxynucleotides, which cause the copying process to stop at specific points. By doing this, scientists can generate fragments of different lengths that end at each of the four DNA bases (A, T, C, G). These fragments are then separated by size using a process called gel electrophoresis, and the sequence is determined by reading the order of the bases.
As the first reliable method to sequence DNA, it lead to the sequencing of individual genes and eventually entire genomes. Its accuracy and reliability have made it a gold standard in DNA sequencing, despite being slower and more expensive than newer technologies.

Next-Generation Sequencing

Next-generation sequencing (NGS) refers to a group of advanced technologies that allow for the rapid sequencing of DNA. Unlike Sanger sequencing, which reads one DNA fragment at a time, NGS can sequence millions of fragments simultaneously. NGS works by fragmenting the DNA, attaching adapters, and then using various methods to read the sequence of each fragment. The data is then assembled using computers to reconstruct the original DNA sequence.
Next-generation sequencing revolutionized genetic research by drastically reducing the time and cost needed to sequence DNA. It enabled large-scale projects like the 1000 Genomes Project, which studies human genetic variation on a global scale.

Whole Genome Sequencing

WGS uses NGS technology to read the complete set of an organism's DNA, including all of its genes and non-coding regions. This method provides a detailed view of the genetic blueprint, allowing scientists to identify genetic variations, mutations, and other features that may contribute to traits, diseases, and evolutionary processes. While WGS generates a vast amount of data, advancements in computational tools have made it increasingly practical for various applications.
Whole genome sequencing has revolutionized medicine by enabling personalized treatment plans, improving the diagnosis of rare genetic disorders, and advancing cancer treatment through targeted therapies. It has improved our understanding of gene function and variation.

How do we construct genomes from DNA fragments?

Central to how next-generation sequencing technology works, is the concept of “shotgun sequencing”. Shotgun sequencing involves randomly breaking up DNA sequences into many small fragments, sequencing these fragments, and then using computational methods to reassemble the original sequence.

As can be seen by the above image, next-generation sequencing generates data for many overlapping fragments of DNA; it is up to computer programs to analyze where the fragments overlap in order to assemble the full, continuous DNA sequence.

The DNA assembly process is a perfect example of how computational biology is foundational to modern approaches to understanding life. Without advanced computational tools, next-generation sequencing would only provide an expansive DNA jig-saw puzzle with no way to make sense of it.

Data Formats Overview

Once sequencing reads have been acquired using one of the technologies described above, they must be stored and manipulated. There are three main data formats used for the sequencing data: FASTQ, BAM, VCF. A simplified summary of the data contained in each file is described in the below table.

  
Data Format Table

  Data Formats and Types of Data Stored
  
      Format
      Type of Data Stored
    
      FASTQ
      Raw sequence
    
      BAM
      Sequence and the position in the reference genome where the read aligns
    
      VCF
      Genetic variants relative to a reference genome (mutations, alternative alleles, etc.)

Let’s learn about each data format in more detail.

FASTQ

FASTQ is a text-based format commonly used to store nucleotide sequencing data (e.g., “ATTGCAG”) and its corresponding quality scores. A quality score represents the probability of an error in the base call (the predicted base). In the FASTQ data format, the Phred quality score is used and represented as a single ASCII character.

  
    




42x2 Table with ASCII Characters for Phred Quality Scores





  Key for Interpreting ASCII codes for Phred Quality Scores
  
      ASCII Code
      
      
      !   
      "   
      #   
      $   
      %   
      & 
      '   
      (   
      )   
      *   
      +   
      ,   
      -   
      .   
      /   
      0   
      1   
      2   
      3   
      4   
      5   
      6   
      7   
      8   
      9   
      :   
      ;   
      < 
      =   
      > 
      ?   
      @   
      A   
      B   
      C   
      D   
      E   
      F   
      G   
      H   
      I   
    

      Phred Score
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
    





  

A higher Phred score indicates a higher confidence (or lower probability of error) in the base call. FASTQ files format the sequence and quality score information as follows:

Line 1: A sequence identifier starting with '@'.
Line 2: The raw sequence of nucleotides (A, T, C, G).
Line 3: A plus sign (+) which can be followed by the same identifier as in line 1
Line 4: A quality score string that encodes the confidence of each nucleotide call in Line 2. Higher scores indicate more confidence.

Here’s an example of a FASTQ entry:

@MY_SEQUENCE
TCTGAATTGGGTAACCCGCGAGTCCGGATTCGCTGAGAATACCGTAGGAT
+MY_SEQUENCE
42;D>3</F:58,C1.@603>?26<9!7:;/=.#@H"$,8&'B!=+08@”

The first and third lines include the identifier of the sequence. In this case, the sequence has been named “MY_SEQUENCE”. The string in line two represents the base calls and the last line has the associated Phred quality score (our confidence in the base calls).

BAM

BAM stands for Binary Alignment/Map. Like the name suggests, it stores sequence data in binary. A BAM file contains sequencing reads that have been aligned to a reference genome. (Alignment is the process of matching these sequencing reads to a known reference genome. This helps determine where each read comes from within the genome.)

A BAM file stores the following types of data for each read:

Read Identifier: A unique ID for each read, usually inherited from the FASTQ file.
Aligned Sequence: The sequence of nucleotides (A, T, C, G) from the read.
Mapping Position: The exact location in the reference genome where the read aligns.
Mapping Quality: A score indicating the confidence of the alignment. Higher scores mean the alignment is more reliable. These scores can be based on factors such as the number of mismatches, gaps, and overall alignment length.
CIGAR String: A notation that describes how the read aligns with the reference genome, including matches, insertions, deletions, etc.
Additional Metadata: Other information like the read's orientation, the sequence it pairs with (if paired-end), and any observed differences (mutations) from the reference genome.

For example, in a SAM file (which is a text-based equivalent of BAM), an entry might look like this:

r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * NM:i:1 MD:Z:8 PG:Z:hello

In this example:

r001 is the read identifier.
ref indicates the read is aligned to the reference genome.
7 is the position where the read starts aligning.
30 is the mapping quality score.
8M2I4M1D3M is the CIGAR string, indicating 8 matches, 2 insertions, 4 matches, 1 deletion, and 3 matches.

A BAM file is essentially a binary, compressed version of this type of data, which makes it more efficient to store and process large amounts of sequencing data.

VCF

VCF stands for Variant Call Format. This format is used to store information about genetic variations relative to a reference genome. Here are some examples of genetic variations:

Single Nucleotide Polymorphisms (SNPs): A single nucleotide differs from the reference genome.
Insertions and Deletions (Indels): Nucleotides are either inserted or deleted relative to the reference genome. Indels can vary in size from a single base pair to several base pairs.
Copy Number Variations (CNVs): Segments of the genome are duplicated or deleted, leading to an alteration in the number of copies of a particular sequence.
Structural Variations (SVs): Larger-scale variations in the genome structure, including inversions, translocations, and large insertions or deletions that can range from hundreds to millions of base pairs.

These variations are crucial for understanding genetic diversity, disease susceptibility, and evolutionary processes across populations. Let’s go through an example to understand how to interpret a VCF file.

First, a VCF file has a header that provides metadata and describes the structure of the data contained within. It includes information such as the file format version, reference genome used, and definitions of the data fields.

##fileformat=VCFv4.3
##reference=GRCh38
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO Sample1 Sample2

In this example:

Line 1: Specifies the VCF format version.
Line 2: Indicates the reference genome used (GRCh38 in this case).
Line 3: Defines additional information fields (INFO) such as allele frequency.
Line 4: Describes the format of genotype data (FORMAT field).
Line 5: Specifies column headers, including sample names.

After the header section, the VCF file contains the entries. A VCF file contains entries for each genetic variation relative to the reference genome. Each VCF entry contains:

CHROM: Chromosome number.
POS: Position in the chromosome.
ID: Identifier for the variant.
REF: Reference allele.
ALT: Alternative allele(s).
QUAL: Quality score of the variant call.
FILTER: Filter status (e.g., passed or failed certain criteria).
INFO: Additional information about the variant.

Here’s a typical entry (the header is included for easy interpretation):

#CHROM POS ID REF ALT QUAL FILTER INFO Sample1 Sample2
12 12237 ab123 T C 30 PASS AF=0.5;AN=2;DP=10 0/1 1/1

The variant records follow the header and describe each genetic variant detected in the samples.

12: Chromosome number.
12237: Position on the chromosome
ab123: Variant ID (usually a unique identifier, if available).
T: Reference allele.
C: Alternate allele observed in the sample(s).
30: Phred-scaled quality score for the variant (likelihood that the variant is real).
PASS: Filter status indicating whether the variant passed quality filters.
AF=0.5;AN=2: INFO field containing additional information like allele frequency (AF), and total number of alleles (AN)
0/1, 1/1: Genotype calls for each sample (in this example, two samples: Sample1 and Sample2).

Genomics Data