Sequencing Technologies: A High-Level Overview

How do researchers read and store the genetic code of organisms? As you know, DNA holds information as a sequence of nucleotides: A, C, G, or T. Over the last half-century, researchers have developed numerous methods to sequence genomes:

Sanger Sequencing

Next-Generation Sequencing

Whole Genome Sequencing

How do we construct genomes from DNA fragments?

Central to how next-generation sequencing technology works, is the concept of “shotgun sequencing”. Shotgun sequencing involves randomly breaking up DNA sequences into many small fragments, sequencing these fragments, and then using computational methods to reassemble the original sequence.

As can be seen by the above image, next-generation sequencing generates data for many overlapping fragments of DNA; it is up to computer programs to analyze where the fragments overlap in order to assemble the full, continuous DNA sequence.

The DNA assembly process is a perfect example of how computational biology is foundational to modern approaches to understanding life. Without advanced computational tools, next-generation sequencing would only provide an expansive DNA jig-saw puzzle with no way to make sense of it.

Data Formats Overview

Once sequencing reads have been acquired using one of the technologies described above, they must be stored and manipulated. There are three main data formats used for the sequencing data: FASTQ, BAM, VCF. A simplified summary of the data contained in each file is described in the below table.

Data Format Table
Data Formats and Types of Data Stored
Format Type of Data Stored
FASTQ Raw sequence
BAM Sequence and the position in the reference genome where the read aligns
VCF Genetic variants relative to a reference genome (mutations, alternative alleles, etc.)

Let’s learn about each data format in more detail.

FASTQ

FASTQ is a text-based format commonly used to store nucleotide sequencing data (e.g., “ATTGCAG”) and its corresponding quality scores. A quality score represents the probability of an error in the base call (the predicted base). In the FASTQ data format, the Phred quality score is used and represented as a single ASCII character.

42x2 Table with ASCII Characters for Phred Quality Scores
Key for Interpreting ASCII codes for Phred Quality Scores
ASCII Code ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I
Phred Score 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

A higher Phred score indicates a higher confidence (or lower probability of error) in the base call. FASTQ files format the sequence and quality score information as follows:

  • Line 1: A sequence identifier starting with '@'.

  • Line 2: The raw sequence of nucleotides (A, T, C, G).

  • Line 3: A plus sign (+) which can be followed by the same identifier as in line 1

  • Line 4: A quality score string that encodes the confidence of each nucleotide call in Line 2. Higher scores indicate more confidence.

Here’s an example of a FASTQ entry:

@MY_SEQUENCE

TCTGAATTGGGTAACCCGCGAGTCCGGATTCGCTGAGAATACCGTAGGAT

+MY_SEQUENCE

42;D>3</F:58,C1.@603>?26<9!7:;/=.#@H"$,8&'B!=+08@”

The first and third lines include the identifier of the sequence. In this case, the sequence has been named “MY_SEQUENCE”. The string in line two represents the base calls and the last line has the associated Phred quality score (our confidence in the base calls).

BAM

BAM stands for Binary Alignment/Map. Like the name suggests, it stores sequence data in binary. A BAM file contains sequencing reads that have been aligned to a reference genome. (Alignment is the process of matching these sequencing reads to a known reference genome. This helps determine where each read comes from within the genome.)

A BAM file stores the following types of data for each read:

  • Read Identifier: A unique ID for each read, usually inherited from the FASTQ file.

  • Aligned Sequence: The sequence of nucleotides (A, T, C, G) from the read.

  • Mapping Position: The exact location in the reference genome where the read aligns.

  • Mapping Quality: A score indicating the confidence of the alignment. Higher scores mean the alignment is more reliable. These scores can be based on factors such as the number of mismatches, gaps, and overall alignment length.

  • CIGAR String: A notation that describes how the read aligns with the reference genome, including matches, insertions, deletions, etc.

  • Additional Metadata: Other information like the read's orientation, the sequence it pairs with (if paired-end), and any observed differences (mutations) from the reference genome.

For example, in a SAM file (which is a text-based equivalent of BAM), an entry might look like this:

r001  99  ref  7  30  8M2I4M1D3M  =  37  39  TTAGATAAAGAGGATACTG  *  NM:i:1  MD:Z:8  PG:Z:hello

In this example:

  • r001 is the read identifier.

  • ref indicates the read is aligned to the reference genome.

  • 7 is the position where the read starts aligning.

  • 30 is the mapping quality score.

  • 8M2I4M1D3M is the CIGAR string, indicating 8 matches, 2 insertions, 4 matches, 1 deletion, and 3 matches.

A BAM file is essentially a binary, compressed version of this type of data, which makes it more efficient to store and process large amounts of sequencing data.

VCF

VCF stands for Variant Call Format. This format is used to store information about genetic variations relative to a reference genome. Here are some examples of genetic variations:

  • Single Nucleotide Polymorphisms (SNPs): A single nucleotide differs from the reference genome.

  • Insertions and Deletions (Indels): Nucleotides are either inserted or deleted relative to the reference genome. Indels can vary in size from a single base pair to several base pairs.

  • Copy Number Variations (CNVs): Segments of the genome are duplicated or deleted, leading to an alteration in the number of copies of a particular sequence.

  • Structural Variations (SVs): Larger-scale variations in the genome structure, including inversions, translocations, and large insertions or deletions that can range from hundreds to millions of base pairs.

These variations are crucial for understanding genetic diversity, disease susceptibility, and evolutionary processes across populations. Let’s go through an example to understand how to interpret a VCF file.

First, a VCF file has a header that provides metadata and describes the structure of the data contained within. It includes information such as the file format version, reference genome used, and definitions of the data fields.

##fileformat=VCFv4.3

##reference=GRCh38

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

#CHROM POS ID REF ALT QUAL FILTER INFO Sample1  Sample2

In this example:

  • Line 1: Specifies the VCF format version.

  • Line 2: Indicates the reference genome used (GRCh38 in this case).

  • Line 3: Defines additional information fields (INFO) such as allele frequency.

  • Line 4: Describes the format of genotype data (FORMAT field).

  • Line 5: Specifies column headers, including sample names.

After the header section, the VCF file contains the entries. A VCF file contains entries for each genetic variation relative to the reference genome. Each VCF entry contains:

  • CHROM: Chromosome number.

  • POS: Position in the chromosome.

  • ID: Identifier for the variant.

  • REF: Reference allele.

  • ALT: Alternative allele(s).

  • QUAL: Quality score of the variant call.

  • FILTER: Filter status (e.g., passed or failed certain criteria).

  • INFO: Additional information about the variant.

Here’s a typical entry (the header is included for easy interpretation):

#CHROM POS ID REF ALT QUAL FILTER INFO Sample1  Sample2

12 12237 ab123 T C 30 PASS AF=0.5;AN=2;DP=10 0/1 1/1

The variant records follow the header and describe each genetic variant detected in the samples.

  • 12: Chromosome number.

  • 12237: Position on the chromosome

  • ab123: Variant ID (usually a unique identifier, if available).

  • T: Reference allele.

  • C: Alternate allele observed in the sample(s).

  • 30: Phred-scaled quality score for the variant (likelihood that the variant is real).

  • PASS: Filter status indicating whether the variant passed quality filters.

  • AF=0.5;AN=2: INFO field containing additional information like allele frequency (AF), and total number of alleles (AN)

  • 0/1, 1/1: Genotype calls for each sample (in this example, two samples: Sample1 and Sample2).