Skip to main content

File Formats for Storing Genomic Data

· 16 min read

FILE.png

Modern genomics relies on high-throughput sequencing technologies (NGS), which generate massive volumes of biological data. Efficient storage, processing, and exchange of this data require specialized file formats, each optimized for a specific stage of analysis.

At first glance, the genomic ecosystem may appear fragmented — dozens of formats, many of them text-based, binary, or domain-specific. However, these formats are not arbitrary. Instead, they form a well-defined hierarchy that closely mirrors the standard data-processing pipelines used in genomics.

In most genomic workflows, data passes through several conceptual stages:

Stage 0 — Raw sequence data
Immediately after sequencing, data is represented as raw nucleotide sequences, with or without base-level quality scores. Regardless of whether the experiment involves bulk DNA sequencing, RNA-seq, single-cell, or spatial transcriptomics, the initial output is always conceptually similar. These data are stored in FASTA or FASTQ formats.

Stage 1 — Read alignment
Raw reads are then aligned to a reference genome. This process produces alignment-based formats such as SAM, BAM, or CRAM, which describe how each read maps to genomic coordinates and how reliable that alignment is.

Stage 2 — Derived biological representations
Aligned reads enable higher-level biological interpretations. Variant calling produces VCF or BCF files, genome annotation relies on formats such as GFF, GTF, or GVF, and single-cell or spatial transcriptomics workflows often transform alignment data into matrix-based formats such as h5ad.

Stage 3 — Specialized and auxiliary formats
Some formats do not fit neatly into this linear pipeline but serve important complementary roles. These include consumer genotyping outputs (e.g., 23andMe), cloud-native array storage formats (Zarr), visualization and browser-specific formats (TRACK), and platform-specific internal formats such as Illumina LOC files.

Understanding this hierarchy reveals the underlying structure of genomic data processing and clarifies why so many formats exist. In the following sections, we examine each format individually while keeping its role in the broader analytical workflow in mind.

1. FASTA

The FASTA format is one of the most fundamental and widely used formats in bioinformatics. It is designed to store nucleotide or amino acid sequences without any associated quality information.

A FASTA file consists of one or more records, each of which begins with a header line indicated by the > symbol, followed by a sequence of characters representing a biological sequence.

Sequences in FASTA format start with a single-line description, followed by lines containing the actual sequence. The description is marked by the greater-than symbol (>) in the first column. The word following this symbol up to the first space is the sequence identifier, followed by an optional description.

>sequence_1 Homo_sapiens chromosome_1
ATGCTAGCTAGCTACGATCGATCGATCG
GCTAGCTAGCTAGCATCGATCGATCGA

This is followed by lines containing the biological sequences themselves. Typically, lines in FASTA format are limited to 80–120 characters in length (for historical reasons), but modern programs can recognize sequences written entirely on a single line. A single file may contain multiple sequences.

Read more

2. FASTQ

FASTQ is a standard format for storing raw sequencing data generated using NGS platforms. Unlike FASTA, it includes not only the nucleotide sequence but also quality information for each base.

Each FASTQ record consists of four lines: a read identifier, the sequence, a separator, and a quality string encoded using ASCII characters according to the Phred scale. This allows the reliability of each nucleotide to be quantitatively assessed.

@sequence_1
ATGCTAGCTAGCTACGATCG
+
IIIIIIIIIIIIIIIIIIII

FASTQ is the primary input format for most bioinformatics pipelines, including quality control, filtering, alignment, and assembly. A significant drawback of the format is its large size, which requires substantial computational and disk resources.

Read more

3. SAM

The SAM (Sequence Alignment/Map) format is a text-based format used to store the results of aligning sequenced reads to a reference sequence. It was developed as a universal and human-readable standard for representing alignment data and serves as the foundation for the binary BAM format.

The SAM format is used to represent information on how sequences generated by next-generation sequencing (NGS) technologies correspond to a reference genome. It is primarily applied at the stages of debugging, analysis, and interpretation of alignment data, as well as for data exchange between various bioinformatics tools.

A SAM file is a plain text file and consists of two main parts: a header and an alignment section.

The header of a SAM file contains lines starting with the @ symbol and includes metadata about the data format, reference sequences, alignment parameters, read groups, and the software tools used in the analysis. Although the header is optional, its presence significantly improves the correctness and reproducibility of the analysis.

@HD	VN:1.6	SO:coordinate
@SQ SN:chr1 LN:248956422
@PG ID:bwa PN:bwa VN:0.7.17

read_001 0 chr1 100 255 10M * 0 0 ATGCTAGCTA IIIIIIIIII

The main part of a SAM file consists of lines, each corresponding to a single sequenced read. Each record contains a fixed set of mandatory fields separated by tab characters, including the read identifier, a bitwise flag, the reference sequence name, the alignment position, the mapping quality score, the CIGAR string, information about paired reads, the nucleotide sequence, and base quality scores. In addition, optional tags may be present, providing extended annotation information.

The key drawback of the SAM format is its large file size compared to binary formats.

Read more

Although BAM is often described as a “compressed SAM file,” this description is somewhat misleading.

BAM is not created by simply applying a generic compression algorithm (such as ZIP or gzip) to a SAM file. Instead, BAM uses a specialized format called BGZF (Blocked GNU Zip Format). BGZF is based on the standard DEFLATE compression algorithm (the same algorithm used by gzip), but it introduces a block-based structure that enables efficient random access.

This design allows BAM files to be indexed and queried by genomic coordinates, which would not be possible with a naïvely compressed text file. As a result, BAM achieves both compact storage and high-performance data access — properties that are essential for large-scale genomic analyses.

4. BAM

The BAM (Binary Alignment/Map) format is a binary representation of the SAM format and is widely used in bioinformatics for storing the results of aligning next-generation sequencing (NGS) data to a reference genome.

The BAM format is used to store information about the positioning of sequenced reads relative to a reference sequence. It is applied after the alignment stage of data obtained in FASTQ format, using specialized algorithms and software tools such as BWA, Bowtie2, or STAR.

A BAM file has a binary structure and logically consists of two main components: a header and alignment records. The header contains metadata about the file and alignment parameters, including information on reference sequences, the format version, the software tools used, and read groups. The presence of a correct header is essential for proper data interpretation.

For efficient processing, BAM files are typically used together with index files in .bai or .csi formats. Indexing enables fast access to data corresponding to specific genomic regions without the need to read the entire file sequentially.

One of the main advantages of the BAM format is its compactness.

Read more

5. CRAM

The CRAM (Compressed Reference-oriented Alignment Map) format is a compressed binary format designed for storing alignment results of next-generation sequencing (NGS) data. It was developed as a more storage-efficient alternative to the BAM format and is based on the use of a reference sequence for data compression.

CRAM is especially valuable in projects where minimizing storage requirements is critical, such as biobanks, clinical studies, and large genomic consortia.

A CRAM file has a binary structure and logically consists of a header and data blocks. The header contains metadata about the format, reference sequences, alignment parameters, and the software tools used.

The key difference between CRAM and BAM is that nucleotide sequences and base quality scores are stored as differences relative to the reference genome rather than as absolute values. This approach enables highly efficient data compression.

A defining feature of the CRAM format is reference-based compression. To correctly read and decode a CRAM file, the same version of the reference genome used during file creation is required. Any mismatch in the reference sequence may lead to incorrect data interpretation.

The main advantage of CRAM is its substantially reduced file size compared to BAM, which lowers storage requirements and facilitates faster data transfer. The format also supports indexing, enabling rapid access to specific genomic regions.

Read more

6. VCF

The VCF (Variant Call Format) is used to store information about genetic variants identified during sequencing data analysis. It is designed to represent single nucleotide polymorphisms (SNPs), insertions and deletions (indels), as well as more complex structural variants.

A VCF file is text-based and consists of a header and a data section. The header contains metadata describing the format, analysis parameters, and annotations. The main section includes variant coordinates, reference and alternative alleles, quality scores, filter status, and genotype information for one or more samples.

##fileformat=VCFv4.2
##source=FreeBayes
##reference=hg38
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 105 rs123456 A G 99 PASS DP=42

VCF is the standard output format for variant calling pipelines and is widely used in population genetics, clinical genomics, and genome-wide association studies.

Read more

BCF follows the same conceptual logic as BAM in relation to SAM. While VCF is a human-readable, text-based format, BCF encodes the same information using a binary representation.

This binary encoding significantly improves parsing speed and reduces file size, making BCF particularly useful for large cohort studies and high-throughput variant analyses. Importantly, BCF is not a compressed VCF file in the traditional sense; it is a structurally different representation optimized for computational efficiency.

7. BCF

The BCF (Binary Call Format) is the binary counterpart of the VCF format. It was developed to improve performance and reduce file size when handling large-scale variant datasets.

BCF preserves the logical structure and fields of VCF but encodes them in a binary form, enabling faster reading, writing, and processing. The format is widely used in tools such as samtools and bcftools.

8. GFF

GFF (General Feature Format) is a file format used for storing annotations of genes and other features of DNA, RNA, and protein sequences.

A GFF file is a text file in which each functional genomic element is represented by a single line. Each line contains nine fields separated by tab characters. This structure allows for efficient and straightforward extraction of the required data.

IV     curated  mRNA   5506800 5508917 . + .   Transcript B0273.1; Note "Zn-Finger"
IV curated 5'UTR 5506800 5508999 . + . Transcript B0273.1
IV curated exon 5506900 5506996 . + . Transcript B0273.1
IV curated exon 5506026 5506382 . + . Transcript B0273.1
IV curated exon 5506558 5506660 . + . Transcript B0273.1
IV curated exon 5506738 5506852 . + . Transcript B0273.1
IV curated 3'UTR 5506852 5508917 . + . Transcript B0273.1

It is used to store the results of gene prediction or experimental determination of genes, as well as more complex functional elements of the genome.

Read more

9. GTF

The GTF (Gene Transfer Format) is a specialized variant of GFF and is primarily used to represent gene and transcript structures. It is widely applied in RNA-seq analyses.

GTF provides detailed hierarchical information about genes, transcripts, and exons, making it suitable for expression quantification.

chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2";


Read more

10. GVF

The GVF (Genome Variation Format) is an extension of GFF designed to store genome variation data with annotations.

GVF enables the description of variant effects, mutation types, and functional impact within the context of genomic annotations.

11. BED

BED (Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

# chrom  chromStart  chromEnd  name  score  strand  thickStart  thickEnd  itemRgb  blockCount  blockSizes  blockStarts
chr1 1000 2000 GeneA 100 + 1050 1950 0,0,255 5 10,20,30,20,10 0,100,150,200,220
chr1 3000 4000 GeneB 50 - 3100 3900 255,0,0 3 15,20,15 0,100,120
chr2 5000 5500 PeakC 200 . . . . . . .

A BED file consists of lines, each describing a single genomic interval.

The BED structure allows for compact and clear representation of genomic coordinate data and enables visualization in genome browsers such as the UCSC Genome Browser.

Read more

12. TRACK

The TRACK format is used in the UCSC Genome Browser to define and customize user visualization tracks for genomic data. This format allows researchers to specify display parameters such as color, name, display type, and other properties, providing a clear visual representation of genomic datasets.

A TRACK file is a text file that begins with a track directive, which sets metadata for the track, followed by the actual data associated with a chosen format (e.g., BED, WIG, GFF).

track name="Coverage" description="Read Depth" visibility=2 color=0,0,255 altColor=255,0,0
chr1 10000 10100 10.5
chr1 10100 10200 12.0
chr1 10200 10300 8.1
chr1 10300 10400 15.3
chrX 500000 500100 25.0
chrX 500100 500200 28.2

Following the track directive are lines containing coordinate data and additional parameters defined by the data format (BED, WIG, GFF). The TRACK format itself does not store the data but serves as a visualization controller.

13. 23andMe

The 23andMe format is a tab-delimited text format used to store consumer-level genotyping results. This format is employed by commercial services such as 23andMe to provide users with information about their single nucleotide polymorphisms (SNPs) and other genetic variants.

# rsid	chromosome	position	genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG

Files may also include additional metadata about the platform version or reference genome build used for genotyping.

Read more

14. h5ad

The h5ad format is a binary format based on HDF5 and is used for storing single-cell transcriptomics data. It was developed within the Python ecosystem and is tightly integrated with the Scanpy library, designed for analysis and visualization of large single-cell RNA sequencing datasets.

h5ad enables storage not only of the gene expression matrix but also of cell and gene metadata, clustering results, low-dimensional embeddings (e.g., UMAP or t-SNE), and annotations, making it a comprehensive format for all stages of single-cell transcriptomics analysis.

An h5ad file is binary and based on the hierarchical HDF5 structure, which supports storage of multidimensional arrays and attribute dictionaries.

This approach allows all necessary data for analysis to be stored in a single file, improving reproducibility and simplifying data sharing between researchers.

15. ZARR

The Zarr data format is an open, community-maintained format designed for efficient, scalable storage of large N-dimensional arrays. It stores data as compressed and chunked arrays in a format well-suited to parallel processing and cloud-native workflows.

This section will help you get up and running with the Zarr library in Python to efficiently manage and analyze multi-dimensional arrays.

Read more

16. Illumina LOC files

The LOC (Location) file format is used in Illumina sequencing platforms to store spatial information about cluster positions on the flowcell surface. These files are part of Illumina’s internal data formats and are involved in the early stages of sequencing data processing.

LOC files contain X and Y coordinates for each cluster detected during image analysis. These coordinates are essential for mapping fluorescence signals to individual clusters during downstream image processing and base calling steps.

Typically, LOC files are generated after image analysis and before base calling. In earlier versions of Illumina software, cluster coordinates were stored in text-based .locs files. In more recent versions, these have been replaced by binary formats such as .clocs, which significantly reduce file size and improve processing performance.

LOC files are not intended for direct end-user analysis and are mainly consumed by internal Illumina tools such as RTA (Real-Time Analysis) and bcl2fastq. However, understanding their role and structure is important for bioinformaticians working with low-level sequencing data, developing custom pipelines, or investigating the internal workflows of Illumina sequencing systems.

Read more

Conclusion

Despite the apparent diversity of genomic file formats, they form a coherent and logically structured system aligned with the main stages of genomic data analysis. From raw sequencing reads to aligned data, annotated genomes, and high-level biological interpretations, each format reflects a specific computational and conceptual need.

This hierarchy not only simplifies data processing but also enables scalability, reproducibility, and interoperability across tools and research groups. At the same time, the continued development of formats such as CRAM and Zarr highlights an ongoing effort to improve storage efficiency, data transfer, and cloud-native analysis.

As sequencing technologies evolve and datasets continue to grow, file formats will remain a critical component of genomic infrastructure — balancing human readability, computational performance, and long-term data sustainability.

Genetic Database Website

· 5 min read

genetic.png

This article will be useful for researchers performing comparative genomics, scientists identifying clinically relevant variants, molecular biologists, students and educators seeking reference sequences.

The four main global repositories USA (GenBank), Japan (DDBJ) and Europe (ENA) exchange data daily as part of the International Nucleotide Sequence Database Collaboration. In addition to these, other major repositories such as China GSA further contribute to global data sharing.

INSDC (USA (GenBank), Japan (DDBJ) and Europe (ENA))

Data submitted to any of these databases are automatically exchanged on a daily basis, ensuring the global dissemination of sequence records across all three platforms. This systematic synchronization guarantees consistent and equivalent access to identical datasets, irrespective of the INSDC partner used for data submission or retrieval.

USA (NCBI)

GenBank represents one of the largest and most comprehensive publicly accessible repositories of nucleotide sequence data worldwide. Submitted records undergo both automated and manual curation procedures aimed at ensuring data integrity, accuracy, and compliance with established formatting and annotation standards. The database is closely integrated within the broader NCBI ecosystem, encompassing resources such as PubMed, BLAST, RefSeq, dbSNP, and ClinVar. This integration facilitates a wide range of research activities, including similarity searches using BLAST, access to high-quality curated reference sequences, linkage of genomic data with the scientific literature, identification of clinically relevant variants, and seamless navigation across interconnected biological databases.

How to download data from this repository?

GenBank sequence records may be downloaded from the FTP site or accessed using NCBI's E-utilities API. SRA sequence records are available using the SRA Toolkit API or on Amazon Web Services (AWS) and Google Cloud Platform (GCP) clouds. SRA availability on cloud platforms enables rapid access to large datasets.

The SRA Toolkit is a collection of command-line utilities provided by NCBI for accessing, downloading, and converting sequencing data stored in the Sequence Read Archive (SRA). It is the primary toolset used by researchers to work with raw high-throughput sequencing data locally or within automated pipelines.

Submission

Submit to the world's largest public repository of biological and scientific information Submission Portal

Japan (DDBJ)

Efficient support for next-generation sequencing (NGS) data, including large-scale datasets such as metagenomic and transcriptomic data. Its infrastructure is optimized for high-volume data deposition, making DDBJ an efficient option for projects involving extensive NGS output or requiring rapid, large-scale data submission workflows.

Services : GEA, MetaboBank, BioProject, BioSample, MSS, NSSS, AGD, JGA (Japanese Genotype-phenotype Archive), NBDC Human Database, ARSA, getentry, DDBJ Search, TXSearch, GGGenome, GGRNA, DFAST, CRISPRdirect, Gendoo, RefEx, DDBJ Core Database, DDBJ-LD, TogoVar, TogoVar-repository, VecScreen.

DRA

DDBJ Sequence Read Archive (DRA) stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis. DRA is part of the International Nucleotide Sequence Database Collaboration (INSDC) and archiving data in collaboration with the NCBI Sequence Read Archive and the EBI European Nucleotide Archive.

How to download data from this repository?

You can download data files in formats such as FASTQ and SRA.

DRA Submission

Create a DDBJ account and register a public key to your account. Upload data files to the submission directory on the file server.

Europe (ENA)

ENA supports a broad spectrum of data types, ranging from raw next-generation sequencing (NGS) reads to assembled genomes and annotated sequences. Closely integrated with other EMBL-EBI databases (e.g., UniProt, Ensembl), enabling richer biological insights. Сonvenient for uploading data in accordance with European legislation requirements.

Access to ENA data is provided through the browser, through search tools, through large scale file download and through the API.

How to download data from this repository?

Providing users with the ability to download submitted data for further analysis purposes is a key part of ENA’s mission. Files are therefore made available through a public FTP server.

ENA provides to access the data it hosts, suiting a range of use-cases and computational ability levels:

Submitting and updating data Getting Started

China (GSA)

GSA is its strong support for next-generation sequencing (NGS) data, including whole-genome, transcriptomic, epigenomic, and metagenomic datasets.

These centers offer access to specialized datasets, analytical tools, training resources, and computational infrastructure. They are particularly useful when working with population-specific datasets or region-specific research initiatives.

How to download data from this repository?

Public datasets can be downloaded directly from the web interface or via FTP. On each dataset page, a Download option is available.

Submitting and updating data

  • Account registration
  • Create a project and samples
  • Prepare metadata
  • Upload files
  • Validation and release

There are also additional databases such as:

By leveraging these global resources, the scientific community can efficiently share data, maintain data integrity, and accelerate genomic research, contributing to a more open, collaborative, and data-driven future in molecular biology and genomics.

Join our Discord community to connect with other users and get support!

Top 5 Major DNA Sequencing Equipment Producers

· 2 min read

Here is a comprehensive list of companies that produce DNA sequencing equipment, spanning established multinational corporations and innovative startups.

Major DNA Sequencing Equipment Producers

  • Illumina: Globally dominant, known for NovaSeq, MiSeq, HiSeq platforms. They have driven innovation and cost reductions in high-throughput sequencing.
  • Thermo Fisher Scientific: Offers a diverse range of sequencers for various research and clinical needs, including capillary and next-generation sequencing platforms[1][2][3][6][7].
  • MGI/BGI: Chinese manufacturer specializing in high-throughput sequencing (DNBSEQ technology), widely used in research and commercial applications.
  • Oxford Nanopore Technologies: Known for portable devices like MinION and GridION enabling real-time, direct DNA/RNA sequencing.
  • Pacific Biosciences (PacBio): Focuses on long-read sequencing platforms, instrumental for highly accurate and complex genomics tasks.

Other Notable Companies

  • Element Biosciences: AVITI sequencing platform for scalable, cost-effective applications.
  • Ultima Genomics: Developing ultra-high-throughput, affordable sequencing solutions for large projects.
  • Genia (acquired by Roche): Involved in developing nanopore sequencing technology.
  • GeneMind: Specializes in DNA sequencer R&D for molecular diagnostic platforms.
  • Vela Diagnostics, Verogen: Known for specialized clinical and forensic sequencing devices.

Table: Representative Producers and Platforms

CompanyFlagship Technology/Platform
IlluminaNovaSeq, MiSeq, HiSeq
Thermo FisherIon Torrent, Applied Biosystems
MGI/BGIDNBSEQ-G50/T7/2000
Oxford NanoporeMinION, GridION
PacBioSequel, RS II
Element BiosciencesAVITI
Ultima GenomicsUG 100