Generic expression and annotation data file formats

If you have your expression or annotation data in Excel and can export the data as a txt file, or if you are able to do some scripting or other manipulations to format your data files, you will be able to import them into the CLC Genomics Workbench as a 'generic' expression or annotation data file. There are a few simple requirements that need to be fulfilled to do this as described below.

Generic expression data table format

The CLC Genomics Workbench will import a tab, semicolon or comma-separated .txt or .csv file as expression array samples if the following requirements are met:

  1. the first non-empty line of the file contains text. All entries, except the first, will be used as sample names
  2. the following (non-empty) lines contain the same number of entries as the first non-empty line. The requirements to these are that the first entry should be a string (this will be used as the feature ID) and the remaining entries should contain numbers (which will be used as expression values -- one per sample). Empty entries are not allowed, but NaN values are allowed.
  3. the file contains at least two samples.
An example of this format is shown below:
This will be imported as three samples with eight genes in each sample.

Download this example as a file here:

Generic annotation file for expression data format

The CLC Genomics Workbench will import a tab, semicolon or comma-separated .txt or .csv file as an annotation file if the following is met:

  1. It has a line which can serve as a valid header line. In order to do this, the line should have a number of headers where at least two are among the valid column headers in the Column header column below.
  2. It contains one of the PROBE_ID headers (that is: 'Probe Set ID', 'Feature ID', 'ProbeID' or 'Probe_Id').
The importer will import an annotation table with a column for each of the valid column headers (those in the Column header column below). Columns with invalid headers will be ignored.

Note that some column headers are alternatives so that only one of the alternative columns headers should be used.

When adding annotations to an experiment, you can specify the column in your annotation file containing the relevant identifiers. These identifiers are matched to the feature ids already present in your experiment. When a match is found, the annotation is added to that entry in the experiment. In other words, at least one column in your annotation file must contain identfiers matching the feature identifiers in the experiment, for those annotations to be applied.

A simple example of an annotation file is shown here:

"Probe Set ID","Gene Symbol","Gene Ontology Biological Process"
"1367452_at","Sumo2","0006464 // protein modification process //  not recorded"
"1367453_at","Cdc37","0051726 // regulation of cell cycle //  not recorded"
"1367454_at","Copb2","0006810 // transport //  ///  0016044 // membrane organization // "
Download this example plus a more elaborate one here:

To meet requirements imposed by special functionalities in the CLC Genomics Workbench, there are a number of further restrictions on the contents in the entries of the columns:

Download sequence functionality
In the experiment table, you can click a button to download sequence. This uses the contents of the PUBLIC_ID column, so this column must be present for the action to work and should contain the NCBI accession number.
Annotation tests
The annotation tests can make use of several entries in a column as long as a certain format is used. The tests assume that entries are separated by /// and it interprets all that appears before // as the actual entry and all that appears after // within an entry as comments. Example:
/// 0000001 //  comment1  /// 0000008 // comment2 /// 0003746 //  comment3
The annotation tests will interpret this as three entries (0000001, 0000008, and 0003746) with the according comments.
The most common column headers are summarized below:

Column header in imported file (alternatives separated by commas) Label in experiment table Description (tool tip)
Probe Set ID, Feature ID, ProbeID, Probe_Id, transcript_cluster_id Feature ID Probe identifier tag
Representative Public ID, Public identifier tag, GenbankAccession Public identifier tag Representative public ID
Gene Symbol, GeneSymbol Gene symbol Gene symbol
Gene Ontology Biological Process, Ontology_Process, GO_biological_process GO biological process Gene Ontology biological process
Gene Ontology Cellular Component, Ontology_Component, GO_cellular_component GO cellular component Gene Ontology cellular component
Gene Ontology Molecular Function, Ontology_Function, GO_molecular_function GO molecular function Gene Ontology molecular function
Pathway Pathway Pathway

The full list of possible column headers:

Column header in imported file (alternatives separated by commas) Label in experiment table Description (tool tip)
Species Scientific Name, Species Name, Species Species name Scientific species name
GeneChip Array Gene chip array Gene Chip Array name
Annotation Date Annotation date Date of annotation
Sequence Type Sequence type Type of sequence
Sequence Source Sequence source Source from which sequence was obtained
Transcript ID(Array Design), Transcript Transcript ID Transcript identifier tag
Target Description Target description Target description
Archival UniGene Cluster Archival UniGene cluster Archival UniGene cluster
UniGene ID, UniGeneID, Unigene_ID, unigene UniGene ID UniGene identifier tag
Genome Version Genome version Version of genome on which annotation is based
Alignments Alignments Alignments
Gene Title Gene title Gene title
geng_assignments Gene assignments Gene assignments
Chromosomal Location Chromosomal location Chromosomal location
Unigene Cluster Type UniGene cluster type UniGene cluster type
Ensemble Ensembl Ensembl
Entrez Gene, EntrezGeneID, Entrez_Gene_ID Entrez gene Entrez gene
SwissProt SwissProt SwissProt
OMIM OMIM Online Mendelian Inheritance in Man
RefSeq Protein ID RefSeq protein ID RefSeq protein identifier tag
RefSeq Transcript ID RefSeq transcript ID RefSeq transcript identifier tag
FlyBase FlyBase FlyBase
WormBase WormBase WormBase
MGI Name MGI name MGI name
RGD Name RGD name RGD name
SGD accession number SGD accession number SGD accession number
InterPro InterPro InterPro
Trans Membrane Trans membrane Trans membrane
Annotation Description Annotation description Annotation description
Annotation Transcript Cluster Annotation transcript cluster Annotation transcript cluster
Transcript Assignments Transcript assignments Trancript assignments
mrna_assignments mRNA assignments mRNA assignments
Annotation Notes Annotation notes Annotation notes
GO, Ontology Go annotations Go annotations
Cytoband Cytoband Cytoband
PrimaryAccession Primary accession Primary accession
RefSeqAccession RefSeq accession RefSeq accession
GeneName Gene name Gene name
Description Description Description
GenomicCoordinates Genomic coordinates Genomic coordinates
Search_key Search key Search key
Target Target Target
Gid, GI Genbank identifier Genbank identifier
Accession GenBank accession GenBank accession
Symbol Gene symbol Gene symbol
Probe_Type Probe type Probe type
crosshyb_type Crosshyb type Crosshyb type
category category category
Start, Probe_Start Start Start
Stop Stop Stop
Definition Definition Definition
Synonym, Synonyms Synonym Synonym
Source Source Source
Source_Reference_ID Source reference id Source reference id
RefSeq_ID Reference sequence id Reference sequence id
ILMN_Gene Illumina Gene Illumina Gene
Protein_Product Protein product Protein product
protein_domains Protein domains Protein domains
Array_Address_Id Array adress id Array adress id
Probe_Sequence Sequence Sequence
seqname Seqname Seqname
Chromosome Chromosome Chromosome
strand Strand Strand
Probe_Chr_Orientation Probe chr orientation Probe chr orientation
Probe_Coordinates Probe coordinates Probe coordinates
Obsolete_Probe_Id Obsolete probe id Obsolete probe id