Generic expression and annotation data file formats
If you have your expression or annotation data in Excel and can export the data as a txt file, or if you are able to do some scripting or other manipulations to format your data files, you will be able to import them into the CLC Genomics Workbench as a 'generic' expression or annotation data file. There are a few simple requirements that need to be fulfilled to do this as described below.
Generic expression data table format
The CLC Genomics Workbench will import a tab, semicolon or comma-separated .txt or .csv file as expression array samples if the following requirements are met:
- the first non-empty line of the file contains text. All entries, except the first, will be used as sample names
- the following (non-empty) lines contain the same number of entries as the first non-empty line. The requirements to these are that the first entry should be a string (this will be used as the feature ID) and the remaining entries should contain numbers (which will be used as expression values -- one per sample). Empty entries are not allowed, but NaN values are allowed.
- the file contains at least two samples.
FeatureID;sample1;sample2;sample3 gene1;200;300;23 gene2;210;30;238 gene3;230;50;23 gene4;50;100;235 gene5;200;300;23 gene6;210;30;238 gene7;230;50;23 gene8;50;100;235This will be imported as three samples with eight genes in each sample.
Download this example as a file here:
https://resources.qiagenbioinformatics.com/madata/CustomExpressionData.txt
Generic annotation file for expression data format
The CLC Genomics Workbench will import a tab, semicolon or comma-separated .txt or .csv file as an annotation file if the following is met:
- It has a line which can serve as a valid header line. In order to do this, the line should have a number of headers where at least two are among the valid column headers in the Column header column below.
- It contains one of the
PROBE_ID
headers (that is: 'Probe Set ID', 'Feature ID', 'ProbeID' or 'Probe_Id').
Note that some column headers are alternatives so that only one of the alternative columns headers should be used.
When adding annotations to an experiment, you can specify the column in your annotation file containing the relevant identifiers. These identifiers are matched to the feature ids already present in your experiment. When a match is found, the annotation is added to that entry in the experiment. In other words, at least one column in your annotation file must contain identfiers matching the feature identifiers in the experiment, for those annotations to be applied.
A simple example of an annotation file is shown here:
"Probe Set ID","Gene Symbol","Gene Ontology Biological Process" "1367452_at","Sumo2","0006464 // protein modification process // not recorded" "1367453_at","Cdc37","0051726 // regulation of cell cycle // not recorded" "1367454_at","Copb2","0006810 // transport // /// 0016044 // membrane organization // "Download this example plus a more elaborate one here:
https://resources.qiagenbioinformatics.com/madata/SimpleCustomAnnotation.csv
https://resources.qiagenbioinformatics.com/madata/FullCustomAnnotation.csv
To meet requirements imposed by special functionalities in the CLC Genomics Workbench, there are a number of further restrictions on the contents in the entries of the columns:
- Download sequence functionality
- In the experiment table, you can click a button to download sequence. This uses the contents of the
PUBLIC_ID
column, so this column must be present for the action to work and should contain the NCBI accession number. - Annotation tests
- The annotation tests can make use of several entries in a column as long as a certain
format is used. The tests assume that entries are separated by
///
and it interprets all that appears before//
as the actual entry and all that appears after//
within an entry as comments. Example:/// 0000001 // comment1 /// 0000008 // comment2 /// 0003746 // comment3
The annotation tests will interpret this as three entries (0000001, 0000008, and 0003746) with the according comments.
Column header in imported file (alternatives separated by commas) | Label in experiment table | Description (tool tip) |
Probe Set ID, Feature ID, ProbeID, Probe_Id, transcript_cluster_id | Feature ID | Probe identifier tag |
Representative Public ID, Public identifier tag, GenbankAccession | Public identifier tag | Representative public ID |
Gene Symbol, GeneSymbol | Gene symbol | Gene symbol |
Gene Ontology Biological Process, Ontology_Process, GO_biological_process | GO biological process | Gene Ontology biological process |
Gene Ontology Cellular Component, Ontology_Component, GO_cellular_component | GO cellular component | Gene Ontology cellular component |
Gene Ontology Molecular Function, Ontology_Function, GO_molecular_function | GO molecular function | Gene Ontology molecular function |
Pathway | Pathway | Pathway |
The full list of possible column headers:
Column header in imported file (alternatives separated by commas) | Label in experiment table | Description (tool tip) |
Species Scientific Name, Species Name, Species | Species name | Scientific species name |
GeneChip Array | Gene chip array | Gene Chip Array name |
Annotation Date | Annotation date | Date of annotation |
Sequence Type | Sequence type | Type of sequence |
Sequence Source | Sequence source | Source from which sequence was obtained |
Transcript ID(Array Design), Transcript | Transcript ID | Transcript identifier tag |
Target Description | Target description | Target description |
Archival UniGene Cluster | Archival UniGene cluster | Archival UniGene cluster |
UniGene ID, UniGeneID, Unigene_ID, unigene | UniGene ID | UniGene identifier tag |
Genome Version | Genome version | Version of genome on which annotation is based |
Alignments | Alignments | Alignments |
Gene Title | Gene title | Gene title |
geng_assignments | Gene assignments | Gene assignments |
Chromosomal Location | Chromosomal location | Chromosomal location |
Unigene Cluster Type | UniGene cluster type | UniGene cluster type |
Ensemble Ensembl | Ensembl | |
Entrez Gene, EntrezGeneID, Entrez_Gene_ID | Entrez gene | Entrez gene |
SwissProt | SwissProt | SwissProt |
EC | EC | EC |
OMIM | OMIM | Online Mendelian Inheritance in Man |
RefSeq Protein ID | RefSeq protein ID | RefSeq protein identifier tag |
RefSeq Transcript ID | RefSeq transcript ID | RefSeq transcript identifier tag |
FlyBase | FlyBase | FlyBase |
AGI | AGI | AGI |
WormBase | WormBase | WormBase |
MGI Name | MGI name | MGI name |
RGD Name | RGD name | RGD name |
SGD accession number | SGD accession number | SGD accession number |
InterPro | InterPro | InterPro |
Trans Membrane | Trans membrane | Trans membrane |
QTL | QTL | QTL |
Annotation Description | Annotation description | Annotation description |
Annotation Transcript Cluster | Annotation transcript cluster | Annotation transcript cluster |
Transcript Assignments | Transcript assignments | Trancript assignments |
mrna_assignments | mRNA assignments | mRNA assignments |
Annotation Notes | Annotation notes | Annotation notes |
GO, Ontology | Go annotations | Go annotations |
Cytoband | Cytoband | Cytoband |
PrimaryAccession | Primary accession | Primary accession |
RefSeqAccession | RefSeq accession | RefSeq accession |
GeneName | Gene name | Gene name |
TIGRID | TIGR Id | TIGR Id |
Description | Description | Description |
GenomicCoordinates | Genomic coordinates | Genomic coordinates |
Search_key | Search key | Search key |
Target | Target | Target |
Gid, GI | Genbank identifier | Genbank identifier |
Accession | GenBank accession | GenBank accession |
Symbol | Gene symbol | Gene symbol |
Probe_Type | Probe type | Probe type |
crosshyb_type | Crosshyb type | Crosshyb type |
category | category | category |
Start, Probe_Start | Start | Start |
Stop | Stop | Stop |
Definition | Definition | Definition |
Synonym, Synonyms | Synonym | Synonym |
Source | Source | Source |
Source_Reference_ID | Source reference id | Source reference id |
RefSeq_ID | Reference sequence id | Reference sequence id |
ILMN_Gene | Illumina Gene | Illumina Gene |
Protein_Product | Protein product | Protein product |
protein_domains | Protein domains | Protein domains |
Array_Address_Id | Array adress id | Array adress id |
Probe_Sequence | Sequence | Sequence |
seqname | Seqname | Seqname |
Chromosome | Chromosome | Chromosome |
strand | Strand | Strand |
Probe_Chr_Orientation | Probe chr orientation | Probe chr orientation |
Probe_Coordinates | Probe coordinates | Probe coordinates |
Obsolete_Probe_Id | Obsolete probe id | Obsolete probe id |