QIAGEN Bioinformatics Manuals

GEO (Gene Expression Omnibus)

The GEO (Gene Expression Omnibus) sample and series formats are supported. Figure 39.10 shows how to download the data from GEO in the right format. GEO is located at https://www.ncbi.nlm.nih.gov/geo/.

Image GEO_download
Figure 39.10: Selecting Samples, SOFT and Data before clicking go will give you the format supported by the CLC Genomics Workbench.

The GEO sample files are tab-delimited .txt files. They have three required lines:

^SAMPLE = GSM21610
!sample_table_begin
...
!sample_table_end

The first line should start with ^SAMPLE = followed by the sample name, the line !sample_table_begin and the line !sample_table_end. Between the !sample_table_begin and !sample_table_end, lines are the column contents of the sample.

Note that GEO sample importer will also work for concatenated GEO sample files -- allowing multiple samples to be imported in one go. Download a sample file containing concatenated sample files here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFilesConcatenated.txt

Below you can find examples of the formatting of the GEO formats.

GEO sample file, simple

This format is very simple and includes two columns: one for feature id (e.g. gene name) and one for the expression value.

^SAMPLE = GSM21610
!sample_table_begin
ID_REF   VALUE
id1	     105.8
id2	     32
id3      50.4
id4      57.8
id5	     2914.1
!sample_table_end

Download the sample file here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFileSimple.txt

GEO sample file, including present/absent calls

This format includes an extra column for absent/present calls that can also be imported.

^SAMPLE = GSM21610
!sample_table_begin
ID_REF   VALUE   ABS_CALL
id1	     105.8   M
id2      32      A
id3      50.4    A
id4      57.8    A
id5      2914.1  P
!sample_table_end

Download the sample file here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFileAbsentPresent.txt

GEO sample file, including present/absent calls and p-values

This format includes two extra columns: one for absent/present calls and one for absent/present call p-values, that can also be imported.

^SAMPLE = GSM21610
!sample_table_begin
ID_REF    VALUE     ABS_CALL     DETECTION P-VALUE
id1       105.8     M            0.00227496
id2       32        A            0.354441
id3       50.4      A            0.904352
id4       57.8      A            0.937071
id5       2914.1    P            6.02111e-05
!sample_table_end

Download the sample file here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFileAbsentPresentCallAndPValue.txt

GEO sample file: using absent/present call and p-value columns for sequence information

The CLC Genomics Workbench assumes that if there is a third column in the GEO sample file then it contains present/absent calls and that if there is a fourth column then it contains p-values for these calls. This means that the contents of the third column is assumed to be text and that of the fourth column a number. As long as these two basic requirements are met, the sample should be recognized and interpreted correctly.

You can thus use these two columns to carry additional information on your probes. The absent/present column can be used to carry additional information like e.g. sequence tags as shown below:

^SAMPLE = GSM21610
!sample_table_begin
ID_REF      VALUE     ABS_CALL
id1        105.8      AAA
id2        32         AAC
id3        50.4       ATA
id4        57.8       ATT
id5        2914.1     TTA
!sample_table_end

Download the sample file here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFileSimpleSequenceTag.txt

Or, if you have multiple probes per sequence you could use the present/absent column to hold the sequence name and the p-value column to hold the interrogation position of your probes:

^SAMPLE = GSM21610
!sample_table_begin
ID_REF    VALUE    ABS_CALL    DETECTION P-VALUE
probe1    755.07   seq1        1452
probe2    587.88   seq1        497
probe3    716.29   seq1        1447
probe4    1287.18  seq2        1899
!sample_table_end

Download the sample file here: https://resources.qiagenbioinformatics.com/madata/GEOSampleFileSimpleSequenceTagAndProbe.txt

GEO series file, simple

The series file includes expression values for multiple samples. Each of the samples in the file will be represented by its own element with the sample name. The first row lists the sample names.

!Series_title	"Myb specificity determinants"
!series_matrix_table_begin
"ID_REF"	"GSM21610"	"GSM21611"	"GSM21612"
"id1"	     2541	     1781.8     1804.8
"id2"	     11.3	     621.5	    50.2
"id3"	     61.2	     149.1	    22
"id4"	     55.3	     328.8	    97.2
"id5"        183.8       378.3      423.2	
!series_matrix_table_end