Browse the manual

Introduction to Biomedical Genomics Workbench
- Contact information
- Download and installation
- System requirements
  - Limitations on maximum number of cores
- Workbench Licenses
- When the program is installed: Getting started
- Plugins
- Network configuration
User interface
- View Area
- Zoom and selection in View Area
- Toolbox and Status Bar
- Workspace
- List of shortcuts
Data organization
- Navigation Area
- Metadata
- Working with tables
  - Filtering tables
- Customized attributes on data locations
- Local search
  - Quick search
  - Advanced search
User preferences and settings
- General preferences
- View preferences
  - Import and export Side Panel settings
- Data preferences
- Advanced preferences
- Export/import of preferences
- View settings for the Side Panel
Printing
- Selecting which part of the view to print
- Page setup
- Print preview
Import/export of data and graphics
- Standard import
  - External files
- Import tracks
  - GFF3 format
- Import high-throughput sequencing data
- Import RNA spike-in controls
- Import Primer Pairs
- Data export
- Export graphics to files
  - File formats
- Export graph data points to a file
- Copy/paste view output
Data download
- SRA search
- Sequence web info
Running tools, handling results and batching
- Running tools
- Handling results
- Batch processing
Workflows
- Creating a workflow
- Distributing and installing workflows
- Executing a workflow
- Open copy of ready-to-use workflow
Viewing and editing sequences
- View sequence
- Circular DNA
  - Using split views to see details of the circular molecule
  - Mark molecule as circular and specify starting point
- Working with annotations
- Element information
- View as text
- Sequence Lists
Viewing structures
- Importing molecule structure files
- Viewing molecular structures in 3D
- Customizing the visualization
  - Visualization styles and colors
  - Project settings
- Tools for linking sequence and structure
- Protein structure alignment
Ready-to-Use Workflows descriptions and guidelines
- General Workflow
- Somatic Cancer
- Hereditary Disease
Reference data for ready-to-use workflows
- Download and configure reference data
- Create a custom Reference Data Set
- Exporting reference data for use in external applications
- Troubleshooting reference data downloads
Preparing raw data
- Prepare Overlapping Raw Data (not recommended)
- Prepare Raw Data (recommended)
  - Output from the Prepare Raw Data workflow
  - How to check the output reports
Whole genome sequencing (WGS)
- General Workflows (WGS)
  - Annotate Variants (WGS)
  - Identify Known Variants in One Sample (WGS)
- Somatic Cancer (WGS)
- Hereditary Disease (WGS)
Whole exome sequencing (WES)
- General Workflows (WES)
  - Annotate Variants (WES)
  - Identify Known Variants in One Sample (WES)
- Somatic Cancer (WES)
- Hereditary Disease (WES)
Targeted amplicon sequencing (TAS)
- General Workflows (TAS)
  - Annotate Variants (TAS)
  - Identify Known Variants in One Sample (TAS)
- Somatic Cancer (TAS)
- Hereditary Disease (TAS)
Whole Transcriptome Sequencing (WTS)
- Analysis of multiple samples
- Annotate Variants (WTS)
- Compare variants in DNA and RNA
- Identify Candidate Variants and Genes from Tumor Normal Pair
- Identify variants and add expression values
- Identify and Annotate Differentially Expressed Genes and Pathways
Genome browser
- Create new genome browser view
- Genome browser view tools
- Graphs
Quality control tools
- QC for Target Sequencing
- QC for Sequencing Reads
- QC for Read Mapping
  - Running the 'QC for Read Mapping' tool
Preparing raw data tools
- Merge overlapping pairs
  - Using quality scores when merging
  - Report of merged pairs
- Trim Reads
- Demultiplex reads
  - An example using Illumina barcoded sequences
Resequencing analysis tools
- Map Reads to Reference
  - Selecting reads and reference
  - Including or excluding regions (masking)
  - Mapping parameters
  - Mapping paired reads
  - Non-specific matches
  - Gap placement
  - Mapping computational requirements
  - Reference caching
- Mapping output
  - Mapping output options
  - Mapped reads coloring
  - Reads track output from a read mapping
- Summary mapping report
- Mapping SOLid reads in color space
  - Viewing color space information
  - Mapping in color space
- Local realignment
  - Method
  - Realignment of unaligned ends
  - Guided realignment
  - Multi-pass local realignment
  - Known limitations
  - Computational requirements
  - How to run the Local Realignment tool
- Merge mapping results
- Remove duplicate mapped reads
  - Algorithm details and parameters
  - Running the duplicate reads removal
- Extract reads based on overlap
- InDels and Structural Variants
  - How to run the InDels and Structural Variants tool
  - The Structural Variants and InDels output
  - The InDels and Structural Variants detection algorithm
  - The InDels and Structural Variants detection algorithm - Step 1: Creating Left- and Right breakpoint signatures
  - The InDels and Structural Variants detection algorithm - Step 2: Creating Structural variant signatures
  - Theoretically expected structural variant signatures
  - How sequence complexity is calculated
- Copy Number Variant Detection
  - Running the Copy Number Variant Detection tool
  - Region-level CNV track (Region CNVs)
  - Target-level CNV track (Target CNVs)
  - Gene-level annotation track (Gene CNVs)
  - CNV results report
  - CNV algorithm report
- Coverage analysis
- Variant Detectors - overview
  - Differences in the variants called by the different tools
  - How the variant detection tools work
- Fixed Ploidy Variant Detection
  - Ploidy and sensitivity
- Low Frequency Variant Detection
- Basic Variant Detection
- Variant Detectors - error model estimation
- Variant Detectors - filters
  - General filters
  - Noise filters
- Variant Detectors - the outputs
  - The variant track output
  - The annotated table output
  - The report
- The Fixed Ploidy and Low Frequency variant callers: detailed descriptions
  - The Fixed Ploidy Variant Caller: Models and methods
  - The Low Frequency Variant caller: Models and methods
- Variant data
  - Variant tracks
  - The annotated variant table
  - Variant types
- Detailed information about overlapping paired reads
- Identify Known Mutations from Sample Mappings
  - How to run the Identify Known Mutations from Sample Mappings tool
  - Output from the Identify Known Mutations from Sample Mappings tool
Add information to variants tools
- Add information from variant databases
- Add conservation scores
- Add exon number
- Add flanking sequence
- Add fold changes
- Add information about amino acid changes
- Add information from genomic regions
- Add information from overlapping genes
- Link Variants to 3D Protein Structure
  - Method details
- Download 3D Protein Structure Database
- From databases
Remove variants tools
- Remove variants found in external database
- Remove variants not found in external database
- Remove false positives
- Remove Germline Variants
- Remove reference variants
- Remove variants inside genome regions
- Remove variants outside genome regions
- Remove variants outside targeted regions
- From databases
Add information to genes tool
- Add information from overlapping variants
Compare samples tools
- Compare shared variants within a group of samples
- Identify Enriched Variants in Case vs Control Group
- Trio analysis
Identify candidate variants tools
- Identify candidate variants
- Remove information from variants
- Identify variants with effect on splicing
Identify candidate genes tools
- Identify differentially expressed gene groups and pathways
- Identify highly mutated gene groups and pathways
- Identify mutated genes
- Select genes by name
RNA-Seq Analysis tools
- RNA-Seq analysis
- Create Combined RNA-Seq Report
- Create fold change track
- Advanced RNA-Seq Tools
  - TMM Normalization
  - Metadata for RNA-Seq
- PCA for RNA-Seq
  - Principal component analysis plot (2D)
  - Principal component analysis plot (3D)
- Differential Expression for RNA-Seq
- Create Heat Map for RNA-Seq
  - Clustering of features and samples
  - The heat map view
- Create Expression Browser
  - The expression browser
- Create Venn Diagram for RNA-Seq
  - Venn diagram table view
- Gene Set Test
Microarray and Small RNA Analysis tools
- Small RNA analysis
- Experimental design
- Working with tracks and experiments
- Transformation and normalization
- Quality control
- Statistical analysis - identifying differential expression
- Feature clustering
  - Hierarchical clustering of features
  - K-means/medoids clustering
- Annotation tests
  - Hypergeometric tests on annotations
  - Gene set enrichment analysis
- General plots
Helper tools
- Extract sequences
- Filter Based on Overlap
Cutting and cloning
- Restriction site analyses
  - Dynamic restriction sites
  - Restriction Site Analysis
- Restriction enzyme lists
- Molecular cloning
- Gateway cloning
- Gel electrophoresis
  - Gel view
Sequencing Data Analysis
- Importing and viewing trace data
  - Trace settings in the Side Panel
- Trim sequences
  - Trimming using the Trim tool
  - Manual trimming
- Assemble sequences
- Assemble sequences to reference
- Sort sequences by name
- Add sequences to an existing contig
- View and edit contigs and read mappings
- Reassemble contig
- Secondary peak calling
Primers
- Primer design - an introduction
  - General concept
  - Scoring primers
- Setting parameters for primers and probes
  - Primer Parameters
- Graphical display of primer information
  - Compact information mode
  - Detailed information mode
- Output from primer design
- Standard PCR
- Nested PCR
- TaqMan
- Sequencing primers
- Alignment-based primer and probe design
- Analyze primer properties
- Find binding sites and create fragments
  - Binding parameters
  - Results - binding sites and fragments
- Order primers
Epigenomics
- ChIP-Seq Analysis
- Annotate with nearby gene information
Legacy tools
- Import Roche 454
- Import SOLiD
Appendix
- Use of multi-core computers
- Reference data overview
- Proteolytic cleavage enzymes
- Restriction enzymes database configuration
- Technical information about modifying Gateway cloning sites
- IUPAC codes for amino acids
- IUPAC codes for nucleotides
- Formats for import and export
  - List of bioinformatic data formats
  - List of graphics data formats
- SAM/BAM export format specification
  - Flags
- Gene expression annotation files and microarray data formats
- Translation Tables
- Matrices for alignment calculation
Bibliography

Fitting a GLM to expression data

It is easiest to understand how the GLM model works through an example. Imagine an experiment looking at the effect of two drug treatments while controlling for the gender of a patient:

Test differential expression due to Treatment with three groups: drugA, drugB, placebo
While controlling for Gender with groups: Male, Female

In an abuse of mathematical notation, the underlying GLM for each gene looks like

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugA} + \mathrm{drugB} + \mathrm{Female} + \mathrm{constant_i}$

(29.1)

where is the expression level for the gene in sample ; the combined term $\mathrm{(placebo and Male)}$ describes an arbitrarily chosen baseline expression level (of males being given a placebo); and the other terms $\mathrm{drugA}$ , $\mathrm{drugB}$ and $\mathrm{Female}$ are numbers describing the effect of each group with respect to this baseline. The $\mathrm{constant_i}$ accounts for differences in the library size between samples. For example, if a patient is male and given a placebo we predict the expression level to be

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{constant_i}.$

If instead he had been given drug B, we would predict the expression level to be augmented with the $\mathrm{drugB}$ coefficient, resulting in

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugB} + \mathrm{constant_i}.$

We assume that the expression levels follow a Negative Binomial distribution. This distribution has a free parameter, the dispersion. The greater the dispersion, the greater the variation in expression levels for a gene.

The most likely values of the dispersion and coefficients, $\mathrm{drugA}$ , $\mathrm{drugB}$ and $\mathrm{Female}$ , are determined simultaneously by fitting the GLM to the data. To see why this simultaneous fitting is necessary, imagine an experiment where we observe counts {3,10,4} for Males and {30,20,8} for Females. The most natural fit is for the coefficient $\mathrm{Female}$ to have a two-fold change and for the dispersion to be small, but an alternative fit has no fold change and a larger dispersion. Under this second fit the variation in the counts is greater, and it is just by chance that all three Female values are larger than all three Male values.