Browse the manual

Introduction to CLC Genomics Workbench
- Contact information
- Download and installation
- System requirements
  - Limitations on maximum number of cores
- Workbench Licenses
- When the program is installed: Getting started
  - Quick start
- Plugins
- Network configuration
- Latest improvements
User interface
- View Area
- Zoom and selection in View Area
- Toolbox and Status Bar
- Workspace
- List of shortcuts
Data management and search
- Navigation Area
- Metadata
- Working with tables
  - Filtering tables
- Customized attributes on data locations
- Local search
  - Quick search
  - Advanced search
User preferences and settings
- General preferences
- View preferences
  - Import and export Side Panel settings
- Data preferences
- Advanced preferences
- Export/import of preferences
- View settings for the Side Panel
Printing
- Selecting which part of the view to print
- Page setup
- Print preview
Import/export of data and graphics
- Standard import
  - External files
- Import tracks
  - GFF3 format
- Import high-throughput sequencing data
- Import RNA spike-in controls
- Data export
- Export graphics to files
  - File formats
- Export graph data points to a file
- Copy/paste view output
Data download
- Download reference genome data
  - Selecting data types for download
  - Cytogenetic ideograms
- Search for Sequences at NCBI
  - NCBI search options
  - Handling of NCBI search results
- Search for structures at NCBI
- UniProt (Swiss-Prot/TrEMBL) search
- SRA search
- Sequence web info
Running tools, handling results and batching
- Running tools
- Handling results
- Batch processing
Workflows
- Creating a workflow
- Distributing and installing workflows
- Executing a workflow
- Open copy of installed workflow
Viewing and editing sequences
- View sequence
- Circular DNA
  - Using split views to see details of the circular molecule
  - Mark molecule as circular and specify starting point
- Working with annotations
- Element information
- View as text
- Sequence Lists
BLAST search
- Running BLAST searches
  - BLAST at NCBI
  - BLAST against local data
- Output from BLAST searches
- Local BLAST databases
- Manage BLAST databases
- Bioinformatics explained: BLAST
3D Molecule Viewer
- Importing molecule structure files
- Viewing molecular structures in 3D
  - Updating old structure files
- Customizing the visualization
  - Visualization styles and colors
  - Project settings
- Tools for linking sequence and structure
- Protein structure alignment
General sequence analyses
- Extract Annotations
- Extract sequences
- Shuffle sequence
- Dot plots
- Local complexity plot
- Sequence statistics
  - Bioinformatics explained: Protein statistics
- Join sequences
- Pattern discovery
  - Pattern discovery search parameters
  - Pattern search output
- Motif Search
- Create motif list
Nucleotide analyses
- Convert DNA to RNA
- Convert RNA to DNA
- Reverse complements of sequences
- Reverse sequence
- Translation of DNA or RNA to protein
- Find open reading frames
  - Open reading frame parameters
Protein analyses
- Protein charge
- Antigenicity
- Hydrophobicity
  - Hydrophobicity graphs along sequence
  - Bioinformatics explained: Protein hydrophobicity
- Pfam domain search
  - Download of Pfam database
  - Running Pfam Domain Search
- Secondary structure prediction
- Protein report
- Reverse translation from protein into DNA
  - Bioinformatics explained: Reverse translation
- Proteolytic cleavage detection
  - Bioinformatics explained: Proteolytic cleavage
Primers
- Primer design - an introduction
  - General concept
  - Scoring primers
- Setting parameters for primers and probes
  - Primer Parameters
- Graphical display of primer information
  - Compact information mode
  - Detailed information mode
- Output from primer design
- Standard PCR
- Nested PCR
- TaqMan
- Sequencing primers
- Alignment-based primer and probe design
- Analyze primer properties
- Find binding sites and create fragments
  - Binding parameters
  - Results - binding sites and fragments
- Order primers
Sequencing data analyses
- Importing and viewing trace data
  - Trace settings in the Side Panel
- Trim sequences
  - Trimming using the Trim tool
  - Manual trimming
- Assemble sequences
- Assemble sequences to reference
- Sort sequences by name
- Add sequences to an existing contig
- View and edit contigs and read mappings
- Reassemble contig
- Secondary peak calling
Cutting and cloning
- Restriction site analyses
  - Dynamic restriction sites
  - Restriction Site Analysis
- Restriction enzyme lists
- Molecular cloning
- Gateway cloning
- Gel electrophoresis
  - Gel view
Sequence alignment
- Create an alignment
- View alignments
  - Bioinformatics explained: Sequence logo
- Edit alignments
  - Realignment
- Join alignments
- Pairwise comparison
  - The pairwise comparison table
  - Bioinformatics explained: Multiple alignments
Phylogenetic trees
- K-mer Based Tree Construction
- Create tree
- Model Testing
- Maximum Likelihood Phylogeny
  - Bioinformatics explained
- Tree Settings
- Metadata and phylogenetic trees
RNA structure
- RNA secondary structure prediction
- View and edit secondary structures
- Evaluate structure hypothesis
  - Selecting sequences for evaluation
  - Probabilities
- Structure scanning plot
  - Selecting sequences for scanning
  - The structure scanning result
- Bioinformatics explained: RNA structure prediction by minimum free energy minimization
  - The algorithm
  - Structure elements and their energy contribution
Trimming, multiplexing and sequencing quality control
- Trim Reads
- Demultiplex reads
  - An example using Illumina barcoded sequences
- Sequencing data quality control
- Merge overlapping pairs
  - Using quality scores when merging
  - Report of merged pairs
Tracks
- Track lists
- Retrieving reference data tracks
- Merging tracks
- Converting data to tracks and back
  - Convert to tracks
  - Convert from tracks
- Annotate and filter tracks
- Graphs
Read mapping
- Map Reads to Reference
- Mapping output
- Mapping reports
  - Summary mapping report
  - Detailed mapping report
- Mapping SOLid reads in color space
  - Viewing color space information
  - Mapping in color space
- Local realignment
- Merge mapping results
- Remove duplicate mapped reads
  - Algorithm details and parameters
  - Running the duplicate reads removal
- Extract consensus sequence
- Sample reads
Resequencing
- Create Statistics for Target Regions
- InDels and Structural Variants
- Coverage analysis
- Variant Detectors - overview
  - Differences in the variants called by the different tools
  - How the variant detection tools work
- Fixed Ploidy Variant Detection
  - Ploidy and sensitivity
- Low Frequency Variant Detection
- Basic Variant Detection
- Variant Detectors - error model estimation
- Variant Detectors - filters
  - General filters
  - Noise filters
- Variant Detectors - the outputs
- The Fixed Ploidy and Low Frequency variant callers: detailed descriptions
  - The Fixed Ploidy Variant Caller: Models and methods
  - The Low Frequency Variant caller: Models and methods
- Variant data
- Detailed information about overlapping paired reads
- Annotate and filter variants
- Comparing variants
- Predicting functional consequences
- Identify Known Mutations from Sample Mappings
  - How to run the Identify Known Mutations from Sample Mappings tool
  - Output from the Identify Known Mutations from Sample Mappings tool
RNA-Seq Analysis tools
- RNA-Seq analysis
- Create Combined RNA-Seq Report
- Advanced RNA-Seq Tools
  - TMM Normalization
  - Metadata for RNA-Seq
- PCA for RNA-Seq
  - Principal component analysis plot (2D)
  - Principal component analysis plot (3D)
- Differential Expression for RNA-Seq
- Create Heat Map for RNA-Seq
  - Clustering of features and samples
  - The heat map view
- Create Expression Browser
  - The expression browser
- Create Venn Diagram for RNA-Seq
  - Venn diagram table view
- Gene Set Test
Microarray and Small RNA Analysis
- Small RNA analysis
- Experimental design
- Working with tracks and experiments
- Transformation and normalization
- Quality control
- Statistical analysis - identifying differential expression
- Feature clustering
  - Hierarchical clustering of features
  - K-means/medoids clustering
- Annotation tests
  - Hypergeometric tests on annotations
  - Gene set enrichment analysis
- General plots
De novo sequencing
- De novo assembly
- Map Reads to Contigs
Epigenomics
- ChIP-Seq Analysis
- Annotate with nearby gene information
Legacy tools
- Import Roche 454
- Import SOLiD
Appendix
- Use of multi-core computers
- Graph preferences
- BLAST databases
- Proteolytic cleavage enzymes
- Restriction enzymes database configuration
- Technical information about modifying Gateway cloning sites
- IUPAC codes for amino acids
- IUPAC codes for nucleotides
- Formats for import and export
  - List of bioinformatic data formats
  - List of graphics data formats
- SAM/BAM export format specification
  - Flags
- Gene expression annotation files and microarray data formats
- Translation Tables
- Custom codon frequency tables
- Comparison of track comparison tools
- Matrices for alignment calculation
Bibliography

Fitting a GLM to expression data

It is easiest to understand how the GLM model works through an example. Imagine an experiment looking at the effect of two drug treatments while controlling for the gender of a patient:

Test differential expression due to Treatment with three groups: drugA, drugB, placebo
While controlling for Gender with groups: Male, Female

In an abuse of mathematical notation, the underlying GLM for each gene looks like

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugA} + \mathrm{drugB} + \mathrm{Female} + \mathrm{constant_i}$

(26.1)

where is the expression level for the gene in sample ; the combined term $\mathrm{(placebo and Male)}$ describes an arbitrarily chosen baseline expression level (of males being given a placebo); and the other terms $\mathrm{drugA}$ , $\mathrm{drugB}$ and $\mathrm{Female}$ are numbers describing the effect of each group with respect to this baseline. The $\mathrm{constant_i}$ accounts for differences in the library size between samples. For example, if a patient is male and given a placebo we predict the expression level to be

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{constant_i}.$

If instead he had been given drug B, we would predict the expression level to be augmented with the $\mathrm{drugB}$ coefficient, resulting in

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugB} + \mathrm{constant_i}.$

We assume that the expression levels follow a Negative Binomial distribution. This distribution has a free parameter, the dispersion. The greater the dispersion, the greater the variation in expression levels for a gene.

The most likely values of the dispersion and coefficients, $\mathrm{drugA}$ , $\mathrm{drugB}$ and $\mathrm{Female}$ , are determined simultaneously by fitting the GLM to the data. To see why this simultaneous fitting is necessary, imagine an experiment where we observe counts {3,10,4} for Males and {30,20,8} for Females. The most natural fit is for the coefficient $\mathrm{Female}$ to have a two-fold change and for the dispersion to be small, but an alternative fit has no fold change and a larger dispersion. Under this second fit the variation in the counts is greater, and it is just by chance that all three Female values are larger than all three Male values.