Working with data in AWS S3

NGS data can be imported from Amazon S3 locations, either directly or using on-the-fly import in workflows. Data can also be exported to Amazon S3 locations.

To do this, AWS credentials for one or more AWS accounts must be provided, as described below.

Working with data on S3 is of particular relevance when submitting jobs to run on a CLC Genomics Cloud Engine, where execution takes place on AWS, close to the data. When launching analyses on the CLC Genomics Workbench or a CLC Genomics Server, files selected for import from an AWS S3 location are first downloaded to a temporary folder and are subsequently imported.

Configuring AWS credentials

To configure AWS accounts for data import and export, go to:

        Connections | Manage AWS S3 Locations (Image cloud_access_16_n_p)

The same configuration dialog can be opened by clicking on the (Image cloud_access_16_n_p) icon at the bottom left of the Workbench frame.

This dialog (figure 6.2) allows you to register the credentials for one or more AWS accounts. To add an AWS account, click on Add Amazon S3 location. After adding one or more AWS data locations, it is possible to Edit or Remove them.

Image data_dialog
Figure 6.2: The AWS S3 Locations configuration dialog

For adding or editing AWS account credentials, the information below is required (figure 6.3). The administrator of the AWS account should be able to provide this information if you do not already have it.

Name: A short name of your choice, identifying the AWS account. This name will be shown as the name of the data location when importing data to or exporting data from Amazon S3.

Description: An optional description of the AWS account.

AWS access key ID: The access key ID for programmatic access, set up for the AWS IAM user.

AWS secret access key: The secret access key for programmatic access, set up for the AWS IAM user.

AWS partition: The partition under which the AWS user is registered.

The dialog continually validates the settings that have been entered. When the settings are valid, the Status box will contain the text "Valid" and a green icon will be shown. Click on OK to save the settings.

Image data_dialog_add
Figure 6.3: Adding an AWS account configuration dialog

When one or more AWS data locations have been added, they will be listed as data locations when importing and exporting data.

When the connection status icon at the bottom of the CLC Workbench looks like (Image cloud_access_16_n_p), a connection has been established to Amazon S3.

Importing data from AWS S3

Next generation sequencing (NGS) data can be imported from configured AWS S3 cloud locations using NGS Import tools or using on-the-fly import when running a workflow. The configured AWS S3 locations will be available in the workflow wizard or the import tool wizard (figure 6.4). CLC format files can also be imported on-the-fly as part of a workflow run.

Image ngs_import
Figure 6.4: Configured AWS S3 locations can be selected when using NGS importers.

Exporting data to AWS S3

To export data to an AWS S3 location, launch the exporter, and in the final configuration step, select the desired export location from the drop-down menu (figure 6.5). New folders in S3 can be created by right-clicking on a folder or bucket and selecting "New folder".

Image export_to_cloud
Figure 6.5: An AWS S3 location can be selected for exporting data to.