Input data location considerations
Input data for your workflow may be on your local system or in the cloud, either in an S3 bucket or in Illumina BaseSpace.
- Input data from your local system When a workflow is launched, data is uploaded to the GCE cache bucket4.1 unless it is already present there.
The location and time of latest modification are used to determine if the the most recent version of the data is already in the cache bucket.
See also the section below about additional considerations relating to reference data.
- Input data from an S3 bucket
Your AWS administrator can grant role-based access to AWS S3 buckets to GCE. This is described in the GCE Administration manual.
If you select input data from an S3 bucket that GCE does not have permissions to access directly, the files are transferred using presigned URLs. This grants time limited access to the selected files to GCE. By default, these presigned URLs are valid for 7 days, which is the maximum allowable by AWS at time of writing.
- There is no charge for data transfer when using data in an S3 bucket that is in the same region as GCE.
- There is a small charge for cross region traffic when using data in an S3 bucket that GCE has access to, but that is in a different region to GCE.
- There is no charge for data transfer when using data in an S3 bucket that is in the same region as GCE.
See the Amazon documentation for more on S3 pricing (https://aws.amazon.com/s3/pricing/). At time of writing, AWS does not charge for uploading data to S3, while storage in S3 and download from S3 are chargeable.
Additional considerations relating to reference data
Often, data is needed for the analysis that is not itself being acted upon by the analysis. for example, reference sequences to be mapped against, or target regions to limit the focus of the analysis. Such reference data flows into parameter input channels in workflows.
Reference data transfer costs differ depending on the data source:
- QIAGEN Reference Data Elements are already present in AWS S3 in all regions GCE is supported on.
These elements are thus not be uploaded to S3 when a workflow is launched. Rather, a copy of the data already in AWS is used.
QIAGEN reference data is available from under the QIAGEN Sets tab of the CLC Genomics Workbench.
- Other reference data is handled like any other data input: it is uploaded to the cache bucket unless the most recent version is already present there.
Footnotes
- ... bucket4.1
- The cache bucket is configured by your GCE administrator. It is a cloud-based location for the temporary storage of input data that you selected from a local system when launching a workflow. By default, files in the cache bucket are retained for 30 days after their last use. Your GCE admin can adjust this period, so please check with them if in doubt.