Exporting data from Google Cloud Storage to Amazon S3

Exporting data from Google Cloud Storage to Amazon S3 - amazon-s3

I would like to transfer data from a table in BigQuery, into another one in Redshift.
My planned data flow is as follows:
BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift
I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me. From Google Cloud documentation:
Cloud Storage Transfer Service
This page describes Cloud Storage Transfer Service, which you can use
to quickly import online data into Google Cloud Storage.
I understand that this service can be used to import data into Google Cloud Storage and not to export from it.
Is there a way I can export data from Google Cloud Storage to Amazon S3?

You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects). You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.

Go to any instance or cloud shell in GCP
First of all configure your AWS credentials in your GCP
aws configure
if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
follow this URL for AWS configure
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Attaching my screenshot
Then using gsutil
gsutil -m rsync -rd gs://storagename s3://bucketname
16GB data transferred in some minutes

Using Rclone (https://rclone.org/).
Rclone is a command line program to sync files and directories to and from
Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem

Using the gsutil tool we can do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects. Moving, copying, and renaming objects.
we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync and gsutil cp operations. whereas
gsutil rsync collects all metadata from the bucket and syncs the data to s3
gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket
gsutil cp copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately.
gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>
if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen command in amazon or GCP instance with AWS credentials configured and GCP auth verified.
Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file
gsutil ls gs://<gcs-bucket> > file_list_part.out
Bash script:
#!/bin/bash
echo "start processing"
input="file_list_part.out"
while IFS= read -r line
do
command="gsutil cp ${line} s3://<bucket-name>"
echo "command :: $command :: $now"
eval $command
retVal=$?
if [ $retVal -ne 0 ]; then
echo "Error copying file"
exit 1
fi
echo "Copy completed successfully"
done < "$input"
echo "completed processing"
execute the Bash script and write the output to a log file to check the progress of completed and failed files.
bash file_copy.sh > /root/logs/file_copy.log 2>&1

I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket.
For the task, I created the Google Compute Engine of V8CPU (30 GB).
Allow Login using SSH on the Compute Engine.
Once logedin create and empty .boto configuration file to add AWS credential information. Added AWS credentials by taking the reference from the mentioned link.
Then run the command:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
The data transfer rate is ~1GB/s.
Hope this help.
(Do not forget to terminate the compute instance once the job is done)

For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).
For that case you're left with simple downloading all files to your machine and uploading them back. Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.
Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI (apt install awscli) and configure the access to S3 (aws configure).
Run these two lines, or make it a bash script, if you have many buckets to copy.
gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"
(It's better to use rsync in general, but cp was faster for me)

Tools like gsutil and aws s3 cp won't use multipart uploads/downloads, so will have poor performance for large files.
Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). You can transfer data with the command:
skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/
(disclaimer: I am a contributor)

Related

gsutil cannot copy to s3 due to authentication

I need to copy many (1000+) files to s3 from GCS to leverage an AWS lambda function. I have edited ~/.boto.cfg and commented out the 2 aws authentication parameters but a simple gsutil ls s3://mybucket fails from either an GCE or EC2 VM.
Error is The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256..
I use gsutil version: 4.28 and locations of GCS and S3 bucket are respectively US-CENTRAL1 and US East (Ohio) - in case this is relevant.
I am clueless as the AWS key is valid and I enabled http/https. Downloading from GCS and uploading to S3 using my laptop's Cyberduck is impracticable (>230Gb)

As per https://issuetracker.google.com/issues/62161892, gsutil v4.28 does support AWS v4 signatures by adding to ~/.boto a new [s3] section like
[s3]
# Note that we specify region as part of the host, as mentioned in the AWS docs:
# http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
host = s3.eu-east-2.amazonaws.com
use-sigv4 = True
The use of that section is inherited from boto3 but is currently not created by gsutil config so it needs to be added explicitly for the target endpoint.
For s3-to-GCS, I will consider the more server-less Storage Transfer Service API.

I had a similar problem. Here is what I ended up doing on a GCE machine:
Step 1: Using gsutil, I copied files from GCS to my GCE hard drive
Step 2: Using aws cli (aws s3 cp ...), I copied files from GCE hard drive to s3 bucket
The above methodology has worked reliably for me. I tried using gsutil rsync but it fail unexpectedly.
Hope this helps

gsutil - How to copy/download all files from Google private cloud?

Google Play Developer account reports are stored on private Google Cloud Storage bucket.
Every Google Play Developer account has Google Cloud Storage bucket ID
So to access I have installed gsutil on my windows machine.
Now I am using this command to copy all files from bucket
gsutil cp -r dir gs://[bucket_id]
its says
CommandException: No URLs matched
When I list all directories on bucket, this command works
gsutil ls gs://[bucket_id]
Can anyone help here to understand the gsutil exception ?

This exception is because destination URL is missing
It should be like...
gsutil cp -r dir gs://[bucket_id] [destination_bucket_url]

Limiting 'ls' command output in s3fs

My Amazon S3 bucket has millions of files and I am mounting it using s3fs. Anytime a ls command is issued (not intentionally) the terminal hangs.
Is there a way to limit the number of results returned to 100 when a ls command is issued in a s3fs mounted path?

Try goofys (https://github.com/kahing/goofys). It doesn't limit the number of item returned for ls, but ls is about 40x faster than s3fs when there are lots of files.

It is not recommended to use s3fs in production situations. Amazon S3 is not a filesystem, so attempting to mount it can lead to some synchronization issues (and other issues like you have experienced).
It would be better to use the AWS Command-Line Interface (CLI), which has commands to list, copy and sync files to/from Amazon S3. It can also do partial listing of S3 buckets by path.

migration from s3 to google cloud storage and ACL

I am currently planning a possible migration from s3 to google cloud storage(g-c-s). I have decided to spin up a gce instance and use gsutil to rsync several millions of files. I would like to know if the permission will be preserved or not.
for example if a file has public read on amazon s3 what will be the acl on g-c-s.
thanks

If you use the gsutil cp command you can specify a canned ACL on the command line, like this:
gsutil cp -R -a public-read s3://your-s3-bucket gs://your-gs-bucket
The rsync command doesn't have a way to do that. However, the other option is you can set a default object ACL on the destination bucket, using the gsutil defacl command. Then you can use gsutil cp without specifying the canned ACL, or you could use gsutil rsync.

getting large datasets onto amazon elastic map reduce

There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)

I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat

If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
wget http://example.com/mydataset dataset
sudo apt-get install s3cmd
s3cmd --configure
s3cmd put dataset s3://mybucket/
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.

I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas