Migrating data from S3 to Google cloud storage - amazon-s3

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.
So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.
What I am now contemplating:
Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:
Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst
If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.
Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.

You could set up a Google Compute Engine (GCE) account and run gsutil from GCE to import the data. You can start up multiple GCE instances, each importing a subset of the data. That's part of one of the techniques covered in the talk we gave at Google I/O 2013 called Importing Large Data Sets into Google Cloud Storage.
One other thing you'll want to do if you use this approach is to use the gsutil cp -L and -n options. -L creates a manifest that records details about what has been transferred, and -n allows you to avoid re-copying files that were already copied (in case you restart the copy from the beginning, e.g., after an interruption). I suggest you update to gsutil version 3.30 (which will come out in the next week or so), which improves how the -L option works for this kind of copying scenario.
Mike Schwartz, Google Cloud Storage team

Google has recently released the Cloud Storage Transfer Service which is designed to transfer large amounts of data from S3 to GCS:
https://cloud.google.com/storage/transfer/getting-started
(I realize this answer is a little late for the original question but it may help future visitors with the same question.)

Related

How to copy Big Data from GCS to S3?

How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil?
Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine).
Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

Bulk transfer of all S3 assets to Google Cloud

I'm looking to move all S3 assets to Google Cloud for a bunch of reasons. However, I have ~25 buckets, with thousands of files in each. I'm aware of the Google Storage Transfer tool - https://cloud.google.com/storage/transfer/getting-started - but that only works on buckets one at a time. Is there anything to do all of them at once?
The Google Cloud Storage Transfer service is still your best bet, especially if your buckets are very, very large.
If your buckets aren't large enough to bother setting it up, you could use the gsutil command-line tool with a little bit of scripting to accomplish this:
for bucket in bucket1 bucket2 bucket3 bucket4 etc; do
gsutil -m cp -r s3://$bucket/* gs://$bucket
done

Best way to copy millions of files from S3 to GCS?

I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.
Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)

Speeding up S3 to GCS transfer using GCE and gsutil

I plan on using a GCE cluster and gsutil to transfer ~50Tb of data from Amazon S3 to GCS. So far I have a good way to distribute the load over however many instances I'll have to use but I'm getting pretty slow transfer rates in comparison to what I achieved with my local cluster. Here are the details of what I'm doing
Instance type: n1-highcpu-8-d
Image: debian-6-squeeze
typical load average during jobs: 26.43, 23.15, 21.15
average transfer speed on a 70gb test (for a single instance): ~21mbps
average file size: ~300mb
.boto process count: 8
.boto thread count: 10
Im calling gsutil on around 400 s3 files at a time:
gsutil -m cp -InL manifest.txt gs://my_bucket
I need some advice on how to make this transfer faster on each instance. I'm also not 100% on whether the n1-highcpu-8-d instance is the best choice. I was thinking of possibly parallelizing the job myself using python, but I think that tweaking the gsutil settings could yield good results. Any advice is greatly appreciated
If you're seeing 21Mbps per object and running around 20 objects at a time, you're getting around 420Mbps throughput from one machine. On the other hand, if you're seeing 21Mbps total, that suggests that you're probably getting throttled pretty heavily somewhere along the path.
I'd suggest that you may want to use multiple smaller instances to spread the requests across multiple IP addresses; for example, using 4 n1-standard-2 instances may result in better total throughput than one n1-standard-8. You'll need to split up the files to transfer across the machines in order to do this.
I'm also wondering, based on your comments, how many streams you're keeping open at once. In most of the tests I've seen, you get diminishing returns from extra threads/streams by the time you've reached 8-16 streams, and often a single stream is at least 60-80% as fast as multiple streams with chunking.
One other thing you may want to investigate is what download/upload speeds you're seeing; copying the data to local disk and then re-uploading it will let you get individual measurements for download and upload speed, and using local disk as a buffer might speed up the entire process if gsutil is blocking reading from one pipe due to waiting for writes to the other one.
One other thing you haven't mentioned is which zone you're running in. I'm presuming you're running in one of the US regions rather than an EU region, and downloading from Amazon's us-east S3 location.
use the parallel_thread_count and parallel_process_count values in your boto configuration (usually, ~/.boto) file.
You can get more info on the -m option by typing:
gsutil help options