Speeding up S3 to GCS transfer using GCE and gsutil - amazon-s3

I plan on using a GCE cluster and gsutil to transfer ~50Tb of data from Amazon S3 to GCS. So far I have a good way to distribute the load over however many instances I'll have to use but I'm getting pretty slow transfer rates in comparison to what I achieved with my local cluster. Here are the details of what I'm doing
Instance type: n1-highcpu-8-d
Image: debian-6-squeeze
typical load average during jobs: 26.43, 23.15, 21.15
average transfer speed on a 70gb test (for a single instance): ~21mbps
average file size: ~300mb
.boto process count: 8
.boto thread count: 10
Im calling gsutil on around 400 s3 files at a time:
gsutil -m cp -InL manifest.txt gs://my_bucket
I need some advice on how to make this transfer faster on each instance. I'm also not 100% on whether the n1-highcpu-8-d instance is the best choice. I was thinking of possibly parallelizing the job myself using python, but I think that tweaking the gsutil settings could yield good results. Any advice is greatly appreciated

If you're seeing 21Mbps per object and running around 20 objects at a time, you're getting around 420Mbps throughput from one machine. On the other hand, if you're seeing 21Mbps total, that suggests that you're probably getting throttled pretty heavily somewhere along the path.
I'd suggest that you may want to use multiple smaller instances to spread the requests across multiple IP addresses; for example, using 4 n1-standard-2 instances may result in better total throughput than one n1-standard-8. You'll need to split up the files to transfer across the machines in order to do this.
I'm also wondering, based on your comments, how many streams you're keeping open at once. In most of the tests I've seen, you get diminishing returns from extra threads/streams by the time you've reached 8-16 streams, and often a single stream is at least 60-80% as fast as multiple streams with chunking.
One other thing you may want to investigate is what download/upload speeds you're seeing; copying the data to local disk and then re-uploading it will let you get individual measurements for download and upload speed, and using local disk as a buffer might speed up the entire process if gsutil is blocking reading from one pipe due to waiting for writes to the other one.
One other thing you haven't mentioned is which zone you're running in. I'm presuming you're running in one of the US regions rather than an EU region, and downloading from Amazon's us-east S3 location.

use the parallel_thread_count and parallel_process_count values in your boto configuration (usually, ~/.boto) file.
You can get more info on the -m option by typing:
gsutil help options

Related

How to copy Big Data from GCS to S3?

How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil?
Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine).
Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.

Best way to copy millions of files from S3 to GCS?

I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.
Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)

Migrating data from S3 to Google cloud storage

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.
So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.
What I am now contemplating:
Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:
Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst
If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.
Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.
You could set up a Google Compute Engine (GCE) account and run gsutil from GCE to import the data. You can start up multiple GCE instances, each importing a subset of the data. That's part of one of the techniques covered in the talk we gave at Google I/O 2013 called Importing Large Data Sets into Google Cloud Storage.
One other thing you'll want to do if you use this approach is to use the gsutil cp -L and -n options. -L creates a manifest that records details about what has been transferred, and -n allows you to avoid re-copying files that were already copied (in case you restart the copy from the beginning, e.g., after an interruption). I suggest you update to gsutil version 3.30 (which will come out in the next week or so), which improves how the -L option works for this kind of copying scenario.
Mike Schwartz, Google Cloud Storage team
Google has recently released the Cloud Storage Transfer Service which is designed to transfer large amounts of data from S3 to GCS:
https://cloud.google.com/storage/transfer/getting-started
(I realize this answer is a little late for the original question but it may help future visitors with the same question.)

Moving 1 million image files to Amazon S3

I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
All new images go to Amazon S3.
Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.