Best way to copy millions of files from S3 to GCS? - amazon-s3

I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.

Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)

Related

Uploading large excel files to Big Query

how can i upload a large excel file and dataset (larger than 10 mb) to big query ?
Can anyone help ?
I tried to research a way to do it, but everything i found was a little bit complicated.
I just started working with SQL and I don't have much experience.
!!! WARNING prior to answer.
The below answer will incur costs. You can use --dry-run to estimate the number of bytes that are processed for bq and then combine it with Pricing calculator:
https://cloud.google.com/products/calculator/
What you are trying to achieve, is possible by loading a dataset from a google cloud storage bucket.
To make it easy for new users, there is a "Guide me" option from GCP on how to upload datasets larger than 10 mb.
GCP "Guide me" site
Don't forget to first estimate the costs.
An example of how this can be used using a Google Storage URI is referenced in this page:
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
Again, use pricing calculator to estimate costs.
To be able to create buckets on a specific project, you need specific IAM permissions, also referenced here

Running Spark application using HDFS or S3

In my spark application, I just want to access a big file, and distribute the computation across many nodes on EC2.
Initially, my file is stored on S3.
It's very convenient for me to load the file with sc.textFile() function from S3.
However, I can put some efforts to load the data to HDFS and then read the data from there.
My question is, will the performance be better with HDFS?
My code involves the spark partitions(mapPartitions transforamtion), so does it really matter what is my initial file system?
Obviously when using S3 the latency is higher and the data throughput is lower compared to HDFS on local disk.
But it depends what you do with your data. It seems most of programs are limited more by CPU power than network throughput. So you should be fine with the 1Gbps throughput that you get from S3.
Anyway you can check recent slides from Aaron Davidson's talk on Spark Summit 2015. This topic is discussed there.
http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users/16

Bulk transfer of all S3 assets to Google Cloud

I'm looking to move all S3 assets to Google Cloud for a bunch of reasons. However, I have ~25 buckets, with thousands of files in each. I'm aware of the Google Storage Transfer tool - https://cloud.google.com/storage/transfer/getting-started - but that only works on buckets one at a time. Is there anything to do all of them at once?
The Google Cloud Storage Transfer service is still your best bet, especially if your buckets are very, very large.
If your buckets aren't large enough to bother setting it up, you could use the gsutil command-line tool with a little bit of scripting to accomplish this:
for bucket in bucket1 bucket2 bucket3 bucket4 etc; do
gsutil -m cp -r s3://$bucket/* gs://$bucket
done

Speeding up S3 to GCS transfer using GCE and gsutil

I plan on using a GCE cluster and gsutil to transfer ~50Tb of data from Amazon S3 to GCS. So far I have a good way to distribute the load over however many instances I'll have to use but I'm getting pretty slow transfer rates in comparison to what I achieved with my local cluster. Here are the details of what I'm doing
Instance type: n1-highcpu-8-d
Image: debian-6-squeeze
typical load average during jobs: 26.43, 23.15, 21.15
average transfer speed on a 70gb test (for a single instance): ~21mbps
average file size: ~300mb
.boto process count: 8
.boto thread count: 10
Im calling gsutil on around 400 s3 files at a time:
gsutil -m cp -InL manifest.txt gs://my_bucket
I need some advice on how to make this transfer faster on each instance. I'm also not 100% on whether the n1-highcpu-8-d instance is the best choice. I was thinking of possibly parallelizing the job myself using python, but I think that tweaking the gsutil settings could yield good results. Any advice is greatly appreciated
If you're seeing 21Mbps per object and running around 20 objects at a time, you're getting around 420Mbps throughput from one machine. On the other hand, if you're seeing 21Mbps total, that suggests that you're probably getting throttled pretty heavily somewhere along the path.
I'd suggest that you may want to use multiple smaller instances to spread the requests across multiple IP addresses; for example, using 4 n1-standard-2 instances may result in better total throughput than one n1-standard-8. You'll need to split up the files to transfer across the machines in order to do this.
I'm also wondering, based on your comments, how many streams you're keeping open at once. In most of the tests I've seen, you get diminishing returns from extra threads/streams by the time you've reached 8-16 streams, and often a single stream is at least 60-80% as fast as multiple streams with chunking.
One other thing you may want to investigate is what download/upload speeds you're seeing; copying the data to local disk and then re-uploading it will let you get individual measurements for download and upload speed, and using local disk as a buffer might speed up the entire process if gsutil is blocking reading from one pipe due to waiting for writes to the other one.
One other thing you haven't mentioned is which zone you're running in. I'm presuming you're running in one of the US regions rather than an EU region, and downloading from Amazon's us-east S3 location.
use the parallel_thread_count and parallel_process_count values in your boto configuration (usually, ~/.boto) file.
You can get more info on the -m option by typing:
gsutil help options

Migrating data from S3 to Google cloud storage

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.
So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.
What I am now contemplating:
Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:
Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst
If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.
Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.
You could set up a Google Compute Engine (GCE) account and run gsutil from GCE to import the data. You can start up multiple GCE instances, each importing a subset of the data. That's part of one of the techniques covered in the talk we gave at Google I/O 2013 called Importing Large Data Sets into Google Cloud Storage.
One other thing you'll want to do if you use this approach is to use the gsutil cp -L and -n options. -L creates a manifest that records details about what has been transferred, and -n allows you to avoid re-copying files that were already copied (in case you restart the copy from the beginning, e.g., after an interruption). I suggest you update to gsutil version 3.30 (which will come out in the next week or so), which improves how the -L option works for this kind of copying scenario.
Mike Schwartz, Google Cloud Storage team
Google has recently released the Cloud Storage Transfer Service which is designed to transfer large amounts of data from S3 to GCS:
https://cloud.google.com/storage/transfer/getting-started
(I realize this answer is a little late for the original question but it may help future visitors with the same question.)