I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?
When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.
I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.
Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)
For the last 6 or years, I've been running an online video hosting site.
Back then, hosting your content on services like S3 wasn't the biggie that it is today, so everything is currently stored on our very expensive dedicated servers. The overages on data are much more expensive.
My question is this: What is the process of moving TB's of data on to a service like Amazon Aws where the site is all coded to link to files on our current server.
Editing hundreds-of-thousands of video links to point to the new Amazon AWS locations surely wouldn't be the ideal?
In this situation, is there a more "easy" approach to this dilemma.
All files are in a singular folder structure. IE; example.com/files/video.mp4
Would it possibly be more of a, leave the current files were they are and just have all future videos put on AWS?
I feel like I'm missing a piece of how this works as if for example IMGUR (who has trillions of images) wanted to move from AWS to another similar storage, it would be impossible to re-link trillions of links.
Any help is appreciated.
we are running a product oriented service, that requires us to daily download and resize thousands and thousands of photos from various web sources and then upload them to Amazon's S3 bucket and use Cloud Front to serve them...
now the problem is that downloading and resizing is really resource consuming and it would take a lot of hours to process them all...
What we are looking for is a service, that would do this for us fast and of course for a reasonable price...
anybody knows such a service? I tried to google it but don't really know how to form the search to get what I need
What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.
One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).
Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.
If you're worried about deletion, you should probably look at S3's new Versioning feature.
I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.
You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.
Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html