What can I do to speed up S3 uploads/updates?

What can I do to speed up S3 uploads/updates? - file-upload

I have been trying to upload something small to s3 all day today. About 20k files in 500 directories that total about 3GB. Something absolutely reasonable for a service called Simple Storage Service. I can upload to different places on average at about 500k/s - 1mb/s (between 1.8 and 3.6 gb/h). I have been trying to upload these files to s3 all day, I must have uploaded at dismal rate on aggregate (think about 100 mb/h or something).
I have tried:
the s3 web console with a variety of browsers on a variety of OS
boto using a variety of scripts I've written, and found online (mainly here on SO).
My problems, which I was hoping you would be so kind to help me diagnose are the following:
dragging and dropping to the s3 console (just for it to count the 20k files, takes like an hour). why? unless I can solve this the web console is mostly useless to me.
the upload itself is extremely slow, seldomly faster than 100 k/s.
after all day uploading I noticed a simple problem with the filenames, not wanting to spend all night uploading again, I used this script: Amazon S3 boto: How do you rename a file in a bucket? which everyone claims works really fast. It manages to rename about 1 200kb file every 2-3 seconds. why?
after uploading, making the all the files public (using the web console) has taken like 4 hours and still it has not finished.
It is really frustrating, there must be something I am doing wrong. I expect everything to work about 10x faster and it doesn't. I've read that if split the file s3 runs faster and I've read that the zone (I'm in NYC) is really important. What change will give me the biggest increase in upload speed?

Maybe the slow upload connection can be fixed with a change of the AWS server location
I just figured out, what the problem was in my case: duration of upload (size 35MB)
Oregon, US us-west-2: 5-6mins
Frankfurth, Germany eu-central: 1mins! (that's about max of my connection)
I'm based in Vienna, not in the US -> check your AWS server location

The upload itself is extremely slow
You can try Bucket Explorer which upload operations in hundreds of parallel queue, so that upload process is faster.
After uploading, making the all the files public (using the web console) has taken like 4 hours and still it has not finished.
You can set policy on bucket for public access to objects.
The following example policy allows access to anonymous user.
{
"Id": "ds",
"Statement": [{
"Action": "s3:GetObject",
"Effect": "Allow",
"Principal": {"AWS": "*"},
"Resource": [
"arn:aws:s3:::testbucket",
"arn:aws:s3:::testbucket/*"
],
"Sid": "1"
}],
"Version": "2008-10-17"
}
Disclosure: I am one of the developer of Bucket Explorer

Related

How to resolve this error in Google Data Fusion: "Stage x contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."

I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?

When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.

Best way to copy millions of files from S3 to GCS?

I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.

Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)

Putting an aged sites files on Amazon AWS (S3)

For the last 6 or years, I've been running an online video hosting site.
Back then, hosting your content on services like S3 wasn't the biggie that it is today, so everything is currently stored on our very expensive dedicated servers. The overages on data are much more expensive.
My question is this: What is the process of moving TB's of data on to a service like Amazon Aws where the site is all coded to link to files on our current server.
Editing hundreds-of-thousands of video links to point to the new Amazon AWS locations surely wouldn't be the ideal?
In this situation, is there a more "easy" approach to this dilemma.
All files are in a singular folder structure. IE; example.com/files/video.mp4
Would it possibly be more of a, leave the current files were they are and just have all future videos put on AWS?
I feel like I'm missing a piece of how this works as if for example IMGUR (who has trillions of images) wanted to move from AWS to another similar storage, it would be impossible to re-link trillions of links.
Any help is appreciated.

High load image uploader/resizer in conjunction with Amazon S3

we are running a product oriented service, that requires us to daily download and resize thousands and thousands of photos from various web sources and then upload them to Amazon's S3 bucket and use Cloud Front to serve them...
now the problem is that downloading and resizing is really resource consuming and it would take a lot of hours to process them all...
What we are looking for is a service, that would do this for us fast and of course for a reasonable price...
anybody knows such a service? I tried to google it but don't really know how to form the search to get what I need

S3: Duplicate buckets

What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.

One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).

Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.

If you're worried about deletion, you should probably look at S3's new Versioning feature.

I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.

You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.

Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas