How to make AWS S3 replication faster - amazon-s3

I want to back up my S3 storage to Glacier, and I set up some replication rules on the bucket. But after I check the metrics, the processing speed is 6000 objects per day which is super slow. I have 200k objects in my bucket, it may takes a month to complete replication. Is there any way can make replication faster?

Related

DynamoDB backup and restore using Data pipelines. How long does it take to backup and recover?

I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?
Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?
It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.

How to have EMRFS consistent view on S3 buckets with retention policy?

I am using an AWS EMR compute cluster (version 5.27.0) , which uses S3 for data persistence.
This cluster both reads and writes to S3.
S3 has an issue of eventual consistency, because of which after writing data, it cannot be immediately listed. Due to this I use EMRFS with DynamoDB to store newly written paths for immediate listing.
Problem now is that I have to set a retention policy on S3, because of which data more than a month old will get deleted from S3. However, in doing so , the data does not get deleted from EMRFS DynamoDB table, leading to consistency issues.
My question is , how can I ensure that on setting the retention policy in S3, the same paths get deleted from the DynamoDB table?
One naive solution I have come up with is to define a Lambda, which fires periodically, and sets TTL of say 1 day on the DynamoDB records manually. Is there a better approach than this ?
You can configure DynamoDB with same expiration policy as your S3 objects have
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
and in this case, you ensure both DynamoDB and S3 have the same existing objects

AWS-EMR-S3DISTCP - Does aws charges for s3 actions?

I was looking into the documentation of s3distcp (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html) but I was not able to find any explicit site where mention about costs of each action. This is a sample scenario:
I have a bucket with 10,000 objects.
An EMR cluster with 3 instances will be used to execute s3distcp.
s3distcp has --deleteOnSuccess option enabled.
Source and destination bucket, along with the EMR, are on the same account and region.
Questions is:
Apart of the EMR usage (based on instance type, elapsed time, etc), is there an additional cost for s3 actions (ie. if I moved all 10K objects, does it mean 10K COPY requests)? Or billing will be associated only with the EMR costs?
AWS S3 costs will be billed as separate line items from your AWS EMR costs, so you should factor in S3 costs in addition to the EMR charges. S3 costs in this scenario will include charges for requests and possibly data transfer as per the S3 pricing guide.
That said, the associated S3 charges are likely to be trivial relative to the cost of an EMR cluster.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

Should I persist images on EBS or S3?

I am migrating my Java,Tomcat, Mysql server to AWS EC2.
I have already attached EBS volume for storing MySql data. In my web application people may upload images. So I should persist them. There are 2 alternatives in my mind:
Save uploaded images to EBS volume.
Use the S3 service.
The followings are my notes, please be skeptic about them, as my expertise is not on servers, but software development.
EBS plus: S3 storage is more expensive. (0.15 $/Gb > 0.1$/Gb)
S3 plus: Serving statics from EBS may influence my web server's performance negatively. Is this true? Does Serving images affect server performance notably? For S3 my server will not be responsible for serving statics.
S3 plus: Serving statics from EBS may result I/O cost, probably it will be minor.
EBS plus: People say EBS is faster.
S3 plus: People say S3 is more safe for persistence.
EBS plus: No need to learn API, it is straight forward to save the images to EBS volume.
Namely I can not decide, will be happy if you guide.
Thanks
The price comparison is not quite right:
S3 charges are $0.14 per GB USED, whereas EBS charges are $0.10 per GB PROVISIONED (the size of your EBS volume), whether you use it or not. As a result, S3 may or may not be cheaper than EBS.
I'm currently using S3 for a project and it's working extremely well.
EBS means you need to manage a volume + machines to attach it to. You need to add space as it's filling up and perform backups (not saying you shouldn't back up your S3 data, just that it's not as critical).
It also makes it harder to scale: when you want to add additional machines, you either need to pull off the images to a separate machine or clone the images across all. This also means you're adding a bottleneck: you'll have to manage your own upload process that will either upload to all machines or have a single machine managing it.
I recommend S3: it's set and forget. Any number of machines can be performing uploads in parallel and you don't really need to notify other machines about the upload.
In addition, you can use Amazon Cloudfront as a cheap CDN in front of the images instead of directly downloading from S3.
I have architected solutions on AWS for Stock photography sites which stores millions of images spanning TB's of data, I would like to share some of the best practice in AWS for your requirement:
P1) Store the Original Image file in S3 Standard option
P2) Store the reproducible images like thumbs etc in the S3 Reduced Redundancy option (RRS) to save costs
P3) Meta data about images including the S3 URL can be stored in Amazon RDS or Amazon DynamoDB depending upon the query complexity. Query the entries from Amazon RDS. If your query is complex it is also common practice to Store the meta data in Amazon CloudSearch or Apache Solr.
P4) Deliver your thumbs to users with low latency using Amazon CloudFront.
P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2
P6) If you are planning to use EBS, then they are not scalable with your EC2. So ideally you can use GlusterFS as your common storage pool for all your images. Multiple Amazon EC2 in Auto Scaled mode can still connect to it and access/write images.
You already outlined the advantages and disadvantages of both.
If you are planning to store terabytes of images, with storage requirements increasing day after day, S3 will probably be your best bet as it is built especially for these kinds of situations. You get unlimited storage space, without having to worry about sharding your data over many EBS volumes.
The recurrent cost of S3 is that it comes 50% more expensive than EBS. You will also have to learn the API and implement it in your application, but that is a one-off expense which I think you should be able to absorb very quickly.
Do you expect the images to last indefinitely?
The Amazon EBS FAQ is pretty clear; the annual failure rate is not "essentially zero"; they quote 0.1% to 0.5%. It's better than the disk under your desk, but it would need some kind of backup.