DynamoDB S3 Imports - amazon-s3

When importing from S3 to DynamoDB, does this count towards provisioned write throughput?
I have a service that is only read from, except for batch updates from a multi-gigabyte file in S3. We don't want to pay for provisioned writes all month, and scaling from 0 writes to several million could take a while given the AWS policy of only allowing provisioned rates to double at one time.

Yes. EMR integration relies on the same API as any client application. As such is is subject to the same throughput policy.
Minor precision:
minimum throughput = 1 (not 0)
maximum throughput = 10,000 (not > 1,000,000)
By the way, huge 'scaling' can easily be automated provided that you only double at once. It only takes a couple of minutes to run. Maybe you could also consider storing "incremental" diff instead of the full "multi-gigabyte file in S3". It would save a lot...
The official optimization guide for DynamoDB can provide you some useful hints on how to optimize your import.

Related

DynamoDB backup and restore using Data pipelines. How long does it take to backup and recover?

I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?
Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?
It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.

Cost effective BigQuery loading data from s3

I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.
What is the most cost effective way to get data to BigQuery?
I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.
Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.
(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)
Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too
As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.
For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB
There is several way for optimizing the transfer and the load.
First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.
If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.
For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.
BigQuery data ingestion
You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.
The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing
BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery.
Transfer Service Pricing
If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing
Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.
For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.
For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.
Egress cost consideration
In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.

Lambda triggers high s3 costs

I created a new Lambda based on a 2MB zip file (it has a heavy dependency). After that, my S3 costs really increased (from $12.27 to $31).
Question 1: As this is uploaded from a CI/CD pipeline, could it be that it's storing every version and then increasing costs?
Question 2: Is this storage alternative more expensive than choosing directly an owned s3 bucket instead of the private one owned by Amazon where this zip goes? Looking at the S3 prices list, only 2MB can't result in 19 Dollars.
Thanks!
few things you can do to mitigate cost:
Use Lambda Layers for dependencies
Use S3 Infrequent access for your lambda archive
Being that I don't have your full configuration of S3, its hard to tell what can be causing cost...things like S3 versioning would do it.
The reason was that Object versioning was enabled, and after some stress tests those versions were accumulated and stored. Costs went back to $12 after they were removed.
It's key to keep the "Show" enabled (see image) to keep track of those files.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

On what factors does the download speed of assets from Amazon S3 depends?

How fast can we download files from Amazon S3, is there an upper limit (and they distribute it between all the requests from the same user), or does it only depend on my internet connection download speed? I couldn't find it in their SLA.
What other factors does it depend on? Do they throttle the data transfer rate at some level to prevent abuse?
This has been addressed in the recent Amazon S3 team post Amazon S3 Performance Tips & Tricks:
First: for smaller workloads (<50 total requests per second), none of
the below applies, no matter how many total objects one has! S3 has a
bunch of automated agents that work behind the scenes, smoothing out
load all over the system, to ensure the myriad diverse workloads all
share the resources of S3 fairly and snappily. Even workloads that
burst occasionally up over 100 requests per second really don't need
to give us any hints about what's coming...we are designed to just
grow and support these workloads forever. S3 is a true scale-out
design in action.
S3 scales to both short-term and long-term workloads far, far greater
than this. We have customers continuously performing thousands of
requests per second against S3, all day every day. [...] We worked with other
customers through our Premium Developer Support offerings to help them
design a system that would scale basically indefinitely on S3. Today
we’re going to publish that guidance for everyone’s benefit.
[emphasis mine]
You may want to read the entire post to gain more insight into the S3 architecture and resulting challenges for really massive workloads (i.e., as stressed by the S3 team, it won't apply at all for most use cases).