I was looking at our bill and apparently we are charged more than $600 for Amazon Simple Storage Service USE2-Requests-Tier2, meaning that we have more than 1 billion GET requests a month, so about 3 million every day? We made sure that none of our S3 buckets are public so attacks should not be possible. I have no idea how we are getting so many requests as we only have about 20 active users of our app everyday. Assuming that each of them were to make about 10 GET requests to our API, which uses lambda and boto3 to download 10 files from S3 bucket to the lambda's tmp folders, then returns a value, it still wouldn't make sense for us to have about 3 millions GET requests a day.
We also have another EventBridge triggered lambda, which uses Athena to query our database (S3), and will run every 2 hours. I don't know if this is a potential cause? Can anyone shed some light on this? And how we can take a better look into where and why are we getting so many GET requests? Thank you.
When you execute a query in Athena, during the initial query planning phase it will list the location of the table, or the locations of all the partitions of the table involved in the query. In the next phase it will make a GET request for each and every one of the objects that it found during query planning.
If your tables consists of many small files it is not uncommon to see S3 charges that are comparable or higher than the Athena charge. If those small files are Parquet files, the problem can be bigger because Athena will also do GET requests for those during query planning to figure out splits.
One way to figure out if this is the case is to enable S3 access logging on the bucket, create a new IAM session and run a query. Wait a few minutes and then look for all S3 operations that were issued with that session, that's an estimate of the S3 operations per query.
Related
How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.
I have some strange things occurring with my AWS DynamoDB tables. To give you some context, I have several tables for an AWS Lambda function to query and modify. The source code for the function is housed in an S3 bucket. The function is triggered by an AWS Api.
A few days ago I noticed a massive spike in the amount of read and write requests I was being charged for in AWS. To be specific, the number of read and write requests increased by 3,000 from what my tables usually experience (they usually have fewer than 750 requests). Additionally, I have seen similar numbers in my Tier 1 S3 requests, with an increase of nearly 4,000 requests in the past six days.
Immediately, I suspected something malicious had happened, and I suspended all IAM roles and changed their keys. I couldn't see anything in the logs from Lambda denoting it was coming from my function, nor had the API received a volume of requests consistent with what was happening on the tables or the bucket.
When I was looking through the logs on the tables, I was met with this very strange behavior relating to the provisioned write and read capacity of the table. It seems like the table's capacities are ping ponging back and forth wildly as shown in the photo.
I'm relatively new to DynamoDB and AWS as a whole, but I thought I had set the table up with very specific provisioned write and read limits. The requests have continued to come in, and I am unable to figure out where in the world they're coming from.
Would one of you AWS Wizards mind helping me solve this bizarre situation?
Any advice or insight would be wildly appreciated.
Turns out refreshing the table that appears in the DynamoDB management window causes the table to be read from, hence the unexplainable jump in reads. I was doing it the whole time 🤦♂️
I have a lot of timeseries relatet data chunked into one hour intervals in parquet files stored in aws s3 (for every hour one file). The purpose would be to have an web application displaying that data. Since we cannot crawl every parquet on s3 on request my approach would be to use ETL processes to aggregate those series and store them as single parquet and in a dynamodb table for different aggregated view like year, month, week, day, hour, or even minutes. Furthermore that processed parquet would be made available with aws athena to be queried (not from the frontend since i expect to long waiting times for the queries to be executed)
Sounds like a reasonable plan. I assume you want some pointers to how to achieve that?
Depending on the size of your current data set, and your requirements for querying it with Athena you may not need to do the ETL.
You can set up a table with a location that is a prefix of all the Parquet files (e.g. s3://example/dir/ if the files are stored with keys like s3://example/dir/2020/03/13/12/file.parquet). If your total data set is not more than a couple of gigabytes I would recommend this. If your data set is larger, and it is organised into prefixes that contain each day or hour, you can create a partitioned table and add partitions with locations that use the prefix structure (e.g. s3://example/dir/2020/03/13, s3://example/dir/2020/03/12 for daily partitions or s3://example/dir/2020/03/13/11 and s3://example/dir/2020/03/13/12` for hourly partitions). Unless you have hundreds of gigabytes of data per day, or the queries you will run with Athena almost always only look at a few hours of data, I would recommend partitioning by date rather than hour to keep the number of partitions down.
If your existing Parquet files are very small, less than a hundred megabytes, and performance for the Athena queries is very important, you could try to ETL the files into bigger files to see if it helps. It may or may not, it will depend. I recommend that you use Athena itself for the ETL. You can use INSERT INTO to create new partitions in a table based on data in another table. I suggest automating this by creating an Event Bridge rule with a schedule that either triggers a Lambda function that runs the conversion query in Athena, or a Step Functions state machine if you want to avoid paying for the Lambda sitting idle waiting for the query to finish (or need to wait more than 15 minutes). AWS has a service called Glue ETL which was made for this kind of thing, but in my experience it is not worth it. Using Athena, Lambda, and Step Functions beats it in terms of usability and flexibility.
You can use the same mechanism for loading pre-calculated time series into DynamoDB – use Event Bridge to schedule a Lambda function that runs queries in Athena, and convert the results to store in DynamoDB. Use Step Functions to avoid paying for idle time when waiting for the queries to finish.
If Amazon Timestream ever gets released it may be a better target for storing the time series than DynamoDB. Also consider storing the pre-calculated time series as JSON, CSV, or Apache Arrow on S3 instead of DynamoDB, that can be cheaper and in some ways simpler depending on your use case.
Please let me know if you want more details on how to set up the partitioned Athena tables, but provide some more context so I can make it more concrete.
I'm using CloudSearch to index a large number of small json data objects which need to be updated regularly (I have a 5 minute cron job) based on the value of an expression and some business logic.
Everything was working until last week. I see the cron job is still running without encountering any error messages, the objects in S3 are still being updated correctly, but when I execute a CloudSearch request ordered by the ranking property, I get stale search data. Not just by a couple of minutes, but by weeks.
I tried re-indexing but that did not result in any change. Does CloudSearch have some sort of update threshold which prevents you from posting updates after a certain number of requests a day? I imagine updating once every 5 minutes would fall well below such a number.
I haven't been able to find any indication in AWS's docs as to whether they do some sort of update throttling
What I ended up finding was a mention in CloudSearch's FAQ: http://aws.amazon.com/cloudsearch/faqs/
Q: How much data can I upload to my search domain?
The number of partitions you need depends on your data and configuration, so the maximum data >you can upload is the data set that when your search configuration is applied results in 10 >search partitions. When you exceed your search partition limit, your domain will stop >accepting uploads until you delete documents and re-index your domain. If you need more than >10 search partitions, please contact us.
I deleted a large amount of data from CloudSearch that I was no longer using and found my updates working again. I had unwittingly run into a data limit in CloudSearch's index. CloudSearch was still returning a success when I submitted the batch update, but it silently ignored the update operations.
I'm try to host videos on S3 and to put data transfer limits so I don't get charged for more than say 20GB, any way to do that?
Not that I know of
But you could try issuing Query String Authentication Urls to S3 resources with expiration dates set. And then use the Server Access Logs to track total downloads. It you miss by a GB or so it's not going to cost too much :-)