AWS DynamoDB Strange Behavior -- Provisioned Capacity & Queries - amazon-s3

I have some strange things occurring with my AWS DynamoDB tables. To give you some context, I have several tables for an AWS Lambda function to query and modify. The source code for the function is housed in an S3 bucket. The function is triggered by an AWS Api.
A few days ago I noticed a massive spike in the amount of read and write requests I was being charged for in AWS. To be specific, the number of read and write requests increased by 3,000 from what my tables usually experience (they usually have fewer than 750 requests). Additionally, I have seen similar numbers in my Tier 1 S3 requests, with an increase of nearly 4,000 requests in the past six days.
Immediately, I suspected something malicious had happened, and I suspended all IAM roles and changed their keys. I couldn't see anything in the logs from Lambda denoting it was coming from my function, nor had the API received a volume of requests consistent with what was happening on the tables or the bucket.
When I was looking through the logs on the tables, I was met with this very strange behavior relating to the provisioned write and read capacity of the table. It seems like the table's capacities are ping ponging back and forth wildly as shown in the photo.
I'm relatively new to DynamoDB and AWS as a whole, but I thought I had set the table up with very specific provisioned write and read limits. The requests have continued to come in, and I am unable to figure out where in the world they're coming from.
Would one of you AWS Wizards mind helping me solve this bizarre situation?
Any advice or insight would be wildly appreciated.

Turns out refreshing the table that appears in the DynamoDB management window causes the table to be read from, hence the unexplainable jump in reads. I was doing it the whole time 🤦‍♂️

Related

Millions of GET requests (Amazon S3 USE2-Requests-Tier2) every day?

I was looking at our bill and apparently we are charged more than $600 for Amazon Simple Storage Service USE2-Requests-Tier2, meaning that we have more than 1 billion GET requests a month, so about 3 million every day? We made sure that none of our S3 buckets are public so attacks should not be possible. I have no idea how we are getting so many requests as we only have about 20 active users of our app everyday. Assuming that each of them were to make about 10 GET requests to our API, which uses lambda and boto3 to download 10 files from S3 bucket to the lambda's tmp folders, then returns a value, it still wouldn't make sense for us to have about 3 millions GET requests a day.
We also have another EventBridge triggered lambda, which uses Athena to query our database (S3), and will run every 2 hours. I don't know if this is a potential cause? Can anyone shed some light on this? And how we can take a better look into where and why are we getting so many GET requests? Thank you.
When you execute a query in Athena, during the initial query planning phase it will list the location of the table, or the locations of all the partitions of the table involved in the query. In the next phase it will make a GET request for each and every one of the objects that it found during query planning.
If your tables consists of many small files it is not uncommon to see S3 charges that are comparable or higher than the Athena charge. If those small files are Parquet files, the problem can be bigger because Athena will also do GET requests for those during query planning to figure out splits.
One way to figure out if this is the case is to enable S3 access logging on the bucket, create a new IAM session and run a query. Wait a few minutes and then look for all S3 operations that were issued with that session, that's an estimate of the S3 operations per query.

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

Microsoft Azure Blob Storage Upload Performance

I am running an Azure web role, which is storing very small blobs into Azure storage. (Blob upload is being done from the server, not from the browser.) I have searched stack overflow and the rest of the internet for tips on optimizing blob storage performance, and I believe I've checked and implemented all of the usual suspects: uploading async, allowing unlimited outgoing web connections (which now seems to be the default setting on web roles and no longer needs to be explicitly set in web.config or in code).
Tweaking the number of concurrent uploads I allow makes some difference, but regardless of what I've tried, I seem to max out at around 1,000 blob uploads per second. This is when running in the Azure web role, in the same region as the storage account (East US). My rate when running this from home over a good internet connection isn't much less, ~700 blobs/sec, which seems to tell me that it's not the network latency that's limiting the rate, it's the actual processing time of the storage service.
I wouldn't normally consider these rates horrible for this kind of a service, but I've read that Microsoft boasts a rate of ~20,000 storage transactions per second, so I've been a little disappointed with these results.
I'd like to get some feedback from those who have really tried to push the limits of blob storage. Does ~1000 small uploads per second sound about right? Or is there possibly something else I should be doing to improve this? I'll post the code if I need to, but I'd rather not receive speculative answers, I'd like to hear from developers who can either confirm that my results are reasonable, or that they've seen much higher throughput.
I should add that I'm currently running this in a small web role. I've tried it also in a medium web role, and didn't see any significant difference.
EDIT:
After a few days of development and testing, my upload rate seemed to suddenly increase. Not by a lot, but maybe by another ~200 per second. In looking around the web, I noticed a comment in the Azure documentation stating "A storage account scales automatically as usage increases." So I'm wondering if it really is capable of much higher rates, but will not automatically scale up until it sees sustained period of high volume. Some confirmation of that would also be greatly appreciated.
Depending on how small your requests are the problem might be caused by Nagle’s Algorithm is Not Friendly towards Small Requests - although usually I see that with queues / table operations. Try disabling Nagle's and let me know if that makes any difference. As an fyi, you have to disable it prior to establishing the connection otherwise the changes will not take effect.
Jason

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Has anyone ever reached a read or write upper-bound for an Amazon S3 bucket?

Are there known limitations of S3 scaling? Anyone ever had so many simultaneous reads or writes that a bucket started returning errors? I'm a bit more interested in writes than reads because S3 is likely to be optimized for reads.
Eric's comment sums it up already on a conceptual level, as addressed in the FAQ What happens if traffic from my application suddenly spikes? as well:
Amazon S3 was designed from the ground up to handle traffic for any
Internet application. [...] Amazon S3’s massive scale enables us to
spread load evenly, so that no individual application is affected by
traffic spikes.
Of course, you still need to account for possible issues and Tune [your] Application for Repeated SlowDown errors (see Amazon S3 Error Best Practices):
As with any distributed system, S3 has protection mechanisms which
detect intentional or unintentional resource over-consumption and
react accordingly. SlowDown errors can occur when a high request rate
triggers one of these mechanisms. Reducing your request rate will
decrease or eliminate errors of this type. Generally speaking, most
users will not experience these errors regularly; however, if you
would like more information or are experiencing high or unexpected
SlowDown errors, please post to our Amazon S3 developer forum
http://developer.amazonwebservices.com/connect/forum.jspa?forumID=24
or sign up for AWS Premium Support
http://aws.amazon.com/premiumsupport/. [emphasis mine]
While rare, these slow downs do happen of course - here is an AWS team response illustrating the issue (pretty dated though):
Amazon S3 will return this error when the request rate is high enough
that servicing the requests would cause degraded service for other
customers. This error is very rarely triggered. If you do receive
it, you should exponentially back off. If this error occurs, system
resources will be reactively rebalanced/allocated to better support a
higher request rate. As a result, the time period during which this
error would be thrown should be relatively short. [emphasis mine]
Your assumption about read vs. write optimization is confirmed there as well:
The threshold where this error is trigged varies and will depend, in
part, on the request type and pattern. In general, you'll be able to
achieve higher rps with gets vs. puts and with lots of gets for a
small number of keys vs. lots of gets for a large number of keys.
When geting or puting a large number of keys you'll be able to achieve
higher rps if the keys are in alphanumeric order vs. random/hashed
order.