How can I monitor the bandwidth of an object in the Amazon S3 service? - amazon-s3

How can programmatically monitor the bandwidth of an object in AWS S3 service? I would like to do this to prevent excessive bandwidth usage by clients who are using our services and costing us more than we can afford. We like to limit 1TB bandwidth for each object.

The detailed usage reports are just per bucket, not per object.
What you could do is enable logging and parse the logs once an hour or so. It's certainly not instant, but it would prevent people from going way over your usage limits.
Also, s3stat is a good option up to a point. Once you start doing more than ~ 50 million requests per month, they have trouble crunching the data.

Related

The Packets Transfer from Kafka connect to AWS S3 bucket is Slow. How to increase the throughput

Am using Kafka Connect S3 to push the records to AWS s3 bucket, we are using timebasedpartioner for the s3.sink.properties and we are noticing only 10 records per second. Please guide in increasing the throughput.
Your question mentions "packet transfer", but given the max size of a Kafka record defaulting to 1MB, and in a 10MB file over the network, that could be a lot of packets...
Ultimately, network is not the only factor here. You need to consider your ISP rates at the very least (assuming Connect is not running in AWS, which it should if you want to save $$ on VPC/S3 ingest fees), then look how much data the producers are actually sending to the topic.
Once determined, since you use the timebased partitioner, you have not mentioned your time period or flush size or flush interval (scheduled or not). If you have a daily partition, and the flush settings are ridicuously high, then of course you'll have little to no data immeadiately in S3.
You can increase throughput by increasing comsumer.max.poll.records, and lowering the flush settings & s3.part.size, but first thoughts suggest this is not your issue.

How to increase queries per minute of Google Cloud SQL?

As in the question, I want to increase number of queries per second on GCS. Currently, my application is on my local machine, when it runs, it repeatedly sends queries to and receives data back from the GCS server. More specifically, my location is in Vietnam, and the server (free tier though) is in Singapore. The maximum QPS I can get is ~80, which is unacceptable. I know I can get better QPS by putting my application on the cloud, same location with the SQL server, but that alone requires a lot of configuration and works. Are there any solutions for this?
Thank you in advance.
colocating your application front-end layer with the data persistence layer should be your priority: deploy your code to the cloud as well
use persistent connections/connection pooling to cut on connection establishment overhead
free tier instances for Cloud SQL do not exist. What are you referring to here? f1-micro GCE instances are not free in Singapore region either.
depending on the complexity of your queries, read/write pattern, size of your dataset, etc. performance of your DB could be I/O bound. Ensuring your instance is provisioned with SSD storage and/or increasing the data disk size can help lifting IOPS limits, further improving DB performance.
Side note: don't confuse commonly used abbreviation GCS (Google Cloud Storage) with Google Cloud SQL.

Google Compute Engine Load Balancer limits

I'm thinking of using Google Compute Engine to run a LOT of instances in a target pool behind a network load balancer. Each of those instances will end up real-time processing many large data streams, so at full scale and peak times there might be multiple Terabytes per second go through.
Question:
Is there a quota or limit to the data you can push through those load balancers? Is there a limit of instances you can have in a target pool? (the documentation does not seem to specify this)
It seems like load balancers have a dedicated IP (means it's a single machine?)
There's no limit on the amount of data that you can push through a LB. As for instances, there are default limits on CPUs, persistent or SSD disks, and you can see those quotas in the Developers Console at 'Compute' > 'Compute Engine'> 'Quotas', however you can always request increase quota at this link. You can have as many instances that you need in a target pool. Take a look to the Compute Engine Autoscaler that will help you to spin up machines as your service needs. The single IP provided for your LB is in charge of distributing incoming traffic across your multiple instances.

On what factors does the download speed of assets from Amazon S3 depends?

How fast can we download files from Amazon S3, is there an upper limit (and they distribute it between all the requests from the same user), or does it only depend on my internet connection download speed? I couldn't find it in their SLA.
What other factors does it depend on? Do they throttle the data transfer rate at some level to prevent abuse?
This has been addressed in the recent Amazon S3 team post Amazon S3 Performance Tips & Tricks:
First: for smaller workloads (<50 total requests per second), none of
the below applies, no matter how many total objects one has! S3 has a
bunch of automated agents that work behind the scenes, smoothing out
load all over the system, to ensure the myriad diverse workloads all
share the resources of S3 fairly and snappily. Even workloads that
burst occasionally up over 100 requests per second really don't need
to give us any hints about what's coming...we are designed to just
grow and support these workloads forever. S3 is a true scale-out
design in action.
S3 scales to both short-term and long-term workloads far, far greater
than this. We have customers continuously performing thousands of
requests per second against S3, all day every day. [...] We worked with other
customers through our Premium Developer Support offerings to help them
design a system that would scale basically indefinitely on S3. Today
we’re going to publish that guidance for everyone’s benefit.
[emphasis mine]
You may want to read the entire post to gain more insight into the S3 architecture and resulting challenges for really massive workloads (i.e., as stressed by the S3 team, it won't apply at all for most use cases).

Would Amazon S3 work as storage for sensor data streaming in from millions of endpoints?

I am looking for options for reliable (and speedy) storage for small amounts of sensor data that would be coming in from (getting optimistic here) millions of endpoints.
The scale I'm talking is 1M endpoints, each sending 100 bytes every minute. This data needs to be available for analysis shortly after this.
Additionally, this data will be kept for a few years and may exceed 100TB of total storage.
Is S3 the solution to this, or would I be better off hosting my own NoSQL cluster like Cassandra/MongoDB etc?
Please let me know if I have not specified any information.
Yes, you could. But, there are no query mechanisms nor any method of reading multiple objects in one request in S3. You would also not have any mechanism to inspect the data before it's written.
This might be a better idea:
Have clients write sensor data onto a SQS queue
Your application reads messages off the queue and writes the data into SimpleDB or other data store
It would de-couple receipt of the data, with any data load/storage phase.
Note that many Amazon services have a per-request charge. For SQS it's $0.01/10000 requests. If you want to have 1 million clients write one message each minute request charges alone would be over $40,000 a month. Doubling when taking reading the messages into account.
(((1000000 * (60*24*30)) / 10000) * $0.01) * 2 = $86,400
For S3, it's $0.01/1000 for POSTs (client writes), and $0.01/10000 GETs (reads). For 1 million clients your per-request charges alone could easily reach $500,000 per month.
Ultimately, at 1 million clients, you likely need to run your own receiving endpoints simply due to economic factors.