How does partition count affect search speed? - amazon-cloudsearch

I'm a little confused by the partition_count setting on aws cloudsearch. Obviously adding more replicas will increase search throughput as there are more instances available to service requests, but how does partition count relate, if at all, to this?

Related

How to handle resource limits for apache in kubernetes

I'm trying to deploy a scalable web application on google cloud.
I have kubernetes deployment which creates multiple replicas of apache+php pods. These have cpu/memory resources/limits set.
Lets say that memory limit per replica is 2GB. How do I properly configure apache to respect this limit?
I can modify maximum process count and/or maximum memory per process to prevent memory overflow (thus the replicas will not be killed because of OOM). But this does create new problem, this settings will also limit maximum number of requests that my replica could handle. In case of DDOS attack (or just more traffic) the bottleneck could be the maximum process limit, not memory/cpu limit. I think that this could happen pretty often, as these limits are set to worst case scenario, not based on average traffic.
I want to configure autoscaler so that it will create multiple replicas in case of such event, not only when the cpu/memory usage is near limit.
How do I properly solve this problem? Thanks for help!
I would recommend doing the following instead of trying to configuring apache to limit itself internally:
Enforce resource limits on pods. i.e let them OOM. (but see NOTE*)
Define an autoscaling metric for your deployment based on your load.
Setup a namespace wide resource-quota. This enforeces a clusterwide limit on the resources pods in that namespace can use.
This way you can let your Apache+PHP pods handle as many requests as possible until they OOM, at which point they respawn and join the pool again, which is fine* (because hopefully they're stateless) and at no point does your over all resource utilization exceed the resource limits (quotas) enforced on the namespace.
* NOTE: This is only true if you're not doing fancy stuff like websockets or stream based HTTP, in which case an OOMing Apache instance takes down other clients that are holding an open socket to the instance. If you want, you should always be able to enforce limits on apache in terms of the number of threads/processes it runs anyway, but it's best not to unless you have solid need for it. With this kind of setup, no matter what you do, you'll not be able to evade DDoS attacks of large magnitudes. You're either doing to have broken sockets (in the case of OOM) or request timeouts (not enough threads to handle requests). You'd need far more sophisticated networking/filtering gear to prevent "good" traffic from taking a hit.

s3 rate limit against website endpoint

I'm hitting my s3 bucket via its website endpoint with various paths/keys. I'm able to get ok (200) responses when I'm hitting it at 1,000 requests per second over the course of 5 minutes. I'm using a popular tool: https://github.com/tsenart/vegeta so I have confidence in these stats.
This is suprising considering the documentation says that anything above is 800 per second is problematic.
Is using a website endpoint different than an API call in terms of throttling? Is 800 a real rate limit or a crude theshhold?
It's a soft limit, and not really a limit from the bucket level perspective. Read carefully. The documentation warns of a rapid request rate increase beyond 800 requests per second potentially resulting in temporary rate limits on your request rate.
S3 increases available capacity by keyspace partition splitting and it takes some time for this to happen... but buckets scale up with workload.
If you are requesting the same object(s) repeatedly, you are also not likely to be imposing as much load on the available resources as you would be if you were hitting 800 unique objects per second and reading between the lines, that is the threshold under discussion -- the time to look up keys in the bucket index. Recent hits are probably already more accessible than cold spots in the index.
The problem that document highlights is that of your object keys are lexically sequential, then S3 will be unable to split the partitions meaningfully, because you will always be creating new objects on only one side of the split or the other and thus working against the scaling algorithm of S3.
The documentation has been updated in meantime and the limits have been increased. Now the limits are per bucket prefix and 1000 req/s isn't a problem any more. For more see the mentioned doc.

Add a random prefix to the key names to improve S3 performance?

You expect this bucket to immediately receive over 150 PUT requests per second. What should the company do to ensure optimal performance?
A) Amazon S3 will automatically manage performance at this scale.
B) Add a random prefix to the key names.
The correct answer was B and I'm trying to figure out why that is. Can someone please explain the significance of B and if it's still true?
As of a 7/17/2018 AWS announcement, hashing and random prefixing the S3 key is no longer required to see improved performance:
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
S3 prefixes used to be determined by the first 6-8 characters;
This has changed mid-2018 - see announcement
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
But that is half-truth. Actually prefixes (in old definition) still matter.
S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.
Learning new access patterns takes time. Repartitioning of the data takes time.
Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).
TLDR:
"Old prefixes" still do matter.
Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)
"New prefixes" do work, but not initially. It takes time to accommodate to load.
PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.
#tagar That's true especially if you are not in a read intensive scenario !
You have to understand the small characters of the doc to reverse engineer how it is working internally and how your are limited by the system. There is no magic !
503 Slow Down errors are emitted typically when a single shard of S3 is in a hot spot scenario : too much requests to a single shard. What is difficult to understand is how sharding is done internally and that the advertised limit of request is not guaranteed.
pre-2018 behavior gives the details : it was advised to start the first 6-8 digits of the prefix with random characters to avoid hot spots.
One can them assume that initial sharding of an S3 bucket is done based on the first 8 digits of the prefix.
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
post-2018 : an automatic sharding was put in place and AWS does no longer advise to bother about the first digits of the prefix... However from this doc :
http-5xx-errors-s3
amazon-s3-performance-tips-fb76daae65cb
One can understand that this automatic shard rebalancing can only work well if load to a prefix is PROGRESSIVELY scaled up to advertised limits:
If the request rate on the prefixes increases gradually, Amazon S3
scales up to handle requests for each of the two prefixes. (S3 will
scale up to handle 3,500 PUT/POST/DELETE or 5,500 GET requests per
second.) As a result, the overall request rate handled by the bucket
doubles.
From my experience 503 can appear way before the advertised levels and there is no guarantee on the speed of the internal rebalancing made internally by S3.
If you are in a write intensive scenario for exemple uploading a lot of small objects, the automatic scaling won't be efficient to rebalance your load.
In short : if you are relying on S3 performance I advise to stick to pre-2018 rules so that the initial sharding of your storage works immediately and does not rely on the auto-rebalancing algorithm of S3.
hash first 6 digits of prefix or design a datamodel balancing partitions uniformly across first 6 digits of prefix
avoid small objects (target size of object ~128MB)
Lookup/writes work means using filenames that are similar or ordered can harm performance.
Adding hashes/random ids prefixing the S3 key is still advisable to alleviate high loads on heavily accessed objects.
Amazon S3 Performance Tips & Tricks
Request Rate and Performance Considerations
How to introduce randomness to S3 ?
Prefix folder names with random hex hashes. For example: s3://BUCKET/23a6-FOLDERNAME/FILENAME.zip
Prefix file names with timestamps. For example: s3://BUCKET/ FOLDERNAME/2013-26-05-15-00-00-FILENAME.zip
B is correct because, when you add randomness (called entropy or some disorderness), that can place all the objects locat close to each other in the same partition in an index.(for example, a key prefixed with the current year) When your application experiences an increase in traffic, it will be trying to read from the same section of the index, resulting in decreased performance.So, app devs add some random prefixes to avoid this.
Note: AWS might have taken care of this so Dev won't need to take care but just wanted to attempt to give the correct answer for the question asked.
As of June 2021.
As mentioned on AWS guidebook Best practice design pattern: optimizing Amazon S3 performance, the application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.
I think the random prefix will help to scale S3 performance.
for example, if we have 10 prefixes in one S3 bucket, it will have up to 35000 put/copy/post/delete requests and 55000 read requests.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

how to keep visited urls and maintain the job queue when writing a crawler

I'm writing a crawler. I keep the visited urls in redis set,and maintain the job queue using redis list. As data grows,memory is used up, my memory is 4G. How to maintain these without redis? I have no idea,if I store these in files,they also need to be in memory.
If I use a mysql to store that,I think it maybe much slower than redis.
I have 5 machines with 4G memory,if anyone has some material to set up a redis cluster,it also helps a lot. I have some material to set up a cluster to be failover ,but what I need is to set a load balanced cluster.
thx
If you are just doing the basic operations of adding/removing from sets and lists, take a look at twemproxy/nutcracker. With it you can use all of the nodes.
Regarding your usage pattern itself, are you removing or expiring jobs and URLs? How much repetition is there in the system? For example, are you repeatedly crawling the same URLs? If so, perhaps you only need a mapping of URLs to their last crawl time, and instead of a job queue you pull URLs that are new or outside a given window since their last run.
Without the details on how your crawler actually runs or interacts with Redis, that is about what I can offer. If memory grows continually, it likely means you aren't cleaning up the DB.

What is "Excessive resource usage" in SQL Azure?

I searched online for awhile about what is "Excessive resource usage" on SQL Azure, still cannot get an idea.
Some articles suggest query takes too long, too much memory etc will cause "Excessive resource usage". But If I use simple query, simple data structure, what will happen?
For example: I get a 1G SQL Azure as session state. Since session is a very small string, and save/delete all the time, I don't think it will grow to 1G for millions of session simultaneously. You can calculate, for 1 million session, 20 char each, only take 20M space, consider 20 minutes expire etc. Cannot even close to 1G. But the queries, should be lots and lots. Each query will be very simple and fast by index.
I wanna know, if this use will be consider as "Excessive resource usage"? Is there any hard number to limit you on the usage?
Btw, as example above, if all happen in same datacenter, so all cost is 1G database which is $10 a month, right?
Unfortunately the answer is 'it depends'. I think that probably the best reference (with guidance) on the SQL Azure Query Throttle is here: TechNet Article on SQL Azure Perormance This will povide details about the metrics that are monitored and the mechanism of the throttle.
The reason that I say it depends is that the throttle is non-deterministic for any given user. This is because the throttle will be activated based on the total load on the node (physical SQL Server in Azure DC). While the subscribers who will get throttled are the subscribers delivering the greatest load the level at which the throttle kicks in will depend on the total load on the node. SO if you are on a quiet node (where other tenant DBs are relatively inactive) then you will be able to put through a bunch more throughput than if you are on a busy node.
It is very appealing to use 1GB SQL Azure DBs for session state storage; you've identified the cost benefits. You are taking a risk though. One way to mitigate this risk is to partition across at least two SQL Azure 1GB DBs and adjust the load yourself based on whether one of the DBs starts hitting the throttle.
Another option if you want determinism for throughput is to use the WIndows Azure Cache to back your sesion state store. The Cache has hard pre-defined limits for query throughput so you can plan for it more easily Azure Caching FAQ including Limits. The Cache approach is probably a bit more expensive but with a lower risk of problems.