In Vulkan:
Multiple VkDescriptorPools can be created.
From each single VkDescriptorPool multiple VkDescriptorSets can be allocated.
Is there any limit to the number of VkDescriptorPools you can create? (apart from available memory)
Is there any indication in the spec of the overhead (memory, cpu-time, gpu-time) of using many small VkDescriptorPools versus using a few large ones? Or doesn't it generally matter?
Descriptor Pools are limited by memory (or at least that is the error code you get, no matter the underlying problem). Update-after-bind Desriptor Pools are limited by maxUpdateAfterBindDescriptorsInAllPools.
Vulkan specification usually does not comment on performance, as that might differ between GPUs or change in future GPUs.
Related
The HT does not rehash.
We use a simple division method as Hash-function.
We assume the Hash-function is efficient at equally distributing the entries.
The goal is to have O(1) insertion, deletion and find.
The optimal number of buckets is a compromise between memory consumption and hash collisions, for intended usage patterns.
For example, if something is very frequently used you might limit the size of the hash table to half the size of a CPU's cache to reduce the chance of "cache miss accessing hash table"; and this can be faster than using a larger hash table (with worse cache misses and lower chance of hash collisions). Alternatively; if it's used infrequently (and therefore you expect cache misses regardless of hash table size) then a larger size is more likely to be optimal.
Of course real systems have multiple caches (L1, L2, L3) plus virtual memory translation caches (TLBs) plus RAM limits (plus swap space limits); real software has more than just one hash table competing for resources in the memory hierarchy; and often the software developers have no idea what other processes might be running (competing for physical RAM, polluting caches, etc) or what any end user's hardware is (sizes of caches, etc). All of this makes it virtually impossible to determine "optimal" with any method (including extensive benchmarking).
The only practical option is to take an educated guess based on various assumptions (about usage, the amount of data and how good the hashing function will be in practice, the CPU, the other things that might be using CPUs and memory, ...); and make the source code configurable (e.g. #define HASH_TABLE_SIZE ..) so you can easily re-assess the guess later.
Redis can be scaled using replicas and shards. However:
replicas scale only reads, but can provide HA
shards scale both reads and writes, and have the added benefit of requiring less memory than adding a shard.
Based on these facts, if I'm not interested in HA does it make sense to always use shards and not replicas since I get the benefit of scaling both reads and writes, with a smaller memory footprint (and lower costs)?
Yes you can.
About HA, you have to be sure you define/know what is the application behaviour if this shard is becoming not available. (dataloss, service unavailable, ...)
On the replica-read, without having information about your application it is hard to tell; but most of the time a Redis instance (shard) is enough to deal with lot of load. A very "short" rules is, that a shard can deal with 25Gb of data, 25.000 operations/seconds with a sub-ms latency without any problem. Obviously this depends of the type of operations, data and command your are doing; it could be a lot more ops/sec if you do basic set/get.
And usually when we have more than this, we use Clustering to distribute the load.
So before going into the "replica-read" route (that I am trying to avoid as much as possible), take a look to your application, do some benchmark on a single shard: and you will probably see that everything is ok (at least from the workload point of view, but you will have a SPOF if you do not replicate)
After going through couple of resources on Google and stack overflow(mentioned below) , I have got high level understanding when to use what but
got couple of questions too
My Understanding :
When used as pure in-memory memory databases both have comparable performance. But for big data where complete complete dataset
can not be fit in memory or even if it can be fit (but it increases the cost), AS(aerospike) can be the good fit as it provides
the mode where indexes can be kept in memory and data in SSD. I believe performance will be bit degraded(compared to completely in memory
db though the way AS handles the read/write from SSD , it makes it fast then traditional disk I/O) but saves the cost and provide performance
then complete data on disk. So when complete data can be fit in memory both can be
equally good but when memory is constraint AS can be good case. Is that right ?
Also it is said that AS provided rich and easy to set up clustering feature whereas some of clustering features in redis needs to be
handled at application. Is it still hold good or it was true till couple of years back(I believe so as I see redis also provides clustering
feature) ?
How is aerospike different from other key-value nosql databases?
What are the use cases where Redis is preferred to Aerospike?
Your assumption in (1) is off, because it applies to (mostly) synthetic situations where all data fits in memory. What happens when you have a system that grows to many terabytes or even petabytes of data? Would you want to try and fit that data in a very expensive, hard to manage fully in-memory system containing many nodes? A modern machine can store a lot more SSD/NVMe drives than memory. If you look at the new i3en instance family type from Amazon EC2, the i3en.24xl has 768G of RAM and 60TB of NVMe storage (8 x 7.5TB). That kind of machine works very well with Aerospike, as it only stores the indexes in memory. A very large amount of data can be stored on a small cluster of such dense nodes, and perform exceptionally well.
Aerospike is used in the real world in clusters that have grown to hundreds of terabytes or even petabytes of data (tens to hundreds of billions of objects), serving millions of operations per-second, and still hitting sub-millisecond to single digit millisecond latencies. See https://www.aerospike.com/summit/ for several talks on that topic.
Another aspect affecting (1) is the fact that the performance of a single instance of Redis is misleading if in-reality you'll be deployed on multiple servers, each with multiple instances of Redis on them. Redis isn't a distributed database as Aerospike is - it requires application-side sharding (which becomes a bit of a clustering and horizontal scaling nightmare) or a separate proxy, which often ends up being the bottleneck. It's great that a single shard can do a million operations per-second, but if the proxy can't handle the combined throughput, and competes with shards for CPU and memory, there's more to the performance at scale picture than just in-memory versus data on SSD.
Unless you're looking at a tiny amount of objects or a small amount of data that isn't likely to grow, you should probably compare the two for yourself with a proof-of-concept test.
I know little about how leading RDBMSs go about retrieving data. So these questions may seem a bit rudimentary:
Does each SELECT in commonly used RDBMSs such as Oracle, SQL Server, MySQL, PostgeSQL etc. always mean a trip to read the data from the disk or do they, to some extent allowable by the hardware, cache commonly requested data to avoid the expensive I/O operation?
How do they determine which data segments to cache?
How do they go about synchronizing the cache once an update of some of the cached data occurs by a different process?
Is there a comparison matrix on how different RDBMSs cache frequently requested data?
Thanks
I'll answer for SQL Server:
Reads are served from cache if possible. Else, an IO occurs.
From what has been written and from what I observe, it is an LRU algorithm. I don't think this is documented anywhere. The LRU items are database pages of 8KB.
SQL Server is the only process which has access to the database files. So no other process can cause modifications. Regarding concurrent transactions: Multiple transactions can modify the same page. Locking (mostly at row-level, sometimes page or table level) ensures that the transactions do not disturb each other.
I don't know.
The answers for Informix are pretty similar to those given for SQL Server:
Reads and writes both use the cache if at all possible. If the page needed is not already in cache, an appropriate collection of I/O operations occurs (typically, evicting some page from cache, perhaps a dirty page that must be written before a new page can be read in, and then reading the new page where the old one was).
There are various algorithms, but page size and usage are the key parts. There are LRU queues for each page size.
The DBMS as a whole is an ensemble of processes that use a buffer pool in shared memory (and, where possible, direct disk I/O instead of going through the kernel cache), and uses various forms of locking (semaphores, spin-locks, mutexes, etc) to handle concurrency and synchronization. (On Windows, Informix uses a single process with multiple threads; on Unix, it uses multiple processes.)
Probably not.
When accessing 2D arrays in global memory, using the Texture Cache has many benefits, like filtering and not having to care as much for memory access patterns. The CUDA Programming Guide is only naming one downside:
However, within the same kernel call, the texture cache is not kept coherent with respect to global memory writes, so that any texture fetch to an address that has been written to via a global write in the same kernel call returns undefined data.
If I don't have a need for that, because I never write to the memory I read from, are there any downsides/pitfalls/problems when using the Texture Cache (or Image2D, as I am working in OpenCL) instead of plain global memory? Are there any cases where I will lose performance by using the Texture Cache?
Textures can be faster, the same speed, or slower than "naked" global memory access. There are no general rules of thumb for predicting performance using textures, as the speed up (or lack of speed up) is determined by data usage patterns within your code and the texture hardware being used.
In the worst case, where cache hit rates are very low, using textures is slower that normal memory access. Each thread has to firstly have a cache miss, then trigger a global memory fetch. The resulting total latency will be higher than a direct read from memory. I almost always write two versions of any serious code I am developing where textures might be useful (one with and one without), and then benchmark them. Often it is possible to develop heuristics to select which version to use based on inputs. CUBLAS uses this strategy extensively.