Redis Hyperloglog limitations - redis

I am trying to solve a problem in a hacky way using Redis Hyperloglog but what I am trying to understand is the limitations and assumptions by Hyperloglog on the data or the distribution.
The count-min and bloom filter have their own set of limitations but google isn't being helpful in providing much info on applications and limitations of Hyperloglog.
I am using Redis Hyperloglog and as Antirez describes there are no practical limits to the cardinality of the sets we can count. But from a theory perspective, does Hyperloglog make any assumptions/constraints about the data or the distribution?

The HyperLogLog algorithm assumes that a strong universal hash function is used. Redis uses MurmurHash64A which should be good enough from a practical point of view. Redis HyperLogLog implementation uses 6 bits per registers which allows to represent any bit run-lengths within 64bit hash values. Hence, the only limitation I see is the 64bit hash value itself. If the cardinality is in the order of 2^64, there will be many hash collisions that finally would lead to large estimation errors. However, cardinalities of this order of magnitude never occur in practice.

Related

Which common database library will rack up the least cost (e.g. from memory and cpu usage) on Google Cloud Run and similar services?

I want to make a CRUD API (create-read-update-delete) by which users can interact with a key-value store database. It'll be hosted on Cloud Run (e.g. see this example) or a similar service, running all day to serve requests.
All data will have a short TTL (time-to-live) around 1 minute, and keys and values will just be short strings. Furthermore, speed, security, redundancy etc. aren't concerns (within reason).
In this case, which common database backend will be the cheapest in terms of its CPU and memory usage? I was thinking of using Redis, but I worried that it might be unnecessarily CPU/memory intensive over say SQLite, PostgresQL, etc..
Or is it the case that basically all these database libraries will have similar CPU/memory usage?
Edit:
Keys are 256-bit numbers, and values are <140-character strings. Every minute, a user requests to write/read from at most 100 of these, and let's just say there's 100k users.
Redis would do fine for this kind of use cases. RDBMs would also do the work but from what you explained, you don't need relational database for this since your data is key/value. Redis is super fast for this case and if you make a good data modeling you may reduce the memory usage.
Since your requirements are key/value and the keys/values have reasonable sizes, you may get the advantage of Redis hashes. In addition to that; you don't need a persistent storage, you may use EXPIRE to manage your memory usage easily. Redis's benchmark tool may help you to benchmark for both strings and hashes to decide which one uses less memory.
Couple of hours ago, I answered a question for reducing memory usage of Redis by using hashes over strings here, it may give some insight.

Redis Sharding performance and o(1) time complexity for get key

please I need a simple explanation as regards this.
Redis claims that time complexity for get key is o(1)
As such, whether i have key value pair of 1,000,000 or 1,000,000,000,000,000 the time to get key is the same.
My question now is
I have a requirement to hold about 1 billion key-value pair, If memory is not a problem (meaning assume i have a single server with enough memory to hold that much data), is there any advantage of sharding? that is to say, will there be any performance advantage of seperating this 1billion key-value pair to 10 redis instances each holding 100million records as against just a single redis instance holding the entire records?
Thanks so much for your anticipated response
There is a definite advantage to sharding in terms of performance, as it can use multiple CPU cores (ideally, one per shard). Being (mostly) single threaded, a single instance Redis can use only one core (and a little more). Sharding effectively increases the parallelism of the deployment, thus contributing positively to performance (but adding to the administrative overhead).

Redis vs RocksDB

I have read about Redis and RocksDB, I don't get the advantages of Redis over RocksDB.
I know that Redis is all in-memory and RocksDB is in-memory and uses flash storage. If all data fits in-memory, which one should I choose? do they have the same performance? Redis scales linearly with the number of CPU's? I guess that there are others differences that I don't get.
I have a dataset which fits in-memory and I was going to choose Redis but it seems that RocksDB offers me the same and if one day the dataset grows too much I wouldn't have to be worried about the memory.
They have nothing in common. You are trying to compare apples and oranges here.
Redis is a remote in-memory data store (similar to memcached). It is a server. A single Redis instance is very efficient, but totally non scalable (regarding CPU). A Redis cluster is scalable (regarding CPU).
RocksDB is an embedded key/value store (similar to BerkeleyDB or more exactly LevelDB). It is a library, supporting multi-threading and a persistence based on log-structured merge trees.
While Didier Spezia's answer is correct in his distinction between the two projects, they are linked by a project called LedisDB. LedisDB is an abstraction layer written in Go that implements much of the Redis API on top of storage engines like RocksDB. In many cases you can use the same Redis client library directly with LedisDB, making it almost a drop in replacement for Redis in certain situations. Redis is obviously faster, but as OP mentioned in his question, the main benefit of using RocksDB is that your dataset is not limited to the amount of available memory. I find that useful not because I'm processing super large datasets, but because RAM is expensive and you can get more milage out of smaller virtual servers.
Redis, in general, has more functionalities than RocksDB. It can natively understand the semantics of complex data structures such as lists and arrays . RocksDB, in contrast, looks at the stored values as a blob of data. If you want to do any further processing, you need to bring the data to your program and process it there (in other words, you can't delegate the processing to the database engine aka RocksDB).
RocksDB only runs on a single server. Redis has a clustered version (though it is not free)
Redis is built for in-memory computation, though it also support backing the data up to the persistent storage, but the main use cases are in memory use cases. RocksDB by contrast is usually used for persisting data and in most cases store the data on persistent medium.
RocksDB has a better multi-threaded support (specially for reads --writes still suffer from concurrent access).
Many memcached servers use Redis (where the protocol used is memcached but underlying server is Redis). This doesn't used most of Redis's functionality but is one case that Redis and RocksDB both function similarly (as a KVS though still in different context, where Redis based memcached is a cache but RocksDB is a database, though not an enterprise grade one)
#Guille If you know the behavior of hot data(getting fetched frequently) is based of time-stamp then Rocksdb would a smart choice, but do optimize it for fallback using bloom-filters .If your hot data is random ,then go for Redis .Using rocksDB entirely in memory is not generally recommended in log-structured databases like Rocksdb and its specifically optimized for SSD and flash storage .So my recommendation would be to understand the usecase and pick a DB for that particular usecase .
Redis is distributed, in-memory data store where as Rocks DB is embedded key-value store and not distributed.
Both are Key-Value Stores, so they have something in common.
As others mentioned RocksDB is embedded (as a library), while Redis is a standalone server. Moreover, Redis can sharded.
RocksDB
Redis
persisted on disk
stored in memory
strictly serializable
eventually consistent
sorted collections
no sorting
vertical scaling
horizontal scaling
If you don't need horizontal scaling, RocksDB is often a superior choice. Some people would assume that an in-memory store would be strictly faster than a persistent one, but it is not always true. Embedded storage doesn't have networking bottlenecks, which matters greatly in practice, especially for vertical scaling on bigger machines.
If you need to server RocksDB over a network or need high-level language bindings, the most efficient approach would be using project UKV. It, however, also supports other embedded stores as engines and provides higher-level functionality, such as Graph collections, similar to RedisGraph, and Document collections, like RedisJSON.

What are the use cases where Redis is preferred to Aerospike?

We are currently using Redis and it's a great in-memory datastore. We're starting to look at some new problems where the in-memory limitation is a factor and looking at other option. One we came across is Aerospike - it seems very fast, even faster than redis on in-memory single-shard operation.
Now that we're adding this to our stack, I'm trying to understand the use cases where Aerospike would not be able to replace redis?
Aerospike supports less data types than Redis, for example pub/sub is not available in Aerospike. However, Aerospike is a distributed key-value store and has superior clustering features.
The two are both great databases. It really depends on how big of a dataset you're handling, and your expectations of growth.
Redis:
Key/value store, dataset fits into RAM in single machine or you can shard yourself across multiple machines (and/or cores since it's single-threaded), persists data to disk, has data structures like lists/sets, basic pub/sub, simple slave replication, Lua scripting.
Aerospike:
Key/value row-store (meaning value contains bins with values and those values can be more maps/lists/values to have multiple levels), multithreaded to use all cores, built for clustering across machines with replication, and can do cross-datacenter replication, Lua scripting for UDFs. Can run directly on SSDs so you can store much more data without it fitting into RAM.
Comparison:
If you just have a smaller dataset or are fine with single-core performance then Redis is great. Quick to install, simple to run, easy to just attach a slave with 1 command if you need more read scalability. Redis also has more unique functionality with list/set/bitmap operations so you can do "more" out of the box.
If you want to store more complicated or nested data or need more performance on a single machine or clustering, then Aerospike gets the job done really well with less operational overhead. Very fast performance and easy cluster setup with all nodes being exactly the same role so you can scale reads and writes.
That's the big difference, scalability beyond a single core or server. With Lua scripting, you can usually fill in any native feature that Redis has into Aerospike. If you have lots of data (like TBs) then Aerospike's SSD feature means you get RAM-like performance without the RAM cost.
Have you looked at the benchmarks? I believe each one performs differently under different conditions and use cases:
http://www.aerospike.com/when-to-use-aerospike-vs-redis/
https://redislabs.com/blog/nosql-performance-aerospike-cassandra-datastax-couchbase-redis
Redis and Aerospike are different and both have their pros and cons, but Redis seems a better fit than Aerospike in the 2 following use cases:
when we don't need replication
We are using a big cache with intensive writes and a very short ttl (20s) for deduplication. There is no point in replicating this data. Redis would probably use half as much cpu and less than half as much RAM than Aerospike. It would be cheaper and as fast, or even faster thanks to pipelining.
when we need cross data-center replication
We have one large database that we need to access from 5 data centres, lots of writes, intensive reads. There is no perfect solution but the best one so far seems to store the central database in Redis and a copy on each data centre using Redis master-slave replication.

Implementations of GRGPF Clustering algorithm

I am looking for source that implements the Clustering algorithm found in Ganti et.al. "Clustering Large Datasets in Arbitrary Metric Spaces." IN particular, I have a large data problem that I seek to cluster (so it is a one pass cluster problem) and I do not have binary operators on the space (so finding "average" elements between elements is not an option).
I am agnostic to language (although I would prefer a simple I/O mechanism).
Any thoughts?