Let's say that i have a hyperloglog in redis which counts messages is there any provisions whereby I can to some degree account for delete messages?
No, the HyperLogLog doesn't support the concept of deletion. Instead, use a different counter (could be an integer, Set or HyperLogLog) and subtract the totals.
Related
I am looking to have a set which will store elements and that I can get the cardinality after. I noticed I could use the commands SADD or PFADD then use SCARD or PFCOUNT. What is the difference between these two? What are the advantages/disadvantages?
When using SADD, you store data in a SET.
When using PFADD, you store data in an Hyperloglog, which is a different kind of data structure.
A SET is used to store unique values, when you have to access again these values.
An HyperLogLog allows to get an approximate count of the number of unique values in the data added using PFADD. It is useful when you have a great number of distinct values and don't need to get them back. It may be used by example to get the number of unique visitors for a given day for a given page on a high traffic web site (you just add the unique visitor IDs to the HLL).
SADD and SCARD are for "Set".
PFADD and PFCOUNT are for "HyperLogLog".
Advantage of "HyperLogLog":
"HyperLogLog" takes much less memory than "Set".
This video below explains about "HyperLogLog" precisely in about 5 minutes.
https://youtu.be/UAL2dxl1fsE
I'm using Redis implementation of HyperLogLog to count distinct values for given keys.
The keys are based on hour window. After the calendar hour changes, I want to reset the count of incoming values. I don't see any direct API for 'clearing' up the values through Jedis.
SET cannot be used here because it would corrupt the hash. Is there a way to correctly "reset" the count for a given key?
Use the DEL command to delete the key, which will effectively reset the count.
I have a requirement to process multiple records from a queue. But due to some external issues the items may sporadically occur multiple times.
I need to process items only once
What I planned to use is PFADD into redis every record ( as a md5sum) and then see if that returns success. If that shows no increment then the record is a duplicate else process the record.
This seems pretty straightforward , but I am getting too many false positives while using PFADD
Is there a better way to do this ?
Being the probabilistic data structure that it is, Redis' HyperLogLog exhibits 0.81% standard error. You can reduce (but never get rid of) the probability for false positives by using multiple HLLs, each counting a the value of a different hash function on your record.
Also note that if you're using a single HLL there's no real need to hash the record - just PFADD as is.
Alternatively, use a Redis Set to keep all the identifiers/hashes/records and have 100%-accurate membership tests with SISMEMBER. This approach requires more (RAM) resources as you're storing each processed element, but unless your queue is really huge that shouldn't be a problem for a modest Redis instance. To keep memory consumption under control, switch between Sets according to the date and set an expiry on the Set keys (another approach is to use a single Sorted Set and manually remove old items from it by keeping their timestamp in the score).
In general in distributed systems you have to choose between processing items either :
at most once
at least once
Processing something exactly-once would be convenient however this is generally impossible.
That being said there could be acceptable workarounds for your specific use case, and as you suggest storing the items already processed could be an acceptable solution.
Be aware though that PFADD uses HyperLogLog, which is fast and scales but is approximate about the count of the items, so in this case I do not think this is what you want.
However if you are fine with having a small probability of errors, the most appropriate data structure here would be a Bloom filter (as described here for Redis), which can be implemented in a very memory-efficient way.
A simple, efficient, and recommended solution would be to use a simple redis key (for instance a hash) storing a boolean-like value ("0", "1" or "true", "false") for instance with the HSET or SET with the NX option instruction. You could also put it under a namespace if you wish to. It has the added benefit of being able to expire keys also.
It would avoid you to use a set (not the SET command, but rather the SINTER, SUNION commands), which doesn't necessarily work well with Redis cluster if you want to scale to more than one node. SISMEMBER is still fine though (but lacks some features from hashes such as time to live).
If you use a hash, I would also advise you to pick a hash function that has fewer chances of collisions than md5 (a collision means that two different objects end up with the same hash).
An alternative approach to the hash would be to assign an uuid to every item when putting it in the queue (or a squuid if you want to have some time information).
I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.
I am using jedis, a redis java client. I have a queue of string items. As per normal I am using lpush lpop rpush rpop for the necessary operations. But I will like to set expiry for each individual items in the queue. Is it possible?
This is not possible in redis by design for the sake of keeping redis simple and fast.
You can either store an expire value along with the string in the list, or store a separate list of expire times to let your application know if the key has expired.
There is also an alternative solution discussed here. You can store values in a sorted set with expire timestamps as scores and only select those members, whose scores are greater than certain timestamp. (This of course leaves it up to your app to clear the expired elements in a set)