What is the difference between SADD vs PFADD? - redis

I am looking to have a set which will store elements and that I can get the cardinality after. I noticed I could use the commands SADD or PFADD then use SCARD or PFCOUNT. What is the difference between these two? What are the advantages/disadvantages?

When using SADD, you store data in a SET.
When using PFADD, you store data in an Hyperloglog, which is a different kind of data structure.
A SET is used to store unique values, when you have to access again these values.
An HyperLogLog allows to get an approximate count of the number of unique values in the data added using PFADD. It is useful when you have a great number of distinct values and don't need to get them back. It may be used by example to get the number of unique visitors for a given day for a given page on a high traffic web site (you just add the unique visitor IDs to the HLL).

SADD and SCARD are for "Set".
PFADD and PFCOUNT are for "HyperLogLog".
Advantage of "HyperLogLog":
"HyperLogLog" takes much less memory than "Set".
This video below explains about "HyperLogLog" precisely in about 5 minutes.
https://youtu.be/UAL2dxl1fsE

Related

How to store unique visits in Redis

I want to know how many people visited each blog page. For that, I have a column in the Blogs table (MS SQL DB) to keep the total visits count. But I also want the visits to be as unique as possible.
So I keep the user's unique Id and blog Id in the Redis cache, and every time a user visits a page, I check if she has visited this page before, if not, I will increase the total visit count.
My question is, what is the best way of storing such data?
Currently, I create a key like this "project-visit-{blogId}-{userId}" and use StringSetAsync and StringGetAsync. But I don't know if this method is efficient or not.
Any ideas?
If you can sacrifice some precision, the HyperLogLog (HLL) probabilistic data structure is a great solution for counting unique visits because:
It only uses 12K of memory, and those are fixed - they don't grow with the number of unique visits
You don't need to store user data, which makes your service more privacy-oriented
The HyperLogLog algorithm is really smart, but you don't need to understand its inner workings in order to use it, some years ago Redis added it as a data structure. So all you, as a user, need to know is that with HyperLogLogs you can count unique elements (visits) in a fixed memory space of 12K, with a 0.81% margin of error.
Let's say you want to keep a count of unique visits per day; you would have to have one HyperLogLog per day, named something like cnt:page-name:20200917 and every time a user visits a page you would add them to the HLL:
> PFADD cnt:page-name:20200917 {userID}
If you add the same user multiple time, they will still only be counted once.
To get the count you run:
> PFCOUNT cnt:page-name:20200917
You can change the granularity of unique users by having different HLLs for different time intervals, for example cnt:page-name:202009 for the month of September, 2020.
This quick explainer lays it out pretty well: https://www.youtube.com/watch?v=UAL2dxl1fsE
This blog post might help too: https://redislabs.com/redis-best-practices/counting/hyperloglog/
And if you're curious about the internal implementation Antirez's release post is a great read: http://antirez.com/news/75
NOTE: Please note that with this solution you lose the information of which user visited that page, you only have the count
Your solution is not atomic, unless you wrap the get and set operation in a transaction or Lua script.
A better solution is to save project-visit-{blogId}-{userId} into a Redis set. When you get a visit, call SADD add an item into the set. Redis adds a new item to the set, only if the user has not visited this page before. If you want to get the total count, just call SCARD to get the size of the set.
Regardless of the back-end technology (programming language etc.), you can use Redis stream. It is a very new feature in Redis 5 and allows you to define publisher and subscriber to a topic (stream) created in Redis. Then, in each user visit, you commit a new record (of course, async) to this stream. You can hold whatever info you want in that record (user ip, id etc..).
Defining a key for each unique visit is not a good idea at all, because:
It makes the life harder for redis GC
Performance, comparing the use-case, is not comparable to Stream, specially if you use that instance of redis for other purposes
Constantly collecting these unique visits and processing them is not efficient. You have to always scan through all keys
Conclusion:
If you want to use Redis, go with Redis Stream. If Redis can be changed, go with Kafka for sure (or a similar technology).

String vs Hash for string type? Hash will have only one key instead of many

For example, I see many people are doing something like the following:
> set data:1000 "some string 1"
> set data:1001 "some string 2"
But what about using a hash to minimize the number of keys?
> hset data 1000 "some string 1"
> hset data 1001 "some string 2"
In the second way, it will only create one data key instead of creating many keys in the first way.
Which way is recommended?
I just see some people and tutorial are doing hset data:10 01 xxx. This is actually not related to my question. My question is simply asking what it's recommended between set data:1001 xxx and hset data 1001 xxx.
And I don't plan to modify hash-max-zipmap-entries and hash-max-zipmap-value. That means the hash will exceed the two values eventually. In such a config, are the two ways the same? or Which way is recommended?
Reasons to use strings:
you need per value timeouts
the values are semantically isolated
you're on cluster and want the values to be sharded over different nodes to spread load (sharding is based on the key)
Reasons to use hashes:
you want to be able to purge all of them at once (del/unlink), or have a timeout that impacts all of those values at once
you want to be able to enumerate them (prefer hscan/hgetall over scan/keys)
slightly better memory usage on the keys themselves
the values are semantically related
it is OK for all the values to be on the same node (whether single-server or cluster)
This all depends on the tradeoffs you want to support. In general, using hashes will have a smaller memory footprint than using simple keys. In fact, it is about an order of magnitude less memory. And access to hash values is constant time. So, if you are using redis simply as a key-value store, then hashes are way more efficient than simple keys.
However, you will want to use simple keys if you need to support expiration, keyspace notifications, etc, then you will need to use simple keys.
Just be careful to tweak the values for hash-max-zipmap-entries and hash-max-zipmap-value in your redis.conf in order to ensure that hashes are treated correctly for your environment.
You can read all about the details in the memory optimization section of the documentation.

Out of Process in memory database table that supports queries for high speed caching

I have a SQL table that is accessed continually but changes very rarely.
The Table is partitioned by UserID and each user has many records in the table.
I want to save database resources and move this table closer to the application in some kind of memory cache.
In process caching is too memory intensive so it needs to be external to the application.
Key Value stores like Redis are proving inefficient due to the overhead of serializing and deserializing the table to and from Redis.
I am looking for something that can store this table (or partitions of data) in memory, but let me query only the information I need without serializing and deserializing large blocks of data for each read.
Is there anything that would provide Out of Process in memory database table that supports queries for high speed caching?
Searching has shown that Apache Ignite might be a possible option, but I am looking for more informed suggestions.
Since it's out-of-process, it has to do serialization and deserialization. The problem you concern is how to reduce the serialization/deserizliation work. If you use Redis' STRING type, you CANNOT reduce these work.
However, You can use HASH to solve the problem: mapping your SQL table to a HASH.
Suppose you have the following table: person: id(varchar), name(varchar), age(int), you can take person id as key, and take name and age as fields. When you want to search someone's name, you only need to get the name field (HGET person-id name), other fields won't be deserialzed.
Ignite is indeed a possible solution for you since you may optimize serialization/deserialization overhead by using internal binary representation for accessing objects' fields. You may refer to this documentation page for more information: https://apacheignite.readme.io/docs/binary-marshaller
Also access overhead may be optimized by disabling copy-on-read option https://apacheignite.readme.io/docs/performance-tips#section-do-not-copy-value-on-read
Data collocation by user id is also possible with Ignite: https://apacheignite.readme.io/docs/affinity-collocation
As the #for_stack said, Hash will be very suitable for your case.
you said that Each user has many rows in db indexed by the user_id and tag_id . So It is that (user_id, tag_id) uniquely specify one row. Every row is functional depends on this tuple, you could use the tuple as the HASH KEY.
For example, if you want save the row (user_id, tag_id, username, age) which values are ("123456", "FDSA", "gsz", 20) into redis, You could do this:
HMSET 123456:FDSA username "gsz" age 30
When you want to query the username with the user_id and tag_id, you could do like this:
HGET 123456:FDSA username
So Every Hash Key will be a combination of user_id and tag_id, if you want the key to be more human readable, you could add a prefix string such as "USERINFO". e.g. : USERINFO:123456:FDSA .
BUT If you want to query with only a user_id and get all rows with this user_id, this method above will be not enough.
And you could build the secondary indexes in redis for you HASH.
as the above said, we use the user_id:tag_id as the HASH key. Because it can unique points to one row. If we want to query all the rows about one user_id.
We could use sorted set to build a secondary indexing to index which Hashes store the info about this user_id.
We could add this in SortedSet:
ZADD user_index 0 123456:FDSA
As above, we set the member to the string of HASH key, and set the score to 0. And the rule is that we should set all score in this zset to 0 and then we could use the lexicographical order to do range query. refer zrangebylex.
E.g. We want to get the all rows about user_id 123456,
ZRANGEBYLEX user_index [123456 (123457
It will return all the HASH key whose prefix are 123456, and then we use this string as HASH key and hget or hmget to retrieve infomation what we want.
[ means inclusive, and ( means exclusive. and why we use 123457? it is obvious. So when we want to get all rows with a user_id, we shoud specify the upper bound to make the user_id string's leftmost char's ascii value plus 1.
More about lex index you could refer the article I mentioned above.
You can try apache mnemonic started by intel. Link -http://incubator.apache.org/projects/mnemonic.html. It supports serdeless features
For a read-dominant workload MySQL MEMORY engine should work fine (writing DMLs lock whole table). This way you don't need to change you data retrieval logic.
Alternatively, if you're okay with changing data retrieval logic, then Redis is also an option. To add to what #GuangshengZuo has described, there's ReJSON Redis dynamically loadable module (for Redis 4+) which implements document-store on top of Redis. It can further relax requirements for marshalling big structures back and forth over the network.
With just 6 principles (which I collected here), it is very easy for a SQL minded person to adapt herself to Redis approach. Briefly they are:
The most important thing is that, don't be afraid to generate lots of key-value pairs. So feel free to store each row of the table in a different key.
Use Redis' hash map data type
Form key name from primary key values of the table by a separator (such as ":")
Store the remaining fields as a hash
When you want to query a single row, directly form the key and retrieve its results
When you want to query a range, use wild char "*" towards your key. But please be aware, scanning keys interrupt other Redis processes. So use this method if you really have to.
The link just gives a simple table example and how to model it in Redis. Following those 6 principles you can continue to think like you do for normal tables. (Of course without some not-so-relevant concepts as CRUD, constraints, relations, etc.)
using Memcache and REDIS combination on top of MYSQL comes to Mind.

Want to use Redis as an events statistics store

I am really interested in Redis, I have an idea and wanted to know if it is a suitable use case, or if it is not any other suggestions on a data store. Also any tips on storing the data would be appreciated.
My idea is just a simple event system so an event happens and it is stored in redis as follows
Key | Value
[unixtimestamp]:[system]:[event] | [result]
The data could be anything sales, impressions, errors, api response times, page load times any real time analytics. I then want to be able to make graphs based on that data.
This isn't an ideal design because it won't support your read pattern effectively and it will probably wasteful in terms of RAM if your [result] is short/small. Instead, look into using Redis' sorted sets with the timestamp as score, in the following fashion:
ZADD [system]:[event] [timestamp] [result]
Note that set members have to be unique so if [result]'s cardinality is low, make it unique by concatenating the timestamp to it (and filtering it out when you graph), i.e.:
ZADD [system]:[event] [timestamp] [result]:[timestamp]
This way you'll be able to fetch ranges of measurements by calling ZRANGEBYSCORE and graphing the results.

Redis suggesstion for selecting data type

We have questions based where in home page we were showing 2 list
Questions by date modified
Question have bigger views and ans count. And in this both listing if question have same views or ans count then sorting is based on date.
Previously i am directly quiring to MySQL database and fetching the values so it's easy.
But each page request hitting to MySQL it's bit expensive then start doing caching.
I started using Redis. Following is the cases when i use redis cache
Issues is On second listing i have to display questions by votes and not answered combine.
How can i stored this type of data in redis to load faster with sorting based by 2 conditions votes with time and ans count with time?
You can use sorted sets in redis. Your view or answer count can be the score. create a key based on timestamp. Sorted set method zrevrangebyscore will give you the correct order.
you can set your member of sorted set as:
'YEAR_MONTH_DATE_HOUR_MINUTE_SECONDS:question_id'
This way if you sort, questions with same score, will be returned in lexicographical order. That way question which came later will be placed higher if you use zrevrangebyscore.
You can create a hash map to map timestamp and question_id. for faster lookup
I asked a similar question, where I also purposed a solution. I want something different but it will do exactly what you want.
Redis zrevrangebyscore, sorting other than lexicographical order