How to store unique visits in Redis - redis

I want to know how many people visited each blog page. For that, I have a column in the Blogs table (MS SQL DB) to keep the total visits count. But I also want the visits to be as unique as possible.
So I keep the user's unique Id and blog Id in the Redis cache, and every time a user visits a page, I check if she has visited this page before, if not, I will increase the total visit count.
My question is, what is the best way of storing such data?
Currently, I create a key like this "project-visit-{blogId}-{userId}" and use StringSetAsync and StringGetAsync. But I don't know if this method is efficient or not.
Any ideas?

If you can sacrifice some precision, the HyperLogLog (HLL) probabilistic data structure is a great solution for counting unique visits because:
It only uses 12K of memory, and those are fixed - they don't grow with the number of unique visits
You don't need to store user data, which makes your service more privacy-oriented
The HyperLogLog algorithm is really smart, but you don't need to understand its inner workings in order to use it, some years ago Redis added it as a data structure. So all you, as a user, need to know is that with HyperLogLogs you can count unique elements (visits) in a fixed memory space of 12K, with a 0.81% margin of error.
Let's say you want to keep a count of unique visits per day; you would have to have one HyperLogLog per day, named something like cnt:page-name:20200917 and every time a user visits a page you would add them to the HLL:
> PFADD cnt:page-name:20200917 {userID}
If you add the same user multiple time, they will still only be counted once.
To get the count you run:
> PFCOUNT cnt:page-name:20200917
You can change the granularity of unique users by having different HLLs for different time intervals, for example cnt:page-name:202009 for the month of September, 2020.
This quick explainer lays it out pretty well: https://www.youtube.com/watch?v=UAL2dxl1fsE
This blog post might help too: https://redislabs.com/redis-best-practices/counting/hyperloglog/
And if you're curious about the internal implementation Antirez's release post is a great read: http://antirez.com/news/75
NOTE: Please note that with this solution you lose the information of which user visited that page, you only have the count

Your solution is not atomic, unless you wrap the get and set operation in a transaction or Lua script.
A better solution is to save project-visit-{blogId}-{userId} into a Redis set. When you get a visit, call SADD add an item into the set. Redis adds a new item to the set, only if the user has not visited this page before. If you want to get the total count, just call SCARD to get the size of the set.

Regardless of the back-end technology (programming language etc.), you can use Redis stream. It is a very new feature in Redis 5 and allows you to define publisher and subscriber to a topic (stream) created in Redis. Then, in each user visit, you commit a new record (of course, async) to this stream. You can hold whatever info you want in that record (user ip, id etc..).
Defining a key for each unique visit is not a good idea at all, because:
It makes the life harder for redis GC
Performance, comparing the use-case, is not comparable to Stream, specially if you use that instance of redis for other purposes
Constantly collecting these unique visits and processing them is not efficient. You have to always scan through all keys
Conclusion:
If you want to use Redis, go with Redis Stream. If Redis can be changed, go with Kafka for sure (or a similar technology).

Related

Best way of storing an array in an SQL database?

For an Android Launcher (Home Screen) app project i want to implement a feature called "Sort by usage". This will sort by the launch count of an app within a user settable timeframe.
The current idea for the implementation is to store an array of unich epoch timestamps, one for each launch.
Additionaly it'll store a counter caching the current amount of launches within the selected timeframe, incremented with every launch. Of course, this would regularly have to be rebuild as time passes, but merely every few hours or at least x percent of the selected timeframe, so computations definitely wouldn't run as often as without the counter, since this information is required everytime when any app entries on screen need to get sorted - but i'm not quite sure if it matters in any way during actual use.
I am now unsure how to store the timestamp array inside the SQL database. As there is a table holding one record with information about each launcher entry i thought about the following options:
Store the array of unix epochs in serialized form (maybe JSON Array) to one field of the entries record
Create a seperate table for launch times with
a. each record starting with an id associated with an entry followed by all launch times, one for each field
b. each record a combination of entry id and one launch time
these options would obvously have the advatage of storing the timestamp using an appropriate type
I probably didn't quite understand why you need a second piece of data for your launch counter - the fact you saved a timestamp already means a launch - why not just count timestamps? Less updating, less record locking, more concurrency.
Now, let's say you've got a separate table with timestamps in a classic one to many setting.
Pros of this setup - you never need to update anything - just keep inserting. You can easily cluster your table by timestamp, run a filter on your timeframe and issue a group by and count rows. The client then will get the numbers and sort by count (I believe it's generally better to not sort in SQL). Cons - you need a join to parent table and probably need to get your indexes right.
Alternatively you store timestamps in a blob text (JSON, CSV, whatever) with your main records. This definitely means you'll have to update your records a lot, which potentially opens you up to locking issues. Then, I'm not entirely sure what you'll have to do to get your final launch counts - you read all entities, deserialise all timestamps, filter by timeframe and then count? It does feel a bit more convoluted in your case.
I don't think there's such thing as a "best" way. You have to consider pros and cons. From what I gather, you might be better off with classic SQL approach unless there's something I didn't catch that will outweigh my points above

Redis write back cache still a manual task?

I am working on an assignment. The REST API (developed in Spring) has a method m() which simulates cleaning of windows by a person. Towards the end the cleaner has to write a unique phrase (a string) on the window. Phrases written by all cleaners are eventually saved in the MySQL DB. So each time m() is executed, a query is made to the DB to fetch all phrases written to the DB today so far. The cleaner method m() then generates a random string as a phrase, checks it in the queried phrases to make sure its unique and writes it to the DB. So there is one query per m() to fetch all phrases and one to write the phrase. Both happens on the same table.
This is a scenario that can take advantage of caching and I went to Redis. I also think write back cache is the best solution. So every write happens, it happens to the cache instead of the DB and every read happens from the cache as well. The cache can be copied to the DB in a new thread per hour (or something configurable). I was reading Can Redis write out to a database like PostgreSQL? and it seems some years back you had to do this manually.
My questions:
Is doing this manually still the way to go? If not, can someone
point me to a Redis resource I can make use of?
If manual is the way to go this is how I plan to implement it. Is it ideal?
Phrases written each hour will be appended to a list of objects (userid, phrase) in Redis, the list for midnight to 1 am will be called phrases_1, for 1 to 2 am as phrases_2 and so on. Each hour a background thread will write the entire hour's list to DB. Every time all phrases are required to be fetched for checking, I will load all lists for the day from the cache e.g. phrases_1, phrases_2 in a loop and consolidate them. (Later when number of users grow - I will have to shard but that is not my immediate concern).
Thanks.
Check https://github.com/RedisGears/rgsync (and https://redislabs.com/solutions/use-cases/caching/) which tries to address both the cases of write-back and write-through.
I'm yet to do a functionality test.
It is also interesting to note that a 2020 CMU paper (https://www.pdl.cmu.edu/PDL-FTP/Storage/2020.apocs.writeback.pdf) claims "writeback-aware caching is NPcomplete and Max-SNP hard"
Instead of going to redis for uniqueness of data,you should create a unique index on the field you want to be unique and MySQL will take care of the rest for you

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]

Creating a variable on database to hold global stats

Let's pretend I've got a social network.
I'm always showing to the user how many users are registered and have activated their profile.
So, everytime a single user logs in, it goes to DB and make a:
select count(*) from users where status = 'activated'
so if 5.000 users logs in, or simply refreshes the page, it will make 5.000 requests to SQL above.
I was wondering if is better to have a variable some place(that I still have no idea where to put) that everytime a user activates his profile will add 1 and then, when I want to show how many users are registered to that social network, I'll only get the value of this variable.
How can I make this? Is it really a better solution to what I've got?
You could use an indexed view, that SQL Server will automatically maintain:
create table dbo.users (
ID int not null,
Activated bit not null
)
go
create view dbo.user_status_stats (Activated,user_count)
with schemabinding
as
select Activated,COUNT_BIG(*) from dbo.users group by Activated
go
create unique clustered index IX_user_status_stats on dbo.user_status_stats (Activated)
go
This just has two possible statuses, but could expand to more using a different data type. As I say, in this case, SQL Server will maintain the counts behind the scenes, so you can just query the view:
SELECT user_count from user_status_stats with (NOEXPAND) where Activated = 1
and it won't have to query the underlying table. You need to use the WITH (NOEXPAND) hint on editions below (Enterprise/Developer).
Although as #Jim suggested, doing a COUNT(*) against an index when the index column(s) can satisfy the query criteria using equality comparisons should be pretty quick also.
As you've already guessed - it's not a great idea to calculate this value every time someone hits the site.
You could do as you suggest, and update a central value as users are added, although you'll have to ensure that you don't end up with two processes updating the number simultaneously.
Alternatively you could have a job which runs your SQL routinely and updates the central 'user count' value.
Alternatively #2, you could use something like MemCache to hold the calculated value for a period of time, and then when the cache expires, recalculate it again.
There's a few options you could consider:
1) like you say, maintain a global count each time a profile is activated to save the hit on the users table each time. You could just store that count in a "Stats" table and then query that value from there.
2) don't show the actual "live" count, show a count that's "pretty much up to date" - e.g. cache the count in your application and have the value expire periodically so you then requery the count less frequently. Or if you store the count in a "Stats" table per above, you could have a scheduled job that updates the count every hour, instead of every time a profile is activated.
Depends whether you want to show the exact figure in real-time or whether you can live with a delay. Obviously, data volumes matter too - if you have a large database, then having a slightly out of date cached value could be worth while.
From a purely SQL Server standpoint, no, you are not going to find a better way of doing this. Unless, perhaps, your social network is Facebook sized. Denormalizing your data design (such as keeping a count in a separate table) will lead to possible sources of the data getting out of sync. It doesn't have to get out of sync if it is coded properly, but it can...
Just make sure that you have an index on Status. At which point SQL will not scan the table for the count, but it will scan the index instead. The index will be much smaller (that is, more data will fit in a disk page). If you were to convert your status to an int, smallint, or tinyint you would get even more index leaves in a disk page and thus much less IO. To get your description ('activated', etc.), use a reference table. The reference table would be so small, SQL would just keep the whole thing in RAM after the first access.
Now, if you still think this is too much overhead (and it should't be) you could come up with hybrid method. You could store your count in a separate table (which SQL would keep in RAM if it is just the one record) or assuming your site is in asp.net you could create an Application variable to keep track of the count. You could increment it in Session_Start and decrement it in Session_End. But, you will have to come up with a way of making the the increment and decrement thread safe so two sessions don't try and update the value at the same time.
You can also use the Global Temporary table. You will always get fast retrieval. Even
if you are setting 30 seconds ping. The Example Trigger Link1, Example Trigger Link2 will maintain such activities in this table.

Best approach to cache Counts from SQL tables?

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.
These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.
I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...
tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.
For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.