Real time analytic processing system design - redis

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help

I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis

For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]

Related

How to store unique visits in Redis

I want to know how many people visited each blog page. For that, I have a column in the Blogs table (MS SQL DB) to keep the total visits count. But I also want the visits to be as unique as possible.
So I keep the user's unique Id and blog Id in the Redis cache, and every time a user visits a page, I check if she has visited this page before, if not, I will increase the total visit count.
My question is, what is the best way of storing such data?
Currently, I create a key like this "project-visit-{blogId}-{userId}" and use StringSetAsync and StringGetAsync. But I don't know if this method is efficient or not.
Any ideas?
If you can sacrifice some precision, the HyperLogLog (HLL) probabilistic data structure is a great solution for counting unique visits because:
It only uses 12K of memory, and those are fixed - they don't grow with the number of unique visits
You don't need to store user data, which makes your service more privacy-oriented
The HyperLogLog algorithm is really smart, but you don't need to understand its inner workings in order to use it, some years ago Redis added it as a data structure. So all you, as a user, need to know is that with HyperLogLogs you can count unique elements (visits) in a fixed memory space of 12K, with a 0.81% margin of error.
Let's say you want to keep a count of unique visits per day; you would have to have one HyperLogLog per day, named something like cnt:page-name:20200917 and every time a user visits a page you would add them to the HLL:
> PFADD cnt:page-name:20200917 {userID}
If you add the same user multiple time, they will still only be counted once.
To get the count you run:
> PFCOUNT cnt:page-name:20200917
You can change the granularity of unique users by having different HLLs for different time intervals, for example cnt:page-name:202009 for the month of September, 2020.
This quick explainer lays it out pretty well: https://www.youtube.com/watch?v=UAL2dxl1fsE
This blog post might help too: https://redislabs.com/redis-best-practices/counting/hyperloglog/
And if you're curious about the internal implementation Antirez's release post is a great read: http://antirez.com/news/75
NOTE: Please note that with this solution you lose the information of which user visited that page, you only have the count
Your solution is not atomic, unless you wrap the get and set operation in a transaction or Lua script.
A better solution is to save project-visit-{blogId}-{userId} into a Redis set. When you get a visit, call SADD add an item into the set. Redis adds a new item to the set, only if the user has not visited this page before. If you want to get the total count, just call SCARD to get the size of the set.
Regardless of the back-end technology (programming language etc.), you can use Redis stream. It is a very new feature in Redis 5 and allows you to define publisher and subscriber to a topic (stream) created in Redis. Then, in each user visit, you commit a new record (of course, async) to this stream. You can hold whatever info you want in that record (user ip, id etc..).
Defining a key for each unique visit is not a good idea at all, because:
It makes the life harder for redis GC
Performance, comparing the use-case, is not comparable to Stream, specially if you use that instance of redis for other purposes
Constantly collecting these unique visits and processing them is not efficient. You have to always scan through all keys
Conclusion:
If you want to use Redis, go with Redis Stream. If Redis can be changed, go with Kafka for sure (or a similar technology).

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

optimization: stream of many key updates, small number of keys

I have a program that receives a constant stream of data.
From this stream of data I populate a hashtable. Every piece of data I receive
is translated in, either:
a key update ;
or a key insertion if it doesn't already exist.
I store the incoming raw data in a queue before it is being processed.
The number of keys in the hashtable is very small. 99% of the data I receive
corresponds to key updates.
The problem is that I have so many key updates that the queue becomes
too big for my consumers.
Obviously, from the thousands of key updates, many of them concern the same
key, so only the last one has a real value while all the others are useless.
What is the best way for me to handle this case? Which data structure should I
be using?
What can you tell us about your keys? How many are there? Are they numeric (and if so, what range of values might they take?), textual? Any limit on the number of bytes per key? What kind of hash table are you inserting to (e.g. closed hashing, open hashing)? What contention/locking is there on the hash table? How many updates per second? What programming language are you using?
How many keys
A few hundreds or maybe a few thousands. Not a lot!
Numeric keys
The keys themselves are alphanumeric, they are not very long, around 30 characters at most. The values, however, are all numbers (integers).
Limit on the number of bytes per key
My keys are 30 characters long, at most.
Kind of hash table
I'm simply using Python's defaultdict
Contention/locking
Python's dictionaries are considered thread-safe
Number of updates per second
It can go from 1 every 3 seconds to more than a 100 per second
Programming language
I'm using python
Instead of using a simple queue you can use another hashtable - each incoming message could be stored in the appropriate stack based on key. You then take each element from each stack (which will be the most recent item) - you can optionally clear each stack when you pull an item out.
ConcurrentDictionary should fit the bill nicely.
But what you need here is an (maybe adaptive) throttling mechanism, that detects when the queue is too slow and starts collapsing the data.

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.