Producer and multiple Consumer problem in Redis - redis

I am new to redis. I write a Script which get keys of constant pattern from redis server and then process them. But, The problem is that Script is running multiple times at certain instant of time. Such that same keys are processed multiple times.
If I delete all those keys after getting them, that will also not work because it might be possible two same scripts get keys at same time and processed them. That will also leads to the same problem of processing same keys multiple times which causes redundancy.
for example
keys are like this:
record_user-12
record_user-13
record_user-14
record_user-15
record_user-16
record_user-17
record_user-18
record_user-19
Constant pattern = "record*"
Let there are two scripts of same code S1 and S2. S1 start running at t=0 time and S2 start running also at t=0. As pattern is same for every script, so both script get get same keys and processed them which causes redundancy as I don't want to process same key multiple time.
Please help me out

Related

How can data be synchronized between processes in SQL?

I'm wondering something perhaps extremely stupid, but I can't seem to find an answer (which is not a good sign, usually).
Assuming we have a SQL server (MySQL, PostgreSQL, this question even applies to Sqlite3 though there's no server) and several clients connected to it. I've seen countless times queries that might be hard to sync in my opinion.
So let's assume we have a table (usage statistics, say) with a row per day.
statistic (
day,
num_requests
)
(I avoid mentioning data types, since it's not the point, but the number of requests should be a number of some sort.)
So when a new web request is sent, the web server will ask this table's current statistic and increase the number of requests. No biggie right?
number = cursor.execute("""
SELECT num_requests FROM statistic
WHERE ...
""")
number += 1
cursor.execute("""
UPDATE statistic SET num_requests=?
WHERE ...
""", (number, ))
But what does happen if two requests are handled somewhat simultaneously, perhaps on several clients? Different processes? They each ask for today's current statistic (just a read operation, non-blocking), they get the number of requests from this row (this step doesn't involve the server) and then they increment it by 1. At this point, if both requests are running somewhat simultaneously, they have both incremented the same number once and they send an UPDATE requests with their number.
In the end, the number of requests for today's statistic has increased by one, although they were two requests. I know there are mechanisms to ensure proper data synchronization, but I fail to see how it could address the situation in this case. Read usually is non-blocking as far as I know. Write can be blocking, but since read for the other process has happened before, the second write operation will not be acceptable. And I don't see any way to express that logically.
In other words, this seems like the point where we would lock the row in most programming languages, and say "from that point onward, you can neither read it or write it, I'm working on it". The first request will execute its read (lock), increment and write, and then will unlock. The second request will have to wait patiently for the lock to be released. I don't see that mechanism in SQL. Is that transparent and not even necessary? And if so, how does it work? Or have we lived our entire life with problems like it?
Thanks!
cursor.execute("""
UPDATE statistic SET num_requests=num_requests+1
WHERE ...
""", (number, ))

How many keys can be deleted in a single redis del command?

I want to delete multiple redis keys using a single delete command on redis client.
Is there any limit in the number of keys to be deleted?
i will be using del key1 key2 ....
There's no hard limit on the number of keys, but the query buffer limit does provide a bound. Connections are closed when the buffer hits 1 GB, so practically speaking this is somewhat difficult to hit.
Docs:
https://redis.io/topics/clients
However! You may want to take into consideration that Redis is single-threaded: a time-consuming command will block all other commands until completed. Depending on your use-case this may make a good case for "chunking" up your deletes into groups of, say, 1000 at a time, because it allows other commands to squeeze in between. (Whether or not this is tolerable is something you'll need to determine based on your specific scenario.)

compare redis command: multi and mget

there are two systems sharing a redis database, one system just read the redis, the other update it.
the read system is so busy that the redis can't handle it, to reduce the count of requests to redis, I find "mget", but I also find "multi".
I'm sure mget will reduce the number of requests, but will "multi" do the same? I think "multi" will force the redis server to keep some info about this transaction and collect requests in this transaction from the client one by one, so the total number of requests sent stays the same, but the results returned will be combined together, right?
So If I just read KeyA, keyB, keyC in "multi" when the other write system changed KeyB's value, what will happen?
Short Answer: You should use MGET
MULTI is used for transaction, and it won't reduces the number of requests. Also, the MULTI command MIGHT be deprecated in the future, since there's a better choice: lua scripting.
So If I just read KeyA, keyB, keyC in "multi" when the other write system changed KeyB's value, what will happen?
Since MULTI (with EXEC) command ensures transaction, all of the three GET commands (read operations) executes atomically. If the update happens before the read operation, you'll get the old value. Otherwise, you'll get the new value.
By the way, there's another option to reduce RTT: PIPELINE. However, in your case, MGET should be the best option.

ServiceStack.Redis SearchKeys

I am using the ServiceStack.Redis client on C#.
I added about 5 million records of type1 using the following pattern a::name::1 and 11 million records of type2 using the pattern b::RecId::1.
Now I am using redis typed client as client = redis.As<String>. I want to retrieve all the keys of type2. I am using the following pattern:
var keys = client.SearchKeys("b::RecID::*");
But it takes forever (approximately 3-5 mins) to retrieve the keys.
Is there any faster and more efficient way to do this?
You should work hard to avoid the need to scan the keyspace. KYES is literally a server stopper, but even if you have SCAN available: don't do that. Now, you could choose to keep the keys of things you have available in a set somewhere, but there is no SRANGE etc - in 2. you'd have to use SMEMBERS, which is still going to need to return a few million records - but at least they will all be available. In later server versions, you have access to SCAN (think: KEYS) and SSCAN (think: SMEMBERS), but ultimately you simply have the issue of wanting millions of rows, which is never free.
If possible, you could mitigate the impact by using a master/slave pair, and running the expensive operations on the slave. At least other clients will be able to do something while you're killing the server.
The keys command in Redis is slow (well, not slow, but time consuming). It also blocks your server from accepting any other command while it's running.
If you really want to iterate over all of your keys take a look at the scan command instead- although I have no idea about ServiceStack for this
You can use the SCAN command, make a loop search, where each search is restricted to a smaller number of keys. For a complete example, refer to this article: http://blog.bossma.cn/csharp/nservicekit-redis-support-scan-solution/

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]