Finding Redis data by last update - redis

I'm new to Redis and I want to use the following scheme:
key: EMPLOYEE_*ID*
value: *EMPLOYEE DATA*
I was thinking of adding a time stamp to the end of the key, but I'm not sure if that'll even help. Basically I want to be able to get a list of employees who are the most stale ie having been updated. What's the best way to accomplish this in Redis?

Keep another key with the data about employees (key names) and the update's timestamp - the best candidate for that is a Sorted Set. To maintain that key's data integrity, you'll have update it with pertinent changes whenever you update one the employees' keys.
With that data structure in place, you can easily get the keys names of the recently-updated employees with the ZRANGE command.

Have you tried to filter by expiration time? You could set the same expiration to all keys and update the expiration each time the key is updated. Then with a LUA script you could iterate through the keys and filter by expiration time. Those with smaller expiration time are those who are not updated.
This would work with some assumptions, it depends on how your system works. Also the approach is O(N) with respect to the number of employees. So if on one side you can save space, it will not scale well with the number of entries and the frequency of scan.

Related

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.
This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)
This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

Way to get notified of key:value on expiry

I have incoming data which i have to aggregate for some time and when the key expires process the data.
I have tried using redis keyspace notifications but it only gives the key.
Is there a better way to handle this scenario ?
Instead of setting an expiry, aggregate the data into a list or set depending on your use case. Put a timestamp in the key itself. For example, if you want to aggregate data for 1 hour, your key can be mydata:2018-26-06-1300, mydata:2018-26-06-1400, mydata:2018-26-06-1500 and so on.
Then you simply run a cron job every hour, read all the values from the key, and delete the key when you are done.

How to clear the values of a key in Redis HyperLogLog

I'm using Redis implementation of HyperLogLog to count distinct values for given keys.
The keys are based on hour window. After the calendar hour changes, I want to reset the count of incoming values. I don't see any direct API for 'clearing' up the values through Jedis.
SET cannot be used here because it would corrupt the hash. Is there a way to correctly "reset" the count for a given key?
Use the DEL command to delete the key, which will effectively reset the count.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.

faster redis lookup for collection of keys

i'm looking for a faster way to lookup a collection of keys in redis.
that's what i need to do:
HGET "user:001:coins" "2013-05-01"
it looks up stored coins for a user on a specific day.
Now i want to lookup all stored coins for a date range of a month:
HGET "user:001:coins" "2013-05-01"
HGET "user:001:coins" "2013-05-02"
....
Thats getting slow becasue i have to do that for 120 different users over 2 months. Is there a faster/better way to do this ?
one idea i had would be to add another key which holds the calculated coins amount for a month, and always recalculate the key if there is a change.
HGET "user:001:coins" "2013-05"
but that would mean additional programming logic, which i would like to avoid.
Restructuring your data is not a bad idea even if it does require additional work. Fetching once is always faster than fetching N times.
If you can group your operations together, why not use HMGET?
HMGET "user:001:coins" "2013-05-01" "2013-05-02" ...