TTL update on every access Vs requery after expiry - redis

I am storing detail of a user for one day.
Which one of below is better approach assuming this user is viewed 2k times in day
.
TTL set one time
Set TTL of 24hours.
On every access to the user we will fetch data from Redis and won't update TTL.
after 24 hours, when the key was expired update the TTL.
TTL set on every request.
Set TTL of 4hours.
On every access to the user we will fetch data from Redis and update TTL to +4 hours from now.
unused keys will automatically expire.
Question:
I feel the first approach is better as on 2nd approach we don't need to update TTL 10k times(views) and need to fire query only one time.
Please suggest which is better approach?

Related

How can I force expired big query partitions to delete immediately?

I see in https://stackoverflow.com/a/49105008/6191970 that partitions with an expiration that is expired still take some unknown amount of time to be deleted, though they are no longer are included in queries after expiration. I experimented with setting the partition expiration on a table which is partitioned hourly as so:
ALTER TABLE `my.table`
SET OPTIONS ( partition_expiration_days=.1 )
And was surprised that even after a few hours, setting the expiration back to its original limit of 90 days, all of the data was still there.
Is there any way to force deletion specifically of all expired partitions?
If not, what time frame is to be expected for this data to clear out?
This is a sensitive data security problem for my use case where we do not want old data to exist.
BigQuery retains data for 7 days, officially it offers a time traveling feature as well. Count that in your Organization policy.
https://cloud.google.com/bigquery/docs/time-travel

Finding Redis data by last update

I'm new to Redis and I want to use the following scheme:
key: EMPLOYEE_*ID*
value: *EMPLOYEE DATA*
I was thinking of adding a time stamp to the end of the key, but I'm not sure if that'll even help. Basically I want to be able to get a list of employees who are the most stale ie having been updated. What's the best way to accomplish this in Redis?
Keep another key with the data about employees (key names) and the update's timestamp - the best candidate for that is a Sorted Set. To maintain that key's data integrity, you'll have update it with pertinent changes whenever you update one the employees' keys.
With that data structure in place, you can easily get the keys names of the recently-updated employees with the ZRANGE command.
Have you tried to filter by expiration time? You could set the same expiration to all keys and update the expiration each time the key is updated. Then with a LUA script you could iterate through the keys and filter by expiration time. Those with smaller expiration time are those who are not updated.
This would work with some assumptions, it depends on how your system works. Also the approach is O(N) with respect to the number of employees. So if on one side you can save space, it will not scale well with the number of entries and the frequency of scan.

how to set expiry for every item in redis queue

I am using jedis, a redis java client. I have a queue of string items. As per normal I am using lpush lpop rpush rpop for the necessary operations. But I will like to set expiry for each individual items in the queue. Is it possible?
This is not possible in redis by design for the sake of keeping redis simple and fast.
You can either store an expire value along with the string in the list, or store a separate list of expire times to let your application know if the key has expired.
There is also an alternative solution discussed here. You can store values in a sorted set with expire timestamps as scores and only select those members, whose scores are greater than certain timestamp. (This of course leaves it up to your app to clear the expired elements in a set)

Is it possible to determine the initial TTL of a volatile key?

In a key-value persistance api I'm porting to Redis, I'm trying to implement a function that updates the time to live for a key. The original code stores ttl as a timestamp and # of minutes; the ttl is updated by writing a new timestamp (the key expires after timestamp + delta).
I've noticed that Redis provides a TTL command, but that only provides the time remaining.
What I'm wondering is if there is a way to retrieve the original TTL from Redis (set with EXPIRE, etc), or if I need to add a TTL meta field to the values I'm storing (like the original code does).
Edit:
I'm using Redis Server v2.4.10
Internally, Redis stores converts the TTL into a unix timestamp. See function expireGenericCommand in db.c. So, Redis cannot return the TTL you specified, simply because it does not store it in that format.
You will need to add a TTL meta field if it is important for your application.

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]