Calculate number of events per last 24 hours with Redis

Calculate number of events per last 24 hours with Redis - redis

Seems, it's common task, but I haven't found solution.
I need to calculate number of user's events (for example, how many comments he left) per last 24 hours. Older data is not interesting for me, so information about comments added month ago should be removed from Redis.
Now I see only one solution. We can make keys that includes ID of user and hour of day, increment it value. Then we will get 24 values and calculate their sum. Each key has 24 hours expiration.
For example,
Event at Jun 22, 13:27 -> creating key _22_13 = 1
Event at Jun 22, 13:40 -> incrementing key _22_13 = 2
Event at Jun 22, 18:45 -> creating key _22_18 = 1
Event at Jun 23, 16:00 -> creating key _23_16 = 1
Getting sum of event at Jun 23, 16:02 -> sum of keys _22_17 - _22_23, _23_00 - _23_16: in our keys it's only _22_18 and _23_16, result is 2. Keys _22_13 and _22_13 are expired.
This method is not accurate (it calculates count of events per 24+0..1 hour) and not so universal (what key will I choose if I need number of events per last 768 minutes or 2.5 month?).
Do you have better solution with Redis data types?

Your model seems fine. Of course, it's not universal, but it's what you have to sacrifice in order to be performant.
I suggest you keep doing this. If you will need another timespan for reporting (768 minutes), you can always get this from mysql where your comments are stored (you do this, right?). This will be slower, but at least the query will be served.
If you need faster response or higher precision, you can store counters in redis with minute resolution (one counter per minute).

You can use redis EXPIRE query after each key creation.
SET comment "Hello, world"
EXPIRE comment 1000 // in seconds
PEXPIRE comment 1000 // in milliseconds
Details here.

Related

Azure Stream Analytics takes too long to process the events

I'm trying to configure an Azure Stream Analytics job, but consistently getting bad performance. I received data from a client system that pushes data into an Event Hub. And the ASA queries that into an Azure SQL database.
A few days ago I noticed that it was generating large amount of InputEventLateBeyondThreshold errors. Here an example out of the ASA. The Timestamp element is set by the client system.
{
"Tag": "MANG_POWER_5",
"Value": 1.08411181,
"ValueType": "Analogue",
"Timestamp": "2022-02-01T09:00:00.0000000Z",
"EventProcessedUtcTime": "2022-02-01T09:36:05.1482308Z",
"PartitionId": 0,
"EventEnqueuedUtcTime": "2022-02-01T09:00:00.8980000Z"
}
You can see that the event arrives pretty quickly, but takes more than 30 mins to process it. To try and avoid InputEventLateBeyondThreshold errors, I have increased the late event threshold. This may be contributing to the increased processing time, but having it too low also increases number of InputEventLateBeyondThreshold errors.
The Watermark Delay is consistently high, and yet SU usage is around 5%. I have increased the SU to as high as I can for this query.
I'm trying to figure out, why it takes so long to process the events once they have arrived.
This is the query I'm using:
WITH PIDataSet AS (SELECT * FROM [<event-hub>] TIMESTAMP BY timestamp)
--Write data to SQL joining with a lookup
SELECT
i.Timestamp as timestamp,
i.Value as value,
INTO [<sql-database>]
FROM PIDataSet as i
INNER JOIN [tagmapping-ref-alias] tm ON tm.sourcename = i.Tag
----Write data to AzureTable joining with a lookup
SELECT
DATEDIFF(second,CAST('1970-01-01' as DateTime), I1.Timestamp) As Rowkey,
I2.TagId as PartitionKey,
I1.Value as Value,
UDF.formatTime(I1.Timestamp) as DeviceTimeStamp
into [<azure-table>]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
--Get an hourly count into a SQL Table.
SELECT
I2.TagId,
System.Timestamp() as WindowEndTime, COUNT(I2.TagId) AS InputCount
into [tagmeta-ref-alias]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
GROUP BY I2.TagId, TumblingWindow(Duration(hour, 1))

When you set up a 59 minutes out-of-order window, what you do is that you set up a 59 minutes buffer for that input. When records land in that buffer, they will wait 59 minutes until they get out. What you get in exchange, is that we have the opportunity to re-order these events so they will look in order to the job.
Using it at 1h is an extreme setting that will automatically give you 59minute of watermark, by definition. This is very surprising and I'm wondering why you need a value so high.
Edit
Now looking at the late arrival policy.
You are using an event time (TIMESTAMP BY timestamp) which means that your events can now be late, see this doc and this one.
What this means is that when a record comes later than 1h (so timestamp is older than 1h from the wall clock on our servers - in UTC), then we adjust its timestamp to our wall clock minus 1h, and send it to the query. It also means that your tumbling window always has to wait an additional hour to be sure it's not missing those late records.
Here what I would do is restore the default settings (no out of order, 5 seconds late events, adjust events). Then when you get InputEventLateBeyondThreshold, it means that that the job received a timestamp that was later in the past than 5 seconds. You're not losing the data, we are adjusting its system.timestamp to a more recent value (but not the timestamp field, we don't change it).
What we then need to understand is why does it take more than 5 seconds for a record in your pipeline to go from production to consumption. Is it because you have big delays in your ingestion pipeline, or because you have a time skew on your producer clock? Do you know?

How cache entry's valid period is calculated in MULE4?

If I cache a payload, how long it will be valid?
There are 2 settings in caching-strategy;
Entry TTL and
Expiration Interval.
If I want to invalidate my cached value after 8 hours, How I should set above parameters?
What is the usage for 'invalidate cache' processor?

Entry TTL is how long an entry should live in the cache. Expiration interval is how frequently the object store will check the entries to see if one entry should be deleted. In your case entryTTL should 8 hours. Be mindful of the units used for each attribute. Expiration interval is a bit more tricky. You may want to check entries much more frequently to avoid them living more than 8 hours before expiring. It may be 10 minutes, 30 minutes, 1 hour or whatever works for you.
I explained it more in my blog: https://medium.com/#adobni/configuring-an-object-store-in-mule-4-5da609e3456a

How do I create unique values for labeling daylight saving extra hours?

I need to manage daylight saving time data, so hours that repeat twice in my table or lack data for that specific hour need to be coded with a unique value that will make me understand which hour corresponds to the daylight saving time hour. I am trying to do this in Pentaho Kettle

If you need a sequence or timeseries, base it on UTC
There's a conflict in your requirements, because there just is no clean sequence to daylight saving time hours on the switching days that will still map automatically to a 24-hour clock.
If you use sequential numbers you end up with 23 or 25 records for the switching days, which means you would need to have a custom mapping back to UTC time to keep things working after hour 3. For any kind of time series, UTC is much more convenient and less error-prone.
If you keep 24 hours in a day and loosen the sequence, you will need to insert something like a 2.0 for the lost hour in spring and a 2.1 and 2.2 for the autumn switch. It just adds more complexity though:
The dummy 2.0 record cannot have activity associated with it (as by definition it is skipped entirely) and if it had, it would break date parsing wherever you try to use it.
The 2.1 and 2.2 records at least would help you see which activity occurred first, but you would need custom coding for this case everywhere, both splitting and merging the records from the double hour based on their UTC timestamps... notice a pattern here?

Keep item in list for certain time

I am no an expert in redis at all. Today I run into one idea, but I don't know if it is possible in redis.
I want to store list of values but only for some time, for example list of ip addresses which visited page in last 5 minutes. As far as I know I can't set EXPIRE on single list/hash item, right? So I am pushing 1, 2, 3 into list/hash but after certain constant time I want each item to expire/disapear? Or maybe instead of list hash structure will be more suitable { '1': timestamp-when-disapear, ... }?
Or maybe only solution is
SET test.1.1 1
EXPIRE test.1.1 60
SET test.1.2 2
EXPIRE test.1.2 60
SET test.1.3 3
EXPIRE test.1.3 60
# to retrieve, can I pipeline KEYS output to MGET?
KEYS test.1.*

Use a sorted set instead.
log the server IP along with the timestamp in a sorted set. During retrieval make use of that timestamp to get things you need. In a scheduler periodically delete keys which goes beyond the range.
Example:
zadd test 1465371055 1.1
zadd test 1465381055 1.3
zadd test 1465391055 1.1
your sorted set will have 1.1 and 1.3, where 1.1 is with the new value 1465391055.
Now on retrieval use
zrangebyscore test min max
min -> currenttime - (5*60*1000)
max -> currenttime
you will get IP's visited in last 5 mins.
In another scheduler kind of thread you need to delete unwanted entries.
zremrangebyscore test min max
min -> currenttime - (10*60*1000) -> you can give it to any minute you want.
max -> currenttime
Also understand that if number of distinct IP's are too large then the sorted set will grow rapidly. Your scheduler thread must work properly to keep the memory in control.

Bitcoin Exchange API - more frequent high low

Any way to get more a high-low value more frequent than every 24 hours from say the Bitstamp API ticker?
This link only tells you how to get the value for every 24 hours
https://www.bitstamp.net/api/
(this also seems to be a problem with every other exchange I've tried)

24 hours is compared by time_now - 24 hours so it should give you updated rates every second or may be every minute depends on the configuration of the api file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas