In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.
When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.
Example:
event
created_at
blockdate
action
16
2021-11-11 11:15
25.11.21
count
16
2021-11-11 11:15
25.11.21
block
16
2021-11-13 10:45
25.11.21
block
16
2021-11-16 10:40
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-12-10 13:00
24.12.21
count
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-20 13:15
24.12.21
block
16
2021-12-23 13:15
24.12.21
block
16
2021-12-31 13:25
14.01.22
count
16
2022-02-05 15:00
19.02.22
count
16
2022-02-05 15:00
19.02.22
block
16
2022-02-13 17:15
19.02.22
block
16
2022-02-21 10:09
07.03.22
count
43
2021-11-26 11:00
10.12.21
count
43
2022-01-01 15:00
15.01.22
count
43
2022-04-13 10:07
27.04.22
count
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:47
27.04.22
block
43
2022-05-11 20:25
25.05.22
count
75
2021-10-21 12:50
04.11.21
count
75
2021-11-02 12:50
04.11.21
block
75
2021-11-18 11:15
02.12.21
count
75
2021-11-18 12:55
02.12.21
block
75
2021-11-18 16:35
02.12.21
block
75
2021-11-24 11:00
02.12.21
block
75
2021-12-01 11:00
02.12.21
block
75
2021-12-14 13:25
28.12.21
count
75
2021-12-15 13:35
28.12.21
block
75
2021-12-26 13:25
28.12.21
block
75
2022-01-31 15:00
14.02.22
count
75
2022-02-02 15:30
14.02.22
block
75
2022-02-03 15:00
14.02.22
block
75
2022-02-17 15:00
03.03.22
count
75
2022-02-17 15:00
03.03.22
block
75
2022-02-18 15:00
03.03.22
block
75
2022-02-23 15:00
03.03.22
block
75
2022-02-25 15:00
03.03.22
block
75
2022-03-04 10:46
18.03.22
count
75
2022-03-08 21:05
18.03.22
block
In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".
When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date + 14 days, otherwise I carry over the previous blockdate.
I tried now to solve this in Postgres with ...
window functions
recursive CTEs
lateral joins
... and all seemed a bit promising, but in the end I failed to implement this tricky count.
For example, my recursive CTE failed with:
aggregate functions are not allowed in WHERE
with recursive event_count AS (
select event
, min(created_at) as created
from test
group by event
union all
( select event
, created_at as created
from test
join event_count
using(event)
where created_at >= max(created) + INTERVAL '14 days'
order by created_at
limit 1
)
)
select * from event_count
Window functions, using lag() to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.
Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.
Above test data can be created with:
CREATE TABLE test (
event INTEGER,
created_at TIMESTAMP
);
INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');
This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.
The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:
Big table.
Unknown number and identity of relevant events (event IDs).
Many rows per event.
Some overlap the 14-day time frame, some don't.
Any number of duplicates possible.
You need an index like this one:
CREATE INDEX test_event_created_at_idx ON test (event, created_at);
Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:
WITH RECURSIVE hit AS (
(
SELECT event, created_at
FROM test
ORDER BY event, created_at
LIMIT 1
)
UNION ALL
SELECT t.*
FROM hit h
CROSS JOIN LATERAL (
SELECT t.event, t.created_at
FROM test t
WHERE (t.event, t.created_at)
> (h.event, h.created_at + interval '14 days')
ORDER BY t.event, t.created_at
LIMIT 1
) t
)
SELECT count(*) AS hits FROM hit;
fiddle
I cannot stress enough how fast it's going to be. :)
It's a recursive CTE using a LATERAL subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).
Effectively, we make Postgres skip over the above index once and only take qualifying rows.
For detailed explanation, see:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Efficiently selecting distinct (a, b) from big table
Optimize GROUP BY query to retrieve latest row per user (chapter 1a)
Different approach?
Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.
Consider working with a constant raster instead. Like a 14-day grid starting from Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.
I cannot think of how to do this without recursion.
with recursive ordered as ( -- Order and number the event instances
select event, created_at,
row_number() over (partition by event
order by created_at) as n
from test
), walk as (
-- Get and keep first instances
select event, created_at, n, created_at as current_base, true as keep
from ordered
where n = 1
union all
-- Carry base dates forward and mark records to keep
select c.event, c.created_at, c.n,
case
when c.created_at >= p.current_base + interval '14 days'
then c.created_at
else p.current_base
end as current_base,
(c.created_at >= p.current_base + interval '14 days') as keep
from walk p
join ordered c
on (c.event, c.n) = (p.event, p.n + 1)
)
select *
from walk
order by event, n;
Fiddle Here
I have a db with 6 tables. Each table has a list of date and datetime columns as shown below
Table 1 Table 2 .... Table 6
Date_of_birth Exam_date exam_datetime Result_date Result_datetime
2190-01-13 2192-01-13 2192-01-13 09:00:00 2194-04-13 2194-04-13 07:12:00
2184-05-21 2186-05-21 2186-05-21 07:00:00 2188-02-03 2188-02-03 09:32:00
2181-06-17 2183-06-17 2183-06-17 05:00:00 2185-07-23 2185-07-23 12:40:00
What I would like to do is shift all these future days back to the past date (definitely has to be less than the current date) but retain the same chronological order. Meaning, we can see that the person was born first, then he took the exam, and finally, he got his results.
In addition, I should be able to revert the changes and get back the future dates again.
I expect my output to be something like below
Stage 1 - shift back to old days (it can be any day but it has to be in the past and retain chronological order)
Table 1 Table 2 .... Table 6
Date_of_birth Exam_date exam_datetime Result_date Result_datetime
1990-01-13 1992-01-13 1992-01-13 09:00:00 1994-04-13 1994-04-13 07:12:00
1984-05-21 1986-05-21 1986-05-21 07:00:00 1988-02-03 1988-02-03 09:32:00
1981-06-17 1983-06-17 1983-06-17 05:00:00 1985-07-23 1985-07-23 12:40:00
Stage 2 - Shift forward to future days as how it was earlier
Table 1 Table 2 .... Table 6
Date_of_birth Exam_date exam_datetime Result_date Result_datetime
2190-01-13 2192-01-13 2192-01-13 09:00:00 2194-04-13 2194-04-13 07:12:00
2184-05-21 2186-05-21 2186-05-21 07:00:00 2188-02-03 2188-02-03 09:32:00
2181-06-17 2183-06-17 2183-06-17 05:00:00 2185-07-23 2185-07-23 12:40:00
Subtract two centuries:
update table1
set date_of_birth = date_of_birth - interval '200 year';
You can do something similar for all the other dates.
I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!
EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.
I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END
I have tried many different things but I am really struggling with this issue. I am used to MySQL, SQLite, and other databases but I can't seem to figure this one out in Access.
I have two tables that I want to join based on if the timestamps of table1 fall within a range of timestamps in table2 grouped by ID. See the following:
Table1:
ID Timestamp
8:00 AM
8:01 AM
8:02 AM
8:03 AM
8:04 AM
8:05 AM
8:06 AM
8:07 AM
8:08 AM
8:09 AM
8:10 AM
8:11 AM
8:12 AM
8:13 AM
8:14 AM
8:15 AM
8:16 AM
8:17 AM
8:18 AM
8:19 AM
Table2:
ID Timestamp
1 8:00 AM
1 8:02 AM
1 8:04 AM
1 8:06 AM
2 8:10 AM
2 8:12 AM
2 8:14 AM
2 8:16 AM
What I want to happen in Table1:
ID Timestamp
1 8:00 AM
1 8:01 AM
1 8:02 AM
1 8:03 AM
1 8:04 AM
1 8:05 AM
1 8:06 AM
8:07 AM
8:08 AM
8:09 AM
2 8:10 AM
2 8:11 AM
2 8:12 AM
2 8:13 AM
2 8:14 AM
2 8:15 AM
8:16 AM
8:17 AM
8:18 AM
8:19 AM
Here is what I tried initially (and wish would work) but have gone through many iterations of different queries without getting anywhere.
UPDATE Table1
SET Table1.ID = Table2.ID
WHERE Table1.Timestamp IN (SELECT Table2.Timestamp GROUP BY Table2.ID);
I either get no output (Table1.ID remains empty) or I get the error "Operation must use an updatable query".
You need to create a temp table and use it as a temporary recordset to use to search the records. The reason for this is that you need the Min/Max timestamp per ID, which requires an aggregate query, which cannot be used in an update query.
SELECT Table2.ID,
Min(Table2.TS) AS MinOfTS,
Max(Table2.TS) AS MaxOfTS
INTO try '<- this is your temporary table.
FROM Table2
GROUP BY Table2.ID;
Now that we have values we can use to search with in our temp table, we can just reference that in our UPDATE statement.
UPDATE Table1, try SET Table1.ID = [try].[ID]
WHERE (((Table1.TS) Between [try].[minofts] And [try].[maxofts]));
Edit: I suppose you could use a DLookup - but they tend to run extremely slow compared to this method.