How to add rows into table by condition in ClickHouse? - sql

I have a table in ClickHouse with events of connecting and disconnecting to system.
Query select timestamp, username, event from table gives following result.
timestamp
username
event
December 20, 2022, 18:24
1
Connect
December 20, 2022, 18:30
1
Disconnect
December 20, 2022, 18:34
1
Connect
December 21, 2022, 12:07
1
Disconnect
December 20, 2022, 12:15
2
Connect
December 20, 2022, 12:47
2
Disconnect
The session must be show in table as finished by the end of the day. If user was connected to system on 20th December and had no "Disconnect" after that in the same day, I have to add such 'Disconnect' event at the table with some query. And also I have to add row with event of 'Connect' to the table at 00:00 of next day.
For example, in sample table you can see that user #1 had not finished session on 20th December, so I want to have the following result:
timestamp
username
event
December 20, 2022, 18:24
1
Connect
December 20, 2022, 18:30
1
Disconnect
December 20, 2022, 18:34
1
Connect
December 20, 2022, 23:59
1
Disconnect
December 21, 2022, 00:00
1
Connect
December 21, 2022, 12:07
1
Disconnect
December 20, 2022, 12:15
2
Connect
December 20, 2022, 12:47
2
Disconnect
Is there any way to amend the query so it will work as I described above? ClickHouse is not so common as Posrgres or SQL Server, as far as I know, so code in Postgres dialect will be fine, I will find out how to make the same with ClickHouse.

You do not need lateral join to achieve desired result in Clickhouse (JOINs in Clickhouse are always compute heavy operations since it is a columnar store. ClickHouse takes the right table and creates a hash table for it in RAM).
You can use UNION ALL and ARRAY JOIN in specific way to generate missing rows:
CREATE TABLE connections
(
`timestamp` DateTime,
`username` LowCardinality(String),
`event` enum('Connect', 'Disconnect')
)
ENGINE = Memory;
INSERT INTO connections VALUES
('2022-12-20 18:24:00'::DateTime, '1', 'Connect')
('2022-12-20 18:30:00'::DateTime, '1', 'Disconnect')
('2022-12-20 18:34:00'::DateTime, '1', 'Connect')
('2022-12-21 12:07:00'::DateTime, '1', 'Disconnect')
('2022-12-20 12:15:00'::DateTime, '2', 'Connect')
('2022-12-20 12:47:00'::DateTime, '2', 'Disconnect');
SELECT * FROM
(
SELECT
timestamp, username, event
FROM
connections
UNION ALL
SELECT
timestamp, username, event
FROM
(
SELECT
[toStartOfDay(timestamp) + INTERVAL '1 DAY' - INTERVAL '1 SECOND',
toStartOfDay(timestamp) + INTERVAL '1 DAY' ] timestamps,
username,
[ 'Disconnect', 'Connect' ] :: Array(Enum('Connect', 'Disconnect')) events
FROM
connections
GROUP BY
toStartOfDay(timestamp), username
HAVING
anyLast(event) = 'Connect'
) ARRAY JOIN
timestamps AS timestamp,
events AS event
)
ORDER BY
username, timestamp
Here is the result:
┌───────────timestamp─┬─username─┬─event──────┐
│ 2022-12-20 18:24:00 │ 1 │ Connect │
│ 2022-12-20 18:30:00 │ 1 │ Disconnect │
│ 2022-12-20 18:34:00 │ 1 │ Connect │
│ 2022-12-20 23:59:59 │ 1 │ Disconnect │
│ 2022-12-21 00:00:00 │ 1 │ Connect │
│ 2022-12-21 12:07:00 │ 1 │ Disconnect │
│ 2022-12-20 12:15:00 │ 2 │ Connect │
│ 2022-12-20 12:47:00 │ 2 │ Disconnect │
└─────────────────────┴──────────┴────────────┘
8 rows in set. Elapsed: 0.011 sec.

First identify these Disconnect events that need to be preceded by a Disconnect/Connect pair. This is t CTE, overnight attribute. Then insert a Disconnect/Connect pair into the_table for every record of t with overnight true.
with t as
(
select *,
"timestamp"::date > lag("timestamp") over (partition by username order by "timestamp")::date overnight
from the_table
where "event" = 'Disconnect'
)
insert into the_table ("timestamp", "username", "event")
select date_trunc('day', "timestamp") - interval '1 second', "username", 'Disconnect'
from t where overnight
union all
select date_trunc('day', "timestamp"), "username", 'Connect'
from t where overnight;
DB-fiddle demo

Related

Calculate time difference by filtering and combining related entries from same table

I have the following table
TicketID
Operator
Datestamp
Remarks
1
p1
July 20, 2022, 10:30 PM
Changed from State A to B
1
p1
July 20, 2022, 11:30 PM
Changed from State B to C
1
p2
July 21, 2022, 10:01 PM
Changed from State D to B
1
p3
July 21, 2022, 11:41 PM
Changed from State B to A
2
p1
November 13, 2022, 11:01 PM
Changed from State C to B
3
p5
November 13, 2022, 09:10 AM
Changed from State A to B
3
p1
November 13, 2022, 11:10 AM
Changed from State B to C
3
p1
November 13, 2022, 11:41 PM
Changed from State C to B
I need to find out the duration tickets(identified by TicketID) have spent in State B
To clarify further referencing the table above Ticket 1 has spent from July 20, 2022, 10:30 PM to 11:30 PM (1hrs) and July 21, 2022, 10:01 PM to 11:41 PM(1hr40min) in state B making a total of (2hrs40min).
Similarly, Ticket 2 has just one state change to B and there is no entry for a state change from B, hence we assume it is still in State B and the duration becomes CurrentTime-November 13, 2022, 11:01 PM.
I'm having a hard time figuring out how to achieve this in a TSQL View. Any help is highly appreciated.
Assuming current time is November 13, 2022, 11:51 PM
The final view output is supposed to be something like below
TicketID
Duration(in minutes)
1
160
2
50
3
130
Ok, with the new changed data, it can be done so.
First determinec what is to and FROM
then, get the LEAD of dateime
Finally sum the difference in minutes up
You still should always provide dates as yyyy-mm-dd hh:MM:ss
WITH CTE as (SELECT
[TicketID], [Operator], [Datestamp],
CASE WHEN [Remarks] LIKE '%to B%' THen 1
WHEN [Remarks] LIKE '%B to%' THen 2
ELSE 0 END flg
FROM tab1
),
CTE2 AS (SELECT
[TicketID],flg,[Datestamp], LEAD([Datestamp]) OVER(PARTITION BY [TicketID] ORDER BY [Datestamp] ) date2
FROM CTE)
SELECT
[TicketID],SUM(ABS(datediff(minute,[Datestamp],COALESCE(date2, getdate())))) as duration
FROM CTE2
WHERE flg = 1
GROUP BY [TicketID]
TicketID
duration
1
160
2
450
3
610
fiddle

Use Django Model to find 1 record per hour, and the most recent one from another timestamp

Django 3.2.9
db: (PostgreSQL) 14.0
Model
class InventoryForecast(models.Model):
count = models.IntegerField()
forecast_for = models.DateTimeField(null=False)
forecasted_at = models.DateTimeField(null=False)
Data
id
count
forecast_for
forecasted_at
8
40910
2022-10-10 11:00
2022-09-04 12:00
9
40909
2022-10-10 11:00
2022-09-05 12:00
10
50202
2022-10-10 11:00
2022-09-06 12:00 (most recent forecast)
11
50301
2022-10-10 12:00
2022-09-04 12:00
12
50200
2022-10-10 12:00
2022-09-05 12:00
13
50309
2022-10-10 12:00
2022-09-06 12:00 (most recent forecast)
How would I use Django Model to find 1 record per forecast_for hour, and the most recent one for the forecasted_at value? So in this example, 2 records.
Desired results
id
count
forecast_for
forecasted_at
10
50202
2022-10-10 11:00
2022-09-06 12:00
13
50309
2022-10-10 12:00
2022-09-06 12:00
What I've tried
>>> from django.db.models.functions import TruncHour
>>> from django.db.models import Max
>>>
InventoryForecast.objects.annotate(
hour=TruncHour('forecast_for')
).values('hour').annotate(
most_recent_forecasted_at=Max('forecasted_at')
).values('hour', 'most_recent_forecasted_at')
SELECT DATE_TRUNC('hour', "app_inventoryforecast"."forecast_for" AT TIME ZONE 'UTC') AS "hour",
MAX("app_inventoryforecast"."forecasted_at") AS "most_recent_forecasted_at"
FROM "app_inventoryforecast"
GROUP BY DATE_TRUNC('hour', "app_inventoryforecast"."forecast_for" AT TIME ZONE 'UTC')
LIMIT 21
Execution time: 0.000353s [Database: default]
<QuerySet [{'hour': datetime.datetime(2022, 10, 10, 12, 0, tzinfo=<UTC>), 'most_recent_forecasted_at': datetime.datetime(2022, 9, 6, 11, 0, tzinfo=<UTC>)}, {'hour': datetime.datetime(2022, 10, 10
, 11, 0, tzinfo=<UTC>), 'most_recent_forecasted_at': datetime.datetime(2022, 9, 6, 11, 0, tzinfo=<UTC>)}]>
That works correctly in the GROUP BY, but I need the count value. The trick is, when I add that into the values it changes my group by to return too many records.
>>>python
InventoryForecast.objects.annotate(hour=TruncHour('forecast_for')).values('hour').annotate(most_recent_forecasted_at=Max('forecasted_at')).values('hour', 'most_recent_forecasted_at', 'count', 'id').all().count(
)
SELECT COUNT(*)
FROM (
SELECT "app_inventoryforecast"."count" AS Col1,
"app_inventoryforecast"."id" AS Col2,
DATE_TRUNC('hour', "app_inventoryforecast"."forecast_for" AT TIME ZONE 'UTC') AS "hour",
MAX("app_inventoryforecast"."forecasted_at") AS "most_recent_forecasted_at"
FROM "app_inventoryforecast"
GROUP BY DATE_TRUNC('hour', "app_inventoryforecast"."forecast_for" AT TIME ZONE 'UTC'),
"app_inventoryforecast"."id"
) subquery
Execution time: 0.002036s [Database: default]
6
So that returns all the example rows, 6. I need to select all my columns and group by just the truncated hour, or similar and return the 2 recent forecasted rows.
This solution annotates new field forecast_for_hour which uses TruncHour to create whole hour from forecast_for timestamp, then orders by forecast_for_hour ascending and forecasted_at descending, grouping them. As it's PostgreSQL you are using, we can then call distinct on forecast_for_hour which, thanks to ordering by forecasted_at descending sort takes newest forecast
qs = (
InventoryForecast.objects
.annotate(forecast_for_hour=TruncHour('forecast_for'))
.order_by('forecast_for_hour', '-forecasted_at')
.distinct('forecast_for_hour')
)

The Average Number of Rides Completed in 4 Hours

I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.
Sample Dataset:
dropoff_datetime ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714
For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?
What I've tried:
I first truncate my dropoff_datetime into the hour. (DATE_TRUNC)
I then group by that hour to get the count of rides per hour.
Example Query:
Note: calling the above table - final.
SELECT DATE_TRUNC('hour', dropoff_datetime) as by_hour
,count(ride_id) as total_rides
FROM final
WHERE 1=1
GROUP BY 1
Result:
by_hour total_rides
2022-08-27 4:00:00 3756
2022-08-27 5:00:00 6710
My question is:
How can I make it so it's grouping every 4 hours instead?
The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-24 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716)),
-- query part
min_max as (
select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
from dataset
),
date_ranges as (
select h
from min_max,
unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
)
select h, count_if(ride_id is not null)
from date_ranges
left join dataset on dropoff_datetime between h and h + interval '4' hour
group by h
order by h;
Which will produce the next output:
h
_col1
2022-08-23 17:00:00
1
2022-08-23 21:00:00
0
2022-08-24 01:00:00
0
2022-08-24 05:00:00
2
2022-08-24 09:00:00
1
Note that this can be quite performance intensive for big amount of data.
Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-27 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716),
(timestamp '2022-08-28 23:06:01', 1715),
(timestamp '2022-08-27 03:21:29', 1714)),
-- query part
base_with_curr AS (
select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
date_trunc('hour', dropoff_datetime) dropoff_datetime
from dataset)
select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
count(*)
from base_with_curr
group by 1;
Output:
four_hour
_col1
2022-08-23 17:00:00
1
2022-08-28 21:00:00
1
2022-08-24 05:00:00
2
2022-08-27 09:00:00
1
2022-08-27 01:00:00
1
Then you can use sequence approach to generate missing dates if needed.

Count unique values per day, if timestamp has hours

I have dataset:
timestamp event user
2020-04-28 20:07:55.503 log_in john
2020-04-28 20:08:01.996 log_out john
2020-04-28 20:08:02.470 log_in john
2020-04-28 20:08:03.996 log_out john
2020-04-28 20:08:05.729 log_failed john
2020-04-29 10:06:45.683 log_in mark
2020-04-29 10:08:58.299 password_change mark
2020-04-30 14:19:24.921 log_in jeff
2020-04-30 14:20:31.266 log_out jeff
2020-04-30 14:21:44.438 create_new_user jeff
2020-04-30 14:22:44.455 create_new_user jeff
How to write a sql query to count all unique events per day. the unclear part for me is the presence of hours in timestamp. The desired result looks like this:
timestamp count
2020-04-28 3
2020-04-29 2
2020-04-30 3
I think the Clickhouse syntax is:
select distinct toDate(timestamp), event
from t;
EDIT:
If you want to count the events, use count(distinct):
select toDate(timestamp), count(distinct event)
from t
group by toDate(timestamp);
create table xx(timestamp DateTime64(3), event String, user String) Engine=Memory;
insert into xx values
('2020-04-28 20:07:55.503','log_in', 'john'),
('2020-04-28 20:08:01.996','log_out','john'),
('2020-04-28 20:08:02.470','log_in','john'),
('2020-04-28 20:08:03.996','log_out','john'),
('2020-04-28 20:08:05.729','log_failed','john'),
('2020-04-29 10:06:45.683','log_in','mark'),
('2020-04-29 10:08:58.299','password_change','mark'),
('2020-04-30 14:19:24.921','log_in','jeff'),
('2020-04-30 14:20:31.266','log_out','jeff'),
('2020-04-30 14:21:44.438','create_new_user','jeff'),
('2020-04-30 14:22:44.455','create_new_user','jeff')
SELECT
toDate(timestamp) AS d,
uniq(event)
FROM xx
GROUP BY d
┌──────────d─┬─uniq(event)─┐
│ 2020-04-28 │ 3 │
│ 2020-04-29 │ 2 │
│ 2020-04-30 │ 3 │
└────────────┴─────────────┘

Calculating datetime intervals taking into an account the current date

I have a sub request which returns this:
item_id, item_datetime, item_duration_in_days
1, '7-dec-2016-12:00', 3
2, '8-dec-2016-11:00', 4
3, '20-dec-2016-05:00', 10
4, '2-jan-2017-14:00', 50
5, '29-jan-2017-22:00', 89
I want to get "item_id" which falls into "now()". For that the algorithm is:
1) var duration_days = interval 'item_duration_in_days[i]'
2) for the very first item:
new_datetime[i] = item_datetime[i] + duration_days
3) for others:
- if a new_datetime from the previous step overlaps with the current item_datetime[i]:
new_datetime[i] = new_datetime[i - 1] + duration_days
- else:
new_datetime[i] = item_datetime[i] + duration_days
4) return an item for each iteration:
{id, item_datetime, new_datetime}
That is, there'll be something like:
item_id item_datetime new_datetime
1 7 dec 2016 10 dec 2016
2 11 dec 2016 15 dec 2016
3 20 dec 2016 30 dec 2016
4 2 jan 2017 22 feb 2017 <------- found because now() == Feb 5
5 22 feb 2017 21 may 2017
How can I do that? I think it should be something like "fold" function. Can it be done via an sql request? Or will have to be an PSQL procedure for intermediate variable storage?
Or please give pointers how to calculate that.
If I understand correctly your task, you need recursive call. Function take first row at first and process each next.
WITH RECURSIVE x AS (
SELECT *
FROM (
SELECT item_id,
item_datetime,
item_datetime + (item_duration_in_days::text || ' day')::interval AS cur_end
FROM ti
ORDER BY item_datetime
LIMIT 1
) AS first
UNION ALL
SELECT item_id,
cur_start,
cur_start + (item_duration_in_days::text || ' day')::interval
FROM (
SELECT item_id,
CASE WHEN item_datetime > prev_end THEN
item_datetime
ELSE
prev_end
END AS cur_start,
item_duration_in_days
FROM (
SELECT ti.item_id,
ti.item_datetime,
x.cur_end + '1 day'::interval AS prev_end,
item_duration_in_days
FROM x
JOIN ti ON (
ti.item_id != x.item_id
AND ti.item_datetime >= x.item_datetime
)
ORDER BY ti.item_datetime
LIMIT 1
) AS a
) AS a
) SELECT * FROM x;
Result:
item_id | item_datetime | cur_end
---------+---------------------+---------------------
1 | 2016-12-07 12:00:00 | 2016-12-10 12:00:00
2 | 2016-12-11 12:00:00 | 2016-12-15 12:00:00
3 | 2016-12-20 05:00:00 | 2016-12-30 05:00:00
4 | 2017-01-02 14:00:00 | 2017-02-21 14:00:00
5 | 2017-02-22 14:00:00 | 2017-05-22 14:00:00
(5 rows)
For seeing current job :
....
) SELECT * FROM x WHERE item_datetime <= now() AND cur_end >= now();
item_id | item_datetime | cur_end
---------+---------------------+---------------------
4 | 2017-01-02 14:00:00 | 2017-02-21 14:00:00
(1 row)