Calculating difference (or deltas) between current and previous row with clickhouse - sql

It would be awesome if there was a way to index rows during a query.
Is there a way to SELECT (compute) the difference of a single column between consecutive rows?
Let's say, something like the following query
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
day[current] - day[previous] AS difference, -- how do I calculate this
day[current] / day[previous] as percent, -- and this
FROM records
GROUP BY day
ORDER BY day
I want to get the integer and percentage difference between the current row's 'events' column and the previous one for something similar to this:
day
events
difference
percent
2022-01-06 00:00:00
197
NULL
NULL
2022-01-07 00:00:00
656
459
3.32
2022-01-08 00:00:00
15
-641
0.02
2022-01-09 00:00:00
7
-8
0.46
2022-01-10 00:00:00
137
130
19.5

My version of Clickhouse doesn't support window-function but, on looking about the LAG() function mentioned in the comments, I found neighbor(), which works perfectly for what I'm trying to do
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
(events - neighbor(events, -1)) as diff,
(events / neighbor(events, -1)) as perc
FROM records
GROUP BY day
ORDER BY day

Related

How to build in product expiration in SQL?

I have a table that looks like the following and from it I want to get days remaining of total doses:
USER|PURCHASE_DATE|DOSES
1111|2017-07-27|15
2222|2020-07-17|3
3333|2021-02-01|5
If the doses do not have an expiration and each can be used for 90 days then the SQL I use is:
SUM(DOSES)*90-DATEDIFF(DAY,MIN(DATE),GETDATE())
USER|DAYS_REMAINING
1111|0
2222|6
3333|385
But what if I want to impose an expiration of each dose at a year? What can I do to modify my SQL to get the following desired answer:
USER|DAYS_REMAINING
1111|-985
2222|6
3333|300
It probably involves taking the MIN between when doses expire and how long they would last but I don't know how to aggregate in the expiry logic.
MIN is a aggregate function you want LEAST to pick between the two values:
WITH data(user,purchase_date, doses) AS (
SELECT * FROM VALUES
(1111,'2017-07-27',15),
(2222,'2020-07-17',3),
(3333,'2021-02-01',5)
)
SELECT
d.*,
d.doses * 90 AS doses_duration,
365::number AS year_duration,
least(doses_duration, year_duration) as max_duration,
DATEADD('day', max_duration, d.purchase_date)::date as last_dose_day,
DATEDIFF('day', current_date, last_dose_day) as day_remaining
FROM data AS d
ORDER BY 1;
gives:
USER PURCHASE_DATE DOSES DOSES_DURATION YEAR_DURATION MAX_DURATION LAST_DOSE_DAY DAY_REMAINING
1111 2017-07-27 15 1350 365 365 2018-07-27 -986
2222 2020-07-17 3 270 365 270 2021-04-13 5
3333 2021-02-01 5 450 365 365 2022-02-01 299
which can all be rolled together with a tiny fix on the date_diff, as:
WITH data(user,purchase_date, doses) AS (
SELECT * FROM VALUES
(1111,'2017-07-27',15),
(2222,'2020-07-17',3),
(3333,'2021-02-01',5)
)
SELECT
d.user,
DATEDIFF('day', current_date, DATEADD('day', least(d.doses * 90, 365::number), d.purchase_date)::date)+1 as day_remaining
FROM data AS d
ORDER BY 1;
giving:
USER DAY_REMAINING
1111 -985
2222 6
3333 300

Calculate overlap time in seconds for groups in SQL

I have a bunch of timestamps grouped by ID and type in the sample data shown below.
I would like to find overlapped time between start_time and end_time columns in seconds for each group of ID and between each lead and follower combinations. I would like to show the overlap time only for the first record of each group which will always be the "lead" type.
For example, for the ID 1, the follower's start and end times in row 3 overlap with the lead's in row 1 for 193 seconds (from 09:00:00 to 09:03:13). the follower's times in row 3 also overlap with the lead's in row 2 for 133 seconds (09:01:00 to 2020-05-07 09:03:13). That's a total of 326 seconds (193+133)
I used the partition clause to rank rows by ID and type and order them by start_time as a start.
How do I get the overlap column?
row# ID type start_time end_time rank. overlap
1 1 lead 2020-05-07 09:00:00 2020-05-07 09:03:34 1 326
2 1 lead 2020-05-07 09:01:00 2020-05-07 09:03:13 2
3 1 follower 2020-05-07 08:59:00 2020-05-07 09:03:13 1
4 2 lead 2020-05-07 11:23:00 2020-05-07 11:33:00 1 540
4 2 follower 2020-05-07 11:27:00 2020-05-07 11:32:00 1
5 3 lead 2020-05-07 14:45:00 2020-05-07 15:00:00 1 305
6 3 follower 2020-05-07 14:44:00 2020-05-07 14:44:45 1
7 3 follower 2020-05-07 14:50:00 2020-05-07 14:55:05 2
In your example, the times completely cover the total duration. If this is always true, you can use the following logic:
select id,
(sum(datediff(second, start_time, end_time) -
datediff(second, min(start_time), max(end_time)
) as overlap
from t
group by id;
To add this as an additional column, then either use window functions or join in the result from the above query.
If the overall time has gaps, then the problem is quite a bit more complicated. I would suggest that you ask a new question and set up a db fiddle for the problem.
Tried this a couple of way and got it to work.
I first joined 2 tables with individual records for each type, 'lead' and 'follower' and created a case statement to calculate max start time for each lead and follower start time combination and min end time for each lead and follower end time combination. Stored this in a temp table.
CASE
WHEN lead_table.start_time > follower_table.start_time THEN lead_table.start_time
WHEN lead_table.start_time < follower_table.start_time THEN patient_table.start_time_local
ELSE 0
END as overlap_start_time,
CASE
WHEN follower_table.end_time < lead_table.end_time THEN follower_table.end_time
WHEN follower_table.end_time > lead_table.end_time THEN lead_table.end_time
ELSE 0
END as overlap_end_time
Then created an outer query to lookup the temp table just created to find the difference between start time and end time for each lead and follower combination in seconds
select temp_table.id,
temp_table.overlap_start_time,
temp_table.overlap_end_time,
DATEDIFF_BIG(second,
temp_table.overlap_start_time,
temp_table.overlap_end_time) as overlap_time FROM temp_table

Multiple day on day changes based on dates in data as not continuous

See table A. There are number of sales per date. The dates are not continuous.
I want table B where it gives the sales moves per the previous date in the dataset.
I am trying to do it in SQL but get stuck. I can do an individual day on day difference by entering the date but I want one where I don't need to enter the dates manually
A
Date Sales
01/01/2019 100
05/01/2019 200
12/01/2019 50
25/01/2019 25
31/01/2019 200
B
Date DOD Move
01/01/2019 -
05/01/2019 +100
12/01/2019 -150
25/01/2019 -25
31/01/2019 +175
Use lag():
select t.*,
(sales - lag(sales) over (order by date)) as dod_move
from t;

TSQL reduce the amount of data returned by a query to a parametric defined sample

I have a table containing a large amount of data which is stored on change.
tbl_bigOne
----------
timestamp | var01 | var02 | ...
2016-01-14 15:20:21 | 10.1 | 100.6 | ...
2016-01-14 15:20:26 | 11.2 | 110.3 | ...`
2016-01-14 15:21:27 | 52.1 | 620.1 | ...
2016-01-14 15:35:00 | 13.5 | 230.6 | ...
...
2016-01-15 09:18:01 | 94.4 | 140.0 | ...
2016-01-15 10:01:15 | 105.3 | 188.7 | ...
...
and so on for years of data
What I would like to obtain is a query/stored procedure that given two datetime references (date_from and date_to) gives the required selected data.
Now, the query just mentioned is pretty straight forward what I would also like to achieve is to set the maximum number of rows returned per day (if data is available) while doing the average of the values.
Let's give a few examples:
date_from: 2016-01-14 00:00:00
date_to: 2016-01-20 23:59:59
max_points:12
in this case the time windows is of 7 days and in this one i would like to have a maximum of 12 rows for each days of the 7 day window, giving a max total of 84 rows whilst doing the average from all the grouping done since, the data for each day is now partitioned by 12.
It is possible to see this partitioning as if every hour worth of data for that specific day is averaged, generating one row of the 12 required for a day.
date_from: 2016-01-14 00:00:00
date_to: 2016-01-14 23:59:59
max_points:1440
in this case the time window is one day worth and, if available, i would like to have a maximum of 1440 rows (for each day) for the selected period.
In this way the parameter defines the maximum number of rows for each day. The minimum time window is one day nothing below that.
Can something like this be achieved just using TSQL?
Thank you.
edit for taking care of the observations raised by #Thorsten Kettner
Use the analytic function ROW_NUMBER() to number the matching rows per day. Then only keep rows up to the given limit. If you want the rows arbitrarily chosen when there exist more than needed, then number the rows in random order using NEWID().
select timestmp, var01, var02, var03
from
(
select
mytable.*,
row_number() over (partition by convert(date, timestmp) order by newid()) as rn
from mytable
where convert(date, timestmp) between #start_date and #end_date
) numbered
where rn <= #limit
order by timestmp;

GROUP BY several hours

I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END