Count events with a cool-down period after each instance - sql

In a Postgres DB I have entries for "events", associated with an id, and when they happened. I need to count them with a special rule.
When an event happens the counter is incremented and for the next 14 days all events of this type are not counted.
Example:
event
created_at
blockdate
action
16
2021-11-11 11:15
25.11.21
count
16
2021-11-11 11:15
25.11.21
block
16
2021-11-13 10:45
25.11.21
block
16
2021-11-16 10:40
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-11-23 11:15
25.11.21
block
16
2021-12-10 13:00
24.12.21
count
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-15 13:25
24.12.21
block
16
2021-12-20 13:15
24.12.21
block
16
2021-12-23 13:15
24.12.21
block
16
2021-12-31 13:25
14.01.22
count
16
2022-02-05 15:00
19.02.22
count
16
2022-02-05 15:00
19.02.22
block
16
2022-02-13 17:15
19.02.22
block
16
2022-02-21 10:09
07.03.22
count
43
2021-11-26 11:00
10.12.21
count
43
2022-01-01 15:00
15.01.22
count
43
2022-04-13 10:07
27.04.22
count
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:09
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:10
27.04.22
block
43
2022-04-13 10:47
27.04.22
block
43
2022-05-11 20:25
25.05.22
count
75
2021-10-21 12:50
04.11.21
count
75
2021-11-02 12:50
04.11.21
block
75
2021-11-18 11:15
02.12.21
count
75
2021-11-18 12:55
02.12.21
block
75
2021-11-18 16:35
02.12.21
block
75
2021-11-24 11:00
02.12.21
block
75
2021-12-01 11:00
02.12.21
block
75
2021-12-14 13:25
28.12.21
count
75
2021-12-15 13:35
28.12.21
block
75
2021-12-26 13:25
28.12.21
block
75
2022-01-31 15:00
14.02.22
count
75
2022-02-02 15:30
14.02.22
block
75
2022-02-03 15:00
14.02.22
block
75
2022-02-17 15:00
03.03.22
count
75
2022-02-17 15:00
03.03.22
block
75
2022-02-18 15:00
03.03.22
block
75
2022-02-23 15:00
03.03.22
block
75
2022-02-25 15:00
03.03.22
block
75
2022-03-04 10:46
18.03.22
count
75
2022-03-08 21:05
18.03.22
block
In Excel I simply add two columns. In one column I carry over a "blockdate", a date until when events have to be blocked. In the other column I compare the ID with the previous ID and the previous "blockdate".
When the IDs a different or the blockdate is less then the current date, I have to count. When I have to count, I set the row's blockdate to the current date + 14 days, otherwise I carry over the previous blockdate.
I tried now to solve this in Postgres with ...
window functions
recursive CTEs
lateral joins
... and all seemed a bit promising, but in the end I failed to implement this tricky count.
For example, my recursive CTE failed with:
aggregate functions are not allowed in WHERE
with recursive event_count AS (
select event
, min(created_at) as created
from test
group by event
union all
( select event
, created_at as created
from test
join event_count
using(event)
where created_at >= max(created) + INTERVAL '14 days'
order by created_at
limit 1
)
)
select * from event_count
Window functions, using lag() to access the previous row don't seem to work because they cannot access columns in the previous row which were created using the window function.
Adding a "block-or-count" information upon entering a new event entry by simply comparing with the last entry wouldn't solve the issue as event entries "go away" after about half a year. So when the first entry goes away, the next one becomes the first and the logic has to be applied on the new situation.
Above test data can be created with:
CREATE TABLE test (
event INTEGER,
created_at TIMESTAMP
);
INSERT INTO test (event, created_at) VALUES
(16, '2021-11-11 11:15'),(16, '2021-11-11 11:15'),(16, '2021-11-13 10:45'),(16, '2021-11-16 10:40'),
(16, '2021-11-23 11:15'),(16, '2021-11-23 11:15'),(16, '2021-12-10 13:00'),(16, '2021-12-15 13:25'),
(16, '2021-12-15 13:25'),(16, '2021-12-15 13:25'),(16, '2021-12-20 13:15'),(16, '2021-12-23 13:15'),
(16, '2021-12-31 13:25'),(16, '2022-02-05 15:00'),(16, '2022-02-05 15:00'),(16, '2022-02-13 17:15'),
(16, '2022-02-21 10:09'),
(43, '2021-11-26 11:00'),(43, '2022-01-01 15:00'),(43, '2022-04-13 10:07'),(43, '2022-04-13 10:09'),
(43, '2022-04-13 10:09'),(43, '2022-04-13 10:09'),(43, '2022-04-13 10:10'),(43, '2022-04-13 10:10'),
(43, '2022-04-13 10:47'),(43, '2022-05-11 20:25'),
(75, '2021-10-21 12:50'),(75, '2021-11-02 12:50'),(75, '2021-11-18 11:15'),(75, '2021-11-18 12:55'),
(75, '2021-11-18 16:35'),(75, '2021-11-24 11:00'),(75, '2021-12-01 11:00'),(75, '2021-12-14 13:25'),
(75, '2021-12-15 13:35'),(75, '2021-12-26 13:25'),(75, '2022-01-31 15:00'),(75, '2022-02-02 15:30'),
(75, '2022-02-03 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-17 15:00'),(75, '2022-02-18 15:00'),
(75, '2022-02-23 15:00'),(75, '2022-02-25 15:00'),(75, '2022-03-04 10:46'),(75, '2022-03-08 21:05');

This lends itself to a procedural solution, since it has to walk the whole history of existing rows for each event. But SQL can do it, too.
The best solution heavily depends on cardinalities, data distribution, and other circumstances.
Assuming unfavorable conditions:
Big table.
Unknown number and identity of relevant events (event IDs).
Many rows per event.
Some overlap the 14-day time frame, some don't.
Any number of duplicates possible.
You need an index like this one:
CREATE INDEX test_event_created_at_idx ON test (event, created_at);
Then the following query emulates an index-skip scan. If the table is vacuumed enough, it operates with index-only scans exclusively, in a single pass:
WITH RECURSIVE hit AS (
(
SELECT event, created_at
FROM test
ORDER BY event, created_at
LIMIT 1
)
UNION ALL
SELECT t.*
FROM hit h
CROSS JOIN LATERAL (
SELECT t.event, t.created_at
FROM test t
WHERE (t.event, t.created_at)
> (h.event, h.created_at + interval '14 days')
ORDER BY t.event, t.created_at
LIMIT 1
) t
)
SELECT count(*) AS hits FROM hit;
fiddle
I cannot stress enough how fast it's going to be. :)
It's a recursive CTE using a LATERAL subquery, all based on the magic of ROW value comparison (which not all major RDBMS supported properly).
Effectively, we make Postgres skip over the above index once and only take qualifying rows.
For detailed explanation, see:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Efficiently selecting distinct (a, b) from big table
Optimize GROUP BY query to retrieve latest row per user (chapter 1a)
Different approach?
Like you mention yourself, the unfortunate task definition forces you to re-compute all newer rows for events where old data changes.
Consider working with a constant raster instead. Like a 14-day grid starting from Jan 1 every year. Then the state of each event could be derived from the local frame. Much cheaper and more reliable.

I cannot think of how to do this without recursion.
with recursive ordered as ( -- Order and number the event instances
select event, created_at,
row_number() over (partition by event
order by created_at) as n
from test
), walk as (
-- Get and keep first instances
select event, created_at, n, created_at as current_base, true as keep
from ordered
where n = 1
union all
-- Carry base dates forward and mark records to keep
select c.event, c.created_at, c.n,
case
when c.created_at >= p.current_base + interval '14 days'
then c.created_at
else p.current_base
end as current_base,
(c.created_at >= p.current_base + interval '14 days') as keep
from walk p
join ordered c
on (c.event, c.n) = (p.event, p.n + 1)
)
select *
from walk
order by event, n;
Fiddle Here

Related

Why does my cumulative column not work as expected?

Here is my query , I have a column called cum_balance which is supposed to calculate the cumulative balance but after row number 10 there is an anamoly and it doesn't work as expected , all I notice is that from row number 10 onwards the hour column has same value. What's the right syntax for this?
[select
hour,
symbol,
amount_usd,
category,
sum(amount_usd) over (
order by
hour asc RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) as cum_balance
from
combined_transfers_usd_netflow
order by
hour][1]
I have tried removing RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , adding a partition by hour and group by hour. None of them gave the expected result or errors
Row Number
Hour
SYMBOL
AMOUNT_USD
CATEGORY
CUM_BALANCE
1
2021-12-02 23:00:00
WETH
227.2795
in
227.2795
2
2021-12-03 00:00:00
WETH
-226.4801153
out
0.7993847087
3
2022-01-05 21:00:00
WETH
5123.716203
in
5124.515587
4
2022-01-18 14:00:00
WETH
-4466.2366
out
658.2789873
5
2022-01-19 00:00:00
WETH
2442.618599
in
3100.897586
6
2022-01-21 14:00:00
USDC
99928.68644
in
103029.584
7
2022-03-01 16:00:00
UNI
8545.36098
in
111574.945
8
2022-03-04 22:00:00
USDC
-2999.343
out
108575.602
9
2022-03-09 22:00:00
USDC
-5042.947675
out
103532.6543
10
2022-03-16 21:00:00
USDC
-4110.6579
out
98594.35101
11
2022-03-16 21:00:00
UNI
-3.209306045
out
98594.35101
12
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
13
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
14
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
15
2022-03-16 21:00:00
UNI
-6.418612089
out
98594.35101
The "problem" with your data in all the ORDER BY values after row 10 are the same.
So if we shrink the data down a little, and use for groups to repeat the experiment:
with data(grp, date, val) as (
select * from values
(1,'2021-01-01'::date, 10),
(1,'2021-01-02'::date, 11),
(1,'2021-01-03'::date, 12),
(2,'2021-01-01'::date, 20),
(2,'2021-01-02'::date, 21),
(2,'2021-01-02'::date, 22),
(2,'2021-01-04'::date, 23)
)
select d.*
,sum(val) over ( partition by grp order by date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cum_val_1
,sum(val) over ( partition by grp order by date ) as cum_val_2
from data as d
order by 1,2;
we get:
GRP
DATE
VAL
CUM_VAL_1
CUM_VAL_2
1
2021-01-01
10
10
10
1
2021-01-02
11
21
21
1
2021-01-03
12
33
33
2
2021-01-01
20
20
20
2
2021-01-02
21
63
63
2
2021-01-02
22
63
63
2
2021-01-04
23
86
86
we see with group 1 that values accumulate as we expect. So for group 2 we put duplicate values as see those rows get the same value, but rows after "work as expected again".
This tells us how this function work across unstable data (values that sort the same) is that they are all stepped in one leap.
Thus if you want each row to be different they will need better ORDER distinctness. This could be forced by add random values of literal random nature, or feeling non random ROW_NUMBER, but really they would be random, albeit not explicit, AND if you use random, you might get duplicates, thus really should use ROW_NUMBER or SEQx to have unique values.
Also the second formula shows they are equal, and it's the ORDER BY problem not the framing of "which rows" are used.
with data(grp, date, val) as (
select * from values
(1,'2021-01-01'::date, 10),
(1,'2021-01-02'::date, 11),
(1,'2021-01-03'::date, 12),
(2,'2021-01-01'::date, 20),
(2,'2021-01-02'::date, 21),
(2,'2021-01-02'::date, 22),
(2,'2021-01-04'::date, 23)
)
select d.*
,seq8() as s
,sum(val) over ( partition by grp order by date ) as cum_val_1
,sum(val) over ( partition by grp order by date, s ) as cum_val_2
,sum(val) over ( partition by grp order by date, seq8() ) as cum_val_3
from data as d
order by 1,2;
gives:
GRP
DATE
VAL S
CUM_VAL_1
CUM_VAL_2
CUM_VAL_2_2
1
2021-01-01
10
0
10
10
1
2021-01-02
11
1
21
21
1
2021-01-03
12
2
33
33
2
2021-01-01
20
3
20
20
2
2021-01-02
21
4
63
41
2
2021-01-02
22
5
63
63
2
2021-01-04
23
6
86
86

The Average Number of Rides Completed in 4 Hours

I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.
Sample Dataset:
dropoff_datetime ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714
For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?
What I've tried:
I first truncate my dropoff_datetime into the hour. (DATE_TRUNC)
I then group by that hour to get the count of rides per hour.
Example Query:
Note: calling the above table - final.
SELECT DATE_TRUNC('hour', dropoff_datetime) as by_hour
,count(ride_id) as total_rides
FROM final
WHERE 1=1
GROUP BY 1
Result:
by_hour total_rides
2022-08-27 4:00:00 3756
2022-08-27 5:00:00 6710
My question is:
How can I make it so it's grouping every 4 hours instead?
The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-24 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716)),
-- query part
min_max as (
select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
from dataset
),
date_ranges as (
select h
from min_max,
unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
)
select h, count_if(ride_id is not null)
from date_ranges
left join dataset on dropoff_datetime between h and h + interval '4' hour
group by h
order by h;
Which will produce the next output:
h
_col1
2022-08-23 17:00:00
1
2022-08-23 21:00:00
0
2022-08-24 01:00:00
0
2022-08-24 05:00:00
2
2022-08-24 09:00:00
1
Note that this can be quite performance intensive for big amount of data.
Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-27 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716),
(timestamp '2022-08-28 23:06:01', 1715),
(timestamp '2022-08-27 03:21:29', 1714)),
-- query part
base_with_curr AS (
select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
date_trunc('hour', dropoff_datetime) dropoff_datetime
from dataset)
select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
count(*)
from base_with_curr
group by 1;
Output:
four_hour
_col1
2022-08-23 17:00:00
1
2022-08-28 21:00:00
1
2022-08-24 05:00:00
2
2022-08-27 09:00:00
1
2022-08-27 01:00:00
1
Then you can use sequence approach to generate missing dates if needed.

Extracting minutes between two timestamps and assigning different weights

I'm breaking my head over how to achieve this in Teradata.
I have two tables, and I need to extract minutes from the Run table and assign hourly weights to them based on the Weights table.
Table 1: Run
Machine Begin End
A 1/1/2010 08:00 AM 1/1/2010 10:45 AM
B 1/2/2010 10:00 AM 1/2/2010 11:45 AM
Table 2: Weights
Weights are assigned for every hour (Record 1 says weight is 10 for every run min between 8am and 9am)
Hour Weight
1/1/2010 08:00 AM 10
1/1/2010 09:00 AM 15
1/1/2010 10:00 AM 16
1/1/2010 11:00 AM 20
1/1/2010 11:00 AM 20
1/1/2010 12:00 AM 25
Needed Result:
Mach Hour Weight Mins Total (Weight*Mins)
A 1/1/2010 08:00 AM 10 60 600
A 1/1/2010 09:00 AM 15 60 900
A 1/1/2010 10:00 AM 16 45 720
B 1/2/2010 10:00 AM 16 60 960
B 1/2/2010 11:00 AM 20 45 900
Any guidance appreciated. Thanks in advance.
Edit: Here are the sample tables
CREATE TABLE RUNS(NAME VARCHAR(50),START_DT timestamp(0),END_dt timestamp(0));
INSERT INTO RUNS VALUES ('A','2020-01-01 08:00:00','2020-01-01 10:15:00');
INSERT INTO RUNS VALUES ('B','2020-01-02 10:00:00','2020-01-02 11:45:00');
CREATE TABLE WEIGHTS(HOUR_MS timestamp(0),WEIGHT INTEGER);
INSERT INTO WEIGHTS('2020-01-01 08:00:00', 10);
INSERT INTO WEIGHTS('2020-01-01 09:00:00', 15);
INSERT INTO WEIGHTS('2020-01-01 10:00:00', 16);
INSERT INTO WEIGHTS('2020-01-01 11:00:00', 20);
INSERT INTO WEIGHTS('2020-01-02 10:00:00', 20);
INSERT INTO WEIGHTS('2020-01-02 11:00:00', 25);
This is a brute force approach using a non-equi-join based on OVERLAPS:
select
machine
,weight
-- get the number of minutes within the hour
,cast((interval(period(begin, end) p_intersect period(hour, hour + interval '1' hour)) minute(4)) as int) as mins
,mins * weight
from run join weights
on period(begin, end) overlaps period(hour, hour + interval '1' hour)
Explain will show a Product Join, which results in high CPU usage.
There's a smarter approach using EXPAND ON, but it's too late for me, maybe tomorrow :-)
An alternate approach using EXPAND ON in a subquery, followed by equality join:
SELECT machine
,TheHour
,weight
,CAST((INTERVAL(pd P_INTERSECT xpd) MINUTE(4)) AS INTEGER) mins
,mins*weight
FROM (
SELECT machine, PERIOD(begin, end) AS pd, xpd, BEGIN(xpd) AS begin_xpd
FROM run
EXPAND ON pd AS xpd
BY ANCHOR PERIOD ANCHOR_HOUR
) x
JOIN weights
ON begin_xpd = TheHour;

Showing results of a SELECT statement including every 30-min range of the day, filling 0´s when no records exist

I have a table arrivals like this:
HHMM Car
---- ---
0001 01
0001 02
0001 03
0002 04
...
0029 20
0029 21
0030 22
...
0059 56
I need to know how many cars arrived at each range of 30 minutes.
I wrote a query like this:
WITH
PREVIOUS_QUERIES AS
(
-- in fact, "PREVIOUS_QUERIES" represent a sequence of queries where I get
-- the timestamp HHMM and convert it in a range where HOURS = HH and
-- MINUTES_START can be 0 (if MM<30) or 30 (if MM>=30).
),
INTERVALS AS
(
SELECT
TO_CHAR(HOURS,'FM00')||':'||TO_CHAR(MINUTES_START,'FM00')||' - '
||TO_CHAR(HOURS,'FM00')||':'||TO_CHAR(MINUTES_START +29,'FM00') AS INTERVAL,
CAR
FROM
PREVIOUS_QUERIES
)
SELECT
INTERVAL,
COUNT (DISTINCT (CAR)),
FROM INTERVALS
GROUP BY INTERVAL
ORDER BY INTERVAL
;
My query produces the following results.
Interval Cars
------------- ----
00:00 - 00:29 21
00:30 - 00:59 35
01:00 - 01:29 41
02:30 - 02:59 5
03:00 - 03:29 12
03:30 - 03:59 13
...
That means, if there are no arrivals in some interval, my query doesn´t show a line with Cars=0. I need these rows in my results:
01:30 - 01:59 0
02:00 - 02:29 0
How could I add these rows? Can it be done with a change on my query, or
should this query be completely rewritten?
What I imagine is that I should generate the 48 ranges of 30 minutes from '00:00-00:29' to '23:30-23:59' and then use them as a parameter for SELECT, but I don´t know how to do it.
You could just select those 48 values into another CTE, something like 'cteIntervals', and then adjust the final query to be something like:
SELECT
I.Interval
, NVL(Q.Cars, 0) Cars
FROM
cteIntervals I
LEFT JOIN
(
-- Your current query
) Q
ON I.Interval = Q.Interval
This has the effect of creating a template for the final query to fit into.
For a slightly more dynamic solution you could look into producing cteIntervals using a recursive CTE, or you could store the values in a table, etc.

GROUP BY several hours

I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END