Grouping/aggregating SQL results into 1-hour buckets - sql

Similar to this question, I need to group a large number of records into 1-hour "buckets". For example, let's say I've got a typical ORDER table with a datetime attached to each order. And I want to see the total number of orders per hour. So I'm using SQL roughly like this:
SELECT datepart(hh, order_date), SUM(order_id)
FROM ORDERS
GROUP BY datepart(hh, order_date)
The problem is that if there are no orders in a given 1-hour "bucket", no row is emitted into the result set. I'd like the resultset to have a row for each of the 24 hour, but if no orders were made during a particular hour, just record the number of orders as O.
Is there any way to do this in a single query?
See also Getting Hourly Statistics Using SQL.

Some of the previous answers recommend using a table of hours and populating it using a UNION query; this can be better done with a Common Table Expression:
; WITH [Hours] ([Hour]) AS
(
SELECT TOP 24 ROW_NUMBER() OVER (ORDER BY [object_id]) AS [Hour]
FROM sys.objects
ORDER BY [object_id]
)
SELECT h.[Hour], o.[Sum]
FROM [Hours] h
LEFT OUTER JOIN (
SELECT datepart(hh, order_date) as [Hour], SUM(order_id) as [Sum]
FROM Orders
GROUP BY datepart(hh, order_date)
) o
ON h.[Hour] = o.[Hour]

You need to have a pre-populated table (or a function returning a table result set) to join with, that contains all the 1-hour slots you want in your result.
Then you do a OUTER JOIN with that, and you should get them all.
Something like this:
SELECT SLOT_HOUR, SUM(order_id)
FROM
ONEHOURSLOTS
LEFT JOIN ORDERS ON DATEPART(hh, order_date) = SLOT_HOUR
GROUP BY SLOT_HOUR

Create a table of hours, either persisted or even synthesized 'on the fly':
SELECT h.hour, s.sum
FROM (
SELECT 1 as hour
UNION ALL SELECT 2
UNION ALL SELECT 3
...
UNION ALL SELECT 24) as h
LEFT OUTER JOIN (
SELECT datepart(hh, order_date) as hour, SUM(order_id) as sum
FROM ORDERS
GROUP BY datepart(hh, order_date) ) as s
ON h.hour = s.hour;

Related

SQL: Difference between consecutive rows

Table with 3 columns: order id, member id, order date
Need to pull the distribution of orders broken down by No. of days b/w 2 consecutive orders by member id
What I have is this:
SELECT
a1.member_id,
count(distinct a1.order_id) as num_orders,
a1.order_date,
DATEDIFF(DAY, a1.order_date, a2.order_date) as days_since_last_order
from orders as a1
inner join orders as a2
on a2.member_id = a1.member_id+1;
It's not helping me completely as the output I need is:
You can use lag() to get the date of the previous order by the same customer:
select o.*,
datediff(
order_date,
lag(order_date) over(partition by member_id order by order_date, order_id)
) days_diff
from orders o
When there are two rows for the same date, the smallest order_id is considered first. Also note that I fixed your datediff() syntax: in Hive, the function just takes two dates, and no unit.
I just don't get the logic you want to compute num_orders.
May be something like this:
SELECT
a1.member_id,
count(distinct a1.order_id) as num_orders,
a1.order_date,
DATEDIFF(DAY, a1.order_date, a2.order_date) as days_since_last_order
from orders as a1
inner join orders as a2
on a2.member_id = a1.member_id
where not exists (
select intermediate_order
from orders as intermedite_order
where intermediate_order.order_date < a1.order_date and intermediate_order.order_date > a2.order_date) ;

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

cross join to get all dates and hours and avoid duplicate values

We have 2 tables:
sales
hourt (only 1 field (hourt) of numbers: 0 to 23)
The goal is to list all dates and all 24 hours for each day and group hours that have sales. For hours that do not have sales, zero will be shown.
This query cross joins the sales table with the hourt table and does list all dates and 24 hours. However, there are also many duplicate rows. How can we avoid the duplicates?
We're using Amazon Redshift (based on Postgres 8.0).
with h as (
SELECT
a.purchase_date,
CAST(DATE_PART("HOUR", AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS
DATETIME), "0:00"), "PST")) as INTEGER) AS Hour,
COUNT(a.quantity) AS QtyCount,
SUM(a.quantity) AS QtyTotal,
SUM((a.price) AS Price
FROM sales a
GROUP BY CAST(DATE_PART("HOUR",
AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS DATETIME), "0:00"),
"PST")) as INTEGER),
DATE_FORMAT(AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS DATETIME),
"0:00"), "PST"), "yyyy-MM-dd")
ORDER by a.purchase_date
),
hr as (
SELECT
CAST(hourt AS INTEGER) AS hourt
FROM hourt
),
joined as (
SELECT
purchase_date,
hourt,
QtyCount,
QtyTotal,
Price
FROM h
cross JOIN hr
)
SELECT *
FROM joined
Order by purchase_date,hourt
Sample Tables:
Before the cross join, query returned correct sales and grouped hours, as seen in the below table.
Desired results table:
Need to create a series of all the hour values and left join your data back to that. Comments inline explain the logic.
WITH data AS (-- Do the basic aggregation first
SELECT DATE_TRUNC('hour',a.purchase_date) purchase_hour --Truncate timestamp to the hour is simpler
,COUNT(a.quantity) AS QtyCount
,SUM(a.quantity) AS QtyTotal
,SUM((a.price) AS Price
FROM sales a
GROUP BY DATE_TRUNC('hour',a.purchase_date)
ORDER BY DATE_TRUNC('hour',a.purchase_date)
-- SELECT '2017-01-13 12:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
-- UNION ALL SELECT '2017-01-13 15:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
-- UNION ALL SELECT '2017-01-14 21:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
)
,time_range AS (--Calculate the start and end **date** values
SELECT DATE_TRUNC('day',MIN(purchase_hour)) start_date
, DATE_TRUNC('day',MAX(purchase_hour))+1 end_date
FROM data
)
,hr AS (--Generate all hours between start and end
SELECT (SELECT start_date
FROM time_range
LIMIT 1) --Limit 1 so the optimizer knows it's not a correlated subquery
+ ((n-1) --Make the series start at zero so we don't miss the starting value
* INTERVAL '1 hour') AS "hour"
FROM (SELECT ROW_NUMBER() OVER () n
FROM stl_query --Can use any table here as long as it enough rows
LIMIT 100) series
WHERE "hour" < (SELECT end_date FROM time_range LIMIT 1)
)
--Use NVL to replace missing values with zeroes
SELECT hr.hour AS purchase_hour --Timestamp like `2017-01-13 12:00:00`
, NVL(data.qty_count, 0) AS qty_count
, NVL(data.qty_total, 0) AS qty_total
, NVL(data.price, 0) AS price
FROM hr
LEFT JOIN data
ON hr.hour = data.purchase_hour
ORDER BY hr.hour
;
I achieved the desired results by using Left Join (table A with table B) instead of Cross Join of these two tables:
Table A has all the dates and hours
Table B is the first part of the original query

Joining 3 grouped values

I have 3 queries that counts for each company, number of rows during certain month, from 3 different tables, and returning the same columns :qty, month and company_name.
Instead of this I need to return 1 table with same 3 columns but qty must sum the value of all the 3 separated queries.
Can you suggest the best way to join or execute it in which I will not loose speed of execution.
Here is example of one of queries, the other 2 queries has exactly the same syntax, just instead T_CUSTSK, they use T_CUSTSK2 and T_CUSTSK3:
SELECT
COUNT(*) as qty,
DATEPART (MONTH, [start_date]) AS [month],
T_SYSCOM.company_name
FROM
T_CUSTSK
INNER JOIN
T_SYSCOM ON T_CUSTSK.company_id = T_SYSCOM.company_id
WHERE
DATEPART (MONTH, [start_date]) = 12
GROUP BY
DATEPART (MONTH, [start_date]), T_SYSCOM.company_name
ORDER BY
month, qty DESC
Union all will combine your queries.
You can then wrap your statement with another query that sums:
select sum(qty), month, company_name from (
Select count(*) as qty, datepart(month, [start_date]) as [month], T_SYSCOM.company_name
from T_CUSTSK
INNER JOIN T_SYSCOM
ON T_CUSTSK.company_id=T_SYSCOM.company_id
where datepart(month, [start_date])=12
group by datepart(month, [start_date]), T_SYSCOM.company_name
order by month, qty DESC
union all
<second query>
union all
<third query> )
group by month, company_name
Try this you could do UNION ALL of all the T_CUSTSKx tables in the subquery and join that result once with T_SYSCOM table.
I'm assuming the start_date is in T_CUSTSK table
Select count(*) as qty, datepart(month, [start_date]) as [month], T2.company_name
from (
SELECT company_id,start_date FROM T_CUSTSK
UNION ALL
SELECT company_id,start_date FROM T_CUSTSK2
UNION ALL
SELECT company_id,start_date FROM T_CUSTSK3
) T1
INNER JOIN #T_SYSCOM T2
ON T1.company_id=T2.company_id
where datepart(month, [start_date])=12
group by datepart(month, [start_date]), T2.company_name
order by month, qty DESC
kindly let me know if this works

How to get value by a range of dates?

I have a table like so
And With this code I get the 5 latest values for each domainId
;WITH grp AS
(
SELECT DomainId, [Date],Passed, DatabasePerformance,ServerPerformance,
rn = ROW_NUMBER() OVER
(PARTITION BY DomainId ORDER BY [Date] DESC)
FROM dbo.DomainDetailDataHistory H
)
SELECT g.DomainId, g.[Date],g.Passed, g.ServerPerformance, g.DatabasePerformance
FROM grp g
INNER JOIN #Latest T ON T.DomainId = g.DomainId
WHERE rn < 7 AND t.date != g.[Date]
ORDER BY DomainId, [Date] DESC
What I Want
Well I would like to know how many tickets were sold for each of these 5 latest rows but with the following condition:
Each of these rows come with their own date which differs.
for each date I want to check how many were sold the last 15minutes AND how many were sold the last 30mns.
Example:
I get these 5 rows for each domainId
I want to extend the above with two columns, "soldTicketsLast15" and "soldTicketsLast30"
The date column contains all the dates I need and for each of these dates I want to go back 15 min and go back 30min to and get how many tickets were sold
Example:
SELECT MAX(SoldTickets) FROM DomainDetailDataHistory
WHERE [Date] >= DATEADD(minute, -15, '2016-04-12 12:10:28.2270000')
SELECT MAX(SoldTickets) FROM DomainDetailDataHistory
WHERE [Date] >= DATEADD(minute, -30, '2016-04-12 12:10:28.2270000')
How can i accomplish this?
I'd use OUTER APPLY or CROSS APPLY.
;WITH grp AS
(
SELECT
DomainId, [Date], Passed, DatabasePerformance, ServerPerformance,
rn = ROW_NUMBER() OVER (PARTITION BY DomainId ORDER BY [Date] DESC)
FROM dbo.DomainDetailDataHistory H
)
SELECT
g.DomainId, g.[Date],g.Passed, g.ServerPerformance, g.DatabasePerformance
,A15.SoldTicketsLast15
,A30.SoldTicketsLast30
FROM
grp g
INNER JOIN #Latest T ON T.DomainId = g.DomainId
OUTER APPLY
(
SELECT MAX(H.SoldTickets) - MIN(H.SoldTickets) AS SoldTicketsLast15
FROM DomainDetailDataHistory AS H
WHERE
H.DomainId = g.DomainId AND
H.[Date] >= DATEADD(minute, -15, g.[Date])
) AS A15
OUTER APPLY
(
SELECT MAX(H.SoldTickets) - MIN(H.SoldTickets) AS SoldTicketsLast30
FROM DomainDetailDataHistory AS H
WHERE
H.DomainId = g.DomainId AND
H.[Date] >= DATEADD(minute, -30, g.[Date])
) AS A30
WHERE
rn < 7
AND T.[date] != g.[Date]
ORDER BY DomainId, [Date] DESC;
To make the correlated APPLY queries efficient there should be an appropriate index, like the following:
CREATE NONCLUSTERED INDEX [IX_DomainId_Date] ON [dbo].[DomainDetailDataHistory]
(
[DomainId] ASC,
[Date] ASC
)
INCLUDE ([SoldTickets])
This index may also help to make the main part of your query (grp) efficient.
If I understood your question correctly, you want to get the tickets sold from one of your dates (in the Date column) going back 15 minutes and 30 minutes. Assuming that you are using your DATEADD function correctly, the following should work:
SELECT MAX(SoldTickets) FROM DomainDetailDataHistory
WHERE [Date] BETWEEN [DATE] AND DATEADD(minute, -15, '2016-04-12 12:10:28.2270000') GROUP BY [SoldTickets]
The between operator allows you to retrieve results between two date parameters. In the SQL above, we also need a group by since you are using a GROUPING function (MAX). The group by would depend on what you want to group by but I think in your case it would be SoldTickets.
The SQL above will give you the ones between the date and 15 minutes back. You could do something similar with the 30 minutes back.