PostgreSQL group by with interval - sql

Well, I have a seemingly simple set of data but it gives me a lot of trouble.
This is an example of what my data look like:
quantity price1 price2 date
100 1 0 2018-01-01 10:00:00
200 1 0 2018-01-02 10:00:00
50 5 0 2018-01-02 11:00:00
100 1 1 2018-01-03 10:00:00
100 1 1 2018-01-03 11:00:00
300 1 0 2018-01-03 12:00:00
I need to sum up "quantity" column grouped by "price1" and "price2" and it would be very easy but I need to take into account time changes of "price1" and "price2". Data is sorted by "date".
What I need is the last row to be not grouped with the first two although it has the same values for "price1" and "price2". Also I need to get minimal and maximal date of each interval.
The end result should looks like this:
quantity price1 price2 dateStart dateEnd
300 1 0 2018-01-01 10:00:00 2018-01-02 10:00:00
50 5 0 2018-01-02 11:00:00 2018-01-02 11:00:00
200 1 1 2018-01-03 10:00:00 2018-01-03 11:00:00
300 1 0 2018-01-03 12:00:00 2018-01-03 12:00:00
Any suggestions for a SQL query?

It is a gap and island problem. Use the following code:
select sum(quantity), price1, price2, min(date) dateStart, max(date) dateend
from
(
select *,
row_number() over (order by date) -
row_number() over (partition by price1, price2 order by date) grp
from data
) t
group by price1, price2, grp
order by dateStart
dbfiddle demo
The solution is based on an identification of consecutive sequences of price1 and price2, which is done by a creation of the grp column. Once you isolate the consecutive sequences then you do a simple group by using grp as well.

I changed a little bit the accepted answer to catch the cases when "date" column of two rows next to each other are exactly the same. I added second parameter so they will be ordered in correct order (my table has "oid" column)
select sum(quantity), price1, price2, min(date) dateStart, max(date) dateend
from
(
select *,
row_number() over (order by date, oid) -
row_number() over (partition by price1, price2 order by date, oid) grp
from data
) t
group by price1, price2, grp
order by dateStart

Related

SQL query to generate summary file base on change in price per item

I need help writing a query to generate a summary file of quantity purchase per item, and per cost from a purchase history file. To run the query the ORDER BY would be ITEM_NO, PO_DATE, AND COST.
SAMPLE DATE - PURCHASE HISTORY
OUTPUT FILE - SUMMARY
We can group by item_no and cost and get all the info we need.
select item_no
,cost
,min(po_date) as start_date
,max(po_date) as end_date
,sum(qty) as qty
from (
select *
,count(chng) over(partition by item_no order by po_date) as grp
from (
select *
,case when lag(cost) over(partition by item_no order by po_date) <> cost then 1 end as chng
from t
) t
) t
group by item_no, cost, grp
order by item_no, start_date
item_no
cost
start_date
end_date
qty
12345
1.25
2021-01-02 00:00:00
2021-01-04 00:00:00
150
12345
2.00
2021-02-01 00:00:00
2021-02-03 00:00:00
60
78945
5.25
2021-06-10 00:00:00
2021-06-12 00:00:00
90
78945
4.50
2021-10-18 00:00:00
2021-10-19 00:00:00
150
Fiddle

Making groups of dates in SQL Server

I have a table contains ids and dates, I want to groups of dates for each id
id date
------------------
1 2019-01-01
2 2019-01-01
1 2019-01-02
2 2019-01-02
2 2019-01-03
1 2019-01-04
1 2019-01-05
2 2019-01-05
2 2019-01-06
I want to check where are gaps in date for each id to get output like
id from to
------------------------------------
1 2019-01-01 2019-01-02
1 2019-01-04 2019-01-05
2 2019-01-01 2019-01-03
2 2019-01-05 2019-01-06
This is a form of gaps-and-islands problem. The simplest solution is to generate a sequential number for each id and subtract that from the date. This is constant for dates that are sequential.
So:
select id, min(date), max(date)
from (select t.*, row_number() over (partition by id order by date) as seqnum
from t
) t
group by id, dateadd(day, -seqnum, date)
order by id, min(date);
Here is a db<>fiddle.
A typical approach to this gaps-and-islands problem is build the groups by comparing the date of the current record to the "previous" date of the same id. When dates are not consecutive, a new group starts:
select id, min(date) from_date, max(date) to_date
from (
select
t.*,
sum(case when date = dateadd(day, 1, lag_date) then 0 else 1 end)
over(partition by id order by date) grp
from (
select
t.*,
lag(date) over(partition by id order by date) lag_date
from mytable t
) t
) t
group by id, grp
order by id, from_date

Oracle SQL - Select users between two date by month

I am learning SQL and I was wondering how to select active users by month, depending on their starting and ending date (both timestamp(6)). My table looks like this:
Cust_Num | Start_Date | End_Date
1 | 2018-01-01 | 2019-01-01
2 | 2018-01-01 | NULL
3 | 2019-01-01 | 2019-06-01
4 | 2017-01-01 | 2019-03-01
So, counting the active users by month, I should have an output like:
As of. | Count
2018-06-01 | 3
...
2019-02-01 | 3
2019-07-01 | 1
So far, I do a manual operation by entering each month:
Select
201906,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190630’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
Select
201905,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190531’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
...
Not very optimized and sustainable if I want to enter 10 years ao 120 months lol.
Any help is welcome. Thanks a lot!
This query shows the active-user-count effective as-of the end of the month.
How it works:
Convert each input row (with StartDate and EndDate value) into two rows that represent a point-in-time when the active-user-count incremented (on StartDate) and decremented (on EndDate). We need to convert NULL to a far-off date value because NULL values are sorted before instead of after non-NULL values:
This makes your data look like this:
OnThisDate Change
2018-01-01 1
2019-01-01 -1
2018-01-01 1
9999-12-31 -1
2019-01-01 1
2019-06-01 -1
2017-01-01 1
2019-03-01 -1
Then we simply SUM OVER the Change values (after sorting) to get the active-user-count as of that specific date:
So first, sort by OnThisDate:
OnThisDate Change
2017-01-01 1
2018-01-01 1
2018-01-01 1
2019-01-01 1
2019-01-01 -1
2019-03-01 -1
2019-06-01 -1
9999-12-31 -1
Then SUM OVER:
OnThisDate ActiveCount
2017-01-01 1
2018-01-01 2
2018-01-01 3
2019-01-01 4
2019-01-01 3
2019-03-01 2
2019-06-01 1
9999-12-31 0
Then we PARTITION (not group!) the rows by month and sort them by their date so we can identify the last ActiveCount row for that month (this actually happens in the WHERE of the outermost query, using ROW_NUMBER() and COUNT() for each month PARTITION):
OnThisDate ActiveCount IsLastInMonth
2017-01-01 1 1
2018-01-01 2 0
2018-01-01 3 1
2019-01-01 4 0
2019-01-01 3 1
2019-03-01 2 1
2019-06-01 1 1
9999-12-31 0 1
Then filter on that where IsLastInMonth = 1 (actually, where ROW_COUNT() = COUNT(*) inside each PARTITION) to give us the final output data:
At-end-of-month Active-count
2017-01 1
2018-01 3
2019-01 3
2019-03 2
2019-06 1
9999-12 0
This does result in "gaps" in the result-set because the At-end-of-month column only shows rows where the Active-count value actually changed rather than including all possible calendar months - but that's ideal (as far as I'm concerned) because it excludes redundant data. Filling in the gaps can be done inside your application code by simply repeating output rows for each additional month until it reaches the next At-end-of-month value.
Here's the query using T-SQL on SQL Server (I don't have access to Oracle right now). And here's the SQLFiddle I used to come to a solution: http://sqlfiddle.com/#!18/ad68b7/24
SELECT
OtdYear,
OtdMonth,
ActiveCount
FROM
(
-- This query adds columns to indicate which row is the last-row-in-month ( where RowInMonth == RowsInMonth )
SELECT
OnThisDate,
OtdYear,
OtdMonth,
ROW_NUMBER() OVER ( PARTITION BY OtdYear, OtdMonth ORDER BY OnThisDate ) AS RowInMonth,
COUNT(*) OVER ( PARTITION BY OtdYear, OtdMonth ) AS RowsInMonth,
ActiveCount
FROM
(
SELECT
OnThisDate,
YEAR( OnThisDate ) AS OtdYear,
MONTH( OnThisDate ) AS OtdMonth,
SUM( [Change] ) OVER ( ORDER BY OnThisDate ASC ) AS ActiveCount
FROM
(
SELECT
StartDate AS [OnThisDate],
1 AS [Change]
FROM
tbl
UNION ALL
SELECT
ISNULL( EndDate, DATEFROMPARTS( 9999, 12, 31 ) ) AS [OnThisDate],
-1 AS [Change]
FROM
tbl
) AS sq1
) AS sq2
) AS sq3
WHERE
RowInMonth = RowsInMonth
ORDER BY
OtdYear,
OtdMonth
This query can be flattened into fewer nested queries by using aggregate and window functions directly instead of using aliases (like OtdYear, ActiveCount, etc) but that would make the query much harder to understand.
I have created the query which will give the result of all the months starting from the minimum start date in the table till maximum end date.
You can change it using adding one condition in WHERE clause.
-- table creation
CREATE TABLE ACTIVE_USERS (CUST_NUM NUMBER, START_DATE DATE, END_DATE DATE)
-- data creation
INSERT INTO ACTIVE_USERS
SELECT * FROM
(
SELECT 1, DATE '2018-01-01', DATE '2019-01-01' FROM DUAL UNION ALL
SELECT 2, DATE '2018-01-01', NULL FROM DUAL UNION ALL
SELECT 3, DATE '2019-01-01', DATE '2019-06-01' FROM DUAL UNION ALL
SELECT 4, DATE '2017-01-01', DATE '2019-03-01' FROM DUAL
)
-- data in the actual table
SELECT * FROM ACTIVE_USERS ORDER BY CUST_NUM;
CUST_NUM START_DATE END_DATE
---------- ---------- ----------
1 2018-01-01 2019-01-01
2 2018-01-01
3 2019-01-01 2019-06-01
4 2017-01-01 2019-03-01
Query to fetch desired result
WITH CTE ( START_DATE, END_DATE ) AS
(
SELECT
ADD_MONTHS( START_DATE, LEVEL - 1 ),
ADD_MONTHS( START_DATE, LEVEL ) - 1
FROM
(
SELECT
MIN( START_DATE ) AS START_DATE,
MAX( END_DATE ) AS END_DATE
FROM
ACTIVE_USERS
)
CONNECT BY LEVEL <= CEIL( MONTHS_BETWEEN( END_DATE, START_DATE ) ) + 1
)
--
--
SELECT
C.START_DATE,
COUNT(1) AS CNT
FROM
CTE C
JOIN ACTIVE_USERS D ON
(
C.END_DATE BETWEEN
D.START_DATE
AND
CASE
WHEN D.END_DATE IS NOT NULL THEN D.END_DATE
ELSE C.END_DATE
END
)
GROUP BY
C.START_DATE
ORDER BY
C.START_DATE;
-- output --
START_DATE CNT
---------- ----------
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-05-01 1
2017-06-01 1
2017-07-01 1
2017-08-01 1
2017-09-01 1
2017-10-01 1
2017-11-01 1
START_DATE CNT
---------- ----------
2017-12-01 1
2018-01-01 3
2018-02-01 3
2018-03-01 3
2018-04-01 3
2018-05-01 3
2018-06-01 3
2018-07-01 3
2018-08-01 3
2018-09-01 3
2018-10-01 3
START_DATE CNT
---------- ----------
2018-11-01 3
2018-12-01 3
2019-01-01 3
2019-02-01 3
2019-03-01 2
2019-04-01 2
2019-05-01 2
2019-06-01 1
30 rows selected.
Cheers!!

Datediff between multiple rows for certain ranges

My DB Table has a data set with datetime values.
How can I return a result set, that returns the datediff between the smallest and the highest date only in case the datediff between two values are not larger than 5 minutes?
Date
2018-01-01 00:00:00
2018-01-01 00:01:00
2018-01-01 00:02:00
2018-01-01 00:03:00
2018-01-01 00:04:00
2018-01-01 00:13:00
2018-01-01 00:14:00
2018-01-01 00:15:00
2018-01-01 00:19:00
2018-01-01 00:54:00
2018-01-01 00:59:00
2018-01-01 01:00:00
Result set should look like this:
Ranges(min)
5
4
1
2
What would be an approach for that query?
You can put breaks in whenever there is a gap of more than 5 minutes. Then accumulate the number of breaks to define a group and aggregate:
select min(dte), max(dte), count(*) as cnt
from (select t.*,
sum(isbreak) over (order by dte) as grp
from (select t.*,
(case when lag(dte) over (order by dte) > dateadd(minute, -5, dte)
then 0 else 1
end) as isbreak
from t
) t
) t
group by grp;
For some reason (not clear to me right now), I thought your question involved SQL Server, so it uses that syntax. lag() is ANSI standard functionality and available in most databases; date arithmetic does vary among databases.

Running Total on date column

I have the following data in my table:
id invoice_id date ammount
1 1 2012-01-01 100.00
20 1 2012-01-31 50.00
470 1 2012-01-15 300.00
Now, I need to calculate running total for an invoice in some period. So, the output for this data sample should look like this:
id invoice_id date ammount running_total
1 1 2012-01-01 100.00 100.00
470 1 2012-01-15 300.00 400.00
20 1 2012-01-31 50.00 450.00
I tried with this samples http://www.sqlusa.com/bestpractices/runningtotal/ and several others, but the problem is that I could have entries like id 20, date 2012-01-31 and id 120, date 2012-01-01, and then I couldn't use NO = ROW_NUMBER(over by date)... in first select and then ID < NO in second select for calculating running total.
DECLARE #DateStart DATE='2012-01-01';
WITH cte
AS (SELECT id = Row_number() OVER(ORDER BY [date]),
DATE,
myid = id,
invoice_id,
orderdate = CONVERT(DATE, DATE),
ammount
FROM [Table_2]
WHERE DATE >= #DateStart)
SELECT myid,
invoice_id,
DATE,
ammount,
runningtotal = (SELECT SUM(ammount)
FROM cte
WHERE id <= a.id)
FROM cte AS a
ORDER BY id