I have data like this:
GroupId DateFrom DateTo value_
Gr1 2022-03-01 2022-08-01 10
Gr2 2022-01-01 2022-12-31 20
Gr3 2022-01-01 2022-12-31 30
I'm trying to construct an SCD2 type dimension by doing an unpivot on data above
WITH UnPivoted AS (SELECT 'Gr1' AS GroupId, '2022-03-01' AS DateFrom, '2022-08-01' AS DateTo, 10 as value_ UNION ALL
SELECT 'Gr2', '2022-01-01', '2022-12-31', 20 UNION ALL
SELECT 'Gr3', '2022-01-01', '2022-12-31', 30
)
SELECT DateFrom, DateTo, SUM([Gr1]) Gr1, SUM([Gr2]) Gr2, SUM([Gr3]) Gr3
FROM UnPivoted
PIVOT (
SUM(value_) FOR GroupId IN ([Gr1],[Gr2],[Gr3])
) pvt
GROUP BY DateFrom, DateTo
with result:
DateFrom DateTo Gr1 Gr2 Gr3
2022-03-01 2022-08-01 10 NULL NULL
2022-01-01 2022-12-31 NULL 20 30
But, as you can see, date ranges are not identical so my GROUP BY does not work. And there is an overlap in date ranges so output is not correct.
I would like to get this result instead:
DateFrom DateTo Gr1 Gr2 Gr3
2022-01-01 2022-03-01 20 30
2022-03-01 2022-08-01 10 20 30
2022-08-01 2022-12-31 20 30
The best approach that I can come up with is to get all distinct values of DateFrom and DateTo and go through intervals between them one by one, constructing a new row for each interval.
Is there an easier way of getting the desired result?
In case someone else has the same situation, script below works. It also has some additional logic to adjust end dates so they do not overlap with start dates.
input (Unpivoted CTE):
GroupId DateFrom DateTo value_
Gr1 2022-03-01 2022-08-01 10
Gr2 2022-01-01 2022-12-31 20
Gr3 2022-01-01 2022-12-31 30
script:
WITH UnPivoted AS (SELECT 'Gr1' AS GroupId, CAST('2022-03-01' AS date) AS DateFrom, CAST('2022-08-01' AS date) AS DateTo, 10 as value_ UNION ALL
SELECT 'Gr2', '2022-01-01', '2022-12-31', 20 UNION ALL
SELECT 'Gr3', '2022-01-01', '2022-12-31', 30
)
,UniqueDateRanges AS (
SELECT DISTINCT DateFrom
FROM UnPivoted
UNION
SELECT DISTINCT DATEADD(d,1,DateTo)
FROM UnPivoted
)
,DateIntervals_SCD2 AS (
SELECT DateFrom
,CAST(NULLIF(LEAD(DateFrom,1,NULL) OVER(PARTITION BY '1' ORDER BY DateFrom),NULL) AS date) AS DateTo1
,CAST(DATEADD(d,-1,NULLIF(LEAD(DateFrom,1,NULL) OVER(PARTITION BY '1' ORDER BY DateFrom),NULL)) AS date) AS DateTo1_adjusted
FROM UniqueDateRanges
)
,Dataset_Fixed_SCD2 AS (
SELECT di.DateFrom, DateTo1_adjusted AS DateTo, up.GroupId, up.value_
FROM DateIntervals_SCD2 di
LEFT JOIN UnPivoted up ON di.DateFrom BETWEEN up.DateFrom AND up.DateTo AND DateTo1_adjusted BETWEEN up.DateFrom AND up.DateTo
WHERE DateTo1_adjusted IS NOT NULL
)
SELECT *
FROM Dataset_Fixed_SCD2
PIVOT (
SUM(value_) FOR GroupId IN ([Gr1],[Gr2],[Gr3])
) pvt
output:
DateFrom DateTo Gr1 Gr2 Gr3
2022-01-01 2022-02-28 20 30
2022-03-01 2022-08-01 10 20 30
2022-08-02 2022-12-31 20 30
I have table records as -
date n_count
2020-02-19 00:00:00 4
2020-07-14 00:00:00 1
2020-07-17 00:00:00 1
2020-07-30 00:00:00 2
2020-08-03 00:00:00 1
2020-08-04 00:00:00 2
2020-08-25 00:00:00 2
2020-09-23 00:00:00 2
2020-09-30 00:00:00 3
2020-10-01 00:00:00 11
2020-10-05 00:00:00 12
2020-10-19 00:00:00 1
2020-10-20 00:00:00 1
2020-10-22 00:00:00 1
2020-11-02 00:00:00 376
2020-11-04 00:00:00 72
2020-11-11 00:00:00 1
I want to be grouped all the records into months for finding month total count which is working, but there is a missing of month. how to fill this gap.
time month_count
"2020-02-01" 4
"2020-07-01" 4
"2020-08-01" 5
"2020-09-01" 5
"2020-10-01" 26
"2020-11-01" 449
This is what I have tried.
SELECT (date_trunc('month', date))::date AS time,
sum(n_count) as month_count
FROM table1
group by time
order by time asc
You can use generate_series() to generate all starts of months between the earliest and latest date available in the table, then bring the table with a left join:
select d.dt, coalesce(sum(t.n_count), 0) as month_count
from (
select generate_series(date_trunc('month', min(date)), date_trunc('month', max(date)), '1 month') as dt
from table1
) as d(dt)
left join table1 t on t.date >= d.dt and t.date < d.dt + interval '1 month'
group by d.dt
order by d.dt
I would simply UNION a date series, generated from MIN and MAX date:
demo:db<>fiddle
WITH cte AS ( -- 1
SELECT
*,
date_trunc('month', date)::date AS time
FROM
t
)
SELECT
time,
SUM(n_count) as month_count --3
FROM (
SELECT
time,
n_count
FROM cte
UNION
SELECT -- 2
generate_series(
(SELECT MIN(time) FROM cte),
(SELECT MAX(time) FROM cte),
interval '1 month'
)::date,
0
) s
GROUP BY time
ORDER BY time
Use CTE to calculate date_trunc only once. Could be left out if you like to call your table twice in the UNION below
Generate monthly date series from MIN to MAX date containing your n_count value = 0. Add it to the table
Do your calculation
I'm trying to solve the following problem:
a user took three loans with running times of 3,4 and 5 months.
How to calculate in BigQuery for each point in time, how much he owns?
I know to do this calculation in R or Python but would clearly prefer a BigQuery/SQL solution.
Thank you!
I have the data:
Take Date Return Date Sum
2016-01-01 2016-03-31 10
2016-02-01 2016-05-31 20
2016-03-01 2016-07-31 50
I need the output like this:
Date Sum
2016-01-01 10
2016-02-01 30
2016-03-01 80
2016-04-01 70
2016-05-01 70
2016-06-01 50
2016-07-01 50
2016-08-01 0
Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2016-01-01' take_date, DATE '2016-03-31' return_date, 10 amount
UNION ALL SELECT 1, DATE '2016-02-01', DATE '2016-05-31', 20
UNION ALL SELECT 1, DATE '2016-03-01', DATE '2016-07-31', 50
), dates AS (
SELECT id, day
FROM (
SELECT id, GENERATE_DATE_ARRAY(
MIN(take_date),
DATE_ADD(DATE_TRUNC(MAX(return_date), MONTH), INTERVAL 1 MONTH),
INTERVAL 1 MONTH
) days
FROM `project.dataset.table`
GROUP BY id
), UNNEST(days) day
)
SELECT d.id, d.day, SUM(IF(d.day BETWEEN t.take_date AND t.return_date, amount, 0)) amount
FROM dates d
LEFT JOIN `project.dataset.table` t
ON d.id = t.id
GROUP BY d.id, d.day
ORDER BY d.day
with result as
Row id day amount
1 1 2016-01-01 10
2 1 2016-02-01 30
3 1 2016-03-01 80
4 1 2016-04-01 70
5 1 2016-05-01 70
6 1 2016-06-01 50
7 1 2016-07-01 50
8 1 2016-08-01 0
I have the following table structure:
id int -- more like a group id, not unique in the table
AddedOn datetime -- when the record was added
For a specific id there is at most one record each day. I have to write a query that returns contiguous (at day level) date intervals for each id.
The expected result structure is:
id int
StartDate datetime
EndDate datetime
Note that the time part of AddedOn is available but it is not important here.
To make it clearer, here is some input data:
with data as
(
select * from
(
values
(0, getdate()), --dummy record used to infer column types
(1, '20150101'),
(1, '20150102'),
(1, '20150104'),
(1, '20150105'),
(1, '20150106'),
(2, '20150101'),
(2, '20150102'),
(2, '20150103'),
(2, '20150104'),
(2, '20150106'),
(2, '20150107'),
(3, '20150101'),
(3, '20150103'),
(3, '20150105'),
(3, '20150106'),
(3, '20150108'),
(3, '20150109'),
(3, '20150110')
) as d(id, AddedOn)
where id > 0 -- exclude dummy record
)
select * from data
And the expected result:
id StartDate EndDate
1 2015-01-01 2015-01-02
1 2015-01-04 2015-01-06
2 2015-01-01 2015-01-04
2 2015-01-06 2015-01-07
3 2015-01-01 2015-01-01
3 2015-01-03 2015-01-03
3 2015-01-05 2015-01-06
3 2015-01-08 2015-01-10
Although it looks like a common problem I couldn't find a similar enough question. Also I'm getting closer to a solution and I will post it when (and if) it works but I feel that there should be a more elegant one.
Here's answer without any fancy joining, but simply using group by and row_number, which is not only simple but also more efficient.
WITH CTE_dayOfYear
AS
(
SELECT id,
AddedOn,
DATEDIFF(DAY,'20000101',AddedOn) dyID,
ROW_NUMBER() OVER (ORDER BY ID,AddedOn) row_num
FROM data
)
SELECT ID,
MIN(AddedOn) StartDate,
MAX(AddedOn) EndDate,
dyID-row_num AS groupID
FROM CTE_dayOfYear
GROUP BY ID,dyID - row_num
ORDER BY ID,2,3
The logic is that the dyID is based on the date so there are gaps while row_num has no gaps. So every time there is a gap in dyID, then it changes the difference between row_num and dyID. Then I simply use that difference as my groupID.
In Sql Server 2008 it is a little bit pain without LEAD and LAG functions:
WITH data
AS ( SELECT * ,
ROW_NUMBER() OVER ( ORDER BY id, AddedOn ) AS rn
FROM ( VALUES ( 0, GETDATE()), --dummy record used to infer column types
( 1, '20150101'), ( 1, '20150102'), ( 1, '20150104'),
( 1, '20150105'), ( 1, '20150106'), ( 2, '20150101'),
( 2, '20150102'), ( 2, '20150103'), ( 2, '20150104'),
( 2, '20150106'), ( 2, '20150107'), ( 3, '20150101'),
( 3, '20150103'), ( 3, '20150105'), ( 3, '20150106'),
( 3, '20150108'), ( 3, '20150109'), ( 3, '20150110') )
AS d ( id, AddedOn )
WHERE id > 0 -- exclude dummy record
),
diff
AS ( SELECT d1.* ,
CASE WHEN ISNULL(DATEDIFF(dd, d2.AddedOn, d1.AddedOn),
1) = 1 THEN 0
ELSE 1
END AS diff
FROM data d1
LEFT JOIN data d2 ON d1.id = d2.id
AND d1.rn = d2.rn + 1
),
parts
AS ( SELECT * ,
( SELECT SUM(diff)
FROM diff d2
WHERE d2.rn <= d1.rn
) AS p
FROM diff d1
)
SELECT id ,
MIN(AddedOn) AS StartDate ,
MAX(AddedOn) AS EndDate
FROM parts
GROUP BY id ,
p
Output:
id StartDate EndDate
1 2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1 2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2 2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2 2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3 2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3 2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3 2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3 2015-01-08 00:00:00.000 2015-01-10 00:00:00.000
Walkthrough:
diff
This CTE returns data:
1 2015-01-01 00:00:00.000 1 0
1 2015-01-02 00:00:00.000 2 0
1 2015-01-04 00:00:00.000 3 1
1 2015-01-05 00:00:00.000 4 0
1 2015-01-06 00:00:00.000 5 0
You are joining same table on itself to get the previous row. Then you calculate difference in days between current row and previous row and if the result is 1 day then pick 0 else pick 1.
parts
This CTE selects result from previous step and sums up the new column(it is a cumulative sum. sum of all values of new column from starting till current row), so you are getting partitions to group by:
1 2015-01-01 00:00:00.000 1 0 0
1 2015-01-02 00:00:00.000 2 0 0
1 2015-01-04 00:00:00.000 3 1 1
1 2015-01-05 00:00:00.000 4 0 1
1 2015-01-06 00:00:00.000 5 0 1
2 2015-01-01 00:00:00.000 6 0 1
2 2015-01-02 00:00:00.000 7 0 1
2 2015-01-03 00:00:00.000 8 0 1
2 2015-01-04 00:00:00.000 9 0 1
2 2015-01-06 00:00:00.000 10 1 2
2 2015-01-07 00:00:00.000 11 0 2
3 2015-01-01 00:00:00.000 12 0 2
3 2015-01-03 00:00:00.000 13 1 3
The last step is just a grouping by ID and new column and picking min and max values for dates.
I took the "Islands Solution #3 from SQL MVP Deep Dives" solution from https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/ and applied to your test data:
with
data as
(
select * from
(
values
(0, getdate()), --dummy record used to infer column types
(1, '20150101'),
(1, '20150102'),
(1, '20150104'),
(1, '20150105'),
(1, '20150106'),
(2, '20150101'),
(2, '20150102'),
(2, '20150103'),
(2, '20150104'),
(2, '20150106'),
(2, '20150107'),
(3, '20150101'),
(3, '20150103'),
(3, '20150105'),
(3, '20150106'),
(3, '20150108'),
(3, '20150109'),
(3, '20150110')
) as d(id, AddedOn)
where id > 0 -- exclude dummy record
)
,CTE_Seq
AS
(
SELECT
ID
,SeqNo
,SeqNo - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) AS rn
FROM
data
CROSS APPLY
(
SELECT DATEDIFF(day, '20150101', AddedOn) AS SeqNo
) AS CA
)
SELECT
ID
,DATEADD(day, MIN(SeqNo), '20150101') AS StartDate
,DATEADD(day, MAX(SeqNo), '20150101') AS EndDate
FROM CTE_Seq
GROUP BY ID, rn
ORDER BY ID, StartDate;
Result set
ID StartDate EndDate
1 2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1 2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2 2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2 2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3 2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3 2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3 2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3 2015-01-08 00:00:00.000 2015-01-10 00:00:00.000
I'd recommend you to examine the intermediate results of CTE_Seq to understand how it actually works. Just put
select * from CTE_Seq
instead of the final SELECT ... GROUP BY .... You'll get this result set:
ID SeqNo rn
1 0 -1
1 1 -1
1 3 0
1 4 0
1 5 0
2 0 -1
2 1 -1
2 2 -1
2 3 -1
2 5 0
2 6 0
3 0 -1
3 2 0
3 4 1
3 5 1
3 7 2
3 8 2
3 9 2
Each date is converted into a sequence number by DATEDIFF(day, '20150101', AddedOn). ROW_NUMBER() generates a set of sequential numbers without gaps, so when these numbers are subtracted from a sequence with gaps the difference jumps/changes. The difference stays the same until the next gap, so in the final SELECT GROUP BY ID, rn brings all rows from the same island together.
Here is a simple solution that does not use analytics. I tend not to use analytics because I work with many different DBMSs and many don't (yet) have them emplemented and even those who do have different syntaxes. I just have the habit of writing generic code whenever possible.
with
Data( ID, AddedOn )as(
select 1, convert( date, '20150101' ) union all
select 1, '20150102' union all
select 1, '20150104' union all
select 1, '20150105' union all
select 1, '20150106' union all
select 2, '20150101' union all
select 2, '20150102' union all
select 2, '20150103' union all
select 2, '20150104' union all
select 2, '20150106' union all
select 2, '20150107' union all
select 3, '20150101' union all
select 3, '20150103' union all
select 3, '20150105' union all
select 3, '20150106' union all
select 3, '20150108' union all
select 3, '20150109' union all
select 3, '20150110'
)
select d.ID, d.AddedOn StartDate, IsNull( d1.AddedOn, '99991231' ) EndDate
from Data d
left join Data d1
on d1.ID = d.ID
and d1.AddedOn =(
select Min( AddedOn )
from data
where ID = d.ID
and AddedOn > d.AddedOn );
In your situation I assume that ID and AddedOn form a composite PK and so are indexed. Thus, the query will run impressively fast even on very large tables.
Also, I used the outer join because it seemed like the last AddedOn date of each ID should be seen in the StartDate column. Instead of NULL I used a common MaxDate value. The NULL could work just as well as a "this is the latest StartDate row" flag.
Here is the output for ID=1:
ID StartDate EndDate
----------- ---------- ----------
1 2015-01-01 2015-01-02
1 2015-01-02 2015-01-04
1 2015-01-04 2015-01-05
1 2015-01-05 2015-01-06
1 2015-01-06 9999-12-31
I'd like to post my own solution too because it's yet another approach:
with data as
(
...
),
temp as
(
select d.id
,d.AddedOn
,dprev.AddedOn as PrevAddedOn
,dnext.AddedOn as NextAddedOn
FROM data d
left JOIN
data dprev on dprev.id = d.id
and dprev.AddedOn = dateadd(d, -1, d.AddedOn)
left JOIN
data dnext on dnext.id = d.id
and dnext.AddedOn = dateadd(d, 1, d.AddedOn)
),
starts AS
(
select id
,AddedOn
from temp
where PrevAddedOn is NULL
),
ends as
(
select id
,AddedOn
from temp
where NextAddedon is NULL
)
SELECT s.id as id
,s.AddedOn as StartDate
,(select min(e.AddedOn) from ends e where e.id = s.id and e.AddedOn >= s.AddedOn) as EndDate
from starts s
I m working on a very weird problem with SQL where I have to compare previous rows
Number start_date end_date
----- ------- ------------
1 2011-06-07 00:00:00.000 2011-07-10 00:00:00.000
2 2011-10-11 00:00:00.000 2011-10-11 00:00:00.000
3 2011-10-26 00:00:00.000 2011-10-29 00:00:00.000
4 2011-10-29 00:00:00.000 2011-11-15 00:00:00.000
Here , I have to compare the start_date and end_date on the two different line and create a view out of it.
(If the start_date is less than the previous end_date , then criteria is set to 1).
Well it should compare 2011-10-26 00:00:00.000 for 3 and 2011-10-27 00:00:00.000 on 2 for 30 days
Number start_date end_date Criteria
----- ----------- ---------------- ------------
1 2011-06-07 00:00:00.000 2011-07-10 00:00:00.000 0
2 2011-10-11 00:00:00.000 2011-10-11 00:00:00.000 0
3 2011-10-26 00:00:00.000 2011-10-29 00:00:00.000 1
4 2011-10-30 00:00:00.000 2011-11-15 00:00:00.000 1
I m confused how should I proceed with this.
Any help would be helpful !!!!
Thanks !!!
The most straightforward way to do this is to use a subquery:
select A.number, a.start_date, a.end_date,
CASE WHEN start_date < dateadd(d,30,(select TOP(1) b.end_date
from mytable B
where B.number < A.number
order by B.number desc)) then 1 else 0 end Criteria
from mytable A
Note: If the start date is the 29th day following the previous row's end date, Criteria becomes 1. By the 30th day onwards, it is 0. Tweak the 30 in the query as required.
Sample:
create table mytable (
Number int primary key,
start_date datetime,
end_date datetime);
insert mytable
select 1, '2011-06-07', '2011-07-10' union all
select 2, '2011-10-11', '2011-10-27' union all
select 3, '2011-10-26', '2011-10-29' union all
select 4, '2011-10-29', '2011-11-15'
Result:
number start_date end_date Criteria
1 2011-06-07 00:00:00.000 2011-07-10 00:00:00.000 0
2 2011-10-11 00:00:00.000 2011-10-27 00:00:00.000 0
3 2011-10-26 00:00:00.000 2011-10-29 00:00:00.000 1
4 2011-10-29 00:00:00.000 2011-11-15 00:00:00.000 0
Try using case like this:
create view vDates as
select Number,start_date,end_date,
case
when start_date<end_date
then 0
else 1
end as Criteria
from tab
SQL Fiddle Demo
A more readable way is create a function and send the correct dates:
Function:
create function [dbo].[CompareDates] (
#START_DATE datetime,
#PREVIOUS_END_DATE datetime
)
RETURNS int
AS
BEGIN
if #START_DATE < #PREVIOUS_END_DATE
return 1
return 0
END
Query (using subquery):
declare #dates table
(
number int,
start datetime,
end_date datetime
)
insert into #dates values
(1, '2011-06-07 00:00:00.000', '2011-07-10 00:00:00.000'),
(2, '2011-10-11 00:00:00.000', '2011-10-27 00:00:00.000'),
(3, '2011-10-26 00:00:00.000', '2011-10-29 00:00:00.000'),
(4, '2011-10-29 00:00:00.000', '2011-11-15 00:00:00.000')
select *, dbo.CompareDates(dates.end_date, dates.previous_end_date) from
(
select number, start, end_date,
(select TOP 1 end_date
from #dates d2
where d2.number < d1.number
order by d2.number desc) as previous_end_date
from #dates d1
) dates