Aggregate only Sequential values - sql

I have a table with 3 columns
Create table test
(
Created Datetime
, Flag Bit
, Amount Money
)
that looks like this
Created Flag Amount
2019-12-01 00:00:00.000 1 50,40
2019-11-21 00:00:00.000 1 50,40
2019-11-06 00:00:00.000 0 50,40
2019-10-04 00:00:00.000 1 50,40
2019-09-08 00:00:00.000 1 50,40
2019-09-01 00:00:00.000 0 50,40
2019-08-04 00:00:00.000 1 50,40
2019-07-24 00:00:00.000 1 50,40
2019-07-23 00:00:00.000 1 50,40
2019-06-01 00:00:00.000 0 50,40
2019-05-05 00:00:00.000 0 50,40
2019-04-25 00:00:00.000 1 50,40
2019-03-11 00:00:00.000 0 50,40
2019-02-03 00:00:00.000 0 50,40
2019-02-02 00:00:00.000 0 50,40
2019-02-01 00:00:00.000 0 50,40
2019-01-31 00:00:00.000 1 50,40
2019-01-26 00:00:00.000 0 50,40
2019-01-26 00:00:00.000 0 50,40
2019-01-01 00:00:00.000 1 50,40
As you can see it is ordered by Created in descending order.
Imagine that all these rows are Transactions. When the flag is one we have a checkpoint. So for example from line 20 to 17 is one period(Always counting from older to newer). From line 17 to 12 is another period and so on.
Please notice that in lines 9,8 and 7 we have 3 consecutive flags with a value of 1. When this happens, having consecutive 1s without 0s, i want to treat all the consecutive 1s as a group. I want them to appear as one row with Summed amount and keeping the MIN(Created) of them.
For example for rows 9-7 I want to group it in one row with where amount has a value of 151.2,flag a value of 1 and Created a value of 2019-07-23 00:00:00.000 (the min(date) of the three rows.
An example output of this table would be the following.
Created Flag Amount
2019-11-21 00:00:00.000 1 100,80
2019-11-06 00:00:00.000 0 50,40
2019-09-08 00:00:00.000 1 100,80
2019-09-01 00:00:00.000 0 50,40
2019-07-23 00:00:00.000 1 151,20
2019-06-01 00:00:00.000 0 50,40
2019-05-05 00:00:00.000 0 50,40
2019-04-25 00:00:00.000 1 50,40
2019-03-11 00:00:00.000 0 50,40
2019-02-03 00:00:00.000 0 50,40
2019-02-02 00:00:00.000 0 50,40
2019-02-01 00:00:00.000 0 50,40
2019-01-31 00:00:00.000 1 50,40
2019-01-26 00:00:00.000 0 50,40
2019-01-26 00:00:00.000 0 50,40
2019-01-01 00:00:00.000 1 50,40

If you just want to collapse adjacent "1"s, then one approach is to assign a grouping based on the count of preceding 0s and aggregate. So for aggregating the "1"s:
select min(created), 1 as flag, sum(amount)
from (select t.*,
sum(1 - flag) over (order by created) as grouping
from t
) t
where flag = 1
group by grouping;
This does not quite work when we include 0s, because the 0s would get combined with the 1s. So I think the simplest method is union all:
select min(created), 1 as flag, sum(amount)
from (select t.*,
sum(1 - flag) over (order by created) as grouping
from t
) t
where flag = 1
group by grouping
union all
select created, flag, amount
from t
where flag = 0;
I originally misinterpreted the question as wanting a summary for all periods, not just the adjacent "1"s. You can do this with a cumulative sum to identify the groups:
select t.*,
sum(flag) over (order by created) as grouping
from t;
And then use a subquery to aggregate this:
select min(created), max(created), count(*) as num_transactions,
sum(amount) as total_amount
from (select t.*,
sum(flag) over (order by created) as grouping
from t
) t
group by grouping;

You want to aggregate all consecutive rows flagged 1. You can achieve this with a running count of rows flagged 0. You can see in the table below that flag + running count of zeros identifies the groups.
Created | Amount | Flag | COUNT_0
-----------+--------+------+--------
2019-12-01 | 50,40 | 1 | 0 \ both rows flag=1, count_0=0 => one group
2019-11-21 | 50,40 | 1 | 0 /
2019-11-06 | 50,40 | 0 | 1 > the only row with flag=0, count_0=1 => one group
2019-10-04 | 50,40 | 1 | 1 \ both rows flag=1, count_0=1 => one group
2019-09-08 | 50,40 | 1 | 1 /
2019-09-01 | 50,40 | 0 | 2 > the only row with flag=0, count_0=2 => one group
2019-08-04 | 50,40 | 1 | 2 \
2019-07-24 | 50,40 | 1 | 2 | all three rows flag=1, count_0=2 => one group
2019-07-23 | 50,40 | 1 | 2 /
2019-06-01 | 50,40 | 0 | 3 > the only row with flag=0, count_0=3 => one group
2019-05-05 | 50,40 | 0 | 4 > the only row with flag=0, count_0=4 => one group
2019-04-25 | 50,40 | 1 | 4 > the only row with flag=1, count_0=4 => one group
2019-03-11 | 50,40 | 0 | 5 > the only row with flag=0, count_0=5 => one group
2019-02-03 | 50,40 | 0 | 6 > the only row with flag=0, count_0=6 => one group
2019-02-02 | 50,40 | 0 | 7 > the only row with flag=0, count_0=7 => one group
2019-02-01 | 50,40 | 0 | 8 > the only row with flag=0, count_0=8 => one group
2019-01-31 | 50,40 | 1 | 8 > the only row with flag=1, count_0=8 => one group
2019-01-26 | 50,40 | 0 | 9 > the only row with flag=0, count_0=9 => one group
2019-01-26 | 50,40 | 0 | 10 > the only row with flag=0, count_0=10 => one group
2019-01-01 | 50,40 | 1 | 10 > the only row with flag=1, count_0=10 => one group
The related query:
select min(created), min(flag), sum(amount)
from
(
select
m.*,
count(case when flag = 0 then 1 end) over (order by created) as count_0
from mytable m
)
group by flag, count_0
order by min(created);

Related

unstack start and end date to month

cust_id start end subs_price_p_month
1 2019-01-01 2019-12-10 50.00
1 2020-02-03 2020-08-05 39.99
2 2019-12-11 2020-11-08 29.99
I would like to "unstack" the table above, so that each row contains the subs price for 1 month:
cust_id month subs_price_p_month
1 2019-01-01 50.00
1 2019-02-01 50.00
1 2019-03-01 50.00
....
1 2019-12-01 50.00
1 2020-02-01 39.99
1 2020-03-01 39.99
1 2020-04-01 39.99
....
1 2020-08-01 39.99
2 2019-12-01 29.99
2 2020-01-01 29.99
2 2020-02-01 29.99
...
2 2020-11-01 29.99
Text explanation:
Customer ID 1 has 2 subscriptions with different prices. The first one starts from 1 January 2020 until December 2020, second one from 3 February 2021 to 5 August 2020.
Customer ID 2 has only 1 subscription, from December 2019 to November 2020.
I want each row equals 1 customer ID, 1 month for easier data manipulation.
generate_series() generates the sequence of dates that you need. However, it is tricky to get the date arithmetic just right for your results.
You seem to want:
select t.cust_id, yyyymm, t.subs_price_p_month
from t cross join lateral
generate_series( date_trunc('month', startd),
date_trunc('month', endd),
interval '1 month'
) gs(yyyymm);
However, if there are multiple rows within the same month, you would get duplicates. This question does not clarify what to do in that case. If you need to handle that case, I would suggest asking a new question.
Here is a db<>fiddle.
Use generate_series with an interval of 1 month and the range from start and end, e.g.:
SELECT
cust_id,
generate_series(start,end,interval '1 month'),
subs_price_p_month
FROM t;
cust_id | generate_series | subs_price_p_month
---------+------------------------+--------------------
1 | 2019-01-01 00:00:00+01 | 50.00
1 | 2019-02-01 00:00:00+01 | 50.00
1 | 2019-03-01 00:00:00+01 | 50.00
1 | 2019-04-01 00:00:00+02 | 50.00
1 | 2019-05-01 00:00:00+02 | 50.00
1 | 2019-06-01 00:00:00+02 | 50.00
1 | 2019-07-01 00:00:00+02 | 50.00
...
Perhaps even formatting the dates to 'Month YYYY' would better display your result set:
SELECT
cust_id,
to_char(generate_series(start_,end_,interval '1 month'),'Month YYYY')
subs_price_p_month
FROM t;
cust_id | to_char | subs_price_p_month
---------+----------------+--------------------
1 | January 2019 | 50.00
1 | February 2019 | 50.00
1 | March 2019 | 50.00
1 | April 2019 | 50.00
1 | May 2019 | 50.00
1 | June 2019 | 50.00
1 | July 2019 | 50.00
...
Demo: db<>fiddle

Output data by putting lag in SQL Server

I have one table as below in SQL Server like below.
SELECT * FROM OverlappingDateRanges
Id startDate EndDate
10001 2020-04-01 00:00:00.000 2020-05-25 00:00:00.000
10001 2020-05-26 00:00:00.000 2020-07-15 00:00:00.000
10001 2020-07-17 00:00:00.000 2020-08-15 00:00:00.000
10001 2020-08-16 00:00:00.000 2020-10-15 00:00:00.000
10001 2020-10-16 00:00:00.000 2020-12-31 00:00:00.000
10002 2020-05-01 00:00:00.000 2020-05-29 00:00:00.000
10002 2020-05-30 00:00:00.000 2020-07-08 00:00:00.000
10002 2020-07-09 00:00:00.000 2020-10-01 00:00:00.000
10002 2020-10-03 00:00:00.000 2020-12-31 00:00:00.000
I want output like below where if there is no date difference between end date and next start date of same id, then then date will continue & its should break if end date and next start date is not in continue.
Output should be:
id startDate endDate
10001 2020-04-01 00:00:00.000 2020-07-15 00:00:00.000
10001 2020-07-17 00:00:00.000 2020-12-31 00:00:00.000
10002 2020-05-01 00:00:00.000 2020-10-01 00:00:00.000
10002 2020-10-03 00:00:00.000 2020-12-31 00:00:00.000
This is a type of gaps-and-islands problem. Identify where each output row starts by looking at the previous end. Then do a cumulative sum and aggregate:
select id, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate) then 0 else 1
end) over (partition by id order by startdate) as grp
from (select t.*,
lag(enddate) over (partition by id order by startdate) as prev_enddate
from t
) t
) t
group by id, grp;
Here is a db<>fiddle.

Wrong results with group by for distinct count

I have these two queries for calculating a distinct count from a table for a particular date range. In my first query I group by location, aRID ( which is a rule ) and date. In my second query I don't group by a date.
I am expecting the same distinct count in both the results but I get total count as 6147 in first result and 6359 in second result. What is wrong here? The difference is group by..
select
r.loc
,cast(r.date as DATE) as dateCol
,count(distinct r.dC) as dC_count
from table r
where r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId, cast(r.date as DATE)
select
r.loc
,count(distinct r.DC) as dC_count
from table r
and r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId
loc dateCol dC_count
1 2018-01-22 1
1 2018-03-09 2
1 2018-01-28 3
1 2018-01-05 1
1 2018-05-28 143
1 2018-02-17 1
1 2018-05-08 187
1 2018-05-31 146
1 2018-01-02 3
1 2018-02-14 1
1 2018-05-11 273
1 2018-01-14 1
1 2018-03-18 2
1 2018-02-03 1
1 2018-05-20 200
1 2018-05-14 230
1 2018-01-11 5
1 2018-01-31 1
1 2018-05-17 209
1 2018-01-20 2
1 2018-03-01 1
1 2018-01-03 3
1 2018-05-06 253
1 2018-05-26 187
1 2018-03-24 1
1 2018-02-09 1
1 2018-03-04 1
1 2018-05-03 269
1 2018-05-23 187
1 2018-05-29 133
1 2018-03-21 1
1 2018-03-27 1
1 2018-05-15 202
1 2018-03-07 1
1 2018-06-01 155
1 2018-02-21 1
1 2018-01-26 2
1 2018-02-15 2
1 2018-05-12 331
1 2018-03-10 1
1 2018-01-09 3
1 2018-02-18 1
1 2018-03-13 2
1 2018-05-09 184
1 2018-01-12 2
1 2018-03-16 1
1 2018-05-18 198
1 2018-02-07 1
1 2018-02-01 1
1 2018-01-15 3
1 2018-02-24 4
1 2018-03-19 1
1 2018-05-21 161
1 2018-02-10 1
1 2018-05-04 250
1 2018-05-30 148
1 2018-05-24 153
1 2018-01-24 1
1 2018-05-10 199
1 2018-03-08 1
1 2018-01-21 1
1 2018-05-27 151
1 2018-01-04 3
1 2018-05-07 236
1 2018-03-25 1
1 2018-03-11 2
1 2018-01-10 1
1 2018-01-30 1
1 2018-03-14 1
1 2018-02-19 1
1 2018-05-16 192
1 2018-01-13 5
1 2018-01-07 1
1 2018-03-17 3
1 2018-01-27 2
1 2018-02-22 1
1 2018-05-13 200
1 2018-02-08 2
1 2018-01-16 2
1 2018-03-03 1
1 2018-05-02 217
1 2018-05-22 163
1 2018-03-20 1
1 2018-02-05 2
1 2018-02-11 1
1 2018-01-19 2
1 2018-02-28 1
1 2018-05-05 332
1 2018-05-25 211
1 2018-03-23 1
1 2018-05-19 219
loc dC_count
1 6359
From "COUNT (Transact-SQL)"
COUNT(DISTINCT expression) evaluates expression for each row in a group, and returns the number of unique, nonnull values.
The distinct is relative to the group, not to the whole table (or selected subset). I think this might be your misconception here.
To better understand what this means, take the following simplified example:
CREATE TABLE group_test
(a varchar(1),
b varchar(1),
c varchar(1));
INSERT INTO group_test
(a,
b,
c)
VALUES ('a',
'r',
'x'),
('a',
's',
'x'),
('b',
'r',
'x'),
('b',
's',
'y');
If we GROUP BY a and select count(DISTINCT c)
SELECT a,
count(DISTINCT c) #
FROM group_test
GROUP BY a;
we get
a | #
----|----
a | 1
b | 2
As there is only c='x' for a=1, there is only a distinct count of 1 for this group but 2 for the other group as it has 'x'and 'y' in c. The sum of counts is 3 here.
Now if we GROUP BY a, b
SELECT a,
b,
count(DISTINCT c) #
FROM group_test
GROUP BY a,
b;
we get
a | b | #
----|----|----
a | r | 1
a | s | 1
b | r | 1
b | s | 1
We get 1 for every count here as each value of c is the only one in the group. And all of a sudden the sum of counts is 4.
And if we get the distinct count of c for the whole table
SELECT count(DISTINCT c) #
FROM group_test;
we get
#
----
2
which sums up to 2.
The sum of the counts is different in each case but right none the less.
The more groups there are, the higher the chance for a value to be unique within that group. So your results seem totally plausible.
db<>fiddle

SQL consolidate consecutive date

Hi could anyone help me with the SQL Query.
I want to display all the consecutive dates in the db where there is at least 2 consecutive dates.
Below are an example of the output what i was hoping for.
here is a list of dates:
2016-06-24 00:00:00.000
2016-06-24 00:00:00.000
2016-06-24 00:00:00.000
2016-06-25 00:00:00.000
2016-06-25 00:00:00.000
2016-06-26 00:00:00.000
2016-05-26 00:00:00.000
2016-05-25 00:00:00.000
2016-04-04 00:00:00.000
2016-06-26 00:00:00.000
----------output----------
| Start Date | End Date | Count | consecutive Date |
| 2016-05-25 00:00:00.000 | 2016-05-25 00:00:00.000 | 1 | 2 |
| 2016-05-25 00:00:00.000 | 2016-05-26 00:00:00.000 | 1 | 2 |
| 2016-06-24 00:00:00.000 | 2016-06-25 00:00:00.000 | 2 | 2 |
| 2016-06-24 00:00:00.000 | 2016-06-26 00:00:00.000 | 2 | 3 |
| 2016-06-25 00:00:00.000 | 2016-06-26 00:00:00.000 | 2 | 2 |
This is what I have currently:
WITH t AS (
SELECT SWITCHOFFSET(CONVERT(DATETIMEOFFSET, dateAvailable), '+08:00') d,
ROW_NUMBER() OVER(ORDER BY dateAvailable) i
FROM user_dateTbl
GROUP BY dateAvailable
)
SELECT MIN(d),MAX(d)
FROM t
GROUP BY DATEDIFF(day,i,d)
please help me out thank you. If your unsure of what I am writing do feel free to write in the comment below.

SQL - Compare rows by id, date and amount

I need to SELECT a row in which issue_date = maturity_date of another row with same id, and same amount_usd.
I tried with self join, but I do not get right result.
Here is a simplified version of my table:
ID ISSUE_DATE MATURITY_DATE AMOUNT_USD
1 2010-01-01 00:00:00.000 2015-12-01 00:00:00.000 5000
1 2010-01-01 00:00:00.000 2001-09-19 00:00:00.000 700
2 2014-04-09 00:00:00.000 2019-04-09 00:00:00.000 400
1 2015-12-01 00:00:00.000 2016-12-31 00:00:00.000 5000
5 2015-02-24 00:00:00.000 2015-02-24 00:00:00.000 8000
4 2012-11-29 00:00:00.000 2015-11-29 00:00:00.000 10000
3 2015-01-21 00:00:00.000 2018-01-21 00:00:00.000 17500
2 2015-02-02 00:00:00.000 2015-12-05 00:00:00.000 12000
1 2015-01-12 00:00:00.000 2018-01-12 00:00:00.000 18000
2 2015-12-05 00:00:00.000 2016-01-10 00:00:00.000 12000
Result should be:
ID ISSUE_DATE MATURITY_DATE AMOUNT_USD
1 2015-12-01 00:00:00.000 2016-12-31 00:00:00.000 5000
2 2015-12-05 00:00:00.000 2016-01-10 00:00:00.000 12000
Thanks in advance!
Do following: http://sqlfiddle.com/#!6/c0a02/1
select a.id, a.issue_date, a.maturity_date, a.amount_usd
from tbl a
inner join tbl b
on a.id = b.id
and a.maturity_date = b.issue_date
-- added to prevent same maturity date and issue date
where a.maturity_date <> a.issue_date
Output:
| id | issue_date | maturity_date | amount_usd |
|----|----------------------------|----------------------------|------------|
| 1 | January, 01 2010 00:00:00 | December, 01 2015 00:00:00 | 5000 |
| 2 | February, 02 2015 00:00:00 | December, 05 2015 00:00:00 | 12000 |