Finding Missing Numbers series when Data Is Grouped in sql server - sql

I need to write a query that will calculate the missing numbers with their count in a sequence when the data is "grouped". The data are in multiple groups & each group is in sequence.
For Ex. I have number series like 1001-1050, 1245-1270, 4571-4590 and all numbers like 1001,1002,1003,....1050 is stored in Table1 and from that Table1 some numbers are stored in another table Table2. E.g. 1001,1002,1003,1004,1005.
I want to get output like this:
Utilized Numbers | Balance Numbers |
----------- -------------------------
1001 - 1005 = 5 | 1006 - 1050 = 45 |
1245 - 1251 = 7 | 1252 - 1270 = 19 |
4571 - 4573 = 3 | 4574 - 4590 = 17 |
The number of each series is single field which is stored in both tables.

You haven't really explained your data, but guessing that "Utilized" are the numbers found in both Table1 and Table2, and "Balance" are the numbers only in Table1.
You can get the result at least this way, it's a little bit messy, mostly because of formatting the results:
Edit: This is a new version that does not use lag.
select
min (case when C2 = 1 then MINID end), max (case when C2 = 1 then MAXID end), max(case when C2=1 then ROWS end),
min (case when C2 = 0 then MINID end), max (case when C2 = 0 then MAXID end), max(case when C2=0 then ROWS end)
from (
select min(ID) as MINID, max(ID) as MAXID, count(*) as ROWS, C2, row_number() over (partition by C2 order by min(ID)) as GRP3 from (
select *, ID - RN as GRP1, ID - RN2 as GRP2 from (
select
T1.ID, row_number() over (order by T1.ID) as RN,
case when T2.ID is NULL then 0 else 1 end as C2,
row_number() over (partition by case when T2.ID is NULL then 0 else 1 end order by T1.ID) as RN2,
T2.ID as ID2
from #Table1 T1
left outer join #Table2 T2 on T1.ID = T2.ID
) X
) Y
group by GRP1, GRP2, C2
) Z
group by GRP3
order by 1
The idea here is to have a row number ordered by Table1.ID, and it's compared to the Table1.ID, and if the difference changes, then it's a new group. The same logic is used second time, but now partitioned differently for rows that exist in Table2 to handle changes between "Utilized" and "Balance".
From those groupings you can get the min and max value + number of rows. There's one additional grouping with min/max and case to format the result into 2 columns.
See the demo.

Related

Filter rows and select in to another columns in SQL?

I have a table like below.
If(OBJECT_ID('tempdb..#temp') Is Not Null)
Begin
Drop Table #Temp
End
create table #Temp
(
Type int,
Code Varchar(50),
)
Insert Into #Temp
SELECT 1,'1'
UNION
SELECT 1,'2'
UNION
SELECT 1,'3'
UNION
SELECT 2,'4'
UNION
SELECT 2,'5'
UNION
SELECT 2,'6'
select * from #Temp
And would like to get the below result.
Type_1
Code_1
Type_2
Code_2
1
1
2
4
1
2
2
5
1
3
2
6
I have tried with union and inner join, but not getting desired result. Please help.
You can use full outer join and cte as follows:
With cte as
(Select type, code,
Row_number() over (partition by type order by code) as rn
From your_table t)
Select t1.type, t1.code, t2.type, t2.code
From cte t1 full join cte t2
On t1.rn = t2.rn and t1.type =1 and t2.type = 2
Here is a query which will produce the output you expect:
WITH cte AS (
SELECT t.[Type], t.Code
, rn = ROW_NUMBER() OVER (PARTITION BY t.[Type] ORDER BY t.Code)
FROM #Temp t
)
SELECT Type_1 = t1.[Type], Code_1 = t1.Code
, Type_2 = t2.[Type], Code_2 = t2.Code
FROM cte t1
JOIN cte t2 ON t1.rn = t2.rn AND t2.[Type] = 2
AND t1.[Type] = 1
This query is will filter out any Type_1 records which do not have a Type_2 record. This means if there are an uneven number of Type_1 vs Type_2 records, the extra records will get eliminated.
Explanation:
Since there is no obvious way to join the two sets of data, because there is no shared key between them, we need to create one.
So we use this query:
SELECT t.[Type], t.Code
, rn = ROW_NUMBER() OVER (PARTITION BY t.[Type] ORDER BY t.Code)
FROM #Temp t
Which assigns a ROW_NUMBER to every row...It restarts the numbering for every Type value, and it orders the numbering by the Code.
So it will produce:
| Type | Code | rn |
|------|------|----|
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
| 2 | 6 | 3 |
Now you can see that we have assigned a key to each row of Type 1's and Type 2's which we can use for the joining process.
In order for us to re-use this output, we can stick it in a CTE and perform a self join (not an actual type of join, it just means we want to join a table to itself).
That's what this query is doing:
SELECT *
FROM cte t1
JOIN cte t2 ON t1.rn = t2.rn AND t2.[Type] = 2
AND t1.[Type] = 1
It's saying, "give me a list of all Type 1 records, and then join all Type 2 records to that using the new ROW_NUMBER we've generated".
Note: All of this works based on the assumption that you always want to join the Type 1's and Type 2's based on the order of their Code.
You can also do this using aggregation:
select max(case when type = 1 then type end) as type_1,
max(case when type = 1 then code end) as code_1,
max(case when type = 2 then type end) as type_2,
max(case when type = 2 then code end) as code_2
from (select type, code,
row_number() over (partition by type order by code) as seqnum
from your_table t
) t
group by seqnum;
It would be interesting to know which is faster -- a join approach or aggregation.
Here is a db<>fiddle.

SELECT statement that shows continuous data with condition

I consider myself good at SQL but failed at this problem.
I need a SELECT statement that shows all rows above 100 if there are
3 rows or more with 100 next to it.
Given Table "Trend":
| id | volume |
+----+---------+
| 0 | 200 |
| 1 | 90 |
| 2 | 101 |
| 3 | 120 |
| 4 | 200 |
| 5 | 10 |
| 6 | 400 |
I need a SELECT statement to produce:
| 2 | 101 |
| 3 | 120 |
| 4 | 200 |
I suspect that you are after the following logic:
select *
from (
select t.*,
sum(case when volume > 100 then 1 else 0 end) over(order by id rows between 2 preceding and 2 following) cnt
from mytable t
) t
where volume > 100 and cnt >= 3
This counts how many values are above 100 in the range made of the two preceding rows, the current row and the next two rows. Then we filter on rows whose window count is 3 or more.
This uses a syntax that most database support (provided that window functions are available). Neater expressions may be available depending on the actual database you are using.
In MySQL:
sum(volume > 100) over(order by id rows between 2 preceding and 2 following) cnt
In Postgres:
count(*) filter(where volume > 100) over(order by id rows between 2 preceding and 2 following) cnt
Or:
sum((volume > 100)::int) over(order by id rows between 2 preceding and 2 following) cnt
This is tricky because you want the original rows . . . I am going to suggest lag() and lead():
select id, volume
from (select t.*,
lag(volume, 2) over (order by id) as prev_volume_2,
lag(volume) over (order by id) as prev_volume,
lead(volume, 2) over (order by id) as next_volume_2,
lead(volume) over (order by id) as next_volume
from t
) t
where volume > 100 and
( (prev_volume_2 > 100 and prev_volume > 100) or
(prev_volume > 100 and next_volume > 100) or
(next_volume_2 > 100 and next_volume > 100)
);
Another method is to treat this as a gaps-and-islands problem. This makes the solution more generalizable. You can assign a group by counting the number of rows less than or equal to 100 up to each row. Then count the number that are greater than 100 to see if those groups qualify to be in the final results:
select id, volume
from (select t.*,
sum(case when volume > 100 then 1 else 0 end) over (partition by grp) as cnt
from (select t.*,
sum(case when volume <= 100 then 1 else 0 end) over (order by id) as grp
from t
) t
) t
where volume > 100 and cnt >= 3;
Here is a db<>fiddle with these two approaches.
Key point here is "3 rows or more". MATCH_RECOGNIZE could be used:
SELECT *
FROM trend
MATCH_RECOGNIZE (
ORDER BY id -- ordering of a streak
MEASURES FINAL COUNT(*) AS l -- count "per" match
ALL ROWS PER MATCH -- get all rows
PATTERN(a{3,}) -- 3 or more
DEFINE a AS volume >= 100 -- condtion of streak
)
ORDER BY l DESC FETCH FIRST 1 ROWS WITH TIES;
-- choose the group that has the longest streak
The strength of this approach is a PATTERN part which could be modifed to handle different scenarios like a{3,5} - between 3 and 5 occurences, a{4} exactly 4 occurences and so on. More conditions could be defined which allows to build complex pattern detection.
db<>fiddle demo
Get the min value of volume for all consecutive 3 rows of the table.
Then join to the table and keep only the ones belonging to a group that has min > 100:
select distinct t.*
from Trend t
inner join (
select t.*,
min(t.volume) over (order by t.id rows between current row and 2 following) min_volume,
lead(t.id, 1) over (order by t.id) next1,
lead(t.id, 2) over (order by t.id) next2
from Trend t
) m on t.id in (m.id, m.next1, m.next2)
where m.min_volume > 100 and m.next1 is not null and m.next2 is not null
See the demo for SQL Server, MySql, Postgresql, Oracle, SQLite.
Results:
> id | volume
> -: | -----:
> 2 | 101
> 3 | 120
> 4 | 200
a simplistic approach:
--CREATE TABLE Trend (id integer, volume integer);
--insert into Trend VALUES
-- (0,200),
-- (1,90),
-- (2,101),
-- (3,120),
-- (4,200),
-- (5,10),
-- (6,400);
SELECT
t1.id, t1.volume
--,t2.id, t2.volume
--,t3.id, t3.volume
FROM Trend t1
INNER JOIN Trend t2 ON t2.id>t1.id and t2.volume>100 and not exists (select * from Trend t5 where t5.id between t1.id+1 and t2.id-1)
INNER JOIN Trend t3 ON t3.id>t2.id and t3.volume>100 and not exists (select * from Trend where id between t2.id+1 and t3.id-1)
WHERE t1.volume>100
union all
SELECT
--t1.id, t1.volume
t2.id, t2.volume
--,t3.id, t3.volume
FROM Trend t1
INNER JOIN Trend t2 ON t2.id>t1.id and t2.volume>100 and not exists (select * from Trend t5 where t5.id between t1.id+1 and t2.id-1)
INNER JOIN Trend t3 ON t3.id>t2.id and t3.volume>100 and not exists (select * from Trend where id between t2.id+1 and t3.id-1)
WHERE t1.volume>100
union all
SELECT
--t1.id, t1.volume
--t2.id, t2.volume
t3.id, t3.volume
FROM Trend t1
INNER JOIN Trend t2 ON t2.id>t1.id and t2.volume>100 and not exists (select * from Trend t5 where t5.id between t1.id+1 and t2.id-1)
INNER JOIN Trend t3 ON t3.id>t2.id and t3.volume>100 and not exists (select * from Trend where id between t2.id+1 and t3.id-1)
WHERE t1.volume>100

SELECT SQL Matching Number

I have millions of rows of data that have similar values ​​like this:
Id Reff Amount
1 a1 1000
2 a2 -1000
3 a3 -2500
4 a4 -1500
5 a5 1500
every data must have positive and negative values. the question is, how do I show only records that don't have a similar value? like a row Id 3. thanks for help
You can use not exists:
select t.*
from mytable t
where not exists (select 1 from mytable t1 where t1.amount = -1 * t.amount)
A left join antipattern would also get the job done:
select t.*
from mytable t
left join mytable t1 on t1.amount = -1 * t.amount
where t1.id is null
Demo on DB Fiddle:
Id | Reff | Amount
-: | :--- | -----:
3 | a3 | -2500
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE Test(
Id int
,Reff varchar(2)
,Amount int
);
INSERT INTO Test(Id,Reff,Amount) VALUES (1,'a1',1000);
INSERT INTO Test(Id,Reff,Amount) VALUES (2,'a2',-1000);
INSERT INTO Test(Id,Reff,Amount) VALUES (3,'a3',-2500);
INSERT INTO Test(Id,Reff,Amount) VALUES (4,'a4',-1500);
INSERT INTO Test(Id,Reff,Amount) VALUES (5,'a5',1500);
Query 1:
select t.*
from Test t
left join Test t1 on t1.amount =ABS(t.amount)
where t1.id is null
Results:
| Id | Reff | Amount |
|----|------|--------|
| 3 | a3 | -2500 |
Using a NOT EXISTS or a LEFT JOIN will work fine to find the amounts that don't have an opposite amount in the data.
But to really find the amounts that don't balance out with an Amount sorted by ID?
For such SQL puzzle it should be handled as a Gaps-And-Islands problem.
So the solution might appear a bit more complicated, but it's actually quite simple.
It first calculates a ranking per absolute value.
And based on that ranking it filters the last amount where the SUM per ranking isn't balanced out (not 0)
SELECT Id, Reff, Amount
FROM
(
SELECT *,
SUM(Amount) OVER (PARTITION BY Rnk) AS SumAmountByRank,
ROW_NUMBER() OVER (PARTITION BY Rnk ORDER BY Id DESC) AS Rn
FROM
(
SELECT Id, Reff, Amount,
ROW_NUMBER() OVER (ORDER BY Id) - ROW_NUMBER() OVER (PARTITION BY ABS(Amount) ORDER BY Id) AS Rnk
FROM YourTable
) AS q1
) AS q2
WHERE SumAmountByRank != 0
AND Rn = 1
ORDER BY Id;
A test on rextester here
If the sequence doesn't matter, and just the balance matters?
Then the query can be simplified.
SELECT Id, Reff, Amount
FROM
(
SELECT Id, Reff, Amount,
SUM(Amount) OVER (PARTITION BY ABS(Amount)) AS SumByAbsAmount,
ROW_NUMBER() OVER (PARTITION BY ABS(Amount) ORDER BY Id DESC) AS Rn
FROM YourTable
) AS q
WHERE SumByAbsAmount != 0
AND Rn = 1
ORDER BY Id;

Cumulative distinct count filtered by last value - T-SQL

I am trying to come up with exactly the same answer as here:
Cumulative distinct count filtered by last value - DAX
but in SQL Server. For convenience I am copying the whole problem description.
I have a dataset:
month name flag
1 abc TRUE
2 xyz TRUE
3 abc TRUE
4 xyz TRUE
5 abc FALSE
6 abc TRUE
I want to calculate month-cumulative distinct count of 'name' filtered by last 'flag' value (TRUE). I.e. I want to have a result:
month count
1 1
2 2
3 2
4 2
5 1
6 2
In months 5 and 6 'abc' should be excluded because the flag switched to 'FALSE' in month 5.
I am thinking about using "over" clause with "partition by" but I don't have any experience here so it's a struggle for me.
UPDATE
I have updated the last row in exemplary source data.
was:
6 abc FALSE
is:
6 abc TRUE
And the last row in output data.
Was:
6 1
is:
6 2
It might have not been obivous from the description that it should work this way and the proposed answer does not solve this problem.
UPDATE 2
I have managed to create a query that gives the result but it's ugly and I think could be shrinked by using over clause. Can you help me with that?
select t5.month_current, count(*) as count from
(select t3.month month_current, t4.month months_until_current, t3.name, t4.flag from
(select name ,month from
(select distinct name
from Source_data) t1
,(select distinct month
from Source_data) t2) t3
left join
Source_data t4
on t3.name = t4.name and t3.month >= t4.month) t5
inner join
(select t3.month month_current, max(t4.month) real_max_month_until_current, t3.name from
(select name ,month from
(select distinct name
from Source_data) t1
,(select distinct month
from Source_data) t2) t3
left join
Source_data t4
on t3.name = t4.name and t3.month >= t4.month
group by
t3.month, t3.name) t6
on t5.month_current = t6.month_current
and t5.months_until_current = t6.real_max_month_until_current
and t5.name = t6.name
where t5.flag = 'TRUE'
group by t5.month_current
You can do a cumulative distinct count as:
select t.*,
sum(case when seqnum = 1 then 1 else 0 end) over (order by month) as cnt
from (select t.*,
row_number() over (partition by name order by month) as seqnum
from t
) t;
I don't understand the logic for incorporating the flag.
You can replicate the results in the question by incorporating the flag:
select t.*,
sum(case when seqnum = 1 and flag = 'true' then 1
when seqnum = 1 and flag = 'false' then -1
else 0
end) over (order by month) as cnt
from (select t.*,
row_number() over (partition by name, flag order by month) as seqnum
from t
) t;

How to replicate a SAS merge

I have two tables, t1 and t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
When SAS runs the following code:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
It generates the following dataset:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I want to replicate this process in SQL, and my idea was to use a full-outer-join. This works unless there are duplicate rows. When we have duplicate rows like in the above example, a full outer join produces the following table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I'm wondering how I would get the SQl table to match the SAS table.
Gordon's answer is close; but it misses one point. Here's its output:
person visit code1 type1 seqnum person visit code2 type2 seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL
The third row's nulls are incorrect, while the fourth's are correct.
As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:
Matching person/visit, Matching seqnums
Matching person/visit, Left has more seqnums
Matching person/visit, Right has more seqnums
Left has unmatched person/visit
Right has unmatched person/visit
I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.
So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1 and type1, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.
Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...
Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.
declare #t1 table (person INT, visit INT, code1 INT, type1 INT);
declare #t2 table (person INT, visit INT, code2 INT, type2 INT);
insert into #t1 values (1,1,1,1)
insert into #t1 values (1,1,2,2)
insert into #t1 values (1,2,1,3)
insert into #t2 values (1,1,1,1)
insert into #t2 values (1,1,2,2)
insert into #t2 values (1,1,3,3)
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from #t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from #t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from #t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from #t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from #t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from #t1 t1 left join #t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null
You can replicate a SAS merge by adding a row_number() to each table:
select t1.*, t2.*
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
Notes:
The ?? means to put in the column(s) used for ordering. SAS datasets have an intrinsic order. SQL tables do not, so the ordering needs to be specified.
You should list the columns explicitly (instead of using t1.*, t2.* in the outer query). I think SAS only includes person and visit once in the resulting dataset.
EDIT:
Note: the above produces separate values for the key columns. This is easy enough to fix:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.key, t2.key) as key,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
That fixes the columns issue. You can fix the copying issue by using first_value()/last_value() or by using a more complicated join condition:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
(t1.seqnum = t2.seqnum or
(t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
(t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
This implements the "keep the last row" logic in a single join. Probably for performance reasons, you would want to put this into separate left joins on the original logic.