What's the best way to get intersected data from one table?

What's the best way to get intersected data from one table? - sql

Suppose I have the below table
CREATE TABLE [dbo].[TestData](
[ID] [bigint] NOT NULL,
[InstanceID] [int] NOT NULL,
[Field] [int] NULL,
[UserID] [bigint] NOT NULL
) ON [PRIMARY]
GO
INSERT [dbo].[TestData] ([ID], [InstanceID], [Field], [UserID])
VALUES (1, 1, NULL, 1000),(2, 1, NULL, 1002),(3, 1, NULL, 1000),
(4, 1, NULL, 1003),(5, 2, NULL, 1002), (6, 2, NULL, 1005),
(7, 2, NULL, 1006),(8, 2, NULL, 1007),(9, 3, NULL, 1002),
(10, 3, NULL, 1006),(11, 3, NULL, 1009),(12, 3, NULL, 1010),
(13, 1, NULL, 1006),(14, 2, NULL, 1002),(15, 3, NULL, 1003)
GO
I search for the best practice to write a query to get the full rows of intersected data between two instances using UserID
For example the intersected UserIDs between InstanceID 1 and 2 are ( 1002 , 1006 ), to get the results I wote the query in two different ways as below :
Select * From TestData
Where UserID in
(
Select T1.UserID From TestData T1 Where InstanceID = 1
Intersect
Select T2.UserID From TestData T2 Where InstanceID = 2
)
and InstanceID in (1,2) Order By 1
Second
Select * From TestData
Where UserID in
(
Select Distinct T1.UserID
From TestData T1 join TestData T2 on T1.UserID = T2.UserID
Where T1.InstanceID = 1 and T2.InstanceID = 2
)
and InstanceID in (1,2) Order By 1
So the results will be
Is one of the above queries is the best way to get the results ??

Using EXISTS is better than using IN. When using the IN subquery, the entire resultset is processed. With EXISTS, it just searches as they are found to match. As far as your question, I think the INTERSECT implementation just simply does the join anyways so there shouldn't be a difference.
EDIT: a post Here says that for IN vs EXISTS, the optimizer will treat them the same as well (as of 2008). So pretty much my guess as well as what I just read boils down to :They will perform the same because the optimizer knows.

Here's an example of the query if you were to use EXISTS statements:
SELECT *
FROM TestData td
WHERE td.InstanceID IN (1, 2)
AND EXISTS
(SELECT 1
FROM TestData sub
WHERE td.UserID = sub.UserID
AND sub.InstanceID = 2)
AND EXISTS
(SELECT 1
FROM TestData sub
WHERE td.UserID = sub.UserID
AND sub.InstanceID = 1)
ORDER BY 1;
For the sample data provided, there was no noticable performance difference between any of the three solutions. However, I agree with Scotch that using EXISTS statements will help performance over IN statements under specific scenarios.
The best thing you can do to improve performance is create the table with a PRIMARY KEY. Setting the ID field as a PRIMARY KEY will bolster performance by 50% since the highest cost of your query is sorting the data.

You can also do this with an aggregation and join:
select td.*
from TestData td join
(select td.userid
from TestData
group by td.userId
having sum(case when InstanceId = 1 then 1 else 0 end) > 0 and
sum(case when InstanceId = 2 then 1 else 0 end) > 0
) td2
on td.userid = td2.userid
The advantage to the aggregation is that the having clause makes it very flexible in terms of the conditions you can represent. Performance will be best if you have an index on userId, InstanceId.

The script is used by two operations of Index seek and one operation of Distinct sorting.
SELECT ID, InstanceID, Field, UserID
FROM [dbo].[TestData] t
WHERE InstanceID IN(1, 2)
AND EXISTS (
SELECT 1
FROM [dbo].[TestData] t2
WHERE InstanceID IN(1, 2) AND t.UserID = t2.UserID
HAVING COUNT(DISTINCT t2.InstanceID) = 2
)
ORDER BY t.ID
OR
;WITH cte AS
(
SELECT ID, InstanceID, Field, UserId
,COUNT(*) OVER(PARTITION BY InstanceID, UserID) AS cntInstanceUser
FROM [dbo].[TestData] t
WHERE InstanceID IN(1, 2)
)
SELECT c.ID, c.InstanceID, c.Field, c.UserID
FROM cte c
WHERE EXISTS (
SELECT 1
FROM cte c2
WHERE c2.UserId = c.UserID
HAVING COUNT(*) != c.cntInstanceUser
)
ORDER BY c.ID
For improving performance use this index:
CREATE INDEX x ON [dbo].[TestData](InstanceID, UserID) INCLUDE(Id, Field)
Demo on SQLFiddle

Related

Query based on multiple rows and multiple columns

Database : Azure SQL server 2019, .net core 3.0
I'm using stored procedure for querying data.
--Table Structure
create table yourtable
(
id int,
class int,
islab bit,
isschool bit
);
insert into yourtable
values (1, 1, 1, 1),
(1, 2, 1, 1),
(1, 3, 1, 1),
(2, 1, 1, 0),
(2, 2, 1, 1),
(2, 3, 1, 1)
Now if I want a query to return all unique Id's where class = 1 and 2 and islab = 1 and isschool = 1, it should return only Id =1 because
a) Id=1 has both classes i,e (1,2) and in both classes islab = 1 and isschool = 1
b) Id=2 is not true for this condition because in classes 1 value for isschool = 0
Can you help me write this query? Currently I'm getting all row for input classes than in c# using list check all conditions. It's working but I want do all in SQL
Also I think using cursor in stored procedure I can have same result as in C# but in C# it's easy as we have Collection and various methods like intersect between lists and so on.

Assuming that islab and isschool are actually a bit (as bool doesn't exist in SQL Server), one method would be to use a HAVING with conditional aggregation. So, for the first one, you would do the following:
SELECT id
FROM dbo.yourtable
WHERE islab = 1
AND isschool = 1
GROUP BY id
HAVING COUNT(CASE class WHEN 1 THEN 1 END) = 1
AND COUNT(CASE class WHEN 2 THEN 1 END) = 1;

The question is not clear, but based on your expected output, I think you want a query that, for a given set of classes, finds the ids where there are no records with isschool = 0 or islab = 0. You can do this with a NOT EXISTS condition:
WITH mytab AS
(
SELECT *
FROM yourtable
WHERE class IN (1,2,3) -- Change this line to get your 3 different outputs
)
SELECT DISTINCT id
FROM mytab t1
WHERE NOT EXISTS
(
SELECT *
FROM mytab t2
WHERE t1.id = t2.id
AND (t2.islab = 0 OR t2.isschool = 0)
)
For class IN (1,2) this returns id 1.
For class IN (2,3) this returns ids 1 and 2.
For class IN (1,2,3) this returns id 1.
The CTE limits to the classes we want to consider. The subquery in the NOT EXISTS finds ids that should be eliminated because either isschool = 0 or islab = 0.
An alternative way of doing this, using a LEFT JOIN instead of the NOT EXISTS condition is:
WITH mytab AS
(
SELECT *
FROM yourtable
WHERE class IN (1,2,3) -- Change this line to get your 3 different version
)
SELECT DISTINCT t1.id
FROM mytab t1 LEFT OUTER JOIN mytab t2
on t1.id = t2.id AND (t2.islab = 0 OR t2.isschool = 0)
WHERE t2.id is null

Group by absorb NULL unless it's the only value

I'm trying to group by a primary column and a secondary column. I want to ignore NULL in the secondary column unless it's the only value.
CREATE TABLE #tempx1 ( Id INT, [Foo] VARCHAR(10), OtherKeyId INT );
INSERT INTO #tempx1 ([Id],[Foo],[OtherKeyId]) VALUES
(1, 'A', NULL),
(2, 'B', NULL),
(3, 'B', 1),
(4, 'C', NULL),
(5, 'C', 1),
(6, 'C', 2);
I'm trying to get output like
Foo OtherKeyId
A NULL
B 1
C 1
C 2
This question is similar, but takes the MAX of the column I want, so it ignores other non-NULL values and won't work.
I tried to work out something based on this question, but I don't quite understand what that query does and can't get my output to work
-- Doesn't include Foo='A', creates duplicates for 'B' and 'C'
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [Foo] ORDER BY [OtherKeyId]) rn1
FROM #tempx1
)
SELECT c1.[Foo], c1.[OtherKeyId], c1.rn1
FROM cte c1
INNER JOIN cte c2 ON c2.[OtherKeyId] = c1.[OtherKeyId] AND c2.rn1 = c1.rn1
This is for a modern SQL Server: Microsoft SQL Server 2019

You can use a GROUP BY expression with HAVING clause like below one
SELECT [Foo],[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo],[OtherKeyId]
HAVING SUM(CASE WHEN [OtherKeyId] IS NULL THEN 0 END) IS NULL
OR ( SELECT COUNT(*) FROM #tempx1 WHERE [Foo] = t.[Foo] ) = 1
Demo

Hmmm . . . I think you want filtering:
select t.*
from #tempx1 t
where t.otherkeyid is not null or
not exists (select 1
from #tempx1 t2
where t2.foo = t.foo and t2.otherkeyid is not null
);

My actual problem is a bit more complicated than presented here, I ended up using the idea from Barbaros Özhan solution to count the number of items. This ends up with two inner queries on the data set with two different GROUP BY. I'm able to get the results I need on my real dataset using a query like the following:
SELECT
a.[Foo],
b.[OtherKeyId]
FROM (
SELECT
[Foo],
COUNT([OtherKeyId]) [C]
FROM #tempx1 t
GROUP BY [Foo]
) a
JOIN (
SELECT
[Foo],
[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo], [OtherKeyId]
) b ON b.[Foo] = a.[Foo]
WHERE
(b.[OtherKeyId] IS NULL AND a.[C] = 0)
OR (b.[OtherKeyId] IS NOT NULL AND a.[C] > 0)

Where clause on Running total

I have this table which stores containers by region and the number of coffee pouches in each of the containers.
if object_id( 'dbo.Container' ) is not null
drop table dbo.Container
go
create table dbo.Container
(
Id int not null,
Region int not null,
NumberOfCoffeePouches int not null,
constraint pkc_Container__Id primary key clustered(Id asc)
)
go
insert into dbo.Container
( Id , Region , NumberOfCoffeePouches )
values
( 1, 1, 10 ),
( 2, 1, 30 ),
( 3, 1, 5),
( 4, 1, 7),
( 5, 1, 1),
( 6, 1, 3),
( 7, 2, 4),
( 8, 2, 4),
( 9, 2, 4)
I need to list out the container Ids that will be used to fulfill an order of, say 50, coffee pouches. Over supplying is OK.
Here is query I have come up with
declare #RequiredCoffeePouches int = 50
select
sq2.Id,
sq2.NumberOfCoffeePouches,
sq2.RunningTotal,
sq2.LagRunningTotal
from
(
select
sq1.Id,
sq1.NumberOfCoffeePouches,
sq1.RunningTotal,
lag(sq1.RunningTotal, 1, 0) over (order by sq1.Id asc)
as 'LagRunningTotal'
from
(
select
c.Id,
c.NumberOfCoffeePouches,
sum(c.NumberOfCoffeePouches)
over (order by c.Id asc) as 'RunningTotal'
from
dbo.Container as c
where
c.Region = 1
) as sq1
) as sq2
where
sq2.LagRunningTotal <= #RequiredCoffeePouches
It gives the expected result
Id NumberOfCoffeePouches RunningTotal LagRunningTotal
----------- --------------------- ------------ ---------------
1 10 10 0
2 30 40 10
3 5 45 40
4 7 52 45
Question:
Is there a better and more optimized way to achieve this?
Specially the Container table is very large table and I think the sub query sq1 will unnecessarily calculate the RunningTotals for all the containers in the region. I was wondering if there is anyway to have sq1 stop processing more rows once the RunnningTotal exceeds over the #RequiredCoffeePouches.

Two things:
Moving your WHERE clause inside of the relevant sub-select can greatly increase the speed of the query because it'll pull less data. Using your example:
SELECT
sq2.Id,
sq2.NumberOfCoffeePouches,
sq2.RunningTotal,
sq2.LagRunningTotal
FROM
(
SELECT
sq1.Id,
sq1.NumberOfCoffeePouches,
sq1.RunningTotal,
lag(sq1.RunningTotal, 1, 0) over (order by sq1.Id asc) AS 'LagRunningTotal'
FROM
(
SELECT
c.Id,
c.NumberOfCoffeePouches,
SUM(c.NumberOfCoffeePouches) OVER (order by c.Id asc) AS 'RunningTotal'
FROM dbo.Container AS c
WHERE c.Region = 1
) AS sq1
WHERE sq2.LagRunningTotal <= #RequiredCoffeePouches
) AS sq2
CTEs can also improve performance:
;WITH sql1CTE AS (
SELECT
c.Id,
c.NumberOfCoffeePouches,
SUM(c.NumberOfCoffeePouches) OVER (order by c.Id asc) AS 'RunningTotal'
FROM dbo.Container AS c
WHERE c.Region = 1
),
sql2CTE AS (
SELECT
Id,
NumberOfCoffeePouches,
RunningTotal,
lag(RunningTotal, 1, 0) over (order by Id asc) AS 'LagRunningTotal'
FROM sql1CTE
WHERE LagRunningTotal <= #RequiredCoffeePouches
)
SELECT
Id,
NumberOfCoffeePouches,
RunningTotal,
LagRunningTotal
FROM sql2CTE
SQL Server CTE Basics
If you're using SSMS, select "Include Client Statistics" and "Include Actual Execution Plan" to keep track of how your query performs while you're crafting it.

T-SQL - Copying & Transposing Data

I'm trying to copy data from one table to another, while transposing it and combining it into appropriate rows, with different columns in the second table.
First time posting. Yes this may seem simple to everyone here. I have tried for a couple hours to solve this. I do not have much support internally and have learned a great deal on this forum and managed to get so much accomplished with your other help examples. I appreciate any help with this.
Table 1 has the data in this format.
Type Date Value
--------------------
First 2019 1
First 2020 2
Second 2019 3
Second 2020 4
Table 2 already has the Date rows populated and columns created. It is waiting for the Values from Table 1 to be placed in the appropriate column/row.
Date First Second
------------------
2019 1 3
2020 2 4

For an update, I might use two joins:
update t2
set first = tf.value,
second = ts.value
from table2 t2 left join
table1 tf
on t2.date = tf.date and tf.type = 'First' left join
table1 ts
on t2.date = ts.date and ts.type = 'Second'
where tf.date is not null or ts.date is not null;

use conditional aggregation
select date,max(case when type='First' then value end) as First,
max(case when type='Second' then value end) as Second from t
group by date

You can do conditional aggregation :
select date,
max(case when type = 'first' then value end) as first,
max(case when type = 'Second' then value end) as Second
from table t
group by date;
After that you can use cte :
with cte as (
select date,
max(case when type = 'first' then value end) as first,
max(case when type = 'Second' then value end) as Second
from table t
group by date
)
update t2
set t2.First = t1.First,
t2.Second = t1.Second
from table2 t2 inner join
cte t1
on t1.date = t2.date;

Seems like you're after a PIVOT
DECLARE #Table1 TABLE
(
[Type] NVARCHAR(100)
, [Date] INT
, [Value] INT
);
DECLARE #Table2 TABLE(
[Date] int
,[First] int
,[Second] int
)
INSERT INTO #Table1 (
[Type]
, [Date]
, [Value]
)
VALUES ( 'First', 2019, 1 )
, ( 'First', 2020, 2 )
, ( 'Second', 2019, 3 )
, ( 'Second', 2020, 4 );
INSERT INTO #Table2 (
[Date]
)
VALUES (2019),(2020)
--Show us what's in the tables
SELECT * FROM #Table1
SELECT * FROM #Table2
--How to pivot the data from Table 1
SELECT * FROM #Table1
PIVOT (
MAX([Value]) --Pivot on this Column
FOR [Type] IN ( [First], [Second] ) --Make column where [Value] is in one of this
) AS [pvt] --Table alias
--which gives
--Date First Second
------------- ----------- -----------
--2019 1 3
--2020 2 4
--Using that we can update #Table2
UPDATE [tbl2]
SET [tbl2].[First] = pvt.[First]
,[tbl2].[Second] = pvt.[Second]
FROM #Table1 tbl1
PIVOT (
MAX([Value]) --Pivot on this Column
FOR [Type] IN ( [First], [Second] ) --Make column where [Value] is in one of this
) AS [pvt] --Table alias
INNER JOIN #Table2 tbl2 ON [tbl2].[Date] = [pvt].[Date]
--Results from #Table 2 after updated
SELECT * FROM #Table2
--which gives
--Date First Second
------------- ----------- -----------
--2019 1 3
--2020 2 4

SQL: How to update multiple fields so empty field content is moved to the logically last columns - lose blank address lines

I have three address line columns, aline1, aline2, aline3 for a street
address. As staged from inconsistent data, any or all of them can be
blank. I want to move the first non-blank to addrline1, 2nd non-blank
to addrline2, and clear line 3 if there aren't three non blank lines,
else leave it. ("First" means aline1 is first unless it's blank,
aline2 is first if aline1 is blank, aline3 is first if aline1 and 2
are both blank)
The rows in this staging table do not have a key and there could be
duplicate rows. I could add a key.
Not counting a big case statement that enumerates the possible
combination of blank and non blank and moves the fields around, how
can I update the table? (This same problem comes up with a lot more
than 3 lines, so that's why I don't want to use a case statement)
I'm using Microsoft SQL Server 2008

Another alternative. It uses the undocumented %%physloc%% function to work without a key. You would be much better off adding a key to the table.
CREATE TABLE #t
(
aline1 VARCHAR(100),
aline2 VARCHAR(100),
aline3 VARCHAR(100)
)
INSERT INTO #t VALUES(NULL, NULL, 'a1')
INSERT INTO #t VALUES('a2', NULL, 'b2')
;WITH cte
AS (SELECT *,
MAX(CASE WHEN RN=1 THEN value END) OVER (PARTITION BY %%physloc%%) AS new_aline1,
MAX(CASE WHEN RN=2 THEN value END) OVER (PARTITION BY %%physloc%%) AS new_aline2,
MAX(CASE WHEN RN=3 THEN value END) OVER (PARTITION BY %%physloc%%) AS new_aline3
FROM #t
OUTER APPLY (SELECT ROW_NUMBER() OVER (ORDER BY CASE WHEN value IS NULL THEN 1 ELSE 0 END, idx) AS
RN, idx, value
FROM (VALUES(1,aline1),
(2,aline2),
(3,aline3)) t (idx, value)) d)
UPDATE cte
SET aline1 = new_aline1,
aline2 = new_aline2,
aline3 = new_aline3
SELECT *
FROM #t
DROP TABLE #t

Here's an alternative
Sample table for discussion, don't worry about the nonsensical data, they just need to be null or not
create table taddress (id int,a varchar(10),b varchar(10),c varchar(10));
insert taddress
select 1,1,2,3 union all
select 2,1, null, 3 union all
select 3,null, 1, 2 union all
select 4,null,null,2 union all
select 5,1, null, null union all
select 6,null, 4, null
The query, which really just normalizes the data
;with tmp as (
select *, rn=ROW_NUMBER() over (partition by t.id order by sort)
from taddress t
outer apply
(
select 1, t.a where t.a is not null union all
select 2, t.b where t.b is not null union all
select 3, t.c where t.c is not null
--- EXPAND HERE
) u(sort, line)
)
select t0.id, t1.line, t2.line, t3.line
from taddress t0
left join tmp t1 on t1.id = t0.id and t1.rn=1
left join tmp t2 on t2.id = t0.id and t2.rn=2
left join tmp t3 on t3.id = t0.id and t3.rn=3
--- AND HERE
order by t0.id
EDIT - for the update back into table
;with tmp as (
select *, rn=ROW_NUMBER() over (partition by t.id order by sort)
from taddress t
outer apply
(
select 1, t.a where t.a is not null union all
select 2, t.b where t.b is not null union all
select 3, t.c where t.c is not null
--- EXPAND HERE
) u(sort, line)
)
UPDATE taddress
set a = t1.line,
b = t2.line,
c = t3.line
from taddress t0
left join tmp t1 on t1.id = t0.id and t1.rn=1
left join tmp t2 on t2.id = t0.id and t2.rn=2
left join tmp t3 on t3.id = t0.id and t3.rn=3

Update - Changed statement to an Update statement. Removed Case statement solution
With this solution, you will need a unique key in the staging table.
With Inputs As
(
Select PK, 1 As LineNum, aline1 As Value
From StagingTable
Where aline1 Is Not Null
Union All
Select PK, 2, aline2
From StagingTable
Where aline2 Is Not Null
Union All
Select PK, 3, aline3
From StagingTable
Where aline3 Is Not Null
)
, ResequencedInputs As
(
Select PK, Value
, Row_Number() Over( Order By LineNum ) As LineNum
From Inputs
)
, NewValues As
(
Select S.PK
, Min( Case When R.LineNum = 1 Then R.addrline1 End ) As addrline1
, Min( Case When R.LineNum = 2 Then R.addrline1 End ) As addrline2
, Min( Case When R.LineNum = 3 Then R.addrline1 End ) As addrline3
From StagingTable As S
Left Join ResequencedInputs As R
On R.PK = S.PK
Group By S.PK
)
Update OtherTable
Set addrline1 = T2.addrline1
, addrline2 = T2.addrline2
, addrline3 = T2.addrline3
From OtherTable As T
Left Join NewValues As T2
On T2.PK = T.PK

R. A. Cyberkiwi, Thomas, and Martin, thanks very much - these were very generous responses by each of you. All of these answers were the type of spoonfeeding I was looking for. I'd say they all rely on a key-like device and work by dividing addresses into lines, some of which are empty and some of which aren't, excluding the empties. In the case of lines of addresses, in my opinion this is semantically a gimmick to make the problem fit what SQL does well, and it's not a natural way to conceptualize the problem. Address lines are not "really" separate rows in a table that just got denormalized for a report. But that's debatable and whether you agree or not, I (a rank beginner) think each of your alternatives are idiomatic solutions worth elaborating on and studying.
I also get lots of similar cases where there really is normalization to be done - e.g., collatDesc1, collatCode1, collatLastAppraisal1, ... collatLastAppraisal5, with more complex criteria about what in excludeand how to order than with addresses, and I think techniques from your answers will be helpful.
%%phsloc%% is fun - since I'm able to create a key in this case I won't use it (as Martin advises). There was other stuff in Martin's stuff I wasn't familiar with too, and I'm still tossing them all around.
FWIW, here's the trigger I tried out, I don't know that I'll actually use it for the problem at hand. I think this qualifies a "bubble sort", with the swapping expressed in a peculiar way.
create trigger fixit on lines
instead of insert as
declare #maybeblank1 as varchar(max)
declare #maybeblank2 as varchar(max)
declare #maybeblank3 as varchar(max)
set #maybeBlank1 = (select line1 from inserted)
set #maybeBlank2 = (select line2 from inserted)
set #maybeBlank3 = (select line3 from inserted)
declare #counter int
set #counter = 0
while #counter < 3
begin
set #counter = #counter + 1
if #maybeBlank2 = ''
begin
set #maybeBlank2 =#maybeblank3
set #maybeBlank3 = ''
end
if #maybeBlank1 = ''
begin
set #maybeBlank1 = #maybeBlank2
set #maybeBlank2 = ''
end
end
select * into #kludge from inserted
update #kludge
set line1 = #maybeBlank1,
line2 = #maybeBlank2,
line3 = #maybeBlank3
insert into lines
select * from #kludge

You could make an insert and update trigger that check if the fields are empty and then move them.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What's the best way to get intersected data from one table? - sql

Related

Query based on multiple rows and multiple columns

Group by absorb NULL unless it's the only value

Where clause on Running total

T-SQL - Copying & Transposing Data

SQL: How to update multiple fields so empty field content is moved to the logically last columns - lose blank address lines

Categories

Resources