Optimizing multiple IF-IN-SELECT statements in SQL - sql

I'm trying to create "flag" columns to see if the primary keys of my main table are in other tables:
SELECT
id
,IIF(id IN (
SELECT DISTINCT id
FROM dbo.example1
), 1, 0) AS example1_flag
,IIF(id IN (
SELECT DISTINCT id
FROM dbo.example2
), 1, 0) AS example2_flag
--etc.
FROM dbo.main_table
I'm doing this multiple times with around ten tables (i.e. creating about ten new columns, each from a different table), and all the tables involved have around a couple million rows. So far, it's a lot slower than I expected. Is there a better way to write this query, or is there any way to optimize it?

Use exists 1. The nice thing about it is you dont need to do a distinct you dont need to get all the records, you just need to validate one existence of the id.
SELECT
id
,
case when exists (
SELECT 1 FROM dbo.example1 b where a.id = b.id) then 1 else 0 end ) AS example1_flag
,case when exists (
SELECT 1 FROM dbo.example2 b where a.id = b.id) then 1 else 0 end ) AS example2_flag
--etc.
FROM dbo.main_table a

Related

Finding the id's which include multiple criteria in long format

Suppose I have a table like this,
id
tagId
1
1
1
2
1
5
2
1
2
5
3
2
3
4
3
5
3
8
I want to select id's where tagId includes both 2 and 5. For this fake data set, It should return 1 and 3.
I tried,
select id from [dbo].[mytable] where tagId IN(2,5)
But it takes 2 and 5 into account respectively. I also did not want to keep my table in wide format since tagId is dynamic. It can reach any number of columns. I also considered filtering with two different queries to find (somehow) the intersection. However since I may search more than two values inside the tagId in real life, it sounds inefficient to me.
I am sure that this is something faced before when tag searching. What do you suggest? Changing table format?
One option is to count the number of distinct tagIds (from the ones you're looking for) each id has:
SELECT id
FROM [dbo].[mytable]
WHERE tagId IN (2,5)
GROUP BY id
HAVING COUNT(DISTINCT tagId) = 2
This is actually a Relational Division With Remainder question.
First, you have to place your input into proper table format. I suggest you use a Table Valued Parameter if executing from client code. You can also use a temp table or table variable.
DECLARE #ids TABLE (tagId int PRIMARY KEY);
INSERT #ids VALUES (2), (5);
There are a number of different solutions to this type of question.
Classic double-negative EXISTS
SELECT DISTINCT
mt.Id
FROM mytable mt
WHERE NOT EXISTS (SELECT 1
FROM #ids i
WHERE NOT EXISTS (SELECT 1
FROM mytable mt2
WHERE mt2.id = mt.id
AND mt2.tagId = i.tagId)
);
This is not usually efficient though
Comparing to the total number of IDs to match
SELECT mt.id
FROM mytable mt
JOIN #ids i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM #ids);
This is much more efficient. You can also do this using a window function, it may be more or less efficient, YMMV.
SELECT mt.Id
FROM mytable mt
JOIN (
SELECT *,
total = COUNT(*) OVER ()
FROM #ids i
) i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = MIN(i.total);
Another solution involves cross-joining everything and checking how many matches there are using conditional aggregation
SELECT mt.id
FROM (
SELECT
mt.id,
mt.tagId,
matches = SUM(CASE WHEN i.tagId = mt.tagId THEN 1 END),
total = COUNT(*)
FROM mytable mt
CROSS JOIN #ids i
GROUP BY
mt.id,
mt.tagId
) mt
GROUP BY mt.id
HAVING SUM(matches) = MIN(total)
AND MIN(matches) >= 0;
db<>fiddle
There are other solutions also, see High Performance Relational Division in SQL Server

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

SELECT VALUES in Teradata

I know that it's possible in other SQL flavors (T-SQL) to "select" provided data without a table. Like:
SELECT *
FROM (VALUES (1,2), (3,4)) tbl
How can I do this using Teradata?
Teradata has strange syntax for this:
select t.*
from (select * from (select 1 as a, 2 as b) x
union all
select * from (select 3 as a, 4 as b) x
) t;
I don't have access to a TD system to test, but you might be able to remove one of the nested SELECTs from the answer above:
select x.*
from (
select 1 as a, 2 as b
union all
select 3 as a, 4 as b
) x
If you need to generate some random rows, you can always do a SELECT from a system table, like sys_calendar.calendar:
SELECT 1, 2
FROM sys_calendar.calendar
SAMPLE 10;
Updated example:
SELECT TOP 1000 -- Limit to 1000 rows (you can use SAMPLE too)
ROW_NUMBER() OVER() MyNum, -- Sequential numbering
MyNum MOD 7, -- Modulo operator
RANDOM(1,1000), -- Random number between 1,1000
HASHROW(MyNum) -- Rowhash value of given column(s)
FROM sys_calendar.calendar; -- Use as table to source rows
A couple notes:
make sure you pick a system table that will always be present and have rows
if you need more rows than are available in the source table, do a UNION to get more rows
you can always easily create a one-column table and populate it to whatever number of rows you want by INSERT/SELECT into it:
CREATE DummyTable (c1 INT); -- Create table
INSERT INTO DummyTable(1); -- Seed table
INSERT INTO DummyTable SELECT * FROM DummyTable; -- Run this to duplicate rows as many times are you want
Then use this table to create whatever resultset you want, similar to the query above with sys_calendar.calendar.
I don't have a TD system to test so you might get syntax errors...but that should give you a basic idea.
I am a bit late to this thread, but recently got the same error.
I solved this by simply using
select distinct 1 as a, 2 as b from DBC.tables
union all
select distinct 3 as a, 4 as b from DBC.tables
Here, DBC.tables is a DB backend table with a few rows only. So, the query runs fast as well

Doing a join only if count is greater than one

I wonder if the following a bit contrived example is possible without using intermediary variables and a conditional clause.
Consider an intermediary query which can produce a result set that contain either no rows, one row or multiple rows. Most of the time it produces just one row, but when multiple rows, one should join the resulting rows to another table to prune it down to either one or no rows. After this if there is one row (as opposed to no rows), one would want to return multiple columns as produced by the original intermediary query.
I have in my mind something like following, but it won't obviously work (multiple columns in switch-case, no join etc.), but maybe it illustrates the point. What I would like to have is to just return what is currently in the SELECT clause in case ##ROWCOUNT = 1 or in case it is greater, do a INNER JOIN to Auxilliary, which prunes down x to either one row or no rows and then return that. I don't want to search Main more than once and Auxilliary only when x here contains more than one row.
SELECT x.MainId, x.Data1, x.Data2, x.Data3,
CASE
WHEN ##ROWCOUNT IS NOT NULL AND ##ROWCOUNT = 1 THEN
1
WHEN ##ROWCOUNT IS NOT NULL AND ##ROWCOUNT > 1 THEN
-- Use here #id or MainId to join to Auxilliary and there
-- FilteringCondition = #filteringCondition to prune x to either
-- one or zero rows.
END
FROM
(
SELECT
MainId,
Data1,
Data2,
Data3
FROM Main
WHERE
MainId = #id
) AS x;
CREATE TABLE Main
(
-- This Id may introduce more than row, so it is joined to
-- Auxilliary for further pruning with the given conditions.
MainId INT,
Data1 NVARCHAR(MAX) NOT NULL,
Data2 NVARCHAR(MAX) NOT NULL,
Data3 NVARCHAR(MAX) NOT NULL,
AuxilliaryId INT NOT NULL
);
CREATE TABLE Auxilliary
(
AuxilliaryId INT IDENTITY(1, 1) PRIMARY KEY,
FilteringCondition NVARCHAR(1000) NOT NULL
);
Would this be possible in one query without a temporary table variable and a conditional? Without using a CTE?
Some sample data would be
INSERT INTO Auxilliary(FilteringCondition)
VALUES
(N'SomeFilteringCondition1'),
(N'SomeFilteringCondition2'),
(N'SomeFilteringCondition3');
INSERT INTO Main(MainId, Data1, Data2, Data3, AuxilliaryId)
VALUES
(1, N'SomeMainData11', N'SomeMainData12', N'SomeMainData13', 1),
(1, N'SomeMainData21', N'SomeMainData22', N'SomeMainData23', 2),
(2, N'SomeMainData31', N'SomeMainData32', N'SomeMainData33', 3);
And a sample query, which actually behaves as I'd like it to behave with the caveat I'd want to do the join only if querying Main directly with the given ID produces more than one result.
DECLARE #id AS INT = 1;
DECLARE #filteringCondition AS NVARCHAR(1000) = N'SomeFilteringCondition1';
SELECT *
FROM
Main
INNER JOIN Auxilliary AS aux ON aux.AuxilliaryId = Main.AuxilliaryId
WHERE MainId = #id AND aux.FilteringCondition = #filteringCondition;
You don't usually use a join to reduce the result set of the left table. To limit a result set you'd use the where clause instead. In combination with another table this would be WHERE [NOT] EXISTS.
So let's say this is your main query:
select * from main where main.col1 = 1;
It returns one of the following results:
no rows, then we are done
one row, then we are also done
more than one row, then we must extend the where clause
The query with the extended where clause:
select * from main where main.col1 = 1
and exists (select * from other where other.col2 = main.col3);
which returns one of the following results:
no rows, which is okay
one row, which is okay
more than one row - you say this is not possible
So the task is to do this in one step instead. I count records and look for a match in the other table for every record. Then ...
if the count is zero we get no result anyway
if it is one I take that row
if it is greater than one, I take the row for which exists a match in the other table or none when there is no match
Here is the full query:
select *
from
(
select
main.*,
count(*) over () as cnt,
case when exists (select * from other where other.col2 = main.col3) then 1 else 0 end
as other_exists
from main
where main.col1 = 1
) counted_and_checked
where cnt = 1 or other_exists = 1;
UPDATE: I understand that you want to avoid unnecessary access to the other table. This is rather difficult to do however.
In order to only use the subquery when necessary, we could move it outside:
select *
from
(
select
main.*,
count(*) over () as cnt
from main
where main.col1 = 1
) counted_and_checked
where cnt = 1 or exists (select * from other where other.col2 = main.col3);
This looks much better in my opinion. However there is no precedence among the two expressions left and right of an OR. So the DBMS may still execute the subselect on every record before evaluating cnt = 1.
The only operation that I know of using left to right precedence, i.e. doesn't look further once a condition on the left hand side is matched is COALESCE. So we could do the following:
select *
from
(
select
main.*,
count(*) over () as cnt
from main
where main.col1 = 1
) counted_and_checked
where coalesce( case when cnt = 1 then 1 else null end ,
(select count(*) from other where other.col2 = main.col3)
) > 0;
This may look a bit strange, but should prevent the subquery from being executed, when cnt is 1.
You may try something like
select * from Main m
where mainId=#id
and #filteringCondition = case when(select count(*) from Main m2 where m2.mainId=#id) >1
then (select filteringCondition from Auxilliary a where a.AuxilliaryId = m.AuxilliaryId) else #filteringCondition end
but it's hardly very fast query. I'd better use temp table or just if and two queries.

Count(*) with 0 for boolean field

Let's say I have a boolean field in a database table and I want to get a tally of how many are 1 and how many are 0. Currently I am doing:
SELECT 'yes' AS result, COUNT( * ) AS num
FROM `table`
WHERE field = 1
UNION
SELECT 'no' AS result, COUNT( * ) AS num
FROM `table`
WHERE field = 0;
Is there an easier way to get the result so that even if there are no false values I will still get:
----------
|yes | 3 |
|no | 0 |
----------
One way would be to outer join onto a lookup table. So, create a lookup table that maps field values to names:
create table field_lookup (
field int,
description varchar(3)
)
and populate it
insert into field_lookup values (0, 'no')
insert into field_lookup values (1, 'yes')
now the next bit depends on your SQL vendor, the following has some Sybase (or SQL Server) specific bits (the outer join syntax and isnull to convert nulls to zero):
select description, isnull(num,0)
from (select field, count(*) num from `table` group by field) d, field_lookup fl
where d.field =* fl.field
you are on the right track, but the first answer will not be correct. Here is a solution that will give you Yes and No even if there is no "No" in the table:
SELECT 'Yes', (SELECT COUNT(*) FROM Tablename WHERE Field <> 0)
UNION ALL
SELECT 'No', (SELECT COUNT(*) FROM tablename WHERE Field = 0)
Be aware that I've checked Yes as <> 0 because some front end systems that uses SQL Server as backend server, uses -1 and 1 as yes.
Regards
Arild
This will result in two columns:
SELECT SUM(field) AS yes, COUNT(*) - SUM(field) AS no FROM table
Because there aren't any existing values for false, if you want to see a summary value for it - you need to LEFT JOIN to a table or derived table/inline view that does. Assuming there's no TYPE_CODES table to lookup the values, use:
SELECT x.desc_value AS result,
COALESCE(COUNT(t.field), 0) AS num
FROM (SELECT 1 AS value, 'yes' AS desc_value
UNION ALL
SELECT 2, 'no') x
LEFT JOIN TABLE t ON t.field = x.value
GROUP BY x.desc_value
SELECT COUNT(*) count, field FROM table GROUP BY field;
Not exactly same output format, but it's the same data you get back.
If one of them has none, you won't get that rows back, but that should be easy enough to check for in your code.