SQL - most efficient way to find if a pair of row does NOT exist

SQL - most efficient way to find if a pair of row does NOT exist - sql

I can't seem to find a similar situation to mine online. I have a table for 'orders' called Order, and a table for details on those orders, called 'order detail'. The definition of a certain type of order is if it has 1 of two pairs of order details (Value-Unit pairs). So, my order detail table might look like this:
order_id | detail
---------|-------
1 | X
1 | Y
1 | Z
2 | X
2 | Z
2 | B
3 | A
3 | Z
3 | B
The two pairs that go together are (X & Y) and (A & B). What is an efficient way of retrieving only those order_ids that DO NOT contain either one of these pairs? e.g. For the above table, I need to receive only the order_id 2.
The only solution I can come up with is essentially to use two queries and perform a self join:
select distinct o.order_id
from orders o
where o.order_id not in (
select distinct order_id
from order_detail od1 where od1.detail=X
join order_detail od2 on od2.order_id = od1.order_id and od2.detail=Y
)
and o.order_id not in (
select distinct order_id
from order_detail od1 where od1.detail=A
join order_detail od2 on od2.order_id = od1.order_id and od2.detail=B
)
The problem is that performance is an issue, my order_detail table is HUGE, and I am quite inexperienced in query languages. Is there a faster way to do this with a lower cardinality? I also have zero control over the schema of the tables, so I can't change anything there.

First and foremost I'd like to emphasise that finding the most efficient query is a combination of a good query and a good index. Far too often I see questions here where people look for magic to happen in only one or the other.
E.g. Of a variety of solutions, yours is the slowest (after fixing syntax errors) when there are no indexes, but is quite a bit better with an index on (detail, order_id)
Please also note that you have the actual data and table structures. You'll need to experiment with various combinations of queries and indexes to find what works best; not least because you haven't indicated what platform you're using and results are likely to vary between platforms.
[/ranf-off]
Query
Without further ado, Gordon Linoff has provided some good suggestions. There's another option likely to offer similar performance. You said you can't control the schema; but you can use a sub-query to transform the data into a 'friendlier structure'.
Specifically, if you:
pivot the data so you have a row per order_id
and columns for each detail you want to check
and the intersection is a count of how many orders have that detail...
Then your query is simply: where (x=0 or y=0) and (a=0 or b=0). The following uses SQL Server's temporary tables to demonstrate with sample data. The queries below work regardless of duplicate id, val pairs.
/*Set up sample data*/
declare #t table (
id int,
val char(1)
)
insert #t(id, val)
values (1, 'x'), (1, 'y'), (1, 'z'),
(2, 'x'), (2, 'z'), (2, 'b'),
(3, 'a'), (3, 'z'), (3, 'b')
/*Option 1 manual pivoting*/
select t.id
from (
select o.id,
sum(case when o.val = 'a' then 1 else 0 end) as a,
sum(case when o.val = 'b' then 1 else 0 end) as b,
sum(case when o.val = 'x' then 1 else 0 end) as x,
sum(case when o.val = 'y' then 1 else 0 end) as y
from #t o
group by o.id
) t
where (x = 0 or y = 0) and (a = 0 or b = 0)
/*Option 2 using Sql Server PIVOT feature*/
select t.id
from (
select id ,[a],[b],[x],[y]
from (select id, val from #t) src
pivot (count(val) for val in ([a],[b],[x],[y])) pvt
) t
where (x = 0 or y = 0) and (a = 0 or b = 0)
It's interesting to note that the query plans for options 1 and 2 above are slightly different. This suggests the possibility of different performance characteristics over large data sets.
Indexes
Note that the above will likely process the whole table. So there is little to be gained from indexes. However, if the table has "long rows", an index on only the 2 columns you're working with means that less data needs to be read from disk.
The query structure you provided is likely to benefit from an indexes such as (detail, order_id). This is because the server can more efficiently check the NOT IN sub-query conditions. How beneficial will depend on the distribution of data in your table.
As a side note I tested various query options including a fixed version of yours and Gordon's. (Only a small data size though.)
Without the above index, your query was slowest in the batch.
With the above index, Gordon's second query was slowest.
Alternative Queries
Your query (fixed):
select distinct o.id
from #t o
where o.id not in (
select od1.id
from #t od1
inner join #t od2 on
od2.id = od1.id
and od2.val='Y'
where od1.val= 'X'
)
and o.id not in (
select od1.id
from #t od1
inner join #t od2 on
od2.id = od1.id
and od2.val='a'
where od1.val= 'b'
)
Mixture between Gordon's first and second query. Fixes the duplicate issue in the first and the performance in the second:
select id
from #t od
group by id
having ( sum(case when val in ('X') then 1 else 0 end) = 0
or sum(case when val in ('Y') then 1 else 0 end) = 0
)
and( sum(case when val in ('A') then 1 else 0 end) = 0
or sum(case when val in ('B') then 1 else 0 end) = 0
)
Using INTERSECT and EXCEPT:
select id
from #t
except
(
select id
from #t
where val = 'a'
intersect
select id
from #t
where val = 'b'
)
except
(
select id
from #t
where val = 'x'
intersect
select id
from #t
where val = 'y'
)

I would use aggregation and having:
select order_id
from order_detail od
group by order_id
having sum(case when detail in ('X', 'Y') then 1 else 0 end) < 2 and
sum(case when detail in ('A', 'B') then 1 else 0 end) < 2;
This assumes that orders do not have duplicate rows with the same detail. If that is possible:
select order_id
from order_detail od
group by order_id
having count(distinct case when detail in ('X', 'Y') then detail end) < 2 and
count(distinct case when detail in ('A', 'B') then detail end) < 2;

Related

SQL Query to Update Object Count Based on Event Name

Imagine I am an owner of many bookstores. I keep a database of all events that occur in all of my many bookstores. Two events of note are "Book Added" and "Book Removed", for when a book is added to the inventory of a story, and when it is sold from a store. An example schema would be bookstore_id, event_name, `time.
Now say I have a second table, which maintains the current state of each bookstore, so the schema would be bookstore_id, num_books.
I want to be able to use the first table to get the count of all the "Book Added" events per bookstore, subtract the count of all the "Book Removed" events per bookstore, and then update the number of books in each bookstore in the second table.
The only way I can think to do it requires using a cursor, but I'm assuming there's a more "SQL-esque" way to do it that is more set-based and doesn't require a cursor.

You can count the events by using a GROUP BY clause.
If we would create 2 subtables where we count the added respectively the removed books, we can simply subtract the results and update these in the parent table. This will look like:
UPDATE b
SET b.numbooks = AddedBooks.BooksAdded - RemovedBooks.BooksRemoved
FROM dbo.Books b
INNER JOIN (SELECT be.book_id, count(*) AS BooksAdded
FROM dbo.BookEvents be
WHERE be.event = 'BookAdded'
GROUP BY be.book_id, be.event) AS AddedBooks
ON b.bookid = AddedBooks.book_id
INNER JOIN (SELECT be.book_id, count(*) AS BooksRemoved
FROM dbo.BookEvents be
WHERE be.event = 'BookRemoved'
GROUP BY be.book_id, be.event) AS RemovedBooks
ON b.bookid = RemovedBooks.book_id

select bookstore_id
, sum(case when event_name = "Book Removed" then -1 else 1 end) as "num books"
from bookstores
group by bookstore_id
if more than 2 events
select bookstore_id
, sum(case when event_name = "Book Removed" then -1
when event_name = "Book Added" then 1
end) as "num books"
from bookstores
group by bookstore_id
And I would just make it a view unless you come up with performance issues

We can use CTEs to get details individually and process them.
With CTE_Add AS
(
Select Bkstr_ID, Count(event_Name) As Added From temp Where event = 'Added' Group by Bkstr_ID
), CTE_Rem As
(
Select Bkstr_ID, Count(event_Name) As Removed From temp Where event = 'Removed' Group by Bkstr_ID
)
Select A.Bkstr_ID, Added - Removed
From CTE_Add A
Left Join CTE_Rem R On A.Bkstr_ID= R.Bkstr_ID
This will give you ID and count.
Instead of select, you can use Insert statement

I'd use SUM(CASE WHEN ...). Below is an example.
If object_id('tempdb..#BoookStores') Is Not Null Drop Table #BoookStores
Create Table #BoookStores (bookstore_id int, num_books int)
/* We have 3 stores */
Insert #BoookStores (bookstore_id, num_books)
Values (1, 0), (2, 0), (3, 0)
If object_id('tempdb..#Events') Is Not Null Drop Table #Events
Create Table #Events (bookstore_id int, event_name varchar(10), time dateTime Default(GetDate()) )
Insert #Events (bookstore_id, event_name)
Values
(1, 'Added'), (1, 'Added'), (1, 'Added'), (1, 'Added'), -- Added 4 books to 1. store
(2, 'Added'), (2, 'Added'), (2, 'Added'), -- Added 3 books to 2. store
(3, 'Added'), (3, 'Added'), -- Added 2 books to 3. store
/* removed 2 books from each stores */
(1, 'Removed'), (1, 'Removed'),
(2, 'Removed'), (2, 'Removed'),
(3, 'Removed'), (3, 'Removed')
/* Calculate adds and removes. Update the results */
;With Tmp As (
Select E.bookstore_id,
Sum(Case When E.event_name = 'Added' Then 1 Else 0 End) As AddCount,
Sum(Case When E.event_name = 'Removed' Then 1 Else 0 End) As RemoveCount
From #Events E
Group By E.bookstore_id
)
Update BS Set num_books = T.AddCount-T.RemoveCount
From #BoookStores BS
Inner Join Tmp T On T.bookstore_id = BS.bookstore_id
/* check results*/
Select * From #BoookStores BS

Something like this will get you in the ball park. Similar logic could be used for INSERT.
UPDATE tableA
SET tableA.num_books = tableB.num_books
FROM secondTable AS TableA
INNER JOIN (
SELECT bookstore_id,
SUM(CASE
WHEN event_name = 'Books Added'
THEN 1
END) - SUM(CASE
WHEN event_name = 'Books Removed'
THEN 1
END
) AS num_books
FROM firstTable
GROUP BY bookstore_id
) TableB ON TableA.bookstore_id = tableB.bookstore_id

You can try a query like below:
update t1
set num_books=inventory
FROM bs t1 LEFT JOIN
(select bookstore_id,SUM(case when event_name like 'A' then 1 when event_name like 'R' then -1 else NULL end) as inventory
from bse
group by bookstore_id) t2
on t1.bookstore_id=t2.bookstore_id
Live SQL demo

UPDATE bsc
SET bsc.num_books = bse.num_books
FROM bookstorecounts bsc
JOIN (SELECT bookstore_id,
SUM(CASE event_name
WHEN 'Book Removed' THEN -1
WHEN 'Book Added' THEN 1
END) AS num_books
FROM bookstoreevents
GROUP BY bookstore_id
) bse ON bsc.bookstore_id = bse.bookstore_id

2 Rows to 1 Row - Nested Query

I have a response column that stores 2 different values for a same product based on question 1 and question 2. That creates 2 rows for each product but I want only one row for each product.
Example:
select Product, XNumber from MyTable where QuestionID IN ('Q1','Q2')
result shows:
Product XNumber
Bat abc
Bat abc12
I want it to display like below:
Product Xnumber1 Xnumber2
Bat abc abc12
Please help.
Thanks.

If you always have two different values you can try this:
SELECT a.Product, a.XNumber as XNumber1, b.XNumber as XNumber2
FROM MyTable a
INNER JOIN MyTable b
ON a.Product = b.Product
WHERE a.QuestionId = 'Q1'
AND b.QuestionId = 'Q2'
I assume that XNumber1 is the result for Q1 and Xnumber2 is the result for Q2.

This will work best if you don't have answers for both Q1 and Q2 for all ids
SELECT a.Product, b.XNumber as XNumber1, c.XNumber as XNumber2
FROM (SELECT DISTINCT Product FROM MyTable) a
LEFT JOIN MyTable b ON a.Product = b.Product AND b.QuestionID = 'Q1'
LEFT JOIN MyTable c ON a.Product = c.Product AND c.QuestionID = 'Q2'

This is one way to achieve your expected results. However, it relies on knowing that only xNumber abc and abc12 are the values. If this is not the case, then a dynamic pivot would be likely needed.
SELECT product, max(case when XNumber = 'abc' then xNumber end) as XNumber1,
max(Case when xNumber = 'abc12' then xNumber end) as xNumber2
FROM MyTable
GROUP BY Product
The problem is that SQL needs to know how many columns will be in the result at the time it compiles the SQL. Since the number of columns could be dependent on the data itself (2 rows vs 5 rows) it can't complete the request. Using Dynamic SQL you can find out the number of rows, then pass those values in as the column names which is why the dynamic SQL works.

This will get you two columns, the first will be the product, and the 2nd will be a comma delimited list of xNumbers.
SELECT DISTINCT T.Product,
xNumbers = Stuff((SELECT DISTINCT ', ' + T1.XNumber
FROM MyTable T1
WHERE t.Product = T1.Product
FOR XML PATH ('')),1,1,'')
FROM MyTable T
To get what you want, we need to know how many columns there will be, what to name them, and how to determine which value goes into which column

Been using rank() a lot in current code we have been working on at my day job. So this fun variant came to mind for your solution.
Using rank to get the 1st, 2nd, and 3rd possible item identifier then grouping them to create a simulated pivot
DECLARE #T TABLE (PRODUCT VARCHAR(50), XNumber VARCHAR(50))
INSERT INTO #T VALUES
('Bat','0-12345-98765-6'),
('Bat','0-12345-98767-2'),
('Bat','0-12345-98768-1'),
('Ball','0-12345-98771-6'),
('Ball','0-12345-98772-7'),
('Ball','0-12345-98777-9'),
('Hat','0-12345-98711-6'),
('Hat','0-12345-98712-3'),
('Tee','0-12345-98465-1')
SELECT
PRODUCT,
MAX(CASE WHEN I = 1 THEN XNumber ELSE '' END) AS Xnumber1,
MAX(CASE WHEN I = 2 THEN XNumber ELSE '' END) AS Xnumber2,
MAX(CASE WHEN I = 3 THEN XNumber ELSE '' END) AS Xnumber3
FROM
(
SELECT
PRODUCT,
XNumber,
RANK() OVER(PARTITION BY PRODUCT ORDER BY XNumber) AS I
FROM #T
) AS DATA
GROUP BY
PRODUCT

sql select same column multiple times (using online oracle explorer)

I have exploratory access to a db using the online explorer(oracle), only 1000 records are returned at a time.
Thus, I need to make sure that the data returned contains a sufficient subset
as a basic example, tableA with columns
id chartid
1 a
1 b
1 c
2 d
2 b
select id
from tableA
where chartid in (a,d)
and chartid in (b)
and chartid in (c)
should return
id
1
I want to make sure that the ids I then run my query on contain enough of the needed data(otherwise, its sparse)
--thankyou
how is this done in general?
is this a limitation of the explorer/online interface ..?

This is an example of a "set-within-sets" query. I like to approach these using group by and having:
select id
from TableA
group by id
having sum(case when chartid in ('a', 'd') then 1 else 0 end) > 0 and
sum(case when chartid in ('b') then 1 else 0 end) > 0 and
sum(case when chartid in ('c') then 1 else 0 end) > 0;
Each condition in the having clause counts the number of rows, for a given id, that match each condition. The > 0 guarantees that at least one row exists that meets each of the conditions.
EDIT:
I'm not actually sure what your conditions are, because it is not meeting all three that you meet. If you want to meet just one of them, them use or. If you want two out of the three, then use:
having (max(case when chartid in ('a', 'd') then 1 else 0 end) +
max(case when chartid in ('b') then 1 else 0 end) +
max(case when chartid in ('c') then 1 else 0 end)
) >= 2;

The answer Gordon Linoff provided is probably the best, but just for reference a general approach to this kind problem would be to break it down into smaller parts (ie form multiple sets) and then join them. Something looking like this:
select t1.id
from tableA t1
inner join tableA t2 on t1.id = t2.id
inner join tableA t3 on t1.id = t3.id
where
t1.chartid in ('a','d')
and t2.chartid = 'b'
and t3.chartid = 'c'
This query would form three distinct sets and then return the intersect of them. At least for me, thinking in terms of sets often is helpful when doing relational database queries (although the solution might not be the most efficient).
Sample SQL Fiddle with both queries.

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

I am writing data to a table and allocating a "group-id" for each batch of data that is written. To illustrate, consider the following table.
GroupId Value
------- -----
1 a
1 b
1 c
2 a
2 b
3 a
3 b
3 c
3 d
In this example, there are three groups of data, each with similar but varying values.
How do I query this table to find a group that contains a given set of values? For instance, if I query for (a,b,c) the result should be group 1. Similarly, a query for (b,a) should result in group 2, and a query for (a, b, c, e) should result in the empty set.
I can write a stored procedure that performs the following steps:
select distinct GroupId from Groups -- and store locally
for each distinct GroupId: perform a set-difference (except) between the input and table values (for the group), and vice versa
return the GroupId if both set-difference operations produced empty sets
This seems a bit excessive, and I hoping to leverage some other commands in SQL to simplify. Is there a simpler way to perform a set-comparison in this context, or to select the group ID that contains the exact input values for the query?

This is a set-within-sets query. I like to solve it using group by and having:
select groupid
from GroupValues gv
group by groupid
having sum(case when value = 'a' then 1 else 0 end) > 0 and
sum(case when value = 'b' then 1 else 0 end) > 0 and
sum(case when value = 'c' then 1 else 0 end) > 0 and
sum(case when value not in ('a', 'b', 'c') then 1 else - end) = 0;
The first three conditions in the having clause check that each elements exists. The last condition checks that there are no other values. This method is quite flexible, for various exclusions and inclusion conditions on the values you are looking for.
EDIT:
If you want to pass in a list, you can use:
with thelist as (
select 'a' as value union all
select 'b' union all
select 'c'
)
select groupid
from GroupValues gv left outer join
thelist
on gv.value = thelist.value
group by groupid
having count(distinct gv.value) = (select count(*) from thelist) and
count(distinct (case when gv.value = thelist.value then gv.value end)) = count(distinct gv.value);
Here the having clause counts the number of matching values and makes sure that this is the same size as the list.
EDIT:
query compile failed because missing the table alias. updated with right table alias.

This is kind of ugly, but it works. On larger datasets I'm not sure what performance would look like, but the nested instances of #GroupValues key off GroupID in the main table so I think as long as you have a good index on GroupID it probably wouldn't be too horrible.
If Object_ID('tempdb..#GroupValues') Is Not Null Drop Table #GroupValues
Create Table #GroupValues (GroupID Int, Val Varchar(10));
Insert #GroupValues (GroupID, Val)
Values (1,'a'),(1,'b'),(1,'c'),(2,'a'),(2,'b'),(3,'a'),(3,'b'),(3,'c'),(3,'d');
If Object_ID('tempdb..#FindValues') Is Not Null Drop Table #FindValues
Create Table #FindValues (Val Varchar(10));
Insert #FindValues (Val)
Values ('a'),('b'),('c');
Select Distinct gv.GroupID
From (Select Distinct GroupID
From #GroupValues) gv
Where Not Exists (Select 1
From #FindValues fv2
Where Not Exists (Select 1
From #GroupValues gv2
Where gv.GroupID = gv2.GroupID
And fv2.Val = gv2.Val))
And Not Exists (Select 1
From #GroupValues gv3
Where gv3.GroupID = gv.GroupID
And Not Exists (Select 1
From #FindValues fv3
Where gv3.Val = fv3.Val))

SQL statement for maximum common element in a set

I have a table like
id contact value
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 2
Now I would like to get the common maximum value for a given set of contacts.
For example:
if my contact set was {A,B} it would return 3;
for the set {A,C} it would return 2
for the set {B} it would return 4
What SQL statement(s) can do this?

Try this:
SELECT value, count(distinct contact) as cnt
FROM my_table
WHERE contact IN ('A', 'C')
GROUP BY value
HAVING cnt = 2
ORDER BY value DESC
LIMIT 1
This is MySQL syntax, may differ for your database. The number (2) in HAVING clause is the number of elements in set.

SELECT max(value) FROM table WHERE contact IN ('A', 'C')
Edit: max common
declare #contacts table ( contact nchar(10) )
insert into #contacts values ('a')
insert into #contacts values ('b')
select MAX(value)
from MyTable
where (select COUNT(*) from #contacts) =
(select COUNT(*)
from MyTable t
join #contacts c on c.contact = t.contact
where t.value = MyTable.value)

Most will tell you to use:
SELECT MAX(t.value)
FROM TABLE t
WHERE t.contact IN ('A', 'C')
GROUP BY t.value
HAVING COUNT(DISTINCT t.*) = 2
Couple of caveats:
The DISTINCT is key, otherwise you could have two rows of t.contact = 'A'.
The number of COUNT(DISTINCT t.*) has to equal the number of values specified in the IN clause
My preference is to use JOINs:
SELECT MAX(t.value)
FROM TABLE t
JOIN TABLE t2 ON t2.value = t.value AND t2.contact = 'C'
WHERE t.contact = 'A'
The downside to this is that you have to do a self join (join to the same table) for every criteria (contact value in this case).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - most efficient way to find if a pair of row does NOT exist - sql

Related

SQL Query to Update Object Count Based on Event Name

2 Rows to 1 Row - Nested Query

sql select same column multiple times (using online oracle explorer)

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

SQL statement for maximum common element in a set

Categories

Resources