Best way to filter union of data from 2 tables by value in shared 3rd table - sql

For sake of example, let's assume 3 tables:
PHYSICAL_ITEM
ID
SELLER_ID
NAME
COST
DIMENSIONS
WEIGHT
DIGITAL_ITEM
ID
SELLER_ID
NAME
COST
DOWNLOAD_PATH
SELLER
ID
NAME
Item IDs are guaranteed unique across both item tables. I want to select, in order, with a type label, all item IDs for a given seller. I've come up with:
Query A
SELECT PI.ID AS ID, 'PHYSICAL' AS TYPE
FROM PHYSICAL_ITEM PI
JOIN SELLER S ON PI.SELLER_ID = S.ID
WHERE S.NAME = 'name'
UNION
SELECT DI.ID AS ID, 'DIGITAL' AS TYPE
FROM DIGITAL_ITEM DI
JOIN SELLER S ON DI.SELLER_ID = S.ID
WHERE S.NAME = 'name'
ORDER BY ID
Query B
SELECT ITEM.ID, ITEM.TYPE
FROM (SELECT ID, SELLER_ID, 'PHYSICAL' AS TYPE
FROM PHYSICAL_ITEM
UNION
SELECT ID, SELLER_ID, 'DIGITAL' AS TYPE
FROM DIGITAL_ITEM) AS ITEM
JOIN SELLER ON ITEM.SELLER_ID = SELLER.ID
WHERE SELLER.NAME = 'name'
ORDER BY ITEM.ID
Query A seems like it would be the most efficient, but it also looks unnecessarily duplicative (2 table joins to the same table, 2 where clauses on the same table column). Query B looks cleaner in a way to me (no duplication), but it also looks much less efficient, since it has a subquery. Is there a way to get the best of both worlds, so to speak?

In both cases, replace the union with union all. Union unnecessarily removes duplicates.
I would expect Query A to be more efficient, because the optimizer has more information when doing the join (although I think Oracle is pretty good with using indexes even after a union). In addition, the first query reduces the amount of data before the union.
This is, however, only an opinion. The real test is to time the two queries -- multiple times to avoid cache fill delays -- to see which is better.

Related

exists(A) and not exists(negA) vs custom aggregation

Many times, I have to select the customers that have made {criteria set A} of transactions and not any OTHER type of transactions. Sample data:
create table customer (name nvarchar(max))
insert customer values
('George'),
('Jack'),
('Leopold'),
('Averel')
create table trn (id int,customer nvarchar(max),product char(1))
insert trn values
(1,'George','A'),
(2,'George','B'),
(3,'Jack','B'),
(4,'Leopold','A')
Let's say we want to find all customers who bought product 'A' and not anything else (in this case, B).
The most typical way to do this includes joining the transaction table with itself:
select * from customer c
where exists(select 1 from trn p where p.customer=c.name and product='A')
and not exists(select 1 from trn n where n.customer=c.name and product='B')
I was wondering if there is a better way to do this. Keep in mind that the transaction table should typically be huge.
What about this alternative:
select * from customer c
where exists
(
select 1
from trn p
where p.customer=c.name
group by p.customer
having max(case when product='B' then 2 when product='A' then 1 else 0 end)=1
)
Will the fact that the transaction table is used only once offset the aggregation calculation needed?
You need to test performance on your data. If you have an index on trn(customer, product), then the exists would generally have very reasonable performance.
This is particularly true when you are using the customers table.
How well does the aggregation version compare? First, the best aggregation would be:
select customer
from trn
where product in ('a', 'b')
group by customer
having min(product) = 'a' and max(product) = 'b';
If you have an index on product -- and there are lots of products (or few customers that have "a" and "b"), then this can be faster than the not exists version.
In general, I advocate using the group by, even though its performance is not always best on a couple of products. Why?
The use of the having clause is quite flexible for handling all different "set-within-set" conditions.
Adding additional conditions doesn't have a large effect on performance.
If you are not using a customer table but instead doing something like (select distinct customer from trn), then the exists/not exists version is likely to be more expensive.
That said, I advocate using group by and having because it is more flexible. That means that under the right circumstances, other solutions should be used.
You could try the following statement. It may be faster than your statements under certain circumstances, since it will always determine first the customers with product A transactions and then looks only for these customers if there are transactions for other products. If there is really a benefit at all depends on the data and indexes of your real tables, so you have to try.
WITH customerA AS (SELECT DISTINCT customer FROM trn WHERE product = 'A')
SELECT DISTINCT customer.*
FROM customerA JOIN customer ON customerA.customer = customer.name
WHERE not exists(select 1 from trn n where n.customer = customerA.customer and
product <> 'A')

Should I JOIN or should I UNION

I have four different tables I am trying to query on, the first table is where I will be doing most of the querying, but if there is no match in car I am to look in other fields in the other tables to see if there is a match from a VIN parameter.
Example:
Select
c.id,
c.VIN,
c.uniqueNumber,
c.anotheruniqueNumber
FROM Cars c, Boat b
WHERE
c.VIN = #VIN(parameter),
b.SerialNumber = #VIN
Now say that I have no match in Cars, but there is a match in Boat, how would I be able to pull the matching Boat record vs the car record? I have tried to JOIN the tables, but the tables have no unique identifier to reference the other table.
I am trying to figure out what is the best way to search all the tables off of a parameter but with the least amount of code. I thought about doing UNION ALL, but not sure if that what I really want for this situation, seeing as the number of records could get extremely large.
I am currently using SQL Server 2012. Thanks in advance!
UPDATED:
CAR table
ID VIN UniqueIdentifier AnotherUniqueIdentifier
1 2002034434 HH54545445 2016-A23
2 2002035555 TT4242424242 2016-A24
3 1999034534 AGH0000034 2016-A25
BOAT table
ID SerialNumber Miscellaneous
1 32424234243 545454545445
2 65656565656 FF24242424242
3 20023232323 AGH333333333
Expected Result if #VIN parameter matches a Boat identifier:
BOAT
ID SerialNumber Miscellaneous
2 65656565656 FF24242424242
Some sort of union all might be the best approach -- at least the fastest with the right indexes:
Select c.id, c.VIN, c.uniqueNumber, c.anotheruniqueNumber
from Cars c
where c.VIN = #VIN
union all
select b.id, b.VIN, b.uniqueNumber, b.anotheruniqueNumber
from Boats b
where b.VIN = #VIN and
not exists (select 1 from Cars C where c.VIN = #VIN);
This assumes that you have the corresponding columns in each of the tables (which your question implies is true).
The chain of not exists can get longer as you add more entity types. A simple way around is to do sorting instead -- assuming you want only one row:
select top 1 x.*
from (Select c.id, c.VIN, c.uniqueNumber, c.anotheruniqueNumber, 1 as priority
from Cars c
where c.VIN = #VIN
union all
select b.id, b.VIN, b.uniqueNumber, b.anotheruniqueNumber, 2 as priority
from Boats b
where b.VIN = #VIN
) x
order by priority;
There is a slight overhead for the order by. But frankly speaking, ordering 1-4 rows is trivial from a performance perspective.

Multiple Many-to-many bi-directional self-inner-joins without repeating whole query

I have a data model such that items can have many-to-many relationships with other items in the same table using a second table to define relationships. Let's call the primary table items, keyed by item_id and the relationships table item_assoc with columns item_id and other_item_id and assoc_type. Generally, you might use a union to pick up on relationships that may be defined in either direction in the item_assoc table, but you would wind up repeating other parts of the same query just to be sure to pick up associations defined in either direction.
Let's say that you're trying to put together a fairly complex query similar to the following where you want to find a list of items that have related items that COULD have associated cancellation items, but select those that do not have cancellation items:
select
orig.*
from items as orig
join item_assoc as orig2related
on orig.item_id = orig2related.item_id
join items as related
on orig2related.other_item_id = related.item_id
and orig2related.assoc_type = 'Related'
left join item_assoc as related2cancel
on related.item_id = related2cancel.item_id
left join items as cancel
on related2cancel.other_item_id = cancel.item_id
and related2cancel.assoc_type = 'Cancellation'
where cancel.item_id is null
This query obviously only picks up items whose relationships are defined in one direction. For a less complex query, I might solve this by adding a union at the bottom for every permutation of the reverse relationships, but I think that would make the query unnecessarily long and hard to understand.
Is there a way I can define both directions of each relationship without repeating the other parts of the query?
A UNION within item_assoc could help. Assuming you have a DB without a WITH clause you would have to define a view
CREATE VIEW bidirec_item_assoc AS
(
SELECT item_id, other_item_id, assoc_type, 1 as direction FROM item_assoc
UNION
SELECT other_item_id, item_id, assoc_type, 2 as direction FROM item_assoc
)
You can now use bidirec_item_assoc in your queries where you have used items_assoc before.
Edited Out: You could add columns for direction and relationtype, of course
Simplify, simplify, simplify: Don't involve tables in the query that aren't needed.
The following query should be equivalent to your sample query and more expressive of your intent:
select i.*
from items i
where not exists ( select *
from item_assoc r
join item_assoc c on c.item_id = r.item_id
and c.assoc_type = 'Cancellation'
where r.item_id = i.item_id
and r.assoc_type = 'Related'
)
It should select the set of items that aren't related to an item that has been cancelled. There's not need to join against the items table 3 times.
Further, your original query will have duplicate rows: every row in the first item table (orig) will be duplicated once for every related item.

SQL Join IN Query with AND?

I have the following tables:
Option
-------
id - int
name - varchar
Product
---------
id - int
name -varchar
ProductOptions
------------------
id - int
product_id - int
option_id - int
If I have a list of option ids, how can I retrieve all Products that have all the options with the list of ids that I have? I know that SQL "IN" will use an "OR" i need an "AND". Thank you!
If the ids are not repeated, you can retrieve the ids of the options you need and count how many they are. Then, you just
SELECT product_id FROM ProductOptions
WHERE option_id IN ( OPTIONS )
GROUP BY product_id
HAVING COUNT(product_id) = NEEDED;
Without the GROUP BY, if you had five option ids, and product 27 had fifteen options among which there were those five, you'd get five rows with the same product_id. The GROUP BY joins those rows. Since you want ALL options, and options have all different IDs, asking "rows with all of them" is equivalent to asking "rows with as many options as the desired option set size".
Plus, you run the big query on ProductOptions only, which should be really fast.
One way to approach queries like this is with a group by and having clause. It is best if you start with your list of required options in a list:
with list as (
select <optionname1> as optionname union all
select <optionname2 union all . . .
)
select ProductId
from list l left outer join
Options o
on l.optionname = o.name
ProductOptions po join
on po.option_id = o.option_id left outer join
group by ProductId
having count(distinct o.optionname) = count(distinct l.optionname)
This guarantees that all are in the list. By the way, I used SQL Server syntax to generate the list.
If you have the list in other formats, such as a delimited string, there are other options. There are other possibilities depending on the database you are using. However, the above idea should work on any database, with two caveats:
The with statement might just become a subquery in the FROM clause where "list" is.
The method for creating the list (a table of constants) varies among databases
If you have list of Id's you have basically only 2 options.
- Either to call as many selects as many id's you have
- or you have to use IN () or OR.
The usage of IN would be recommended however, as calling one statement is usually more performant (moreover in case you have index on all your id columns, no table scan should be required).
I'd use following statement:
select Product.* from Product, Option, ProductOption where Option.id IN ( 1, 2, ... ) and option.id = ProductOption.option_id and Product.product_id = Product.id
One more remark, why do you have id column in ProductOptions table? It's useless from my point of view, you should rather have composite primary key from columns product_id and option_id (as this couple is unique).
Will this work?:
select p.id, p.name
from Product as p inner join
ProductOptions as po on p.id=po.product_id
where po.option_id in (1,2,3,4)

How to join table in many-to-many relationship?

Here is a simplified version of my problem. I have two tables. Each table has a unique ID field, but it's irrelevant in this case.
shipments has 3 fields: shipment_id, receive_by_datetime, and qty.
deliveries has 4 fields: delivery_id, shipment_id, delivered_on_datetime, and qty.
In shipments, the shipment_id and receive_by_datetime fields always match up. There are many rows in the table that would appear to be duplicates based off of those two columns (but they aren't... other fields are different).
In deliveries, the shipment_id matches up to the shipments table. There are also many rows that would appear to be duplicates based off of the delivery_id and delivered_on_datetime fields (but they aren't again... other fields exist that I didn't list).
I am trying to pull one row per aggregate delivered_on_datetime and receive_by_datetime, but because of the many-to-many relationships, it's difficult. Is a query somewhere along these lines correct?
SELECT d.delivered_on_datetime, s.receive_by_datetime, SUM(d.qty)
FROM deliveries d
LEFT JOIN (
SELECT DISTINCT s1.shipment_id, s1.receive_by_datetime
FROM shipments s1
) s ON (s.shipment_id = d.shipment_id)
GROUP BY d.delivered_on_datetime, s.receive_by_datetime
You will run into problems where the total SUM(d.qty) will be larger than the value from SELECT SUM(qty) FROM deliveries
Something like this might be better suited for you:
SELECT d.delivered_on_datetime, s.receive_by_datetime, SUM(d.qty) AS delivered_qty, SUM(d.qty) AS shipped_qty
FROM deliveries d
LEFT JOIN (
SELECT s1.shipment_id, s1.receive_by_datetime, SUM(s1.qty) AS qty
FROM shipments s1
GROUP BY s1.shipment_id, s1.received_by_datetime
) s ON (s.shipment_id = d.shipment_id)
GROUP BY d.delivered_on_datetime, s.receive_by_datetime
If you somehow have (or might have) a shipment_id that has multiple values for received_by_datetime and it's best practice to assume that something else might have corrupted the data slightly then to prevent the lines in the deliveries table being duplicated while still returning a valid result you can use:
SELECT d.delivered_on_datetime, s.receive_by_datetime, SUM(d.qty) AS delivered_qty, SUM(d.qty) AS shipped_qty
FROM deliveries d
LEFT JOIN (
SELECT s1.shipment_id, MAX(s1.receive_by_datetime) AS receive_by_datetime, SUM(s1.qty) AS qty
FROM shipments s1
GROUP BY s1.shipment_id
) s ON (s.shipment_id = d.shipment_id)
GROUP BY d.delivered_on_datetime, s.receive_by_datetime
Yep, the problem with many-to-many is you get the cartesian product of rows, so you end up counting the same row more than once. Once for each other row it matches against.
In shipments, the shipment_id and receive_by_datetime fields always match up
If this means there cannot be two shipments with the same ID but different dates then your query will work. But in general it is not safe. i.e. If subselect distinct could return more than one row per shipment ID, you will be subject to the double counting issue. Generically this is a very tricky problem to solve - in fact I see no way it could be with this data model.