exists(A) and not exists(negA) vs custom aggregation - sql

Many times, I have to select the customers that have made {criteria set A} of transactions and not any OTHER type of transactions. Sample data:
create table customer (name nvarchar(max))
insert customer values
('George'),
('Jack'),
('Leopold'),
('Averel')
create table trn (id int,customer nvarchar(max),product char(1))
insert trn values
(1,'George','A'),
(2,'George','B'),
(3,'Jack','B'),
(4,'Leopold','A')
Let's say we want to find all customers who bought product 'A' and not anything else (in this case, B).
The most typical way to do this includes joining the transaction table with itself:
select * from customer c
where exists(select 1 from trn p where p.customer=c.name and product='A')
and not exists(select 1 from trn n where n.customer=c.name and product='B')
I was wondering if there is a better way to do this. Keep in mind that the transaction table should typically be huge.
What about this alternative:
select * from customer c
where exists
(
select 1
from trn p
where p.customer=c.name
group by p.customer
having max(case when product='B' then 2 when product='A' then 1 else 0 end)=1
)
Will the fact that the transaction table is used only once offset the aggregation calculation needed?

You need to test performance on your data. If you have an index on trn(customer, product), then the exists would generally have very reasonable performance.
This is particularly true when you are using the customers table.
How well does the aggregation version compare? First, the best aggregation would be:
select customer
from trn
where product in ('a', 'b')
group by customer
having min(product) = 'a' and max(product) = 'b';
If you have an index on product -- and there are lots of products (or few customers that have "a" and "b"), then this can be faster than the not exists version.
In general, I advocate using the group by, even though its performance is not always best on a couple of products. Why?
The use of the having clause is quite flexible for handling all different "set-within-set" conditions.
Adding additional conditions doesn't have a large effect on performance.
If you are not using a customer table but instead doing something like (select distinct customer from trn), then the exists/not exists version is likely to be more expensive.
That said, I advocate using group by and having because it is more flexible. That means that under the right circumstances, other solutions should be used.

You could try the following statement. It may be faster than your statements under certain circumstances, since it will always determine first the customers with product A transactions and then looks only for these customers if there are transactions for other products. If there is really a benefit at all depends on the data and indexes of your real tables, so you have to try.
WITH customerA AS (SELECT DISTINCT customer FROM trn WHERE product = 'A')
SELECT DISTINCT customer.*
FROM customerA JOIN customer ON customerA.customer = customer.name
WHERE not exists(select 1 from trn n where n.customer = customerA.customer and
product <> 'A')

Related

Best way to filter union of data from 2 tables by value in shared 3rd table

For sake of example, let's assume 3 tables:
PHYSICAL_ITEM
ID
SELLER_ID
NAME
COST
DIMENSIONS
WEIGHT
DIGITAL_ITEM
ID
SELLER_ID
NAME
COST
DOWNLOAD_PATH
SELLER
ID
NAME
Item IDs are guaranteed unique across both item tables. I want to select, in order, with a type label, all item IDs for a given seller. I've come up with:
Query A
SELECT PI.ID AS ID, 'PHYSICAL' AS TYPE
FROM PHYSICAL_ITEM PI
JOIN SELLER S ON PI.SELLER_ID = S.ID
WHERE S.NAME = 'name'
UNION
SELECT DI.ID AS ID, 'DIGITAL' AS TYPE
FROM DIGITAL_ITEM DI
JOIN SELLER S ON DI.SELLER_ID = S.ID
WHERE S.NAME = 'name'
ORDER BY ID
Query B
SELECT ITEM.ID, ITEM.TYPE
FROM (SELECT ID, SELLER_ID, 'PHYSICAL' AS TYPE
FROM PHYSICAL_ITEM
UNION
SELECT ID, SELLER_ID, 'DIGITAL' AS TYPE
FROM DIGITAL_ITEM) AS ITEM
JOIN SELLER ON ITEM.SELLER_ID = SELLER.ID
WHERE SELLER.NAME = 'name'
ORDER BY ITEM.ID
Query A seems like it would be the most efficient, but it also looks unnecessarily duplicative (2 table joins to the same table, 2 where clauses on the same table column). Query B looks cleaner in a way to me (no duplication), but it also looks much less efficient, since it has a subquery. Is there a way to get the best of both worlds, so to speak?
In both cases, replace the union with union all. Union unnecessarily removes duplicates.
I would expect Query A to be more efficient, because the optimizer has more information when doing the join (although I think Oracle is pretty good with using indexes even after a union). In addition, the first query reduces the amount of data before the union.
This is, however, only an opinion. The real test is to time the two queries -- multiple times to avoid cache fill delays -- to see which is better.

Querying and adding rows

OK, this is a second attempt to resolve my issue, for those who will read this a second time, i hope its clear enough to understand a problem.
I am developing a query for a report, the thing is that while retrieving data from database this report should populate some rows, which do not exist. For illustrating purpose lets say i have these tables :
Table 1 - Companies
Table 2 - Transactions.
Table 3 - Transaction types.
Important detail that most of the companies do not have transactions of all transaction types. Although the report logic requires to dysplay a company with all of them : "real" ones with real money values and other, not existed ones with just $0. The problem starts here because transaction types are combined in logical groups, so lets say if a company has only 1 real transaction of type_1, the report should contain "$0" records of other types associated with type_1, like type_2, type_3 and type_4. If company has transactions of type_1 and type_2, report should be populated with some other tran types from different transaction type group etc.
The problem here is that the environment where it should be executed must be a pure sql (being a java programmer i understand how easy is to query database, load data into array[][] and add missing transaction types) - but the query should be ran on UNIX inside plsql batch so it should be single (or joined) select.
Thanks in advance. Any help or ideas would be very appreciated!
It sounds like you just need some sort of outer join. I'm guessing at how your tables relate to each other but it appears that you want something like
SELECT c_typ_cross_join.company_name,
c_typ_cross_join.transaction_type,
nvl( sum( t.transaction_amount ), 0 ) total_amt
FROM (SELECT c.company_name,
typ.transaction_type
FROM companies c
FULL OUTER JOIN transaction_type typ) c_typ_cross_join
LEFT OUTER JOIN transactions t ON ( c_typ_cross_join.company_id = t.company_id
AND c_typ_cross_join.transaction_type = t.transaction_typ)
GROUP BY c_typ_cross_join.company_name,
c_typ_cross_join.transaction_type
This should produce one row for every company for every transaction type and the sum of the related transactions (or 0 if there are no transactions for the combination of companies and transaction types).
You could use two sub-queries one to find all transactions per company based on the existing types the company has, second to find the totals.
SELECT companies.id, all_transactions.transaction, COALESCE(sums.total_amount, 0)
FROM companies
JOIN (SELECT ct.companyid, t.transaction
FROM transactions ct
JOIN transactions t ON t.transactiontype = ct.transactiontype
GROUP BY ct.companyid, t.transaction) all_transactions ON all_transactions.companyid = companies.companyid
LEFT JOIN (SELECT ct.companyid, SUM(t.amount) as total_amount
FROM transactions ct
GROUP BY ct.companyid) sums ON sums.companyid = companies.companyid

Some SQL Questions

I have been using SQL for years, but have mostly been using the query designer within SQL Studio (etc.) to put together my queries. I've recently found some time to actually "learn" what everything is doing and have set myself the following fairly simple tasks. Before I begin, I'd like to ask the SOF community their thoughts on the questions, possible answers and any tips they may have.
The questions are;
Find all records w/ a duplicate in a particular column (e.g. a linking id is in more than 1 record throughout table)
SUM price from a linked table within the same query (select within a select?)
Explain the difference between the 4 joins; LEFT, RIGHT, OUTER, INNER
Copy data from one table to another based on SELECT and WHERE criteria
Input welcomed & appreciated.
Chris
I recommend that you start by following some tutorials on this topic. Your questions are not uncommon questions for someone moving from a beginner to intermediate level in SQL. SQLZoo is an excellent resource for learning SQL so consider following that.
In response to your questions:
1) Find all records with a duplicate in a particular column
There are two steps here: find duplicate records and select those records. To find the duplicate records you should be doing something along the lines of:
select possible_duplicate_field, count(*)
from table
group by possible_duplicate_field
having count(*) > 1
What we're doing here is selecting everything from a table, then grouping it by the field we want to check for duplicates. The count function then gives me a count of the number of items within that group. The HAVING clause indicates that we want to filter AFTER the grouping to only show the groups which have more than one entry.
This is all fine in itself but it doesn't give you the actual records that have those values on them. If you knew the duplicate values then you'd write this:
select * from table where possible_duplicate_field = 'known_duplicate_value'
We can use the SELECT within a select to get a list of the matches:
select *
from table
where possible_duplicate_field in (
select possible_duplicate_field
from table
group by possible_duplicate_field
having count(*) > 1
)
2) SUM price from a linked table within the same query
This is a simple JOIN between two tables with a SUM of the two:
select sum(tableA.X + tableB.Y)
from tableA
join tableB on tableA.keyA = tableB.keyB
What you're doing here is joining two tables together where those two tables are linked by a key field. In this case, this is a natural join which operates as you would expect (i.e. get me everything from the left table which has a matching record in the right table).
3) Explain the difference between the 4 joins; LEFT, RIGHT, OUTER, INNER
Consider two tables A and B. The concept of "LEFT" and "RIGHT" in this case are slightly clearer if you read your SQL from left to right. So, when I say:
select x from A join B ...
The left table is "A" and the right table is "B". Now, when you explicitly say "LEFT" the SQL statement you are declaring which of the two tables you are joining is the primary table. What I mean by this is: Which table do I scan through first? Incidentally, if you omit the LEFT or RIGHT, then SQL implicitly uses LEFT.
For INNER and OUTER you are declaring what to do when matches don't exist in one of the tables. INNER declares that you want everything in the primary table (as declared using LEFT or RIGHT) where there is a matching record in the secondary table. Hence, if the primary table contains keys "X", "Y" and "Z", and the secondary table contains keys "X" and "Z", then an INNER will only return "X" and "Z" records from the two tables.
When OUTER is used, we're saying: Give me everything from the primary table and anything that matches from the secondary table. Hence, in the previous example, we'd get "X", "Y" and "Z" records in the output record set. However, there would be NULLs in the fields which should have come from the secondary table for key value "Y" as it doesn't exist in the secondary table.
4) Copy data from one table to another based on SELECT and WHERE criteria
This is pretty trivial and I'm surprised you've never encountered it. It's a simple nested SELECT in an INSERT statement (this may not be supported by your database - if not, try the next option):
insert into new_table select * from old_table where x = y
This assumes the tables have the same structure. If you have different structures then you'll need to specify the columns:
insert into new_table (list, of, fields)
select list, of, fields from old_table where x = y
Let's say you have 2 tables named :
[OrderLine] with the columns [Id, OrderId, ProductId, Qty, Status]
[Product] with [Id, Name, Price]
1) all orderline of command having more than 1 line (it's technically the same as looking for duplicates on OrderId :) :
select OrderId, count(*)
from OrderLine
group by OrderId
having count(*) > 1
2) total price for all order line of the order 1000
select sum(p.Price * ol.Qty) as Price
from OrderLine ol
inner join Product p on ol.ProductId = p.Id
where ol.OrderId = 1000
3) difference between joins:
a inner join b => take all a that has a match with b. if b is not found, a will be not be returned
a left join b => take all a, match them with b, include a even if b is not found
a righ join b => b left join a
a outer join b => (a left join b) union ( a right join b)
4) copy order lines to a history table :
insert into OrderLinesHistory
(CopiedOn, OrderLineId, OrderId, ProductId, Qty)
select
getDate(), Id, OrderId, ProductId, Qty
from
OrderLine
where
status = 'Closed'
To answer #4 and to perhaps show at least some understanding of SQL and the fact this isn't HW, just me trying to learn best practise;
SET NOCOUNT ON;
DECLARE #rc int
if #what = 1
BEGIN
select id from color_mapper where product = #productid and color = #colorid;
select #rc = ##rowcount
if #rc = 0
BEGIN
exec doSavingSPROC #colorid, #productid;
END
END
END

Uses of unequal joins

Of all the thousands of queries I've written, I can probably count on one hand the number of times I've used a non-equijoin. e.g.:
SELECT * FROM tbl1 INNER JOIN tbl2 ON tbl1.date > tbl2.date
And most of those instances were probably better solved using another method. Are there any good/clever real-world uses for non-equijoins that you've come across?
Bitmasks come to mind. In one of my jobs, we had permissions for a particular user or group on an "object" (usually corresponding to a form or class in the code) stored in the database. Rather than including a row or column for each particular permission (read, write, read others, write others, etc.), we would typically assign a bit value to each one. From there, we could then join using bitwise operators to get objects with a particular permission.
How about for checking for overlaps?
select ...
from employee_assignments ea1
, employee_assignments ea2
where ea1.emp_id = ea2.emp_id
and ea1.end_date >= ea2.start_date
and ea1.start_date <= ea1.start_date
Whole-day inetervals in date_time fields:
date_time_field >= begin_date and date_time_field < end_date_plus_1
Just found another interesting use of an unequal join on the MCTS 70-433 (SQL Server 2008 Database Development) Training Kit book. Verbatim below.
By combining derived tables with unequal joins, you can calculate a variety of cumulative aggregates. The following query returns a running aggregate of orders for each salesperson (my note - with reference to the ubiquitous AdventureWorks sample db):
select
SH3.SalesPersonID,
SH3.OrderDate,
SH3.DailyTotal,
SUM(SH4.DailyTotal) RunningTotal
from
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH3
join
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH4
on SH3.SalesPersonID = SH4.SalesPersonID AND SH3.OrderDate >= SH4.OrderDate
group by SH3.SalesPersonID, SH3.OrderDate, SH3.DailyTotal
order by SH3.SalesPersonID, SH3.OrderDate
The derived tables are used to combine all orders for salespeople who have more than one order on a single day. The join on SalesPersonID ensures that you are accumulating rows for only a single salesperson. The unequal join allows the aggregate to consider only the rows for a salesperson where the order date is earlier than the order date currently being considered within the result set.
In this particular example, the unequal join is creating a "sliding window" kind of sum on the daily total column in SH4.
Dublicates;
SELECT
*
FROM
table a, (
SELECT
id,
min(rowid)
FROM
table
GROUP BY
id
) b
WHERE
a.id = b.id
and a.rowid > b.rowid;
If you wanted to get all of the products to offer to a customer and don't want to offer them products that they already have:
SELECT
C.customer_id,
P.product_id
FROM
Customers C
INNER JOIN Products P ON
P.product_id NOT IN
(
SELECT
O.product_id
FROM
Orders O
WHERE
O.customer_id = C.customer_id
)
Most often though, when I use a non-equijoin it's because I'm doing some kind of manual fix to data. For example, the business tells me that a person in a user table should be given all access roles that they don't already have, etc.
If you want to do a dirty join of two not really related tables, you can join with a <>.
For example, you could have a Product table and a Customer table. Hypothetically, if you want to show a list of every product with every customer, you could do somthing like this:
SELECT *
FROM Product p
JOIN Customer c on p.SKU <> c.SSN
It can be useful. Be careful, though, because it can create ginormous result sets.

Select based on the number of appearances of an id in another table

I have a table B with cids and cities. I also have a table C that has these cids with extra information. I want to list all the cids in table C that are associated with ALL appearances of a given city in Table B.
My current solution relies on counting the number of times the given city appears in Table B and selecting only the cids that appear that many times. I don't know all the SQL syntax yet, but is there a way to select for this kind of pattern?
My current solution:
SELECT Agents.aid
FROM Agents, Customers, Orders
WHERE (Customers.city='Duluth')
AND (Agents.aid = Orders.aid)
AND (Customers.cid = Orders.cid)
GROUP BY Agents.aid
HAVING count(Agents.aid) > 1
It only works because I know right now with the HAVING statement.
Thanks for the help. I wasn't sure how to google this problem, since it's pretty specific.
EDIT: I'm pinpointing my problem a bit. I need to know how to determine if EVERY row in a table has a certain value for a field. Declaring a variable and counting the rows in a sub-selection and filtering out my results by IDs that appear that many times works, but It's really ugly.
There HAS to be a way to do this without explicitly count()ing rows. I hope.
Not an answer to your question, but a general improvement.
I'd recommend using JOIN syntax to join your tables together.
This would change your query to be:
SELECT Agents.aid
FROM Agents
INNER JOIN Orders
ON Agents.aid = Orders.aid
INNER JOIN Customers
ON Customers.cid = Orders.cid
WHERE Customers.city='Duluth'
GROUP BY Agents.aid
HAVING count(Agents.aid) > 1
What variant of SQL are you using?
To start with, you can (and should) use JOIN instead of doing it in the WHERE clause, e.g.,
select Agents.aid
from Agents
join Orders on Agents.aid = Orders.aid
join Customers on Customers.cid = Orders.cid
where Customers.city = 'Duluth'
group by Agents.aid
having count(Agents.aid) > 1
After that, I'm afraid I might be a little lost. Using the table names in your example query, what (in English, not pseudocode) are you trying to retrieve? For example, I think your sample query is retrieving the PK for all Agents that have been involved in at least 2 Orders involving Customers in Duluth.
Also, some table definitions for Agents, Orders, and Customers might help (then again, they might be irrelevant).
I'm not sure if I understood you problem, but I think the following query is what you want:
SELECT *
FROM customers b
INNER JOIN orders c USING (cid)
WHERE b.city = 'Duluth'
AND NOT EXISTS (SELECT 1
FROM customers b2
WHERE b2.city = b.city
AND b2.cid <> cid);
Probably you will need some indexes on these columns.