Making simple SQL more efficient - sql

SQL Fiddle.
I'm having a slow start to the morning. I thought there was a more efficient way to make the following query using a join, instead of two independent selects -- am I wrong?
Keep in mind that I've simplified/reduced my query into this example for SO purposes, so let me know if you have any questions as well.
SELECT DISTINCT c.*
FROM customers c
WHERE c.customer_id IN (select customer_id from customers_cars where car_make = 'BMW')
AND c.customer_id IN (select customer_id from customers_cars where car_make = 'Ford')
;
Sample Table Schemas
-- Simple tables to demonstrate point
CREATE TABLE customers (
customer_id serial,
name text
);
CREATE TABLE customers_cars (
customer_id integer,
car_make text
);
-- Populate tables
INSERT INTO customers(name) VALUES
('Joe Dirt'),
('Penny Price'),
('Wooten Nagen'),
('Captain Planet')
;
INSERT INTO customers_cars(customer_id,car_make) VALUES
(1,'BMW'),
(1,'Merc'),
(1,'Ford'),
(2,'BMW'),
(2,'BMW'), -- Notice car_make is not unique
(2,'Ferrari'),
(2,'Porche'),
(3,'BMW'),
(3,'Ford');
-- ids 1 and 3 both have BMW and Ford
Other Expectations
There are ~20 car_make in the database
There are typically 1-3 car_make per customer_id
There is expected to be not more than 50 car_make assignments per customer_id (generally 20-30)
The query is generally only going to look for 2-3 specific car_make per customer (e.g., BMW and Ford), but not 10-20

And here another option, don't know what the fastest one would be on large tables.
SELECT customers.*
FROM customers
JOIN customers_cars USING(customer_id)
WHERE car_make = ANY(ARRAY['BMW','Ford'])
GROUP BY
customer_id, name
HAVING array_agg(car_make) #> ARRAY['BMW','Ford'];
vol7ron:
Fiddle
The following is a modification of the above, taking the same idea using an array for comparison. I'm not sure how any more efficient it would be compared to the dual-query approach, since it would have to create an array as one pass and then do more heavy-handed comparison because of comparing the elements of an array.
SELECT DISTINCT c.*
FROM customers c
WHERE customer_id IN (
select customer_id
from customers_cars
group by customer_id
having array_agg(car_make) #> ARRAY['BMW','Ford']
);

I would write it as
SELECT DISTINCT c.customer_id
FROM customers c
JOIN customers_cars cc_f on c.customer_id = cc_f.customer_id and cc_f.car_make = 'Ford'
JOIN customers_cars cc_b on c.customer_id = cc_b.customer_id and cc_b.car_make = 'BMW'
;
Whether this is better or not I don't know. In some RDBMs plain joins like this work better than subqueries, but I don't know about Postgres. From readability point of view it is also questionable.

It seems to me that you are trying to find customers that has at least 1 BMW and at least 1 Ford car.
This query should get that for you:
SELECT
customers.customer_id
FROM
customers
INNER JOIN customer_cars ON
customers.customer_id = customer_cars.customers_id
AND customer_cars.car_make IN ('BMW', 'Ford')
GROUP BY
customers.customer_id
HAVING
COUNT(CASE WHEN car_make = 'BMW' THEN 1 ELSE NULL END) > 0
AND COUNT(CASE WHEN car_make = 'Ford' THEN 1 ELSE NULL END) > 0
Make sure you have an indexes on customer_cars.customer_id and customer_cars.car_make to achieve maximum performance.

You don't need to join to customers at all (given relational integrity).
Generally, this is a case of relational division. We assembled an arsenal of techniques under this related question:
How to filter SQL results in a has-many-through relation
Unique combinations
If (customer_id, car_make) was defined unique in customers_cars, it would get much simpler:
SELECT customer_id
FROM customers_cars
WHERE car_make IN ('BMW', 'Ford')
GROUP BY 1
HAVING count(*) = 2;
Combinations not unique
Since (customer_id, car_make) is not unique, we need an extra step.
For only a few cars, your original query is not that bad. But (especially with duplicates!) EXISTS is typically faster than IN, and we don't need the final DISTINCT:
SELECT customer_id -- no DISTINCT needed.
FROM customers c
WHERE EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'BMW')
AND EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'Ford');
Above query gets verbose and less efficient for a longer list of cars. For an arbitrary number of cars I suggest:
SELECT customer_id
FROM (
SELECT customer_id, car_make
FROM customers_cars
WHERE car_make IN ('BMW', 'Ford')
GROUP BY 1, 2
) sub
GROUP BY 1
HAVING count(*) = 2;
SQL Fiddle.

Related

Postgres Question: Aren't both a and b correct?

For questions below, use the following schema definition.
restaurant(rid, name, phone, street, city, state, zip)
customer(cid, fname, lname, phone, street, city, state, zip)
carrier(crid, fname, lname, lp)
delivery(did, rid, cid, tim, size, weight)
pickup(did, tim, crid)
dropoff(did, tim, crid)
It's a schema for a food delivery business that employs food carriers (carrier table).
Customers (customer table) order food from restaurants (restaurant table).
The restaurants order a delivery (delivery table); to deliver food from restaurant to customer.
The pickup table records when carrier picks up food at restaurant.
The dropoff table records when carrier drops off food at customer.
1.Find customers who have less than 5 deliveries.
a. select cid,count()
from delivery
group by cid
having count() < 5;
b. select a.cid,count()
from customer a
inner join delivery b
using(cid)
group by a.cid
having count() < 5;
c. select a.cid,count()
from customer a
left outer join delivery b
on a.cid=b.cid
group by a.cid
having count() < 5;
d. select cid,sum(case when b.cid is not null then 1 else 0 end)
from customer a
left outer join delivery b
using (cid)
group by cid
having sum(case when b.cid is not null then 1 else 0 end) < 5;
e. (write your own answer)
No, they are not correct. They miss customers who have had no deliveries.
The last is the best of a bunch of not so good queries. A better version would be:
select c.cid, count(d.cid)
from customer c left outer join
delivery d
on c.cid = d.cid
group by c.cid
having count(d.cid) < 5;
The sum(case) is over kill. And Postgres even offers a better solution than that!
count(*) filter (where d.cid is not null)
But count(d.cid) is still more concise.
Also note the use of meaningful table aliases. Don't get into the habit of using arbitrary letters for tables. That just makes queries hard to understand.

Handling nested case statements in Redshift

I am writing a Redshift query which require use of multiple case statements.
Pretext:
Customers can associated with more than one organizations like, sweet or salt etc.
Ask :
We have to check that customers associated with 'SWEETS' organization are picked first, if no affiliation with 'SWEETS' is available , than we have to take id of that organization where flag = 1.
I have to use a case statement in redshift to derive the result.
There are three different tables, customer table, organization table and 3 table that determines how customers are associated with organization.
![enter image description here][1]
Code that I have tried is below , but after executing this , I am still getting the two organization ids, instead of one id which should be of sweet org.
SELECT customer_id
, organization_id
FROM customer_details AS customer
LEFT JOIN organization AS org
ON customer.customer_id
AND organization_id = CASE WHEN organization_id IN (SELECT organization_id
FROM organization_type
WHERE organization_type = 'SWEET')
THEN organization_id
ELSE org.organization_id END
You can use window functions:
select customer_id, organization_id
from (select c.customer_id, o.organization_id,
row_number() over (partition by o.customer_id order by o.organization_type = 'SWEET' desc) as seqnum
from customer_details c left join
organization o
on c.customer_id = o.organization_id
) co
where seqnum = 1;

Subquery and normal query comes out with different results

I'm a beginner of the oracle, currently, I'm doing a question using subquery(without JOIN) and normal (with JOIN) query, but at the end, the results are different from this two query,
I can't figure out this problem, does anyone know?
The question is asking about list the dog owner details which has booked at least twice in this platform
SELECT PET_OWNER.Owner_id,Oname,OAdd,COUNT(*) AS BOOKING
FROM PET_OWNER
WHERE Owner_id IN(
SELECT Owner_id
FROM PET
WHERE PType = 'DOG' AND Pet_id IN(SELECT Pet_id FROM BOOKING))
GROUP BY PET_OWNER.Owner_id,Oname,OAdd
HAVING COUNT(*) >=2
ORDER BY PET_OWNER.Owner_id;
This subquery shows no rows selected,
SELECT PET_OWNER.Owner_id,Oname,OAdd,COUNT(*) AS BOOKING
FROM PET_OWNER,PET,BOOKING
WHERE PET_OWNER.Owner_id = PET.Owner_id AND
PET.Pet_id = BOOKING.Pet_id AND
PType = 'DOG'
GROUP BY PET_OWNER.Owner_id,Oname,OAdd
HAVING COUNT(*) >=2
ORDER BY PET_OWNER.Owner_id;
this query shows 10 records which are the correct answer for this question
I expected these two queries come out with the same result but it is not
does anyone know what is wrong with it?
can anyone show me how to convert this code to subquery?
Because duplicated join key will cause duplicatation in result.
In your case, the Owner_id should be non-unique in the PET table.
It is still possible to get the correct answer by using join. And as the owner_id in the subquery t is unique, so the execution plan should be same with the subquery version.
select p.* from Pet_Owner p
join (
select PET.Owner_id
from PET
inner join Booking on Booking.Pet_id = PET.Pet_id
where pType = 'DOG'
group by PET.Owner_id
having count(1) >= 2) t
on t.Owner_id = p.Owner_id
order by p.Owner_id
By the way, your SQL code is so old-school as it is in ANSI-89, while the join syntax is already in ANSI-92. I know many school teachers still love the old style, I hope you can read both, but only write code in ANSI-92 way.
What happen is that it will give you distinct values on your PET_OWNER.Owner_id,Oname,OAdd. So what we need is to group by owner_id first.
Here's your query. get first those owner_id with count() >= 2 as subquery
select * from Pet_Owner where Owner_id in (
select t1.Owner_id from PET_OWNER t1
inner join PET t2 on t1.Owner_id = t2.Owner_id
inner join Booking t3 on t3.Pet_id = t2.Pet_id
where pType = 'DOG'
group by t1.Owner_id
having count(1) >= 2)
order by Owner_id
not using join, nested subqueries is our only option
select * from Pet_Owner where Owner_id in (
select owner_id from Pet_Owner where Owner_id in
(select Owner_id from Pet where Pet_id in
(select Pet_id in Booking) and PType='DOG')
group by owner_id
having count(1) >= 2)
order by Owner_id
if you are trying to the # of dogs per owner:
select * from Pet_Owner where Owner_id in (
select Owner_id from Pet where Pet_id in
(select Pet_id in Booking) and PType='DOG'
group by owner_id
having count(1) >= 2)
) order by Owner_id

Select all customers loyal to one company?

I've got tables:
TABLE | COLUMNS
----------+----------------------------------
CUSTOMER | C_ID, C_NAME, C_ADDRESS
SHOP | S_ID, S_NAME, S_ADDRESS, S_COMPANY
ORDER | S_ID, C_ID, O_DATE
I want to select id of all customers who made order only from shops of one company - 'Samsung' ('LG', 'HP', ... doesn't really matter, it's dynamic).
I've come only with one solution, but I consider it ugly:
( SELECT DISTINCT c_id FROM order JOIN shop USING(s_id) WHERE s_company = "Samsung" )
EXCEPT
( SELECT DISTINCT c_id FROM order JOIN shop USING(s_id) WHERE s_company != "Samsung" );
Same SQL queries, but reversed operator. Isn't there any aggregate method which solves such query better?
I mean, there could be millions of orders(I don't really have orders, I've got something that occurs more often).
Is it efficient to select thousands of orders and then compare them to hundreds of thousands orders which have different company? I know, that it compares sorted things, so it's O( m + n + sort(n) + sort(m) ). But that's still large for millions of records, or isn't?
And one more question. How could I select all customer values (name, address). How can I join them, can I do just
SELECT CUSTOMER.* FROM CUSTOMER JOIN ( (SELECT...) EXCEPT (SELECT...) ) USING (C_ID);
Disclaimer: This question ain't homework. It's preparation for the exam and desire to things more effective. My solution would be accepted at exam, but I like effective programming.
I like to approach this type of question using group by and a having clause. You can get the list of customers using:
select o.c_id
from orders o join
shops s
on o.s_id = o.s_id
group by c_id
having min(s.s_company) = max(s.s_company);
If you care about the particular company, then:
having min(s.s_company) = max(s.s_company) and
max(s.s_company) = 'Samsung'
If you want full customer information, you can join the customers table back in.
Whether this works better than the except version is something that would have to be tested on your system.
How about a query that uses no aggregate functions like Min and Max?
select C_ID, S_ID
from shop
group by C_ID, S_ID;
Now we have a distinct list of customers and all the companies they shopped at. The loyal customers will be the ones who only appear once in the list.
select C_ID
from Q1
group by C_ID
having count(*) = 1;
Join back to the first query to get the company id:
with
Q1 as(
select C_ID, S_ID
from shop
group by C_ID, S_ID
),
Q2 as(
select C_ID
from Q1
group by C_ID
having count(*) = 1
)
select Q1.C_ID, Q1.S_ID
from Q1
join Q2
on Q2.C_ID = Q1.C_ID;
Now you have a list of loyal customers and the one company each is loyal to.

How to ensure outer join with filter still returns all desired rows?

Imagine I have two tables in a DB like so:
products:
product_id name
----------------
1 Hat
2 Gloves
3 Shoes
sales:
product_id store_id sales
----------------------------
1 1 20
2 2 10
Now I want to do a query to list ALL products, and their sales, for store_id = 1. My first crack at it would be to use a left join, and filter to the store_id I want, or a null store_id, in case the product didn't get any sales at store_id = 1, since I want all the products listed:
SELECT name, coalesce(sales, 0)
FROM products p
LEFT JOIN sales s ON p.product_id = s.product_id
WHERE store_id = 1 or store_id is null;
Of course, this doesn't work as intended, instead I get:
name sales
---------------
Hat 20
Shoes 0
No Gloves! This is because Gloves did get sales, just not at store_id = 1, so the WHERE clause has filtered them out.
How then can I get a list of ALL products and their sales for a specific store?
Here are some queries to create the test tables:
create temp table test_products as
select 1 as product_id, 'Hat' as name;
insert into test_products values (2, 'Gloves');
insert into test_products values (3, 'Shoes');
create temp table test_sales as
select 1 as product_id, 1 as store_id, 20 as sales;
insert into test_sales values (2, 2, 10);
UPDATE: I should note that I am aware of this solution:
SELECT name, case when store_id = 1 then sales else 0 end as sales
FROM test_products p
LEFT JOIN test_sales s ON p.product_id = s.product_id;
however, it is not ideal... in reality I need to create this query for a BI tool in such a way that the tool can simply add a where clause to the query and get the desired results. Inserting the required store_id into the correct place in this query is not supported by this tool. So I'm looking for other options, if there are any.
Add the WHERE condition to the LEFT JOIN clause to prevent that rows go missing.
SELECT p.name, coalesce(s.sales, 0)
FROM products p
LEFT JOIN sales s ON p.product_id = s.product_id
AND s.store_id = 1;
Edit for additional request:
I assume you can manipulate the SELECT items? Then this should do the job:
SELECT p.name
,CASE WHEN s.store_id = 1 THEN coalesce(s.sales, 0) ELSE NULL END AS sales
FROM products p
LEFT JOIN sales s USING (product_id)
Also simplified the join syntax in this case.
I'm not near SQL, but give this a shot:
SELECT name, coalesce(sales, 0)
FROM products p
LEFT JOIN sales s ON p.product_id = s.product_id AND store_id = 1
You don't want a where on the whole query, just on your join