Why join query give different result than subquery? - sql

I am learning PostgreSQL and working with Nortwind database
Now I am testing JOIN and subquery with ANY
I want select all product_name of which exactly 10 were ordered (column quantity from order_details)
So I have 2 different queries:
SELECT product_name FROM products
WHERE product_id = ANY(
SELECT product_id FROM order_details
WHERE quantity = 10
)
and
SELECT products.product_name FROM products
JOIN order_details ON order_details.product_id = products.product_id
WHERE order_details.quantity = 10
But they are giving different results!
Firts one gives:
Only 60 rows
And the second one gives: 181 rows
Why is that and which result is right?

The first query will output each products row at most once.
The second query can have several result rows for a single products row: one for each matching order_details row.
Which of the queries is better depends on your requirements.

Related

How to do a 3-table join with aggregate functions and a group by clause

I can't figure out how to do a 3-table join with aggregate functions and a group by clause.
I have 3 tables:
Products
id
...
Sales
id
product_id
sale_price
...
Punches
id
punchable_id
punchable_type
...
I want to get all of the products data together with its total number of sales, views, (each row in punches table represents a view) and the total revenue from a product.
I have the following query:
#products = Product.joins("INNER JOIN sales ON sales.product_id = products.id INNER JOIN punches ON sales.product_id = punches.punchable_id AND punches.punchable_type = 'Product'").group("products.id").select("products.*, SUM(sales.sale_price) as revenue, COUNT(sales.id) as total_sales_count, COUNT(punches.id) as views").distinct
However, this is giving me the wrong data for the aggregate functions. The numbers are very high (I think it's summing and counting the rows a bunch of extra times).
When I do the Inner Join query with the products table and only one other table (either the punches or sales table), the data is fine. It blows up whenever I do the 2 joins in one query.
Does anyone know what's wrong with the above and how to properly write the query?
You should join to punches directly with products.id = punches.punchable_id ..etc not via sales.product_id

Finding the most frequently occurring combination

I have two table with name Orders and Products,The order table contains the number of specific orders made by a customer and the products included in that order is in the Products table.
My requirement is to get the number of total orders against the most frequently coming products.
means for these products product 1,Product 2, product 3 what is the total orders,If an order contains 10 Products which contains Product 1 ,Product 2 and Product 3 that order should be counted.
For an order_id there can be multiple products will be there and i'm confused on how to get this result.Can anyone share or suggest a solution on how to get this?
I'm using PostgreSQL.
Below is the sample query ,
SELECT
"Orders"."order_id",pr.product_name
FROM
"data"."orders" AS "Orders"
LEFT JOIN data.items i On i."order_id"="Orders"."order_id"
LEFT join data.products pr on pr."product_id"=i."product_id"
WHERE TO_CHAR("Orders"."created_at_order",'YYYY-MM-DD') BETWEEN '2019-02-01' AND '2019-04-30'
ORDER BY "Orders"."order_id"
Desired Result will be like this(3 columns),The most purchased product combination with number of occurring orders.
Product 1, Product 2,Product 3,etc..... , Number Of Orders
This is the sample data output,Need the product list which is purchased in combination the most.(As of now i have given only 3 columns for sample but it may vary according to the number of PRODUCTS in an order).
and example
SELECT
"Orders"."order_id",
string_agg(DISTINCT pr.product_name,::character varying, ',') AS product_name
count(1) AS product_no
FROM
"data"."orders" AS "Orders"
LEFT JOIN data.items i On i."order_id"="Orders"."order_id"
LEFT join data.products pr on pr."product_id"=i."product_id"
WHERE TO_CHAR("Orders"."created_at_order",'YYYY-MM-DD') BETWEEN '2019-02-01' AND '2019-04-30'
GROUP BY "Orders"."order_id"
ORDER BY count(1);
You can try to use group by clause.
If you want to generally get the number of orders against some products then you can just count the number of orders grouped on the products from product table. Query should look something like this:
SELECT product_id, COUNT(*)
FROM data.products
GROUP BY product_id
ORDER BY COUNT(*)
LIMIT 1;
Hope this helps!
Try to use GROUP BY and take MOST counted value as below-
SELECT
pr.product_name,
COUNT(DISTINCT Orders.order_id)
FROM
"data"."orders" AS "Orders"
LEFT JOIN data.items i On i."order_id"="Orders"."order_id"
LEFT join data.products pr on pr."product_id"=i."product_id"
WHERE TO_CHAR("Orders"."created_at_order",'YYYY-MM-DD') BETWEEN '2019-02-01' AND '2019-04-30'
GROUP BY pr.product_name
ORDER BY COUNT(DISTINCT Orders.order_id) DESC
LIMIT 1 -- You can use the LIMIT or NOT as per requirement

SQL query using EXIST operator - unexpected records in result

Using the AdventureWorks2014 database, I was experimenting with the EXIST keyword. Please note the following query:
select p.color, p.productid, p.name, th.Quantity
from production.product p, production.TransactionHistory th
where p.ProductID=th.ProductID and EXISTS(
select *
from Production.TransactionHistory t
where t.Quantity = 1000
and t.ProductID=p.ProductID
)
I was expecting to see only products that were ordered 1000 at a time (there is only one transaction that meets this condition), but instead I get hundreds of rows where th.Quantity is < 1000.
Removing the joined TransactionHistory table from the outer query solves the problem, but I just want to know why the original query returns the rows I am seeing.
Thanks
Edit:
For clarification, I understand how to solve the question that I want. I just wanted to understand the behavior of EXISTS and why I'm not getting the results I expected.
The following subquery (which is part of the EXISTS subquery), only returns a single result.
select *
from Production.TransactionHistory t
where t.Quantity = 1000
Therefore, if this is inside EXISTS it will return true every time. The caveat is that I am linking t.ProductID with p.ProductID in the subquery. So, for every row in the outer query, the product ID should be matching the product ID in the inner query. EXISTS should only return true when the product ID matches and the quantity is exactly 1000. To be precise, EXIST should only return true when the product ID is 994, because there is only one transaction in the entire table (with that product ID) that satisfies both the product ID requirement and the 1000 quantity requirement.
Notice the rest of the EXISTS subquery...
where t.Quantity = 1000 and t.ProductID=p.ProductID
The product ID has to match the outer record's product ID AND the quantity must be 1000.
To me, this query says "Give me the color, product id and name of all products, join in transactions, and then only include each row where there is at least one record in the transaction table whose product id matches the id of the CURRENT outer row, AND the order quantity is 1000". But this is not how it behaves. Just trying to understand why.
Your query sounds like this:
Get all transaction history entries of product if any of history entry have
Quantity equal to 1000.
EXISTS return true or false, so
EXISTS(
select *
from Production.TransactionHistory t
where t.Quantity = 1000
and t.ProductID=p.ProductID
)
will return true for all TransactionHistory rows of product which have Transaction with Quantity = 1000
In addition:
Query above will be executed for every row of "Main" query and will return True on every row in your case. Thats why you get all rows
EXISTS returns true if the following query has even one record in it.
You are looking for a query something like below:
SELECT p.color, p.productid, p.name, th.Quantity
FROM production.product p, production.TransactionHistory th
WHERE p.ProductID = th.ProductID and th.Quantity = 1000
OR you can replace it with a better looking join query which looks like this:
SELECT p.color, p.productid, p.name, th.Quantity
FROM production.product p
INNER JOIN production.TransactionHistory th ON p.ProductID = th.ProductID
WHERE th.Quantity = 1000
It's because you are checking only ProductID in EXIST clause. When it finds at least one transaction with your productID then it displays such transaction. So all transactions for product that has transaction with quantity equals to 1000 will be displayed.
basically your query is saying
Give me all product and it's transaction history WHERE there is
EVER a transaction with quantity of 1000

Aggregate after join without duplicates

Consider this query:
select
count(p.id),
count(s.id),
sum(s.price)
from
(select * from orders where <condition>) as s,
(select * from products where <condition>) as p
where
s.id = p.order;
There are, for example, 200 records in products and 100 in orders (one order can contain one or more products).
I need to join then and then:
count products (should return 200)
count orders (should return 100)
sum by one of orders field (should return sum by 100 prices)
The problem is after join p and s has same length and for 2) I can write count(distinct s.id), but for 3) I'm getting duplicates (for example, if sale has 2 products it sums price twice) so sum works on entire 200 records set, but should query only 100.
Any thoughts how to sum only distinct records from joined table but also not ruin another selects?
Example, joined table has
id sale price
0 0 4
0 0 4
1 1 3
2 2 4
2 2 4
2 2 4
So the sum(s.price) will return:
4+4+3+4+4+4=23
but I need:
4+3+4=11
If the products table is really more of an "order lines" table, then the query would make sense. You can do what you want by in several ways. Here I'm going to suggest conditional aggregation:
select count(distinct p.id), count(distinct s.id),
sum(case when seqnum = 1 then s.price end)
from (select o.* from orders o where <condition>) s join
(select p.*, row_number() over (partition by p.order order by p.order) as seqnum
from products p
where <condition>
) p
on s.id = p.order;
Normally, a table called "products" would have one row per product, with things like a description and name. A table called something like "OrderLines" or "OrderProducts" or "OrderDetails" would have the products within a given order.
You are not interested in single product records, but only in their number. So join the aggregate (one record per order) instead of the single rows:
select
count(*) as count_orders,
sum(p.cnt) as count_products,
sum(s.price)
from orders as s
join
(
select order, count(*) as cnt
from products
where <condition>
group by order
) as p on p.order = s.id
where <condition>;
Your main problem is with table design. You currently have no way of knowing the price of a product if there were no sales on it. Price should be in the product table. A product cost a certain price. Then you can count all the products of a sale and also get the total price of the sale.
Also why are you using subqueries. When you do this no indexes will be used when joining the two subqueries. If your joins are that complicated use views. In most databases they can indexed

SQL SUM, COUNT for only unique id

I want to calculate sum and count for only unique ids.
SELECT COUNT(orders.id), SUM(orders.total), SUM(orders.shipping) FROM "orders"
INNER JOIN "designer_orders" ON "designer_orders"."order_id" = "orders"."id"
WHERE (designer_orders.state = 'pending' OR
designer_orders.state = 'dispatched' OR
designer_orders.state = 'completed')
Do this only for unique orders ids.
Add orders.total only if orders.id is unique. Same goes for shipping.
Avoid adding duplicates.
For example, orders table inner joined designer_orders table:
OrderId Total Some designer order column
1 1000 2
1 1000 3
1 1000 5
2 100 7
3 133 8
4 1000 10
4 1000 20
In this case:
count of orders should be 4.
total of orders should be 2233.
Schema:
One order has many designer orders.
One designer order has only one order.
Try it this way
SELECT COUNT(o.id) no_of_orders,
SUM(o.total) total,
SUM(o.shipping) shipping
FROM orders o JOIN
(
SELECT DISTINCT order_id
FROM designer_orders
WHERE state IN('pending', 'dispatched', 'completed')
) d
ON o.id = d.order_id
Here is SQLFiddle demo
Since you are only interested whether any row with qualifying status exists in the table designer_orders, the most obvious query style would be an EXISTS semi-join. Typically fastest with potentially many duplicate rows in n-table:
SELECT COUNT(o.id) AS no_of_orders
,SUM(o.total) AS total
,SUM(o.shipping) AS shipping
FROM orders o
WHERE EXISTS (
SELECT 1
FROM designer_orders d
WHERE d.state = ANY('{pending, dispatched, completed}')
AND d.order_id = o.id
);
-> SQLfiddle demo
For fast SELECT queries with bigger tables (and at some cost for write performance), you would have a partial index like:
CREATE INDEX designer_orders_order_id_idx ON designer_orders (order_id)
WHERE state = ANY('{pending, dispatched, completed}');
The index condition must match the WHERE condition of the query to talk the query planner into actually using the index.
A partial index is particularly attractive if there are many rows with a status that does not qualify. Else, an index without condition might be the better choice overall.