PostgreSQL ON vs WHERE when joining tables? - sql

I have 2 tables customer and coupons, a customer may or may not have a reward_id assigned to, so it's a nullable column. A customer can have many coupons and coupon belongs to a customer.
+-------------+------------+
| coupons | customers |
+-------------+------------+
| id | id |
| customer_id | first_name |
| code | reward_id |
+-------------+------------+
customer_id column is indexed
I would like to make a join between 2 tables.
My attempt is:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id and cust.reward_id is not null
However, I think there isn't an index on reward_id, so I should move cust.reward_id is not null in where clause:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id
where cust.reward_id is not null
I wonder if the second attempt would be more efficient than the first attempt.

It would be better if you see the execution plan on your own. Add EXPLAIN ANALYZE before your select statement and execute both to see the differences.
Here's how:
EXPLAIN ANALYZE select ...
What it does? It actually executes the select statement and gives you back the execution plan which was chosen by query optimizer. Without ANALYZE keyword it would only estimate the execution plan without actually executing the statement in the background.
Database won't use two indexes at one time, so having an index on customer(id) will make it unable to use index on customer(reward_id). This condition will actually be treated as a filter condition which is correct behaviour.
You could experiment with performance of a partial index created as such: customer(id) where reward_id is not null. This would decrease index size as it would only store these customer id's for which there is a reward_id assigned.
I generally like to split the relationship/join logic from conditions applied and I myself put them within the WHERE clause because it's more visible in there and easier to read for future if there are any more changes.
I suggest you see for yourself the possible performance gain, because it depends on how much data there is and the possible low cardinality for reward_id. For example if most rows have this column filled with a value it wouldn't make that much of a difference as the index size (normal vs partial) would be almost the same.

In a PostgreSQL inner join, whether a filter condition is placed in the ON clause or the WHERE clause does not impact a query result or performance.
Here is a guide that explores this topic in more detail: https://app.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins

Related

Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows
Customer
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001 John 30 1 Jan 2017
002 Jack 10 2 Jan 2017
003 Jim 50 3 Jan 2017
Payment
CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT |
-------------------------------------------
001 900 100.00
001 901 200.00
001 902 300.00
003 903 999.00
We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.
Solution 1
SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;
Solution 2
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE
Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID
While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.
With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.
I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.
Update
I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.
My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.
Both versions use about 13 minutes, no significant difference.
I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.
With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.
Another option is
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM PAYMENT
GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
To decide which one of three is better just test them and see the execution plans.
Neither. Do the sum on payment, then join the results.
select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id
Solution 1 is costly.
Even though optimizer could avoid the unecessary sorting,
at some point you will be forced to add indexes/constraints
over irrelevant columns to improve performance.
Not a good practice in the long term.
Solution 2 is the Oracle way.
Oracle documentation states that:
GROUP BY clause must contain only aggregates or grouping columns
Oracle engineers had valid reasons to do that,
however this does not apply to other RDBMS where you
can simply put GROUP BY c.customerID and all will be fine.
For the sake of code readability a --comment would be cheaper.
In general, not embracing any platform principles would have a cost:
more code, weird code, memory, disk space, performance, etc.
In Solution 1 the query will repeat the MAX function for each column. I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). It is kind of a time bomb, when your table gets bigger this query will get worst very fast. So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application.

SQL Join using Parallel Processing

I'm new to the parallel processing concept. I read through Oracle's white paper here to learn the basics but am unsure of how to best construct a SQL join to take advantage of parallel processing. I'm querying my company's database which is massive. The first table is products which is 1 entry per product with product details and the other is sales by week by store by product.
Sales:
Week Store Product OtherColumns
1 S100 prodA
2 S100 prodB
3 S100 prodC
1 S200 prodA
2 S200 prodB
3 S200 prodC
I need to join the 2 tables based on a list of product I specify. My query looks like this:
select *
from
(select prod_id, upc
from prod_tbl
where upc in (...)) prod_tbl
join
(select location, prod_id, sum(adj_cost), sum(sales), row_number() over (partition by loc_id order by sum(adj_cost))
from wk_sales
group by...
having sum(adj_cost)< 0) sales_tbl
on prod_tbl.prod_id = sales_tbl.prod_id
The left table in the join processes a lot faster because it's just one entry per product. The right table is incredibly slow even without the calculations. So here's my question(s):
To parallel process the right table (sales_tbl), do I restructure like so:
...
join
select location, sum(), ...more
from (select ...fields... from same_tbl) --no calculations in table subquery
where
group by
on ...
Am I able to change the redistribution method to broadcast since the first return set is drastically smaller?
To use parallel execution all you need is to add PARALLEL hint. Optionally you can also specify degree like:
/*+ parallel(4) */
In you query you need to make sure that it uses full scan and hash joins. To do that you need check you plan. Parallel is not very efficient for nested loops and merge joins.
Update: small hint regarding parallel - bear in mind that parallel scan bypasses buffer cache. So if you read big table many times in different sessions it might be better to use serial read. Consider parallel only for one off tasks like ETL jobs and data migration.

Performance impact due to Join

I have two tables PRODUCT and ACCOUNT.
PRODUCT table columns
product_id (PK)
subscription_id
ACCOUNT table columns
account_nbr
subscription_id
(account_nbr and subscription_id are primary key columns in account table)
... Other columns
I have to find account_nbr and subscription_id for a product_id.
I get product_id as input. Using it I can get the subscription_id from PRODUCT table and
using subscription_id I can get the account_nbr value from ACCOUNT table.
Instead of getting the info in two queries, can it be done in one query?
Something like the below:
select distinct a.acct_nbr,p.subscription_id
from ACCOUNT a,PRODUCT p
where v.product_id = ' val 1' and
v.subscription_id = p.subscription_id
Will the performance of the above query be low compared to two separate queries?
I have to find account_nbr and subscription_id for a product_id.
So, you're correct in your approach you need to JOIN the two result-sets together:
select p.account_nbr, p.subscription_id
from account a
join product p
on a.subscription_id = p.subscription_id
where p.product_id = :something
Some points:
You have an alias v in your query; I don't know where this came from.
Learn to use the ANSI join syntax; if you make a mistake it's a lot more obvious.
You're selecting a.acct_nbr, which doesn't exist as-is.
There's no need for a DISTINCT. ACCOUNT_NBR and PRODUCT_ID are the primary key of ACCOUNT.
Will the performance of the above query be low compared to two separate queries?
Probably not. If your query is correct and your tables are correctly indexed it's highly unlikely that whatever you've coded could beat the SQL engine. The database is designed to select data quickly.

Applying calculations based on a number of criteria stored in separate tables?

What I need to create is a table containing "Rules" such as overriding prices and applying percentage increases to the price of stock.
For Example:
Sales Price is select from the table containing information about products, then the system needs to check another table to see if that Customer/Product/Product Category has any price rules set against it, such as percentage discount or set price to be overridden to.
How do I get access to first of all check if the customer in question exists in the table, then if the product exists and then if the category exists; and then apply the price change that is stored?
So far we have a PriceRules table that contains the headers:
RuleID | CustomerID | Product Code | Category | Price | Percentage | DateApplied | AppliedBy
The plan is to store the different variables in each column and then search based on the columns.
I'm sure this sounds really confusing so I will be around to answer queries as quickly as possible.
Thanks in advance,
Bob P
You can get these results using SQL JOINs:
SELECT ...
Product.ProductPrice as Price,
CustomerRules.ProductPriceRules as Rules
FROM Product
LEFT JOIN Customer
ON ...
LEFT JOIN CustomerRules
ON Product.ProductID = CustomerRules.ProductID
AND Customer.CustomerID = CustomerRules.CustomerID
LEFT JOIN will return ONLY matching results if any exist, if record does not exist all CustomerRules fields will contain NULL values

SQL aggregation question

I have three tables:
unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased
purchases table:
purchases_id --primary key
location_id --which store
customer_id
credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id
All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.
I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:
1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
(EG if the purchase_date is Jan 3, 2005
and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)
I need the following aggregated:
1) How many unmatched purchases are there for that location
2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.
So, what is a query (or queries) to get this information that won't take forever to run?
Note: all three tables are indexed on location_id
EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.
CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.
try this (I have no idea how fast it will be - that depends on your indices)
Select Count(*) TotalPurchases,
Sum(Case When c.transaction_id Is Not Null
Then 1 Else 0 End) MatchablePurchases
From unmatched_purchases u
Join purchases p
On p.purchases_id = u.unmatched_purchases_id
Left Join credit_card_transactions c
On customer_id = p.customer_id
And item_id = u.item_id
And trans_timestamp - purchase_date < #DelayThreshold
Where Location_id = #Location
At least, you'll need more indexes. I propose at least the folloging:
An index on unmatched_purchases.purchases_id, one on purchases.location_id and
another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).
Without those indexes, there is little hope IMO.
I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.
But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...
The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.
The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.
The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.
Whether to use the location index or not ...
Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:
select some_other_column
from my_table
where location_id in (value 1, value 2, value 3)
... and see if oracle thinks the index is useful.