Put many columns in group by clause in Oracle SQL - sql

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows
Customer
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001 John 30 1 Jan 2017
002 Jack 10 2 Jan 2017
003 Jim 50 3 Jan 2017
Payment
CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT |
-------------------------------------------
001 900 100.00
001 901 200.00
001 902 300.00
003 903 999.00
We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.
Solution 1
SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;
Solution 2
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE
Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID
While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.
With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.
I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.
Update
I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.
My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.
Both versions use about 13 minutes, no significant difference.
I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.
With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.

Another option is
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM PAYMENT
GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
To decide which one of three is better just test them and see the execution plans.

Neither. Do the sum on payment, then join the results.
select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id

Solution 1 is costly.
Even though optimizer could avoid the unecessary sorting,
at some point you will be forced to add indexes/constraints
over irrelevant columns to improve performance.
Not a good practice in the long term.
Solution 2 is the Oracle way.
Oracle documentation states that:
GROUP BY clause must contain only aggregates or grouping columns
Oracle engineers had valid reasons to do that,
however this does not apply to other RDBMS where you
can simply put GROUP BY c.customerID and all will be fine.
For the sake of code readability a --comment would be cheaper.
In general, not embracing any platform principles would have a cost:
more code, weird code, memory, disk space, performance, etc.

In Solution 1 the query will repeat the MAX function for each column. I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). It is kind of a time bomb, when your table gets bigger this query will get worst very fast. So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application.

Related

Joining and Aggregating a Large Number of Fact Tables Efficiently in Redshift

I have a number of (10M+ rows) fact tables in Redshift, each with a natural key memberid and each with a column timestamp. Let's say I have three tables: transactions, messages, app_opens, with transactions looking like this (all the other tables have similar structure):
memberid
revenue
timestamp
374893978
3.99
2021-02-08 18:34:01
374893943
7.99
2021-02-08 19:34:01
My goal is to create a daily per-memberid aggregation table that looks likes this, with a row for each memberid and date:
memberid
date
daily_revenue
daily_app_opens
daily_messages
374893978
2021-02-08
4.95
31
45
374893943
2021-02-08
7.89
23
7
The SQL I'm currently using for this is the following, which involves unioning separate subqueries:
SELECT memberid,
date,
max(NVL(daily_revenue,0)) daily_revenue,
max(NVL(daily_app_opens,0)) daily_app_opens,
max(NVL(daily_messages,0)) daily_messages
FROM
(
SELECT memberid,
trunc(timestamp) as date,
sum(revenue) daily_revenue,
NULL AS daily_app_opens,
NULL AS daily_messages
FROM transactions
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
count(*) daily_app_opens,
NULL AS daily_messages
FROM app_opens
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
NULL AS daily_app_opens,
count(*) daily_messages
FROM messages
GROUP BY 1,2
)
GROUP BY memberid, date
This works fine and produces the expected output, but I'm wondering if this is the most efficient way to carry out this kind of query. I have also using FULL OUTER JOIN in place of UNION ALL, but the performance is essentially identical.
What's the most efficient way to achieve this in Redshift?
Seeing the EXPLAIN plan would help as it would let us see what the most costly parts of the query are. Based on a quick read of the SQL it looks pretty good. The cost of scanning the fact tables is likely meaningful but this is a cost you have to endure. If you can restrict the amount of data read with a where clause this can be reduced but doing this may not meet your needs.
One place that you should review is the distribution of these tables. Since you are grouping by accountid having this as the distribution key will make this process faster. Grouping will need bring rows of the same accountid value together, distributing on these values will greatly cut down on network traffic within the cluster.
At large data sizes and with everything else optimized I'd expect UNION ALL to out perform FULL OUTER JOIN but this will depend on a number of factors (like how much the data size is reduced by the accountid aggregation). 10M rows is not very big in Redshift terms (I have 160M rows of wide data on a minimal cluster) so I don't think you will see much difference between these plans at these sizes.

Performance Comparison of Multiple Join and Multiple Subqueries

I know that this type of questions are asked before, but I couldn't find one with my exact problem.
I'll try to give an exaggerated example.
Let's say that we want to find companies with at least one employee older than 40 and at least one customer younger than 20.
Query my colleague wrote for this problem is like this :
SELECT DISTINCT(c.NAME) FROM COMPANY c
LEFT JOIN EMPLOYEE e ON c.COMPANY_ID = e.COMPANY_ID
LEFT JOIN CUSTOMER u ON c.COMPANY_ID = u.COMPANY_ID
WHERE e.AGE > 40 and u.AGE < 20
I'm new to databases. But looking at this query (like a time complexity problem) it will create an unnecessarily huge temporary table. It will have employeeAmount x customerAmount rows for each companies.
So, I re-wrote the query:
SELECT c.NAME FROM COMPANY c
WHERE EXISTS (SELECT * FROM EMPLOYEE e WHERE e.AGE > 40 AND c.COMPANY_ID = e.COMPANY_ID )
OR EXISTS (SELECT * FROM CUSTOMER u WHERE u.AGE < 20 AND c.COMPANY_ID = u.COMPANY_ID )
I do not know if this query will be worse since it will run 2 subqueries for each company.
I know that there can be better ways to write this. For example writing 2 different subqueries for 2 age conditions and then UNION'ing them may be better. But I really want to learn if there is something wrong with one of / both of two queries.
Note: You can increase the join/subquery amount. For example, "we want to find companies with at least one employee older than 40 and at least one customer younger than 20 and at least one order bigger than 1000$"
Thanks.
The exists version should have much better performance in general, especially if you have indexes on company_id in each of the subtables.
Why? The JOIN version creates an intermediate result with all customers over 40 and all employees under 20. That could be quite large if these groups are large for a particular company. Then, the query does additional work to remove duplicates.
There might be some edge cases where the first version has fine performance. I would expect this, for instance, if either of the groups were empty -- no employees ever under 20 or no customers ever over 40. Then the intermediate result set is empty and removing duplicates is not necessary. For the general case, though, I recommend exists.
To know what really happens in your current environment, with your database settings and with your data you need to compare real execution plans (not just EXPLAIN PLAN which gives only the estimated plan). Only real execution plan can give detailed resources used by the query like CPU and IO in addition to detailed steps used by Oracle (full table scan, joins, etc.).
Try:
ALTER SESSION STATISTICS_LEVEL=ALL;
<your query>
SELECT * FROM TABLE(dbms_xplan.display(NULL, NULL, format=>'allstats last'));
Do not assume, just test.

PostgreSQL ON vs WHERE when joining tables?

I have 2 tables customer and coupons, a customer may or may not have a reward_id assigned to, so it's a nullable column. A customer can have many coupons and coupon belongs to a customer.
+-------------+------------+
| coupons | customers |
+-------------+------------+
| id | id |
| customer_id | first_name |
| code | reward_id |
+-------------+------------+
customer_id column is indexed
I would like to make a join between 2 tables.
My attempt is:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id and cust.reward_id is not null
However, I think there isn't an index on reward_id, so I should move cust.reward_id is not null in where clause:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id
where cust.reward_id is not null
I wonder if the second attempt would be more efficient than the first attempt.
It would be better if you see the execution plan on your own. Add EXPLAIN ANALYZE before your select statement and execute both to see the differences.
Here's how:
EXPLAIN ANALYZE select ...
What it does? It actually executes the select statement and gives you back the execution plan which was chosen by query optimizer. Without ANALYZE keyword it would only estimate the execution plan without actually executing the statement in the background.
Database won't use two indexes at one time, so having an index on customer(id) will make it unable to use index on customer(reward_id). This condition will actually be treated as a filter condition which is correct behaviour.
You could experiment with performance of a partial index created as such: customer(id) where reward_id is not null. This would decrease index size as it would only store these customer id's for which there is a reward_id assigned.
I generally like to split the relationship/join logic from conditions applied and I myself put them within the WHERE clause because it's more visible in there and easier to read for future if there are any more changes.
I suggest you see for yourself the possible performance gain, because it depends on how much data there is and the possible low cardinality for reward_id. For example if most rows have this column filled with a value it wouldn't make that much of a difference as the index size (normal vs partial) would be almost the same.
In a PostgreSQL inner join, whether a filter condition is placed in the ON clause or the WHERE clause does not impact a query result or performance.
Here is a guide that explores this topic in more detail: https://app.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins

SQL Join using Parallel Processing

I'm new to the parallel processing concept. I read through Oracle's white paper here to learn the basics but am unsure of how to best construct a SQL join to take advantage of parallel processing. I'm querying my company's database which is massive. The first table is products which is 1 entry per product with product details and the other is sales by week by store by product.
Sales:
Week Store Product OtherColumns
1 S100 prodA
2 S100 prodB
3 S100 prodC
1 S200 prodA
2 S200 prodB
3 S200 prodC
I need to join the 2 tables based on a list of product I specify. My query looks like this:
select *
from
(select prod_id, upc
from prod_tbl
where upc in (...)) prod_tbl
join
(select location, prod_id, sum(adj_cost), sum(sales), row_number() over (partition by loc_id order by sum(adj_cost))
from wk_sales
group by...
having sum(adj_cost)< 0) sales_tbl
on prod_tbl.prod_id = sales_tbl.prod_id
The left table in the join processes a lot faster because it's just one entry per product. The right table is incredibly slow even without the calculations. So here's my question(s):
To parallel process the right table (sales_tbl), do I restructure like so:
...
join
select location, sum(), ...more
from (select ...fields... from same_tbl) --no calculations in table subquery
where
group by
on ...
Am I able to change the redistribution method to broadcast since the first return set is drastically smaller?
To use parallel execution all you need is to add PARALLEL hint. Optionally you can also specify degree like:
/*+ parallel(4) */
In you query you need to make sure that it uses full scan and hash joins. To do that you need check you plan. Parallel is not very efficient for nested loops and merge joins.
Update: small hint regarding parallel - bear in mind that parallel scan bypasses buffer cache. So if you read big table many times in different sessions it might be better to use serial read. Consider parallel only for one off tasks like ETL jobs and data migration.

SQL JOIN returning multiple rows when I only want one row

I am having a slow brain day...
The tables I am joining:
Policy_Office:
PolicyNumber OfficeCode
1 A
2 B
3 C
4 D
5 A
Office_Info:
OfficeCode AgentCode OfficeName
A 123 Acme
A 456 Acme
A 789 Acme
B 111 Ace
B 222 Ace
B 333 Ace
... ... ....
I want to perform a search to return all policies that are affiliated with an office name. For example, if I search for "Acme", I should get two policies: 1 & 5.
My current query looks like this:
SELECT
*
FROM
Policy_Office P
INNER JOIN Office_Info O ON P.OfficeCode = O.OfficeCode
WHERE
O.OfficeName = 'Acme'
But this query returns multiple rows, which I know is because there are multiple matches from the second table.
How do I write the query to only return two rows?
SELECT DISTINCT a.PolicyNumber
FROM Policy_Office a
INNER JOIN Office_Info b
ON a.OfficeCode = b.OfficeCode
WHERE b.officeName = 'Acme'
SQLFiddle Demo
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
Simple join returns the Cartesian multiplication of the two sets and you have 2 A in the first table and 3 A in the second table and you probably get 6 results. If you want only the policy number then you should do a distinct on it.
(using MS-Sqlserver)
I know this thread is 10 years old, but I don't like distinct (in my head it means that the engine gathers all possible data, computes every selected row in each record into a hash and adds it to a tree ordered by that hash; I may be wrong, but it seems inefficient).
Instead, I use CTE and the function row_number(). The solution may very well be a much slower approach, but it's pretty, easy to maintain and I like it:
Given is a person and a telephone table tied together with a foreign key (in the telephone table). This construct means that a person can have more numbers, but I only want the first, so that each person only appears one time in the result set (I ought to be able concatenate multiple telephone numbers into one string (pivot, I think), but that's another issue).
; -- don't forget this one!
with telephonenumbers
as
(
select [id]
, [person_id]
, [number]
, row_number() over (partition by [person_id] order by [activestart] desc) as rowno
from [dbo].[telephone]
where ([activeuntil] is null or [activeuntil] > getdate()
)
select p.[id]
,p.[name]
,t.[number]
from [dbo].[person] p
left join telephonenumbers t on t.person_id = p.id
and t.rowno = 1
This does the trick (in fact the last line does), and the syntax is readable and easy to expand. The example is simple but when creating large scripts that joins tables left and right (literally), it is difficult to avoid that the result contains unwanted duplets - and difficult to identify which tables creates them. CTE works great for me.