SQL Join using Parallel Processing - sql

I'm new to the parallel processing concept. I read through Oracle's white paper here to learn the basics but am unsure of how to best construct a SQL join to take advantage of parallel processing. I'm querying my company's database which is massive. The first table is products which is 1 entry per product with product details and the other is sales by week by store by product.
Sales:
Week Store Product OtherColumns
1 S100 prodA
2 S100 prodB
3 S100 prodC
1 S200 prodA
2 S200 prodB
3 S200 prodC
I need to join the 2 tables based on a list of product I specify. My query looks like this:
select *
from
(select prod_id, upc
from prod_tbl
where upc in (...)) prod_tbl
join
(select location, prod_id, sum(adj_cost), sum(sales), row_number() over (partition by loc_id order by sum(adj_cost))
from wk_sales
group by...
having sum(adj_cost)< 0) sales_tbl
on prod_tbl.prod_id = sales_tbl.prod_id
The left table in the join processes a lot faster because it's just one entry per product. The right table is incredibly slow even without the calculations. So here's my question(s):
To parallel process the right table (sales_tbl), do I restructure like so:
...
join
select location, sum(), ...more
from (select ...fields... from same_tbl) --no calculations in table subquery
where
group by
on ...
Am I able to change the redistribution method to broadcast since the first return set is drastically smaller?

To use parallel execution all you need is to add PARALLEL hint. Optionally you can also specify degree like:
/*+ parallel(4) */
In you query you need to make sure that it uses full scan and hash joins. To do that you need check you plan. Parallel is not very efficient for nested loops and merge joins.
Update: small hint regarding parallel - bear in mind that parallel scan bypasses buffer cache. So if you read big table many times in different sessions it might be better to use serial read. Consider parallel only for one off tasks like ETL jobs and data migration.

Related

Joining and Aggregating a Large Number of Fact Tables Efficiently in Redshift

I have a number of (10M+ rows) fact tables in Redshift, each with a natural key memberid and each with a column timestamp. Let's say I have three tables: transactions, messages, app_opens, with transactions looking like this (all the other tables have similar structure):
memberid
revenue
timestamp
374893978
3.99
2021-02-08 18:34:01
374893943
7.99
2021-02-08 19:34:01
My goal is to create a daily per-memberid aggregation table that looks likes this, with a row for each memberid and date:
memberid
date
daily_revenue
daily_app_opens
daily_messages
374893978
2021-02-08
4.95
31
45
374893943
2021-02-08
7.89
23
7
The SQL I'm currently using for this is the following, which involves unioning separate subqueries:
SELECT memberid,
date,
max(NVL(daily_revenue,0)) daily_revenue,
max(NVL(daily_app_opens,0)) daily_app_opens,
max(NVL(daily_messages,0)) daily_messages
FROM
(
SELECT memberid,
trunc(timestamp) as date,
sum(revenue) daily_revenue,
NULL AS daily_app_opens,
NULL AS daily_messages
FROM transactions
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
count(*) daily_app_opens,
NULL AS daily_messages
FROM app_opens
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
NULL AS daily_app_opens,
count(*) daily_messages
FROM messages
GROUP BY 1,2
)
GROUP BY memberid, date
This works fine and produces the expected output, but I'm wondering if this is the most efficient way to carry out this kind of query. I have also using FULL OUTER JOIN in place of UNION ALL, but the performance is essentially identical.
What's the most efficient way to achieve this in Redshift?
Seeing the EXPLAIN plan would help as it would let us see what the most costly parts of the query are. Based on a quick read of the SQL it looks pretty good. The cost of scanning the fact tables is likely meaningful but this is a cost you have to endure. If you can restrict the amount of data read with a where clause this can be reduced but doing this may not meet your needs.
One place that you should review is the distribution of these tables. Since you are grouping by accountid having this as the distribution key will make this process faster. Grouping will need bring rows of the same accountid value together, distributing on these values will greatly cut down on network traffic within the cluster.
At large data sizes and with everything else optimized I'd expect UNION ALL to out perform FULL OUTER JOIN but this will depend on a number of factors (like how much the data size is reduced by the accountid aggregation). 10M rows is not very big in Redshift terms (I have 160M rows of wide data on a minimal cluster) so I don't think you will see much difference between these plans at these sizes.

PostgreSQL ON vs WHERE when joining tables?

I have 2 tables customer and coupons, a customer may or may not have a reward_id assigned to, so it's a nullable column. A customer can have many coupons and coupon belongs to a customer.
+-------------+------------+
| coupons | customers |
+-------------+------------+
| id | id |
| customer_id | first_name |
| code | reward_id |
+-------------+------------+
customer_id column is indexed
I would like to make a join between 2 tables.
My attempt is:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id and cust.reward_id is not null
However, I think there isn't an index on reward_id, so I should move cust.reward_id is not null in where clause:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id
where cust.reward_id is not null
I wonder if the second attempt would be more efficient than the first attempt.
It would be better if you see the execution plan on your own. Add EXPLAIN ANALYZE before your select statement and execute both to see the differences.
Here's how:
EXPLAIN ANALYZE select ...
What it does? It actually executes the select statement and gives you back the execution plan which was chosen by query optimizer. Without ANALYZE keyword it would only estimate the execution plan without actually executing the statement in the background.
Database won't use two indexes at one time, so having an index on customer(id) will make it unable to use index on customer(reward_id). This condition will actually be treated as a filter condition which is correct behaviour.
You could experiment with performance of a partial index created as such: customer(id) where reward_id is not null. This would decrease index size as it would only store these customer id's for which there is a reward_id assigned.
I generally like to split the relationship/join logic from conditions applied and I myself put them within the WHERE clause because it's more visible in there and easier to read for future if there are any more changes.
I suggest you see for yourself the possible performance gain, because it depends on how much data there is and the possible low cardinality for reward_id. For example if most rows have this column filled with a value it wouldn't make that much of a difference as the index size (normal vs partial) would be almost the same.
In a PostgreSQL inner join, whether a filter condition is placed in the ON clause or the WHERE clause does not impact a query result or performance.
Here is a guide that explores this topic in more detail: https://app.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins

Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows
Customer
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001 John 30 1 Jan 2017
002 Jack 10 2 Jan 2017
003 Jim 50 3 Jan 2017
Payment
CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT |
-------------------------------------------
001 900 100.00
001 901 200.00
001 902 300.00
003 903 999.00
We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.
Solution 1
SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;
Solution 2
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE
Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID
While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.
With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.
I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.
Update
I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.
My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.
Both versions use about 13 minutes, no significant difference.
I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.
With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.
Another option is
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM PAYMENT
GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
To decide which one of three is better just test them and see the execution plans.
Neither. Do the sum on payment, then join the results.
select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id
Solution 1 is costly.
Even though optimizer could avoid the unecessary sorting,
at some point you will be forced to add indexes/constraints
over irrelevant columns to improve performance.
Not a good practice in the long term.
Solution 2 is the Oracle way.
Oracle documentation states that:
GROUP BY clause must contain only aggregates or grouping columns
Oracle engineers had valid reasons to do that,
however this does not apply to other RDBMS where you
can simply put GROUP BY c.customerID and all will be fine.
For the sake of code readability a --comment would be cheaper.
In general, not embracing any platform principles would have a cost:
more code, weird code, memory, disk space, performance, etc.
In Solution 1 the query will repeat the MAX function for each column. I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). It is kind of a time bomb, when your table gets bigger this query will get worst very fast. So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application.

SQL Multiple Joins - How do they work exactly?

I'm pretty sure this works universally across various SQL implementations. Suppose I have many-to-many relationship between 2 tables:
Customer: id, name
has many:
Order: id, description, total_price
and this relationship is in a junction table:
Customer_Order: order_date, customer_id, order_id
Now I want to write SQL query to join all of these together, mentioning the customer's name, the order's description and total price and the order date:
SELECT name, description, total_price FROM Customer
JOIN Customer_Order ON Customer_Order.customer_id = Customer.id
JOIN Order = Order.id = Customer_Order.order_id
This is all well and good. This query will also work if we change the order so it's FROM Customer_Order JOIN Customer or put the Order table first. Why is this the case? Somewhere I've read that JOIN works like an arithmetic operator (+, * etc.) taking 2 operands and you can chain operator together so you can have: 2+3+5, for example. Following this logic, first we have to calculate 2+3 and then take that result and add 5 to it. Is it the same with JOINs?
Is it that behind the hood, the first JOIN must first be completed in order for the second JOIN to take place? So basically, the first JOIN will create a table out of the 2 operands left and right of it. Then, the second JOIN will take that resulting table as its left operand and perform the usual joining. Basically, I want to understand how multiple JOINs work behind the hood.
In many ways I think ORMs are the bane of modern programming. Unleashing a barrage of underprepared coders. Oh well diatribe out of the way, You're asking a question about set theory. THere are potentially other options that center on relational algebra but SQL is fundamentally set theory based. here are a couple of links to get you started
Using set theory to understand SQL
A visual explanation of SQL

Slow but simple Query, how to make it quicker?

I have a database which is 6GB in size, with a multitude of tables however smaller queries seem to have the most problems, and want to know what can be done to optimise them for example there is a Stock, Items and Order Table.
The Stock table is the items in stock this has around 100,000 records within with 25 fields storing ProductCode, Price and other stock specific data.
The Items table stores the information about the items there are over 2,000,000 of these with over 50 fields storing Item Names and other details about the item or product in question.
The Orders table stores the Orders of Stock Items, which is the when the order was placed plus the price sold for and has around 50,000 records.
Here is a query from this Database:
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
Given the information here what could be done to improve this query, there are others like this, what alternatives are there. The nature of the data and the database cannot be detailed further however, so if general optmisation tricks and methods are given these will be fine, or anything which applies generally to databases.
The Database is MS SQL 2000 on Windows Server 2003 with the latest service packs for each.
DB Upgrade / OS Upgrade are not options for now.
Edit
Indices are Stock.SKU, Items.ProductCode and Orders.OrderID on the tables mentioned.
Execution plan is 13-16 seconds for a Query like this 75% time spent in Stock
Thanks for all the responses so far - Indexing seems to be the problem, all the different examples given have been helpful - dispite a few mistakes with the query, but this has helped me a lot some of these queries have run quicker but combined with the index suggestions I think I might be on the right path now - thanks for the quick responses - has really helped me and made me consider things I did not think or know about before!
Indexes ARE my problem added one to the Foriegn Key with Orders (Customer) and this
has improved performance by halfing execution time!
Looks like I got tunnel vision and focused on the query - I have been working with DBs for a couple of years now, but this has been very helpful. However thanks for all the query examples they are combinations and features I had not considered may be useful too!
is your code correct??? I'm sure you're missing something
INNER JOIN Batch ON Order.OrderID = Orders.OrderID
and you have a ) in the code ...
you can always test some variants against the execution plan tool, like
SELECT
s.SKU, i.Name, s.ProductCode
FROM
Stock s, Orders o, Batch b, Items i
WHERE
b.OrderID = o.OrderID AND
s.ProductCode = i.ProductCode AND
s.Status IN (1, 2) AND
o.Customer = 12345
ORDER BY
o.OrderDate DESC;
and you should return just a fraction, like TOP 10... it will take some milliseconds to just choose the TOP 10 but you will save plenty of time when binding it to your application.
The most important (if not already done): define your primary keys for the tables (if not already defined) and add indexes for the foreign keys and for the columns you are using in the joins.
Did you specify indexes? On
Items.ProductCode
Stock.ProductCode
Orders.OrderID
Orders.Customer
Sometimes, IN could be faster than OR, but this is not as important as having indexes.
See balexandre answer, you query looks wrong.
Some general pointers
Are all of the fields that you are joining on indexed?
Is the ORDER BY necessary?
What does the execution plan look like?
BTW, you don't seem to be referencing the Order table in the question query example.
Table index will certainly help as Cătălin Pitiș suggested.
Another trick is to reduce the size of the join rows by either use sub select or to be more extreme use temp tables. For example rather than join on the whole Orders table, join on
(SELECT * FROM Orders WHERE Customer = 12345)
also, don't join directly on Stock table join on
(SELECT * FROM Stock WHERE Status = 1 OR Status = 2)
Setting the correct indexes on the tables is usually what makes the biggest difference for performance.
In Management Studio (or Query Analyzer for earlier versions) you can choose to view the execution plan of the query when you run it. In the execution plan you can see what the database is really doing to get the result, and what parts takes the most work. There are some things to look for there, like table scans, that usually is the most costly part of a query.
The primary key of a table normally has an index, but you should verify that it's actually so. Then you probably need indexes on the fields that you use to look up records, and fields that you use for sorting.
Once you have added an index, you can rerun the query and see in the execution plan if it's actually using the index. (You may need to wait a while after creating the index for the database to build the index before it can use it.)
Could you give it a go?
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID AND (Order.Customer = 12345) AND (Stock.Status = 1 OR Stock.Status = 2))
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
ORDER BY Order.OrderDate DESC;
Elaborating on what Cătălin Pitiș said already: in your query
SELECT Stock.SKU, Items.Name, Stock.ProductCode
FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
the criterion Order.Customer = 12345 looks very specific, whereas (Stock.Status = 1 OR Stock.Status = 2) sounds unspecific. If this is correct, an efficient query consists of
1) first finding the orders belonging to a specific customer,
2) then finding the corresponding rows of Stock (with same OrderID) filtering out those with Status in (1, 2),
3) and finally finding the items with the same ProductCode as the rows of Stock in 2)
For 1) you need an index on Customer for the table Order, for 2) an index on OrderID for the table Stock and for 3) an index on ProductCode for the table Items.
As long your query does not become much more complicated (like being a subquery in a bigger query, or that Stock, Order and Items are only views, not tables), the query optimizer should be able to find this plan already from your query. Otherwise, you'll have to do what kuoson is suggesting (but the 2nd suggestion does not help, if Status in (1, 2) is not very specific and/or Status is not indexed on the table Status). But also remember that keeping indexes up-to-date costs performance if you do many inserts/updates on the table.
To shorten my answer I gave 2 hours ago (when my cookies where switched off):
You need three indexes: Customer for table Order, OrderID for Stock and ProductCode for Items.
If you miss any of these, you'll have to wait for a complete table scan on the according table.