Improving run time of SQL - currently 61 hours - sql

Complex select statement with approximately 20 left outer join statements. Many of the joins are essential to obtain data from a single column in that table (poorly designed database). The current runtime using EXPLAIN is estimated at 61 hours (45GB).
I have limited options due to user permissions. How can I optimise the SQL?
identifying and removing unnecessary joins
writing statements to include data rather than exclude data I don't need
trying to get user permission to CREATE Table ('hell no')
trying to get access to a sandpit like space on a server to create a view ('oh hells no no no').
SELECT t1.column1, t1.column2, t2.column1, t3.column2, t4.column3
--- (etc - approximately 30 items)
, CASE WHEN t1.column2 is NULL
THEN t2.column3
ELSE t1.column2
END as Derived_Column_1
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 t2
ON t1.column1 = t2.column3
LEFT OUTER JOIN TABLE3 T3
ON T1.column5 = t3.column6
AND t1.column6 = t3.column7
LEFT OUTER JOIN TABLE4 T4
ON T2.Column4 = T4.Column8
AND T2.Column5 = '16'
--- (etc - approximately 16 other joins, some of which are only required to connect table 1 to 5, because they have no direct common fields)
--- select data that was timestamped in the last 120 days
WHERE CAST(t1.Column3 as Date) > CURRENT_DATE - 120
-- de-duplify the data by four values and use the latest entry
QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1
Single output that has 30 fields + derived_column field
for data that was timestamped in the last 120 days.
Would like to remove duplicates based on four fields but the QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1 adds a lot of time to the run.

I think you could CREATE VOLATILE TABLE ... ON COMMIT PRESERVE ROWS to store some intermediate data. It may need some checking, but I think you would not need any special rights to do that (only a spool space quota you already have as a means to run your SELECT's).
The usual optimization technique is as follows: you take control of the execution plan by cutting your large SELECT to pieces, which sequentially compute intermediate results (saving those into volatile tabless) and redistribute them (by specifying the PRIMARY KEY on the volatile tables) to take advantage of the Teradata parallelism.
Usually, you choose the columns that are used in join conditions as a primary index; you may encounter a skew, which you may solve by cutting your intermediate volatile table in two and choosing different primary indexes for the two parts. That would make your code more sophisticated, but much more optimal.
By the way, do not let the "hours" estimate of the Teradata plan fool you; those are not the actual hours, minutes or seconds, only synthetic ones. Usually, they are pretty far from the actual query run time.

Related

SAS multiple large tables join - ERROR: Sort execution failure

When running a large query of the form (using the undocuemnted _method to output the query method):
PROC SQL _method; CREATE TABLE output AS
SELECT
t1.foo
,t2.bar
,t3.bat
,t4.fat
,t5.baa
FROM table1 t1
LEFT JOIN table t2
ON t1.key2 = t2.key2
LEFT JOIN table3 t3
ON t1.key3 = t3.key3
LEFT JOIN table t4
ON t1.key4 = t4.key4
...
LEFT JOIN tablen tn
ON t1.keyn = tn.keyn
;
Where t1 is ca. 6 Gb, t2 is a view on a table that is ca. 500 Gb, and t3, t4 ... tn are each data tables ca. 1-10 Mb (there are typically six or seven of these), I run into the following error:
NOTE: SAS threaded sort was used. ERROR: Sort execution failure.
NOTE: View WORK.table2.VIEW used (Total process time):
real time 17:02.55
user cpu time 2:40.12
The SAS System
system cpu time 2:19.41
memory 303785.64k
OS Memory 322280.00k
Timestamp 11/03/2014 08:13:25 PM
When I sample a very small % of t1 to make it only ca. 30 Mb the query runs okay but even 10% of the table1 causes a similar issue.
How can I profile this query?
to help me choose a better strategy
to enable me to perform the operation on the whole dataset
to limit the need for excessive I/O on the file system (i.e. I could process this batchwise and union the results)
First, this is a really big set of data, and the problem may be with the view. Second, if the data is in a database, you might want a pass-through query, so the processing is all done on the database side.
If the left joins are just looking up values, particularly individual values, you can rephrase the query as:
SELECT t1.foo,
(SELECT t2.bar FROM table t2 WHERE t1.key2 = t2.key2) as bar,
(SELECT t3.bat FROM table t3 WHERE t1.key3 = t3.key3) as bat,
. . .
FROM table1 t1;
This should eliminate any possible sort that would occur on table1.
If the joins are returning multiple rows, this won't work; it will generate errors.

Bigquery JOIN optimization

We are running a query every 5 minutes with a JOIN. On one side of the JOIN is table1#time1-time2 (as we only look at the incremental part), another side of the JOIN is table2, which keeps changing as we are stream data into it. The JOIN is now like
[table1#time1-time2] AS T1 INNER JOIN EACH table2 AS T2 ON T1.id = T2.id
Since every time this query involves the whole T2, is there any possible optimization I can do, such as using cache or else, in order to minimize the money cost?
EDIT
The query:
Copy pasting text would be better, hard to read the query on that screenshot.
That said, I see a SELECT * for the second table. Selecting only the needed columns would only query a fraction of the table, instead of all of it.
Also, why are you generating a row_in and joining on a different one?

JOIN versus EXISTS performance

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.
I do see other SO answers, like this and this, but my question is specific to Teradata.
For example, consider these two queries, which return identical results:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).
There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).
Both queries perform the same (I'm using Teradata 13.10).
A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
where x.srv_accs_id is null
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where not exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).
My example may be too trivial to see a difference; I'm just looking for coding guidance.
NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.
While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.
EDIT:
Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.