SAS multiple large tables join - ERROR: Sort execution failure - sql

When running a large query of the form (using the undocuemnted _method to output the query method):
PROC SQL _method; CREATE TABLE output AS
SELECT
t1.foo
,t2.bar
,t3.bat
,t4.fat
,t5.baa
FROM table1 t1
LEFT JOIN table t2
ON t1.key2 = t2.key2
LEFT JOIN table3 t3
ON t1.key3 = t3.key3
LEFT JOIN table t4
ON t1.key4 = t4.key4
...
LEFT JOIN tablen tn
ON t1.keyn = tn.keyn
;
Where t1 is ca. 6 Gb, t2 is a view on a table that is ca. 500 Gb, and t3, t4 ... tn are each data tables ca. 1-10 Mb (there are typically six or seven of these), I run into the following error:
NOTE: SAS threaded sort was used. ERROR: Sort execution failure.
NOTE: View WORK.table2.VIEW used (Total process time):
real time 17:02.55
user cpu time 2:40.12
The SAS System
system cpu time 2:19.41
memory 303785.64k
OS Memory 322280.00k
Timestamp 11/03/2014 08:13:25 PM
When I sample a very small % of t1 to make it only ca. 30 Mb the query runs okay but even 10% of the table1 causes a similar issue.
How can I profile this query?
to help me choose a better strategy
to enable me to perform the operation on the whole dataset
to limit the need for excessive I/O on the file system (i.e. I could process this batchwise and union the results)

First, this is a really big set of data, and the problem may be with the view. Second, if the data is in a database, you might want a pass-through query, so the processing is all done on the database side.
If the left joins are just looking up values, particularly individual values, you can rephrase the query as:
SELECT t1.foo,
(SELECT t2.bar FROM table t2 WHERE t1.key2 = t2.key2) as bar,
(SELECT t3.bat FROM table t3 WHERE t1.key3 = t3.key3) as bat,
. . .
FROM table1 t1;
This should eliminate any possible sort that would occur on table1.
If the joins are returning multiple rows, this won't work; it will generate errors.

Related

Improving run time of SQL - currently 61 hours

Complex select statement with approximately 20 left outer join statements. Many of the joins are essential to obtain data from a single column in that table (poorly designed database). The current runtime using EXPLAIN is estimated at 61 hours (45GB).
I have limited options due to user permissions. How can I optimise the SQL?
identifying and removing unnecessary joins
writing statements to include data rather than exclude data I don't need
trying to get user permission to CREATE Table ('hell no')
trying to get access to a sandpit like space on a server to create a view ('oh hells no no no').
SELECT t1.column1, t1.column2, t2.column1, t3.column2, t4.column3
--- (etc - approximately 30 items)
, CASE WHEN t1.column2 is NULL
THEN t2.column3
ELSE t1.column2
END as Derived_Column_1
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 t2
ON t1.column1 = t2.column3
LEFT OUTER JOIN TABLE3 T3
ON T1.column5 = t3.column6
AND t1.column6 = t3.column7
LEFT OUTER JOIN TABLE4 T4
ON T2.Column4 = T4.Column8
AND T2.Column5 = '16'
--- (etc - approximately 16 other joins, some of which are only required to connect table 1 to 5, because they have no direct common fields)
--- select data that was timestamped in the last 120 days
WHERE CAST(t1.Column3 as Date) > CURRENT_DATE - 120
-- de-duplify the data by four values and use the latest entry
QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1
Single output that has 30 fields + derived_column field
for data that was timestamped in the last 120 days.
Would like to remove duplicates based on four fields but the QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1 adds a lot of time to the run.
I think you could CREATE VOLATILE TABLE ... ON COMMIT PRESERVE ROWS to store some intermediate data. It may need some checking, but I think you would not need any special rights to do that (only a spool space quota you already have as a means to run your SELECT's).
The usual optimization technique is as follows: you take control of the execution plan by cutting your large SELECT to pieces, which sequentially compute intermediate results (saving those into volatile tabless) and redistribute them (by specifying the PRIMARY KEY on the volatile tables) to take advantage of the Teradata parallelism.
Usually, you choose the columns that are used in join conditions as a primary index; you may encounter a skew, which you may solve by cutting your intermediate volatile table in two and choosing different primary indexes for the two parts. That would make your code more sophisticated, but much more optimal.
By the way, do not let the "hours" estimate of the Teradata plan fool you; those are not the actual hours, minutes or seconds, only synthetic ones. Usually, they are pretty far from the actual query run time.

ORACLE join multiple tables performance

I have kinda complex question.
Let's say that I have 7 tables (20mil+ rows each) (Table1, Table2 ...) with corresponding pk (pk1, pk2, ....) (cardinality among all tables is 1:1)
I want to get my final table (using hash join) as:
Create table final_table as select
t1.column1,
t2.column2,
t3.column3,
t4.column4,
t5.column5,
t6.column6,
t7.column7
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
join table2 t3 on t1.pk1 = t3.pk3
join table2 t4 on t1.pk1 = t4.pk4
join table2 t5 on t1.pk1 = t5.pk5
join table2 t6 on t1.pk1 = t6.pk6
join table2 t7 on t1.pk1 = t7.pk7
I would like to know if it would be faster to create partial tables and then final table, like this?
Create table partial_table1 as select
t1.column1,
t2.column2
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
create table partial_table2 as select
t1.column1, t1.column2
t3.column3
from partial_table1 t1
join table3 t3 on t1.pk1 = t3.pk3
create table partial_table3 as select
t1.column1, t1.column2, t1.column3
t4.column4
from partial_table1 t1
join table3 t4 on t1.pk1 = t4.pk4
...
...
...
I know it depends on RAM (because I want to use hash join), actual server usage, etc.. I am not looking for specific answer, I am looking for some explanations why and in what situations would it be better to use partial results or why it would it be better to use all 7 joins in 1 select.
Thanks, I hope that my question is easy to understand.
In general, it is not better to create temporary tables. SQL engines have an optimization phase and this optimization phase should do well as figuring out the best query plan.
In the case of a bunch of joins, this is mostly about join order, use of indexes, and the optimal algorithm.
This is a good default attitude. Does it mean that temporary tables are never useful for performance optimization? Not at all. Here are some exceptions:
The optimizer generates a suboptimal query plan. In this case, query hints can push the optimizer in the right direction. And, temporary tables can help.
Indexing the temporary tables. Sometimes an index on the temporary tables can be a big win for performance. The optimizer might not pick this up.
Re-use of temporary tables across queries.
For your particular goal of using hash joins, you can use a query hint to ensure that the optimizer does what you would like. I should note that if the joins are on primary keys, then a hash join might not be the optimal algorithm.
It is not a good idea to create temporary tables in your database. To Optimize your query for reporting purposes or faster results trying using views and it can lead to much better results.
For your specific case, you want to use hash join can you please explain a bit more like why you want to use that in particular because the optimizer will determine the best plan by itself and you don't need to worry about the type of join it performs.

Including column improves query performance SQL Server 2008

A query performance is being affected if a column is included or not, but the weird thing is that it affects positive (reduce time execution) if the column is included.
The query includes a few joins to a view, some tables and tabled valued functions like the next:
SELECT
v1.field1, t2.field2
FROM
view v1 WITH (nolock)
INNER JOIN
table t1 WITH (nolock) ON v.field1 = t1.field1
INNER JOIN
table2 t2 WITH (nolock) ON t2.field2 = t1.field2
INNER JOIN
function1(#param) f1 ON f1.field3 = t2.field3
WHERE
(v.date1 = #param OR v.date2 = #param)
The thing is if I include within the select a varchar(200) not null column which is part of the view (it is not indexed in the original table or the view, and it's not part of a constraint), the query performance is X seconds, but if I don't include it then the performance ups to 4X seconds, which is a lot of difference just for including a column; so the query with the best performance will be like:
SELECT
v1.field1, t2.field2, v1.fieldWhichAffectsPerformance
view v1 WITH (nolock)
INNER JOIN
table t1 WITH (nolock) ON v.field1 = t1.field1
INNER JOIN
table2 t2 WITH (nolock) ON t2.field2 = t1.field2
INNER JOIN
function1(#param) f1 ON f1.field3 = t2.field3
WHERE
(v.date1 = #param OR v.date2 = #param)
It's mandatory to remove the column which improves the query performance, but without affecting in a negative way the actual performance. Any ideas?
EDIT: as suggested i've reviewed the execution plan, and the query without the column runs an extra hash match (left outer join) and uses index scan which cost a lot of CPU instead index seek which are the plan in the query with the column included. how can I remove the column without affecting the performance? any ideas?
Optimizers are complicated. Without query plans, there is only speculation.
You need to look at the query plans to get a real answer.
One possibility is the order of processing. The select could equivalently be written as:
SELECT t1.field1, t2.field2
because the on condition specifies that columns in the two tables are the same. The optimizer my recognize that the or prevents the use of indexes on the view (which is probably not applicable anyway). So, instead of scanning the view, it decides to scan table1 and then bring in the view.
By including an additional column in the select, you are pushing the optimizer to scan the view -- and this might be the better execution plan.
This is all hypothetical, but it gives a mechanism on how your observed timings could happen.

Bigquery JOIN optimization

We are running a query every 5 minutes with a JOIN. On one side of the JOIN is table1#time1-time2 (as we only look at the incremental part), another side of the JOIN is table2, which keeps changing as we are stream data into it. The JOIN is now like
[table1#time1-time2] AS T1 INNER JOIN EACH table2 AS T2 ON T1.id = T2.id
Since every time this query involves the whole T2, is there any possible optimization I can do, such as using cache or else, in order to minimize the money cost?
EDIT
The query:
Copy pasting text would be better, hard to read the query on that screenshot.
That said, I see a SELECT * for the second table. Selecting only the needed columns would only query a fraction of the table, instead of all of it.
Also, why are you generating a row_in and joining on a different one?

Nested sql joins process explanation needed

I want to understand the process of nested join clauses in sql queries. Can you explain this example with pseudo codes? (What is the order of joining tables?)
FROM
table1 AS t1 (nolock)
INNER JOIN table2 AS t2 (nolock)
INNER JOIN table3 as t3 (nolock)
ON t2.id = t3.id
ON t1.mainId = t2.mainId
In SQl basically we have 3 ways to join two tables.
Nested Loop ( Good if one table has small number of rows),
Hash Join (Good if both table has very large rows, it does expensive hash formation in memory)
Merge Join (Good when we have sorted data to join).
From your question it seems that you want for Nested Loop.
Let us say t1 has 20 rows, t2 has 500 rows.
Now it will be like
For each row in t1
Find rows in t2 where t1.MainId = t2.MainId
Now out put of that will be joined to t3.
Order of Joining depends on Optimizer, Expected Row count etc.
Try EXPLAIN query.
It tells you exactly what's going on. :)
Of course that doesn't work in SQL Server. For that you can try Razor SQLServer Explain Plan
Or even SET SHOWPLAN_ALL
If you're using SQL Server Query Analyzer, look for "Show Execution Plan" under the "Query" menu, and enable it.