Is there anyway that i could avoid spool errors like this

Is there anyway that i could avoid spool errors like this - spool

Scenarios where Spool error occurs in Teradata
Does join clause can get the spool error
I have implemented a query where there is a join similar to
SELECT <case_stmt>
FROM Tabel_1 A
JOIN Tabel_2 B
ON A.column = B.column sample 10;
and i also tried with other columns suspecting that only one AMP encountering the load because of only one column
SELECT <some_columns from table_1 and table_2>, <case_stmt>
FROM Tabel_1 A
JOIN Tabel_2 B
ON A.column = B.column sample 10;
where Table_1 is a big fact table, and Table_2 is a dim table
I encountered an error: No more Spool Space for my account
But when i run a query similar to
SELECT *
FROM Tabel_1 sample 10;
then it ran fine and i got the results.
Question: What exactly caused spool error in this case, can joining clause can create spool error.

Related

Execute SQL query based on results of a condition

My SQL knowledge is somewhat limited. I have 2 Select statements that returns separate sets of results. Since the tables being queried are different sets of tables for each Select statement I would like to know if it is possible to write it such that it picks which query to execute based on the result returned from a Select Statement on a completely different table, for example:
Select C.ManBtchNum from Table C
If OITM.ManBtchNum = 'Y' then
Select * from Table A
else
Select * from Table B
Apologies for using psuedocode but its the shortest way I can explain this.
I'm not sure if a Case expression can be used here. Any advice would be helpful. Thanks
I have tried both a Case as well as a Union but I'm not getting the results needed. Granted I might be doing something wrong considering my limited knowledge

As an example,
select t1.Val, t2.Val, t3.Val,
case
when(t1.Val = Something)Then (T2.Val)Else(t3.Val)
END
from tablex t1
inner join tabley t2 on t1.commonfield = t2.commonfield
inner join tablez t3 ON t1.commonfield = t3.commonfield

Calculate row count when facing SPOOL space issue

SEL COUNT(*) FROM DATABASE_A.QF
Count = 37,011,480
SEL COUNT(*) FROM DATABASE_A_INC.QFA
Count = 368,454
Query 1
DELETE A
FROM
DATABASE_A.QF A,
DATABASE_A_INC.QFA B
WHERE
A.Q_NUM = B.Q_NUM
AND
A.ID = B.ID
AND
A.LOCATION_ID=1;
The above DELETE query runs into SPOOL space issue.
So I rewrote it in another form.
Query 2
DELETE FROM DATABASE_A.QF A WHERE (Q_NUM,ID) IN
(SELECT Q_NUM,ID FROM DATABASE_A_INC.QFA B)
AND LOCATION_ID=1;
368454 rows processed.
DELETE Command Complete
My questions:
Are query 1 and 2 logically the same? Are they deleting the same records?
How do I verify the count from Query 1 without running into a SPOOL
space issue? I have tried a general COUNT function. I tried increasing spool space to a certain extent.
Is there a better way to check the count for Query 1?

The queries are logically the same, yes. My guess is the reason for your SPOOL space issue is that you are listing your tables with commas instead of joining them. Try counting query 1 like this:
SELECT COUNT(*)
FROM DATABASE_A.QF A
INNER JOIN DATABASE_A_INC.QFA B ON A.Q_NUM = B.Q_NUM
WHERE A.ID = B.ID
AND A.LOCATION_ID=1;

Hive query stuck at 99%

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.

Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)

If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.

Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true

Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.

use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m

I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

Inner query with same source table as Outer Query

I went through some PL/SQL codes and found a piece of query where I not actually get how it works. Hoping to get some technical advise from here.
The piece of query was shown as below:
SELECT a.ROWID
FROM TableA a
WHERE a.object_name IN ('HEADERS','LINES','DELIVERIES')
AND a.change_type IN ('A','C')
AND a.ROWID NOT IN (SELECT MAX (b.ROWID)
FROM TableA b
WHERE b.object_name = a.object_name
AND b.change_type = a.change_type
AND b.pk1 = a.pk1
AND b.object_identifier = a.object_identifier
);
From what I know, the inner query should run first (correct me if I am wrong) and then the inner query result will used for the outer query.
For the above query, how the inner query run as it needs data from the outer query (data from alias TableA a).
Hope to have some guidance on this as I am very fresh in PL/SQL development.
Thanks!

It is not PL/SQL, just classic SQL statement.
The purpose seams to be
retrieve all the lines which are not the "last version" (biggest rowid for a couple pk1 and object_identifier)
The "not in" part will retrieve the max rowid for a couple (pk1 and object_identifier) and then, the outer query will retrive all the lines which are not the max rowid
In term of execution process, you can take a look at the explain plan to see what oracle is going to do.

The inner query does not run first. Conceptually, you can think of it running like this:
Run the outer query,
For each row in the other query, run the inner query using specific values for the a.* columns
If the inner query for that row doesn't return anything, output the outer query row to the result set

BigQuery - joining on a repeated field

I'm trying to run a join on a repeated field.
Originally I get an error:
Cannot join on repeated field payload.pages.action
I fix this by running flatten on the relevant table (this is only an example query - it will give empty result if it would successfully run):
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each flatten(publicdata:samples.github_nested,payload.pages) t2
on t2.payload.pages.action=t1.repository.url
I get a different error:
Table wildcard function 'FLATTEN' can only appear in FROM clauses
This used to work in the past. Is there some syntax change?

I don't think there has been a syntax change, but you should be able to wrap the flatten statement in a subselect. That is,
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each (SELECT * FROM flatten(publicdata:samples.github_nested,payload.pages)) t2
on t2.payload.pages.action=t1.repository.url

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is there anyway that i could avoid spool errors like this - spool

Related

Execute SQL query based on results of a condition

Calculate row count when facing SPOOL space issue

Hive query stuck at 99%

Inner query with same source table as Outer Query

BigQuery - joining on a repeated field

Categories

Resources