Hive query stuck at 99% - sql

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.

Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)

If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.

Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true

Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.

use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m

I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

Related

Determine datatypes of columns - SQL selection

Is it possible to determine the type of data of each column after a SQL selection, based on received results? I know it is possible though information_schema.columns, but the data I receive comes from multiple tables and is joint together and the data is renamed. Besides that, I'm not able to see or use this query or execute other queries myself.
My job is to store this received data in another table, but without knowing beforehand what I will receive. I'm obviously able to check for example if a certain column contains numbers or text, but not if it is originally stored as a TINYINT(1) or a BIGINT(128). How to approach this? To clarify, it is alright if the data-types of the columns of the source and destination aren't entirely the same, but I don't want to reserve too much space beforehand (or too less for that matter).
As I'm typing, I realize I'm formulation the question wrong. What would be the best approach to handle described situation? I thought about altering tables on the run (e.g. increasing size if needed), but that seems a bit, well, wrong and not the proper way.
Thanks
Can you issue the following query about your new table after you create it?
SELECT *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'JoinedQueryResults'
Is the query too big to run before knowing how big the results will be? Get a idea of how many rows it may return, but the trick with queries with joins is to group on the columns you are joining on, to help your estimate return more quickly. Here's of an example of just returning a row count from the query above which would have created the JoinedQueryResults table above.
SELECT SUM(A.NumRows * B.NumRows)
FROM (SELECT ID, COUNT(*) AS NumRows
FROM TableA
GROUP BY ID) AS A
INNER JOIN (SELECT ID, COUNT(*) AS NumRows
FROM TableB
GROUP BY ID) AS B ON A.ID = B.ID
The query above will run faster if all you need is a record count to help you estimate a size.
Also try instantiating a table for your results with a query like this.
SELECT TOP 0 *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID

SAS Enterprise: left join and right join difference?

I joined a new company that uses SAS Enterprise Guide.
I have 2 tables, table A has 100 row, and table B has over 30M rows (50-60 columns).
I tried to do a right join from A (100) to B (30M), it took over 2 hours and no result come back. I want to ask, will it help if I do a left join? I used the GUI and created the following query.
30M Record <- 100 Record ?
or
100 Record -> 30M Record ?
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CASE_NUMBER AS
SELECT t2.EMPGRPCOM,
t2.SEQINVNUM,
t2.SBSID,
t2.SBSLASTNAME,
t2.SBSFIRSTNAME,
t2.PMTDUEDATE,
t2.PREMAMT,
t2.ITEMDESC,
t2.EFFDATE,
t2.PAYAMT,
t2.MCAIDRATECD,
t2.REBILLIND,
t2.BILLTYPE
FROM WORK.'CASE NUMBER'n t1
LEFT JOIN DW.BILLING t2 ON (t1.CaseNumber = t2.SBSID)
WHERE t2.LOB = 'MD' AND t2.PMTDUEDATE BETWEEN '1Jan2015:0:0:0'dt AND '31Dec2017:0:0:0'dt AND t2.SITEID = '0001';
QUIT;
Left join and Right join, all other things aside, are equivalent - if you implement them the same way, anyway. I.E.,
select a.*
from a
left join
b
on a.id=b.id
;
vs
select a.*
from b
right join
a
on b.id=a.id
;
Same exact query, no difference, same time used. SQL is an interpreted language, meaning the SQL interpreter looks at what you send it and figures out what the best way to do it is - so it sees both queries and knows in both cases to do the same thing.
You can read about this in all sorts of articles, this one is a good starting point, or if that link ages just search for "right join vs left join".
Now, what you might want to consider is writing this in a different way, namely not using SQL; this kind of query SQL should be good at but sometimes isn't for some reason. I would write it as a hash table search, where the smaller case_number dataset is loaded to memory, then data step iterate over the larger table and check if it's found in the smaller dataset - if so, then great, return it.
I'd also think about whether left/right join is what you want, vs. inner join. Seems to me that if you're returning solely t2 values, right/left join isn't correct (when t1 is the "primary"): you'll just get empty rows for the non-matches. Either return a t1 variable, or use inner join.

Hive: Can't select one random match on right table in left outer join

EDIT - I don't care about the skewness or things being slow. I found out that the slowness was more so caused by a many times many join on many matches in my left outer join... Please skip down to the bottom.
I have an issue of a skewed table, that is, many more keys than other keys to join. My problem is that I have more than one key with many appearances in the rows.
Stats on this table and table I am joining with:
Larger table: totalSize=47431500000, numRows=509500000, rawDataSize=47022050000 21052 distinct keys
Smaller table: totalSize=1154984612, numRows=13780692, rawDataSize=1141203920 AND 39313 distinct keys
The smaller table also has repeated rows of keys. The other challenge is that I need to randomly select a matching key from the smaller table.
What I have tried so far:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1155mb;
and
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
LEFT OUTER JOIN smaller_table as b
on a.key = b.key
ORDER BY RAND()
)b;
It has been running for a day now.
I thought about manually doing something like this but, I have more than one key that there are a ton of, so I would have to make a bunch of tables and merge them. Which I can do if that is my only option :O
But I wanted to reach out to you all on SO.
Thanks for the help in advance
EDIT June 20th
I found to try:
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = 200000;
I had already created a few separate tables to separate and join up the highest appearing keys, such that now the highest appearing key in the rest was 200k. Running the query to join the rest now took 25 minutes, finished all tasks successfully, according to the job tracker on the web interface. On the command line in the Hive shell, it is still sitting there, and when I go to check, the table does not exist.
**EDIT #2 After a lot of reading and trying out a lot of sql hive code... the 1 solution that should have worked in theory did not work, specifically the order by rand() never even happened...
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
JOIN
(SELECT *, row_number() over (partition by key order by rand() )
from smaller_table) as b
on a.key = b.key
and b.row_num=1
)b;
In the results it is being matched with the first row, not a rand() row at all..
Any other options or anything I did incorrectly here?

Why and when to use CROSS JOIN instead of INNER JOIN with UPDATE statements?

Coding in T-Sql since three mounths or so, I've just seen for the first time the use of a CROSS JOIN in an UPDATE statement in some code and I'm not able to figure out the use cases of such a construct.
Does anyone know?
Edit: here is a sample code of what I can't understand well yet.
UPDATE a
SET a.COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
And there are other updates in the code that provide a WHERE clause like:
UPDATE a
SET a.COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
And the point is that for each row of Table1, a select on the the cross join with the filtering returns more than a row.
I'm a bit confused with the understanding of the behavior.
PS: the table Table1 takes more than 5 giga bytes of space..
A cross join generates the cartesian product of two tables. This means it combines EVERY row of table A with EVERY row of table B. When Table A has n rows and table B has m rows, the result set has n*m rows.
There is no good reason that I can imagine to do this. The query is either written incorrectly, or just a test to slow down your system or to invalidate the target table's data (or perhaps, just to see what it does).
It will probably set COL1 of every row in Table1 to the same single random value from Table2's COL1 (though probably either the first or last such value). But it will do so very inefficiently (unless the optimizer in later versions of SQL Server have optimized out this useless case, I haven't tested it in years myself).
To understand the use case, you would need to look at the data. I can easily see using the first update if I was positive tableb would always and only contain one record. This is especially true of that one record has no field to join to table A on. In this case you are updating all the fields in table a with the value of that field in table b. Normally this type of thing where all records are updated woudl only be for resetting values.
To see what would be updated, do this:
UPDATE a
SET a.COL1 = b.COL1
--select a.COL1,b.COL1, *
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
Now you can run just the select part to see what value a.col1 would be replaced with and see the other fields in the tables to see if the join and where clasue appear to be correct. This will help you understand what the corss join is doing. YOu could then temporarily replace the cross join with a left join and an inner join to understand what behavior it has that is differnt than the other types of joins. Play around with the select for awhile until you really understand what is happening. I never write an update without having the select in comments so I can ensure I am updating what I think I should be before I move the code to prod. This is espcially true if you write complex updates like I do that could involve ten or fifteen joins and several where conditions.
Okay, with this query:
UPDATE a
SET COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
If we take the set formed by a CROSS JOIN b (and before considering the FROM clause), then we have a Cartesian product, where every row from a is paired with every row from b.
If we now consider the WHERE clause - unless this WHERE clause is sufficient to guarantee that each row from a is only represented once, then we will have an indeterminate result. That is, if there are two rows in the set which are both derived from the same row from a (but different rows from b), then there is no way to know, for sure, which of those two rows will be used to compute the SET a.COL1 = b.COL1 assignment.
I don't think it's even guaranteed, if we had the following:
UPDATE a
SET COL1 = b.COL1, COL2 = b.COL2
FROM --As before
that the same row from b will be used for both assignments.
All of the above is true for any UPDATE statement using the T-SQL FROM clause extension - unless you're careful to constrain your join conditions, then multiple assignments for the same row may be possible. But a CROSS JOIN just seems to make it far more likely to occur. And SQL Server issues no diagnostic messages if this occurs.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.