Hive Index Performance - hive

I have two tables Table A and Table B which are 100GB and 35GB in size respectively. Also both the tables are compact indexed on the same column which is prodID.
I am facing an issue here where I am getting the same response time with or without index for a below query. It takes 30 minute to process the query.
select a.* from TableA a inner join TableB b on a.prodID=b.prodID.
I have 19 nodes cluster setup. Can you please advise me if I am missing any configuration here or doing something wrong.
Regards,
Prabu

I think you should try putting large table i.e Table A at the last, or stream table A to improve performance. You can try following query to stream table.
select /*+STREAMTABLE(a)*/ a.* from TableA a inner join TableB b on a.prodID=b.prodID;
Please refer Tips using joins in hive for more information.

Related

INNER JOIN query performance very slow

I have a SQL query that looks like this:
SELECT *
FROM tableB ta
INNER JOIN tableB tb ON tb.someColumn = ta.someOtherColumn
Both, someColumn and someOtherColumn, are not the primary key of their tables. Both are of the datatype int.
TableA has ~500.000 records, tableB has ~250.000 records. The query takes about 2 minutes to finish, which is much too long in my opinion.
The query execution plan looks as follows:
I already tried to (a) use OPTION (RECOMPILE) and (b) create an INDEX on the respective tables. To no avail.
My question is: How can the performance of this query be improved?
Create an index on tb.SomeColumn, and create another index on ta.SomeOtherColumn.
Then when you run this query, the Hash Match should be replaced with an Inner Loop, and will be much faster.

Determine datatypes of columns - SQL selection

Is it possible to determine the type of data of each column after a SQL selection, based on received results? I know it is possible though information_schema.columns, but the data I receive comes from multiple tables and is joint together and the data is renamed. Besides that, I'm not able to see or use this query or execute other queries myself.
My job is to store this received data in another table, but without knowing beforehand what I will receive. I'm obviously able to check for example if a certain column contains numbers or text, but not if it is originally stored as a TINYINT(1) or a BIGINT(128). How to approach this? To clarify, it is alright if the data-types of the columns of the source and destination aren't entirely the same, but I don't want to reserve too much space beforehand (or too less for that matter).
As I'm typing, I realize I'm formulation the question wrong. What would be the best approach to handle described situation? I thought about altering tables on the run (e.g. increasing size if needed), but that seems a bit, well, wrong and not the proper way.
Thanks
Can you issue the following query about your new table after you create it?
SELECT *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'JoinedQueryResults'
Is the query too big to run before knowing how big the results will be? Get a idea of how many rows it may return, but the trick with queries with joins is to group on the columns you are joining on, to help your estimate return more quickly. Here's of an example of just returning a row count from the query above which would have created the JoinedQueryResults table above.
SELECT SUM(A.NumRows * B.NumRows)
FROM (SELECT ID, COUNT(*) AS NumRows
FROM TableA
GROUP BY ID) AS A
INNER JOIN (SELECT ID, COUNT(*) AS NumRows
FROM TableB
GROUP BY ID) AS B ON A.ID = B.ID
The query above will run faster if all you need is a record count to help you estimate a size.
Also try instantiating a table for your results with a query like this.
SELECT TOP 0 *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID

Very bad performance using 3 tables join on SQL Server

I have serious performance issue when I execute a SQL statements which involves 3 tables as following:
TableA<----TableB---->TableC
In particular, these tables are in a data warehouse and the table in the middle is a dimension table while the others are fact tables. TableA has about 9 millions of record, while TableC about 3 million. The dimension table (TableB) only 74 records.
The syntax of the query is very simple, as you can see, where TableA is called _PG, TableB is equal to _MDT and Table C is called _FM:
SELECT _MDT.codiceMandato as Customer, SUM(_FM.Totale) AS Revenue,
SUM(_PG.ErogatoTotale) AS Paid
FROM _PG INNER JOIN
_MDT
ON _PG.idMandato = _MDT.idMandato INNER JOIN
_FM
ON _FM.idMandato = _MDT.idMandato
GROUP BY _MDT.codiceMandato
Actually, I never has seen the end of this query :-(
_PG has a non clustered index on idMandato and the same _FM table
_MDT table has a clustered index on idMandato
and the execution plan is the following
As you can see the bottleneck is due to Stream Aggregate (33% of cost) and Merge Join (66% of cost). In particular, the stream aggregate underlines about 400 billions of estimated rows!!
I don’t know the reasons and I don’t know how to proceed in order to solve this bad issue.
I use SQL Server 2016 SP1 installed of a virtual server with Windows Server 2012 Standard with 4 Cpu core and 32 GB of RAM , 1,5TB on a dedicated volume made up SAS disks with SSD cache.
I hope anybody can help me to understand.
Thanks in advance
The most likely cause is because you are getting a Cartesian product along two dimensions. This multiplies the rows unnecessarily. The solution is to aggregate before doing the join.
You haven't provided sample data, but this is the idea:
SELECT m.codiceMandato as Customer, f.revenue, p.Paid
FROM _MDT m INNER JOIN
(SELECT p.idMandato, SUM(p.ErogatoTotale) AS Paid
FROM _PG p
GROUP BY p.idMandato
) p
ON p.idMandato = m.idMandato INNER JOIN
(SELECT f.idMandato, SUM(f.Totale) AS Revenue
FROM _FM f
GROUP BY f.idMandato
) f
ON f.idMandato = m.idMandato;
I'm not 100% sure this will fix the problem, because your data structure is not clear.
You can try doing a subquery between TableA and TableC without aggregation and then joining this subquery with TableB and apply the GROUP BY:
SELECT _MDT.codiceMandato, SUM(A.Totale) AS Revenue, sum( A.ErogatoTotale)
AS Paid
FROM ( SELECT m.idMandato, _FM.Totale, _PG.ErogatoTotale FROM _PG
INNER JOIN _FM
ON _FM.idMandato = _MDT.idMandato ) A
INNER JOIN _MDT ON A.idMandato = _MDT.idMandato
GROUP BY _MDT.codiceMandato

Hive query stuck at 99%

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.
Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.
Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true
Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.
use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m
I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

SQL Server, fetching data from multiple joined tables. Why is slow?

I have problem with performance when retrieving data from SQL Server.
My sql query looks something like this:
SELECT
table_1.id,
table_1.value,
table_2.id,
table_2.value,...,
table_20.id,
table_20.value
From table_1
INNER JOIN table_2
ON table_1.id = table_2.table_1_id
INNER JOIN table_3
ON table_2.id = table_3.table_2_id...
WHERE table_1.row_number BETWEEN 1 AND 20
So, I am fetching 20 results.
This query takes about 5 seconds to execute.
When I select only table_1.id, it returns results instantly.
Because of that, I guess that problem is not in JOINs, it is in retrieving data from multiple tables.
Any suggestions how I would speed up this query?
Assuming your tables are designed properly (have a useful primary key etc.), then the first thing I would check is this:
are there indices on each of the foreign key columns in the child tables?
SQL Server does not automatically create indices on the foreign key columns - yet those are indeed very helpful for speeding up your JOINs.
Other than that: just look at the query plans! They should tell you everything about this query - what indices are being used (or not), what operations are being executed to get the results....
Without knowing a lot more about your tables, their structure and the data they contain (how much? What kind of values? etc.), there's really not much we can do to help here....
Between can really slow a query, what do you want to achieve with it
also
Do you have an on the columns you are joining on
You could use with(nolock) on the table
Check to execution plan to see whats taking so long
How about this one:
SELECT
table_1.id,
table_1.value,
table_2.id,
table_2.value,...,
table_20.id,
table_20.value
FROM
table_1
INNER JOIN table_2 ON table_1.id = table_2.id AND table_1.row_Number between 1 and 20
INNER JOIN table_3 ON table_2.id = table_3.id
I mean before joining to another table, you choose range of data.