HIVE comparing strings in join - hive

I have keys in hive query which are strings. Joing that looks like its taking forever. Can I create an index or something instead of?
select
*
from e
left join tabele a
on e.string1=a.string2
How can improve that that it would be faster?

Just for performance issue ,try to make hive tables as ORC format .
Try to use Map join in query and try.
http://grisha.org/blog/2013/04/19/mapjoin-a-simple-way-to-speed-up-your-hive-queries/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Related

SQL Replace in join condition affect performance badly

I have two tables in Teradata that I need to join.
In one of the tables I need to remove a string from the joined column.
When I used:
JOIN ON OReplace(A.id, '/','')= B.id
The query run for over ten minutes.
I tried to change different parts of the query. One of the attempts was to move the Replace into Select, like:
SELECT OReplace(A.id, '/','') as id_new
...
JOIN ON id_new= B.id
And this run few seconds!
From my point of view, this is the same script and even the Explain is totaly the same... Can someone explain me why is there such huge difference in performance? Thank you.
Regards, Robert

Optimize select Query performance that is having join with various other tables in MS SQL

I have a select query that retrieve huge amount of data based upon some joins with other tables and all the tables are being used by other processes(Some of them are writing data to these tables and some other are retrieving from). The simultaneous operations put locks on the tables.
Is there any way in the select query that can optimize the query response time even there is an write/Shared lock on the table? Can "With (NOLOCK)" with table help?
Thanks
Manoj
You can try these options too
- Remove unnecessary left joins
- Remove where clause if it can be used along with Inner join condition
- May create indexes on your columns
- Select desired columns, avoid using * for all columns
- Avoid giving large lengths for your columns
With (NOLOCK) will improve the performance but it will give you dirty reads which are not committed yet.
Idealy this is not recommended on transactional tables, if you are fine with this dirty reads, you can use it,
And other optimizations are like maintain proper indexes on tables columns which are being used in joins. and the other one point is join the tables from small to bigger in data, and fucntion call in select clause and where clause.
hope this will help you!

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.

Slow Join on Varchar

I have a query in sql server with a join that is taking forever. I'm hoping someone might have a tip to speed it up.
I think the problem is I'm joining on a field called Reseller_CSN which has values like '_0070000050'
I've tried using the substring function in the join to return everything but underscore, example '0070000050' but I keep getting an error when I try to cast or convert the result to int or bigint.
Any tips would be greatly appreciated, the query is below:
SELECT
t1.RESELLER_CSN
,t1.FIRST_YEAR_RENEWAL
,t1.SEAT_SEGMENT
,t2.Target_End_Date_CY
,t2.Target_End_Date_PQ
,t2.Target_End_Date_CY_1
,t2.Target_End_Date_CY_2
,t1.ASSET_SUB_END_DATE
FROM dbo.new_all_renewals t1
LEFT JOIN dbo.new_all_renewals_vwTable t2
ON SUBSTRING(t1.RESELLER_CSN,2,11) = SUBSTRING(t2.RESELLER_CSN,2,11)
A join on processed columns invariably takes more effort than a join on raw columns. In this case, you can improve performance by using computed columns. For instance, on the first table, you can do:
alter table new_all_renewals add CSN_num as SUBSTRING(t1.RESELLER_CSN,2,11);
create index on new_all_renewals(CSN_num);
This will generate an index on the column, which should speed the query. (Note: you'll need to reference the computed column rather than actually using the function.)

How to improve the performance of multiple joins

I have a query with multiple joins in it. When I execute the query it takes too long. Can you please suggest me how to improve this query?
ALTER View [dbo].[customReport]
As
SELECT DISTINCT ViewUserInvoicerReport.Owner,
ViewUserAll.ParentID As Account , ViewContact.Company,
Payment.PostingDate, ViewInvoice.Charge, ViewInvoice.Tax,
PaymentProcessLog.InvoiceNumber
FROM
ViewContact
Inner Join ViewUserInvoicerReport on ViewContact.UserID = ViewUserInvoicerReport.UserID
Inner Join ViewUserAll on ViewUserInvoicerReport.UserID = ViewUserAll.UserID
Inner Join Payment on Payment.UserID = ViewUserAll.UserID
Inner Join ViewInvoice on Payment.UserID = ViewInvoice.UserID
Inner Join PaymentProcessLog on ViewInvoice.UserID = PaymentProcessLog.UserID
GO
Work on removing the distinct.
THat is not a join issue. The problem is that ALL rows have to go into a temp table to find out which are double - if you analyze the query plan (programmers 101 - learn to use that fast) you will see that the join likely is not the big problem but the distinct is.
And IIRC that distinct is USELESS because all rows are unique anyway... not 100% sure, but the field list seems to indicate.
Use distincts VERY rarely please ;)
You should see the Query Execution Plan and optimize the query section by section.
The overall optimization process consists of two main steps:
Isolate long-running queries.
Identify the cause of long-running queries.
See - How To: Optimize SQL Queries for step by step instructions.
and
It's difficult to say how to improve the performance of a query without knowing things like how many rows of data are in each table, which columns are indexed, what performance you're looking for and which database you're using.
Most important:
1. Make sure that all columns used in joins are indexed
2. Make sure that the query execution plan indicates that you are using the indexes you expect