Hadoop - Hive - Impala - rewrite a query for performance

Hadoop - Hive - Impala - rewrite a query for performance - sql

I have 2 tables with below columns
Table1
col1 col2 col3 val
11 221 38 10
null 90 null 989
78 90 null 77
table2
col1 col2 col3
12 221 78
23 null 67
78 90 null
I want to join these 2 tables first on col1 if values matched then stop if not join on col2 if matches stop else join on col3 and populate val if any of column matches else null and whichever columns matching then populate that column in matchingcol column. So, the output should look like this:
col1 col2 col3 val matchingcol
11 221 38 10 col2
null 90 null null null
78 90 null 77 col1
I was able to do this using below query, but the performance is very slow. Please let me know if there is any better way of writing below for faster performance
select *
from table1 t1 left join
table2 t2_1
on t2_1.col1 = t1.col1 left join
table2 t2_2
on t2_2.col2 = t1.col2 and t2_1.col1
left join table2 t2_3 on t2_3.col3 = t1.col3 and t2_2.col2 is null
ps: I asked same question before but there was no better answer

What you describe is:
select t1.col1, t1.col2, t1.col3,
(case when t2_1.col1 is not null or t2_2.col1 is not null or t2_3.col1 is not null then t1.val end) as val
(case when t2_1.col1 is not null then 'col1'
when t2_2.col2 is not null then 'col2'
when t2_3.col3 is not null then 'col3'
end) as matching
from table1 t1 left join
table2 t2_1
on t2_1.col1 = t1.col1 left join
table2 t2_2
on t2_2.col2 = t1.col2 and t2_1.col1 is null left join
table2 t2_3
on t2_3.col3 = t1.col3 and t2_2.col2 is null;
This is probably the best approach.

Related

SQL join two tables that have the same columns, with an overlapping `id` column, but merge based on if table1.col1 >= table2.col1

I want to join two tables that have the same columns, with an overlapping id column, but merge based on if table1.col1 >= table2.col1. This is in SQL.
If table1.col1>=table2.col1, use the columns from table1.
If table1.col1< table2.col1, then use columns from table2.
If the id does not exist in table1 but exists in table2, use the columns from table2
If the id does not exist in table2 but exists in table1, use the columns from table1
For example:
Table1:
id
col1
col2
col3
A
3
5
4
B
1
2
3
C
8
9
7
Table2:
id
col1
col2
col3
A
2
5
6
B
5
7
8
D
2
3
4
I want the result to be:
id
col1
col2
col3
A
3
5
4
B
5
7
8
C
8
9
7
D
2
3
4
I have tried union, full outer join, and CASE statements, but am stuck

I think individual case expressions for each column might be best:
select id,
(case when t1.col1 < t2.col1 then t2.col1 else t1.col1 end) as col1,
(case when t1.col1 < t2.col1 then t2.col2 else t1.col2 end) as col2,
(case when t1.col1 < t2.col1 then t2.col3 else t1.col3 end) as col3
from t1 full join
t2
using (id);
If that is cumbersome, another approach uses not exists:
select t1.*
from t1
where not exists (select 1
from t2
where t2.id = t1.id and t2.col1 > t1.col1
)
union all
select t2.*
from t2
where not exists (select 1
from t1
where t2.id = t1.id and t1.col1 >= t2.col1
);

Another solution:
SELECT DISTINCT ON (id) *
FROM (
SELECT *
FROM table1
UNION ALL
SELECT *
FROM table2
) AS aux
ORDER BY id, col1 DESC;
I tried it in Postgresql.

SAP HANA | JOIN with LIKE OPERATOR

I have two tables -
TABLE T1 having below data -
COL1 COL2
1 A,B,C
TABLE T2 having below data -
COL3 COL4
A 10
B 20
C 30
D 40
The output I want is -
COL1 COL3 COL4
1 A 10
1 B 20
1 C 30
T1 and T2 have around one million rows.
I have already tried the following:
select T1.COL1,T2.COL3,T2.COL4
from T1
inner join T2 on T1.COL2 LIKE '%' || T2.COL3 || '%';
select T1.COL1,T2.COL3,T2.COL4
from T1
inner join T2 on instr(T1.COL2, T2.COL3 )>0;
select T1.COL1,T2.COL3,T2.COL4
from T1
inner join T2 on T1.COL2 LIKE_REGEXPR T2.COL3;
All the three statements work fine with TEST data. But when running on actual data results in memory allocation error.
Is there a way to do this in better way?

SQL Compare Two tables with column value difference

I've 2 tables with exact same structure and I would like compare the column values and display in specific format. I'm new to SQL. I tried with Minus function but its not helping. Find below scenario
Table 1
Key Col1 Col2
1 110 AAA
2 120 BBB
Table 2
Key Col1 Col2
1 111 CCC
2 120 DDD
I need output in below format
Key Field Table1 Table2
1 Col1 110 111
1 Col2 AAA CCC
2 Col2 BBB DDD
How can this be accomplished?
Thanks,
Milind

This is an arcane structure for bringing the tables together. I think this will work:
select t1.col1,
(case when t2.key is not null then 'col2' else 'col1' end) as field,
(case when t2.key is not null then t1.col2
when seqnum = 1 then t1.col1
when seqnum = 2 then t1.col2
end) as Table1,
(case when t2.key is not null then t2.col2
when seqnum = 1 then t2.col1
when seqnum = 2 then t2.col2
end) as Table2
from table1 t1 left join
table2 t2
on t1.key = t2.key and t1.col1 = t2.col1 left join
(select tt2.*, row_number() over (partition by tt2.key order by tt2.key) as seqnum
from table2 tt2
) tt2
on t1.key = tt2.key and t2.key is null;

SQL Query - Indirect joining of two tables

I have two tables like the following
Table1
COL1 COL2 COL3
A 10 ABC
A 11 ABC
A 1 DEF
A 2 DEF
B 10 ABC
B 11 ABC
B 1 DEF
C 3 DEF
C 12 ABC
C 21 GHI
Table2
COL1 GHI ABC DEF
A1 21 10 1
A2 21 12 1
A3 21 10 1
A4 23 10 1
A5 25 11 3
A6 21 14 3
A7 25 11 1
A8 23 10 1
A9 29 10 2
A10 21 12 3
I have created another temporary table that returns all the distinct values from tbl1.col1
The values of col3 in tbl1 are columns in tbl2, which are populated by some values.
What I need is for each of these distinct values of table1.column1, (A, B, C) in this case, return a combination of table2.column1 and table1.column1 such that
the ABC value of table2.column1 matches any of the ABC value of the "group" from table1,
AND the DEF value of table2.column1 matches any of the DEF value of the "group" from table1,
AND IF THE GROUP CONTAINS GHI VALUES, the GHI value of table2.column1 matches any of the GHI value of the "group" from table1
So, I would need something like the following
Output Table
Table2.COL1 Table1.Col1
A1 A
A3 A
A4 A
A7 A
A8 A
A9 A
A1 B
A3 B
A4 B
A7 B
A8 B
A10 C
I tried something like this, but Im not sure if this is the right way of approaching
select table2.col1, temp_distinct_table.column1
from table2, temp_distinct_table
where table2.def IN (SELECT col2
FROM table1
WHERE table1.col1 = temp_distinct_table.col1
AND table1.col3 = 'DEF')
AND table2.abc IN (SELECT col2
FROM table1
WHERE table1.col1 = temp_distinct_table.col1
AND table1.col3 = 'ABC')
AND (
table2.ghi IN (SELECT col2
FROM table1
WHERE table1.col1 = temp_distinct_table.col1
AND table1.col3 = 'GHI')
OR NOT EXISTS (SELECT col2
FROM table1
WHERE table1.col1 = temp_distinct_table.col1
AND table1.col3 = 'GHI')
)
where temp_distinct_table contains of all the distinct values from table1.col1
Could someone guide me on the matter?

Another approach, counting how many matches there are for each t1.col/t2.col combination after joining all the possible matches:
select distinct t2_col1, t1_col1
from (
select t2.col1 as t2_col1, t1.col1 as t1_col1, t1.ghi_count as t1_ghi_count,
count(case when t1.col3 = 'ABC' then 1 end)
over (partition by t1.col1, t2.col1) as abc_matches,
count(case when t1.col3 = 'DEF' then 1 end)
over (partition by t1.col1, t2.col1) as def_matches,
count(case when t1.col3 = 'GHI' then 1 end)
over (partition by t1.col1, t2.col1) as ghi_matches
from (
select t1.*,
count(case when t1.col3 = 'GHI' then 1 end)
over (partition by t1.col1) as ghi_count
from table1 t1
) t1
join table2 t2
on (t1.col3 = 'ABC' and t2.abc = t1.col2)
or (t1.col3 = 'DEF' and t2.def = t1.col2)
or (t1.col3 = 'GHI' and t2.ghi = t1.col2)
)
where abc_matches > 0
and def_matches > 0
and (t1_ghi_count = 0 or ghi_matches > 0)
order by t1_col1, t2_col1;
Which with your sample data gets:
T2_COL T1_COL
------ ------
A1 A
A3 A
A4 A
A7 A
A8 A
A9 A
A1 B
A3 B
A4 B
A7 B
A8 B
A10 C
Not sure if the efficiency of that will be significantly different to MTO's cross join with your real data.

This becomes quite simple when you use collections (and you only need to do one table scan for each table):
Oracle Setup:
CREATE TYPE intlist AS TABLE OF INT;
/
Query:
SELECT t2.col1 AS t2_col1,
t1.col1 AS t1_col1
FROM (
SELECT col1,
CAST( COLLECT( CASE col3 WHEN 'ABC' THEN col2 END ) AS INTLIST ) AS abc,
CAST( COLLECT( CASE col3 WHEN 'DEF' THEN col2 END ) AS INTLIST ) AS def,
CAST( COLLECT( CASE col3 WHEN 'GHI' THEN col2 END ) AS INTLIST ) AS ghi
FROM table1
GROUP BY col1
) t1
INNER JOIN table2 t2
ON ( t2.abc MEMBER OF t1.abc
AND t2.def MEMBER OF t1.def
AND ( t2.ghi MEMBER OF t1.ghi OR t1.ghi IS EMPTY ) );
Output:
t2_col1 t1_col1
------- -------
A1 A
A3 A
A4 A
A7 A
A8 A
A9 A
A1 B
A3 B
A4 B
A7 B
A8 B
A10 C
Update
An alternative query without using collections (it is going to be more efficient than your query but probably less efficient than collections):
SELECT t2.col1,
t1.col1
FROM table1 t1
CROSS JOIN
table2 t2
GROUP BY t1.col1, t2.col1
HAVING COUNT( CASE WHEN t1.col2 = t2.abc AND t1.col3 = 'ABC' THEN 1 END ) > 0
AND COUNT( CASE WHEN t1.col2 = t2.def AND t1.col3 = 'DEF' THEN 1 END ) > 0
AND ( COUNT( CASE WHEN t1.col2 = t2.ghi AND t1.col3 = 'GHI' THEN 1 END ) > 0
OR COUNT( CASE t1.col3 WHEN 'GHI' THEN 1 END ) = 0 )
ORDER BY t1.col1, t2.col1;
Update 2:
Changed from CROSS JOIN to INNER JOIN:
SELECT t2.col1 AS t2_col1,
t1.col1 AS t1_col1
FROM (
SELECT t1.*,
COUNT( CASE col3 WHEN 'GHI' THEN 1 END )
OVER ( PARTITION BY col1 ) AS has_ghi
FROM table1 t1
) t1
INNER JOIN table2 t2
ON ( t1.col3 = 'ABC' AND t2.abc = t1.col2 )
OR ( t1.col3 = 'DEF' AND t2.def = t1.col2 )
OR ( t1.col3 = 'GHI' AND t2.ghi = t1.col2 )
GROUP BY t1.col1, t2.col1, t1.has_ghi
HAVING COUNT( CASE t1.col3 WHEN 'ABC' THEN 1 END ) > 0
AND COUNT( CASE t1.col3 WHEN 'DEF' THEN 1 END ) > 0
AND ( COUNT( CASE t1.col3 WHEN 'GHI' THEN 1 END ) > 0 OR has_ghi = 0 )
ORDER BY t1.col1, t2.col1;

Ordering in Access SQL

I don't know how to order the following data in Access SQL:
Col1 Col2
1 1
1 2
1 3
2 4
2 5
2 6
3 7
3 8
3 9
Where it grabs the lowest value in Col2 where Col1 = 1, and then the lowest value in Col2 where Col1 = 2 etc, leading to sorted data of:
Col1 Col2
1 1
2 4
3 7
1 2
2 5
3 8
1 3
2 6
3 9
Col1 can range from 1 to any number, and Col2 doesn't start from 1, or is consistently incremental (but still in the order of size).
The table also has an auto ID primary key if that helps.
---- Thanks to #shA.t this answer works perfectly. I added a simple table join which works as well:
SELECT t1.Col1, t1.Col2 FROM
(SELECT Table1.Col1, Table2.Col2 FROM Table2 INNER JOIN Table1 ON Table2.ID = Table1.ID) t1
INNER JOIN
(SELECT Table1.Col1, Table2.Col2 FROM Table2 INNER JOIN Table1 ON Table2.ID = Table1.ID) t2
ON t1.Col1 = t2.Col1 and t1.Col2 >= t2.Col2
Group by t1.Col1, t1.Col2
ORDER BY Count(t2.Col2), t1.Col1

I think you can use a query like this:
SQLFIDDLE DEMO
SELECT t1.Col1, t1.Col2
FROM t t1
JOIN t t2 ON t1.Col1 = t2.Col1 and t1.Col2 >= t2.Col2
Group by t1.Col1, t1.Col2
ORDER BY Count(t2.Col2), t1.Col1;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hadoop - Hive - Impala - rewrite a query for performance - sql

Related

SQL join two tables that have the same columns, with an overlapping `id` column, but merge based on if table1.col1 >= table2.col1

SAP HANA | JOIN with LIKE OPERATOR

SQL Compare Two tables with column value difference

SQL Query - Indirect joining of two tables

Ordering in Access SQL

Categories

Resources