Optimize Hive query with similar tables

Optimize Hive query with similar tables - sql

I have two big tables that are very similar to each other and are made of 6 left joins. The only difference between them is in the first table to which the others are left joined, otherwise the main select clause and the rest of the tables are the same.
A simple example would be:
Create table A as
Select a.attr, b.attr, ...
From
(Select attr
From table a
Where cond1, cond2, cond3) a
Left join
(Select attr
From table) b
on a.whatever = b.whatever
Left join ...;
Create table B as
Select a.attr, b.attr ...
From
(Select attr
From table a
Where cond1) a
Left join
(Select attr
From table) b
on a.whatever = b.whatever
Left join...;
I hope this is clear. The only difference is the where conditions of table 'a' to which everything else is joined. How could I optimize this so I don't have to write two almost identical queries?

Maybe you can get rid of the restrictions of table a in first, get the result, add it when use

Related

duplicate query result when join table

I face issue about duplicate data when join table, here my sample data table I have
-- Table A
I want to join with
-- Table B
this my query notation for join both table,
select a.trans_id, name
from tableA a
inner join tableB b
on a.ID_Trans = b.trans_id
and this the result, why I get the duplicating data which should show only two lines of data, please help me to solve this case.

Firstly, as you have been told multiple times in the comments, this is working exactly as you have written, and (more importantly) as intended. You have 2 rows in tableA and those 2 rows match 2 rows in your table tableB according to the ON clause. This means that each join operation, for the each of the rows in tableA, results in 2 rows as well; thus 4 rows (2 * 2 = 4).
Considering that your table, TableA only has one column then it seems that you should be cleaning up that data and deleting the duplicates. There are plenty of examples on how to do that already (example).
Perhaps the column you show us in TableA is one many, and thus instead you have a denormalisation issue, and instead there should be another table with the details of Id_trans and a PRIMARY KEY or UNIQUE CONSTRAINT/INDEX on it. Then you would join fron that table to TableB.
Finally, what you might be after is an EXISTS, which would look like this:
SELECT B.trans_id, B.[name]
FROM dbo.TableB B
WHERE EXISTS(SELECT 1
FROM dbo.TableA A
WHERE A.ID_Trans = B.trans_id); --Odd that it's called ID_Trans in one table, and Trans_ID in another

As the comments mentioned your query does exactly what you asked it to do but I think you wanted something like:
select a.trans_id, a.name, b.name
from tableA a
inner join tableB b on a.trans_id = b.trans_id
group by a.trans_id, a.name, b.name

Since there are two rows in both table with same ID join will make them four. You can use distinct to remove duplicates:
select distinct a.trans_id, name
from tableA a
inner join tableB b
on a.id_trans = b.trans_id
But I would suggest to use exists:
select trans_id, name
from tableB b
exists (select 1 from tableA a where a.trans_id=b.trans_id)

Selecting Fields and Join Clause

I have two tables, TableA:
- Original_Location
- Units
and TableB:
- Original_Loc
- Adjacent_Loc
- Direction (up/down/left/right from original loc)
My goal is to return the original location, the adjacent location, the direction, and the number of units at the adjacent loc. So far, I've only been able to return the units from the original loc.
Here is what I've tried so far:
Select Original_Location,
Units,
TableB.Adjacent_Loc,
TableB.Direction
From TableA
Inner Join
Select *
From TableB
On TableA.Original_Location = TableB.Original_Loc
My thought is that I might need to change the fields I'm selecting before the inner join, or potentially join on Original_Location = Adjacent_Loc.

Your Join syntax is not right. First you don't need to "select" anything from Table B, your initial SELECT is getting data from the joined tables (A and B) as if they were 1 table. Secondly, you need to specify which fields to join the tables with.
Your join will be something like :-
From TableA Inner Join TableB
on TableA.Original_Loc = TableB.Original_Loc
Once you have got your joins right, you need to make another join to TableA to get the Units. This time you are joining the Adjacent_Loc in Table B to the Original_Location in Table A - which will have the Units value you need.
My example below uses aliases to identify each table (there are now 2 references to TableA so they need to be identified separately). So when you do the second join to TableA, this has the alias of c to differentiate it from the first TableA reference. You then need to select the Units from c.
Select a.Original_Location, c.Units, b.Adjacent_Loc, b.Direction
From TableA a Inner Join TableB b On a.Original_Location = b.Original_Loc
inner join TableA c on b.Adjacent_Loc = c.Original_Location

Are the SQL concepts LEFT OUTER JOIN and WHERE NOT EXISTS basically the same?

Whats the difference between using a LEFT OUTER JOIN, rather than a sub-query that starts with a WHERE NOT EXISTS (...)?

No they are not the same thing, as they will not return the same rowset in the most simplistic use case.
The LEFT OUTER JOIN will return all rows from the left table, both where rows exist in the related table and where they does not. The WHERE NOT EXISTS() subquery will only return rows where the relationship is not met.
However, if you did a LEFT OUTER JOIN and looked for IS NULL on the foreign key column in the WHERE clause, you can make equivalent behavior to the WHERE NOT EXISTS.
For example this:
SELECT
t_main.*
FROM
t_main
LEFT OUTER JOIN t_related ON t_main.id = t_related.id
/* IS NULL in the WHERE clause */
WHERE t_related.id IS NULL
Is equivalent to this:
SELECT
t_main.*
FROM t_main
WHERE
NOT EXISTS (
SELECT t_related.id
FROM t_related
WHERE t_main.id = t_related.id
)
But this one is not equivalent:
It will return rows from t_main both having and not having related rows in t_related.
SELECT
t_main.*
FROM
t_main
LEFT OUTER JOIN t_related ON t_main.id = t_related.id
/* WHERE clause does not exclude NULL foreign keys */
Note This does not speak to how the queries are compiled and executed, which differs as well -- this only addresses a comparison of the rowsets they return.

As Michael already answered your question here is a quick sample to illustrate the difference:
Table A
Key Data
1 somedata1
2 somedata2
Table B
Key Data
1 data1
Left outer join:
SELECT *
FROM A
LEFT OUTER JOIN B
ON A.Key = B.Key
Result:
Key Data Key Data
1 somedata1 1
2 somedata2 null null
EXISTS use:
SELECT *
FROM A WHERE EXISTS ( SELECT B.Key FROM B WHERE A.Key = B.Key )
Not Exists In:
SELECT *
FROM A WHERE NOT EXISTS ( SELECT B.Key FROM B WHERE A.Key = B.Key )
Result:
Key Data
2 somedata2

Left outer join is more flexible than where not exists. You must use a left outer join if you want to return any of the columns from the child table. You can also use the left outer join to return records that match the parent table as well as all records in the parent table that have no match. Where not exists only lets you return the records with no match.
However in the case where they do return the equivalent rows and you do not need any of the columns in the right table, then where exists is likely to be the more performant choice (at least in SQL server, I don't know about other dbs).

I suspect the answer ultimately is, both are used (among other constructs) to perform the relational operation antijoin in SQL.

I suspect the OP wanted to know which construct is better when they are functionally the same (ie I want to see only rows where there is no match in the secondary table).
As such, WHERE NOT EXISTS will always be as quick or quicker, so is a good habit to get into.

Join SQL query to get data from two tables

I'm a newbie, just learning SQL and have this question: I have two tables with the same columns. Some registers are in the two tables but others only are in one of the tables. To illustrate, suppose table A = (1,2,3,4), table B=(3,4,5,6), numbers are registers. I need to select all registers in table B if they are not in table A, that is result=(5,6). What query should I use? Maybe a join. Thanks.

You can either use a NOT IN query like this:
SELECT col from A where col not in (select col from B)
or use an outer join:
select A.col
from A LEFT OUTER JOIN B on A.col=B.col
where B.col is NULL
The first is easier to understand, but the second is easier to use with more tables in the query.

Select register from TABLE_B b
Where not exists (Select register from TABLE_A a where a.register = b.register)
I assumed you have a column named register in TABLE_A and TABLE_B

MySQL: Multi-column join on several tables

I have several tables that I am joining that I need to add another table to and I can't seem to get the right query. Here is what I have now -
Table 1
carid, catid, makeid, modelid, caryear
Table 2
makeid, makename
Table 3
modelid, modelname
Table 4
catid, catname
The query I am using to join these is:
SELECT * FROM table1 a
JOIN table2 b on a.makeid=b.makeid
JOIN table3 c on a.modelid=c.modelid
JOIN table4 d on a.catid=d.catid
WHERE a.carid = $carid;
Now I need to add a 5th table that I am getting from a 3rd party that I am having a hard time adding to my existing query. The new table has these fields -
Table 5
id, year, make, model, citympg, hwympg
I need the citympg and hwympg based on caryear from table 1, makename from table 2, and modelname from table 3. I know I can do a second query with those values, but I would prefer to do a single query and have all of the data in a single row. Can this be done in a single query? If so, how?

it's possible to have more than condition in a join.
does this work?
SELECT a.*, e.citympg, e.hwympg
FROM table1 a
JOIN table2 b on a.makeid=b.makeid
JOIN table3 c on a.modelid=c.modelid
JOIN table4 d on a.catid=d.catid
Join table5 e on b.makename = e.make
and c.modelname = e.model
and a.caryear = e.year
WHERE a.carid = $carid;
...though your question is not clear. Did you only want to join table5 to the others, or was there something else you wanted to do with table5?

Without indexes, It won't be efficient, but you can do
LEFT JOIN table5 ON (table2.make = table5.make AND table3.model = table5.model AND table1.caryear = table5.caryear)
This also assumes the make and models and years strings match exactly.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimize Hive query with similar tables - sql

Maybe you can get rid of the restrictions of table a in first, get the result, add it when use

Related

duplicate query result when join table

Selecting Fields and Join Clause

Are the SQL concepts LEFT OUTER JOIN and WHERE NOT EXISTS basically the same?

Join SQL query to get data from two tables

MySQL: Multi-column join on several tables

Categories

Resources