Remove Duplicate Join Columns in Hive Joins

Remove Duplicate Join Columns in Hive Joins - hive

I am performing joins in Hive:
select * from
(select * from
(select * from A join B on A.x = B.x) t1
join C on t1.y = C.y) t2
join D on t2.x = D.x
I am getting column x cannot be resolved since A and B both contains column x. How should I use qualified name or is there a way to drop the duplicate column in Hive.

Because table A and table B have the column x, you must assign an alias within this select for this column
select * from A join B on A.x = B.x
Something like this
select A.x as x1, B.x as x2, ...
from A join B on A.x = B.x

You can do something similar to the following but it means you cannot use special character in columns names.
set hive.support.quoted.identifiers=none;
select * from
(select C.*,t1.`(y)?+.+` from
(select A.*,B.`(x)?+.+` from A join B on A.x = B.x) t1
join C on t1.y = C.y) t2
join D on t2.x = D.x
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification

I have got exactly the same problem and solution for me was to simply rename the duplicate columns by recreating the Dataframe with modified schema. Here is some sample code:
def renameDuplicatedColumns(df: DataFrame): DataFrame = {
val duplicatedColumns = df.columns
.groupBy(identity)
.filter(_._2.length > 1)
.keys
.toSet
val newIndexes = mutable.Map[String, Int]().withDefaultValue(0)
val schema: StructType = StructType(
df.schema
.collect {
case field if duplicatedColumns.contains(field.name) =>
val idx = newIndexes(field.name)
newIndexes.update(field.name, idx + 1)
field.copy(name = field.name + "__" + idx)
case field =>
field
}
)
df.sqlContext.createDataFrame(df.rdd, schema)
}

Related

how to update BigQuery table using struct column

My Big Query table T1 is as below.
T1
A string,
B string,
C Record,
C.key string,
C.formula string,
D string
I want to update column D based on B and C, with query something like below.
update T1
set D = 'd1'
where B = 'b1' and C.formula = 'f1' ;
How to do that in BigQuery ?

Even simpler version
update T1 t
set D = 'd1'
where B = 'b1' and
'f1' in (select formula from t.C);

You don't have a record for C. You have an array of records. So, you can use unnest():
update T1
set D = 'd1'
where B = 'b1' and
exists (select 1
from unnest(T1.C) as c_el
where c_el.formula = 'f1'
);

SQL : matching two tables with two possibles conditions

is there a way in SQL while we join two tables table_A and table_B, if we can’t match the two tables on a criteria said criteria_X at all we will try the second criteria criteria_Y
Something like this:
select *
from table_A, table_B
where table_A.id = table_B.id2
and (if there is no row where table_B.criteria_X = X then try table_B.criteria_Y = Y)
The following query is not a solution:
..
and (table_B.criteria_X = X OR table_B.criteria_Y = Y)
Thanks

This is a find the best match query:
select *
from
(
select *,
row_number() -- based on priority, #1 criteria_X, #2 criteria_Y
over (partition by table_A.id
order by case when table_B.criteria_X = X then 1
else 2
end) as best_match
from table_A, table_B
where table_A.id = table_B.id2
and (table_B.criteria_X = X OR table_B.criteria_Y = Y)
) dt
where best_match = 1
If the ORed condition results in loosing indexed access you might try splitting it into two UNION ALL selects.

A typical method uses left join twice . . . once for each criterion. Then then uses coalesce() in the select. And, with indexes on the join keys, this also should have very good performance:
select a.*, coalesce(b1.colx, b2.colx)
from table_A a left join
table_B b1
on a.id = b1.id2 and b1.criteria_X = X left join
table_B b2
on a.id = b1.id2 and b2.criteria_Y = Y
where b1.id2 is not null or b2.id2 is not null;
The where clause ensures that at least one row matches.
This does not work under all circumstances -- in particular, each join needs to return only 0 or 1 matching rows. This is often the situation with this type of "priority" joins.
An alternative version uses row_number(). This is sort of similar to #dnoeth's approach, but the row number calculation is done before the join:
select a.*, coalesce(b1.colx, b2.colx)
from table_A a join
(select b.*,
row_number() over (partition by id2
order by (case when criteria_x = X then 1
when criteria_y = Y then 2
end)
) as seqnum
from table_B b
where criteria_x = X or criteria_y = Y
) b
on a.id = b.id2 and seqnum = 1

Update table for all entries from select statement

In my scenario, i select all the entries from a table where the condition is true, put it into a vector and use an update statement through a loop, passing the vector's values. It works.
SELECT * FROM MAP AS A WHERE EXISTS
(SELECT (X, Y) FROM MAP AS B WHERE B.X = A.X + 1 AND B.Y = A.Y ) AND EXISTS
(SELECT (X, Y) FROM MAP AS C WHERE C.X = A.X - 1 AND C.Y = A.Y ) ;
for...
UPDATE MAP SET VAL = 2 WHERE X = ? AND Y = ?;
...
But i wanted to try and use a single statement to complete this objective, while we can update a table using a select statement, in my scenario there are 2 keys that need to checked before selecting a record, so i'm not able to put a where condition for x or y together.
UPDATE MAP SET USED = 1 WHERE EXISTS (
SELECT * FROM MAP AS A WHERE EXISTS
(SELECT (X, Y) FROM MAP AS B WHERE B.X = A.X + 1 AND B.Y = A.Y ) AND EXISTS
(SELECT (X, Y) FROM MAP AS C WHERE C.X = A.X - 1 AND C.Y = A.Y ) );
When i put a where exists condition like above, it updates all the entries. How do i update the table in one query ?

The problem with your subquery is that it does not refer to the table (MAP) in the UPDATE statement.
Just drop the MAP AS A subquery and refer to MAP directly (UPDATE does not allow table aliases):
UPDATE MAP
SET USED = 1
WHERE EXISTS (SELECT 1 FROM MAP AS B WHERE B.X = MAP.X + 1 AND B.Y = MAP.Y)
AND EXISTS (SELECT 1 FROM MAP AS C WHERE C.X = MAP.X - 1 AND C.Y = MAP.Y)

Since you've verified the subquery is returning the rows you want to update, your update should then look something like this:
UPDATE MAP SET USED = 1
WHERE (X,Y) IN (
SELECT X, Y FROM MAP AS A WHERE EXISTS
(SELECT X, Y FROM MAP AS B WHERE B.X = A.X + 1 AND B.Y = A.Y ) AND EXISTS
(SELECT X, Y FROM MAP AS C WHERE C.X = A.X - 1 AND C.Y = A.Y ) );

update Informix table with joins

Is this the correct syntax for an Informix update?
update table1
set table1.code = 100
from table1 a, table2 b, table3 c
where a.key = c.key
a.no = b.no
a.key = c.key
a.code = 10
b.tor = 'THE'
a.group = 4183
a.no in ('1111','1331','1345')
I get the generic -201 'A syntax error has occurred' message, but I can't see what's wrong.

Unfortunately, the accepted answer causes syntax error in Informix Dynamic Server Version 11.50.
This is the only way to avoid syntax error:
update table1
set code = (
select 100
from table2 b, table3 c
where table1.key = c.key
and table1.no = b.no
and table1.key = c.key
and table1.code = 10
and b.tor = 'THE'
and table1.group = 4183
and table1.no in ('1111','1331','1345')
)
BTW, to get Informix version, run the following SQL:
select first 1 dbinfo("version", "full") from systables;
Updated: also see this answer.
Updated: also see the docs.

your syntax error is table1.code
set table1.code = 100
change this into
set a.code = 100
Full code
update table1
set a.code = 100
from table1 a, table2 b, table3 c
where a.key = c.key
and a.no = b.no
and a.key = c.key
and a.code = 10
and b.tor = 'THE'
and a.group = 4183
and a.no in ('1111','1331','1345')

The original SQL in the question was:
update table1
set table1.code = 100
from table1 a, table2 b, table3 c
where a.key = c.key
a.no = b.no
a.key = c.key
a.code = 10
b.tor = 'THE'
a.group = 4183
a.no in ('1111','1331','1345')
This is unconditionally missing a series of AND keywords. The accepted solution also identifies a problem in the SET clause with the use of table1 instead of its alias a. That might be material; I can't test it (see discussion below). So, assuming that the join UPDATE is accepted at all, the corrected SQL should read:
UPDATE table1
SET a.code = 100
FROM table1 a, table2 b, table3 c
WHERE a.key = c.key
AND a.no = b.no
AND a.key = c.key
AND a.code = 10
AND b.tor = 'THE'
AND a.group = 4183
AND a.no IN ('1111','1331','1345')
This is the same as the (syntax-corrected) accepted answer. However, I'm curious to know which version of Informix you are using that accepts the FROM syntax (maybe XPS?). I'm using IDS 11.70.FC2 (3 fix packs behind the current 11.70.FC5 version) on Mac OS X 10.7.4, and I can't get the UPDATE with FROM syntax to work. Further the manual at the IBM's Informix 11.70 Information Center for UPDATE does not mention it. I'm not sure whether it would make any difference if you're using ODBC or JDBC; it shouldn't, but I'm using ESQL/C, which sends the SQL unchanged to the server.
The notation I tried is (+ is the prompt):
+ BEGIN;
+ CREATE TABLE a(a INTEGER NOT NULL, x CHAR(10) NOT NULL, y DATE NOT NULL);
+ INSERT INTO a(a, x, y) VALUES(1, 'obsoletely', '2012-04-01');
+ INSERT INTO a(a, x, y) VALUES(2, 'absolutely', '2012-06-01');
+ CREATE TABLE b(b INTEGER NOT NULL, p CHAR(10) NOT NULL, q DATE NOT NULL);
+ INSERT INTO b(b, p, q) VALUES(3, 'daemonic', '2012-07-01');
+ SELECT * FROM a;
1|obsoletely|2012-04-01
2|absolutely|2012-06-01
+ SELECT * FROM b;
3|daemonic|2012-07-01
+ SELECT *
FROM a, b
WHERE a.a < b.b
AND b.p MATCHES '*a*e*';
1|obsoletely|2012-04-01|3|daemonic|2012-07-01
2|absolutely|2012-06-01|3|daemonic|2012-07-01
+ UPDATE a
SET x = 'crumpet'
FROM a, b
WHERE a.a < b.b
AND b.p MATCHES '*a*e*';
SQL -201: A syntax error has occurred.
SQLSTATE: 42000 at <<temp>>:23
+ SELECT * FROM a;
1|obsoletely|2012-04-01
2|absolutely|2012-06-01
+ ROLLBACK;

It depends on the version you are using. If you are using at least 11.50 the best solution would be:
MERGE INTO table1 as t1
USING table2 as t2
ON t1.ID = t2.ID
WHEN MATCHED THEN UPDATE set (t1.col1, t1.col2) = (t2.col1, t2.col2);
The UPDATE - SET - FROM - Syntax was removed in versions greater than 11.50.
If you are using an earlier version you can go with
UPDATE t SET a = t2.a FROM t, t2 WHERE t.b = t2.b;

For Informix SE 7.25...
UPDATE ... FROM ... syntax does not exist
You also "Cannot modify table or view used in subquery"
which is given when using Rockallite's answer
Another solution would be to break it down into two queries:
First, get the ROWIDs for the required records (filtered on multiple tables):
SELECT a.ROWID
FROM table1 a, table2 b, table3 c
WHERE a.key = c.key
AND a.no = b.no
AND a.key = c.key
AND a.code = 10
AND b.tor = 'THE'
AND a.group = 4183
AND a.no IN ('1111','1331','1345')
Put the result into a comma separated string.
Then, update only those records for the main table where the ROWID was found in the first query:
UPDATE table1 a
SET a.code = 100
WHERE a.ROWID in ([comma separated ROWIDs found above])

full outer join 3 tables with matching index in Postgre SQL

I have an SQL query
SELECT * FROM A FULL OUTER JOIN B ON A.z = B.z WHERE A.z = 1 OR B.z = 1
where A.z and B.z are primary keys.
The purpose is to do a full outer join on two tables whilst their primary keys match a given value - so that only one row is returned.
But I got confused on how to extend it to 3 or more tables. The restriction that their primary keys match a given index so that only one row is return in total remains. How do you do it?

First, note that in the provided query, the FULL OUTER JOIN that you request could be rewritten as:
SELECT *
FROM (SELECT * FROM A WHERE z = 1) A
FULL OUTER JOIN (SELECT * FROM B WHERE z = 1) B ON A.z = B.z
which makes (IMO) more clear what the data sources are and what the join condition is. For a moment, with your WHERE condition, I had the feeling that you wanted actually an INNER JOIN.
With this you can extend more easily probably:
SELECT *
FROM (SELECT * FROM A WHERE z = 1) A
FULL OUTER JOIN (SELECT * FROM B WHERE z = 1) B ON A.z = B.z
FULL OUTER JOIN (SELECT * FROM C WHERE z = 1) C ON COALESCE(A.z,B.z) = C.z
FULL OUTER JOIN (SELECT * FROM D WHERE z = 1) D ON COALESCE(A.z,B.z,C.z) = D.z

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove Duplicate Join Columns in Hive Joins - hive

Because table A and table B have the column x, you must assign an alias within this select for this column select * from A join B on A.x = B.x Something like this select A.x as x1, B.x as x2, ... from A join B on A.x = B.x

Related

how to update BigQuery table using struct column

SQL : matching two tables with two possibles conditions

Update table for all entries from select statement

update Informix table with joins

full outer join 3 tables with matching index in Postgre SQL

Categories

Resources