Dynamically add prefix to spark dataset columns while doing join without specifiying individual column names - apache-spark-sql

I am joining two Spark datasets as follows
Dataset<Row> dataDF = merc.as("merc").join(ded.as("ded"),col("merc.id").equalTo(col("ded.id")).and(
col("merc.mid").equalTo(col("ded.mid"))), "outer");
Both the datasets have same schema.
Schema :id, mid, pid, zid
The dataDf schema has id, mid, pid, zid, id, mid, pid, zid since i am doing outer join.
While trying to write in parquet format, I am getting below error
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/xyz: `id`, `mid`, `pid`, `zid`;
How can I rename columns (may be add prefix with tablename e.g merc.id) dynamically without specifying individual columns for dataDF while doing join so that schema of dataDF can be
ded.id, ded.mid, ded.pid, ded.zid, merc.id, merc.mid, merc.pid, merc.zid

probably try renaming befor join for one of the table.
eg:in pyspark
merc.alias("merc").select(col("id").alias("merc_id"),col("mid").alias("merc_mid"))\
.join(ded.alias("ded"),[(col("merc.merc_id")==col("ded.id")) \
& (col("merc.merc_mid")==(col("ded.mid")))], how="outer")

Related

Why is Big Query creating a new column instead of joining two columns when using a Join?

When I use a Join in BigQuery, it completes it but creates a new column which are named Id_1 and Date_1 with the same information from the primary key. What could cause this? Here is the code.
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
ON
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Id = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Id
AND `bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Date = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Date
I made the query and expected the tables to join by the Primary keys of Id and Date, but instead this created two new columns with the same information.
When you use * in the select list the ON variant of a JOIN clause produces all columns from both tables in the result set. If there are columns with the same name on both sides, then both will show up in the result [with slightly different names] as you can see.
You can use the USING variant of the JOIN clause instead, that merges the columns and produces only one resulting column for each column mentioned in the USING clause. This is probably what you want. See BigQuery - INNER JOIN.
Your query could take the form:
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
USING (Id, Date)
Note: USING can only be used when the columns you want to join with have the exact same name. It won't be possible to use it if a column is, for example, called id in one table and employee_id in the other one.

SQL query : name columns by "columnname.field"

Hello I've written the following query :
SELECT *
FROM [woJob]
LEFT JOIN [woJobTask]
ON [woJob].jobID=[woJobTask].jobID
The query it returns has duplication columns but they are named the same. Is it possible to name column by table.Field. For example, name woJob.jobID and woJobTask.jobID?
My work flow is to use SQL to get the data out of the database and then im using pandas (a python library) to explore the data. Having duplicate column names makes things a little more complicated analyzing the data in python. I want to get all the data out labeled up with column names so I know each column belongs to which table and then analyze the data in Pandas, I can drop any columns in pandas I don't want.
You need to enumerate the columns, and assign alias as needed.
You did not tell what the columns of the tables are, so here is a contrived example, assuming colums jobid, name and value in both tables:
SELECT j.jobid, j.name, j.value, jt.name as jt_name, jt.value as jt_value
FROM [woJob] j
LEFT JOIN [woJobTask] jt ON j.jobid = jt.jobid
Or more simply:
SELECT j.*, jt.name as jt_name, jt.value as jt_value
FROM [woJob] j
LEFT JOIN [woJobTask] jt ON j.jobid = jt.jobid

How to drop one join key when joining two tables

I have two tables. Both have lot of columns. Now I have a common column called ID on which I would join.
Now since this variable ID is present in both the tables if I do simply this
select a.*,b.*
from table_a as a
left join table_b as b on a.id=b.id
This will give an error as id is duplicate (present in both the tables and getting included for both).
I don't want to write down separately each column of b in the select statement. I have lots of columns and that is a pain. Can I rename the ID column of b in the join statement itself similar to SAS data merge statements?
I am using Postgres.
Postgres would not give you an error for duplicate output column names, but some clients do. (Duplicate names are also not very useful.)
Either way, use the USING clause as join condition to fold the two join columns into one:
SELECT *
FROM tbl_a a
LEFT JOIN tbl_b b USING (id);
While you join the same table (self-join) there will be more duplicate column names. The query would make hardly any sense to begin with. This starts to make sense for different tables. Like you stated in your question to begin with: I have two tables ...
To avoid all duplicate column names, you have to list them in the SELECT clause explicitly - possibly dealing out column aliases to get both instances with different names.
Or you can use a NATURAL join - if that fits your unexplained use case:
SELECT *
FROM tbl_a a
NATURAL LEFT JOIN tbl_b b;
This joins on all columns that share the same name and folds those automatically - exactly the same as listing all common column names in a USING clause. You need to be aware of rules for possible NULL values ...
Details in the manual.

SQL Change View Name / Joins

I am trying to join two views I created, however I am joining them using their common field (cAuditNumber).
The issue is, once I have done the joins, it will not let me create the view as it cannot have the field name cAuditNumber twice.
Is the cAuditNumber the PK I should use?
How do I correct this and still join the tables?
CREATE VIEW KFF_Sales_Data_Updated AS
SELECT CustSalesUpdated.*, StkSalesUpdated.*
FROM CustSalesUpdated
INNER JOIN StkSalesUpdated
ON StkSalesUpdated.cAuditNumber = CustSalesUpdated.cAuditNumber
I get the following error:
Msg 4506, Level 16, State 1, Procedure KFF_Sales_Data_Updated, Line 2
Column names in each view or function must be unique. Column name 'cAuditNumber' in view or function 'KFF_Sales_Data_Updated' is specified more than once.
Substitute your own column names instead of ColumnA, Column B, etc, but it should follow this format:
CREATE VIEW KFF_Sales_Data_Updated AS
SELECT CustSalesUpdated.cAuditNumber
,CustSalesUpdated.ColumnA
,CustSalesUpdated.ColumnB
,CustSalesUpdated.ColumnC
,StkSalesUpdated.ColumnA as StkColumnA
,StkSalesUpdated.ColumnB as StkColumnB
,StkSalesUpdated.ColumnC as StkColumnC
FROM CustSalesUpdated
INNER JOIN StkSalesUpdated
ON StkSalesUpdated.cAuditNumber = CustSalesUpdated.cAuditNumber
You only have to alias duplicate columns using "as", or you can use it to rename any column that you so desire.
CREATE VIEW KFF_Sales_Data_Updated AS
SELECT csu.cAuditNumber cAuditNumber1 , ssu.cAuditNumber cAuditNumber2
FROM CustSalesUpdated csu
INNER JOIN StkSalesUpdated ssu
ON StkSalesUpdated.cAuditNumber = CustSalesUpdated.cAuditNumber
You could add any other column in the select statement from the two tables but if there are two column with the same name you should give them aliases
Using select * is a bad practice in general. On the other hand, it is a good practice to alias your table names and columns. Especially in your case, your table names as well as your same columns name(across two tables) could use aliases. The database is confused as to which cAuditNumber is coming from where. So, alias comes in handy.
CREATE VIEW KFF_Sales_Data_Updated
AS
SELECT
csu.cAuditNumber
,csu.Col1
,csu.Col2
,csu.Col3
,ssu.Col1 AS StkCol1
,ssu.Col2 AS StkCol2
,ssu.Col3 AS StkCol3
FROM CustSalesUpdated csu
INNER JOIN StkSalesUpdated ssu ON csu.cAuditNumber = ssu.cAuditNumber

Error "for pooled tables cluster tables and projection views join is not allowed: "T588B""

I'm new to ABAP Development, trying to join T588B and T588T and got this error "for pooled tables cluster tables and projection views join is not allowed: "T588B"".
SELECT a~mandt AS mandt a~userg AS userg a~mntyp AS mntyp a~menue AS menue
a~infty AS infty b~sprsl AS sprsl b~dtext As dtext
INTO CORRESPONDING FIELDS OF TABLE zfinaltable
FROM T588B AS a LEFT JOIN T588T AS b ON a~mntyp = b~mntyp
WHERE a~mntyp = 'I'
I just wanted to join the two table and store the output data into zfinaltable table which is custom table.
Any idea on how to accomplish this join? An example would be really helpful!
From the documentation: "Pooled and cluster tables cannot be joined using join expressions."
http://help.sap.com/abapdocu_731/en/abapselect_join.htm
You need to use a for all entries select instead.
You could try using SELECT. ... ENDSELECT. for selecting the data from table T588B and inside of it reading the data from T588T. An example could look like this. I think it could be easily adapted to your needs.
DATA: ls_T588B TYPE T588B.
DATA: lt_T588T TYPE TABLE OF T588T.
SELECT mntyp menue
FROM T588B
INTO CORRESPONDING FIELDS OF ls_t588b.
SELECT *
FROM T588T
APPENDING TABLE lt_T588T
WHERE MNTYP = ls_t588b-mntyp
AND MENUE = ls_t588b-menue.
ENDSELECT.