Join two tables and concat columns using Pyspark (databricks)

Join two tables and concat columns using Pyspark (databricks) - dataframe

I have two tables in my database. I need to perform left outer join on these two tables with condition table1.id = table2.id also, source should match
Below are my two source tables.
Table 1 :
`source id type `
eu2 10000162 N4
sus 10000162 M1
pda 10000162 XM
Table 2 :
`source id code1 code2`
eu2 10000162 CDNG_GRP PROB_CD
sus 10000162 AANV NW
pda 10000162 PM2 VLPD
Expected output :
source id type concat
eu2 10000162 N4 CDNG_GRP-PROB_CD
sus 10000162 M1 AANV-NW
pda 10000162 XM PM2-VLPD
I want this result in Dataframe.
Thanks in advance !

Spark always returns a dataframe (until specified not to do so)
Try this:
Considering your tables are already spark dataframe
left_join = table1.join(table2, table1.id==table1.id, "leftouter")
left_join.show()

To get the desired result, you need to perform join on the source and id columns.
import pyspark.sql.functions as F
...
df = df1.join(df2, on=['id', 'source'], how='left') \
.withColumn('concat', F.concat('code1', F.lit('-'), 'code2')) \
.select(['source', 'id', 'type', 'concat'])
df.show(truncate=False)

Related

Duplicate data 1 to 2 to 1 : 1 sql join

I have a problem with my query where it duplicate as follow
![result query][1]
As you can see my rows multiple the number of rows by itself, as its the same id
do you know how to avoid it ?
[1]: https://i.stack.imgur.com/a4uS8.png
SELECT
distinct
tt.sourceId, tt.name,tt.date,
ts.content_plainText, ts.itemId,
ttk.content_Plaintext, ttk.ticketId
FROM ticket_tickets as tt
inner join ticket_ticketSolutions ts
on tt.sourceId = ts.itemId
inner join ticket_ticketTasks ttk
on ts.itemId = ttk.ticketId
would it be possible to do like one rows equal 3 match result ??
I have even try the groupby id but won't work ?

Join-Group PySpark - SQL to Pysaprk

I am trying to join 2 tables based on this SQL query using pyspark.
%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df c
LEFT JOIN pub_df b
ON c.pp = b.pp
GROUP BY c.cust_id
)
I tried following in pyspark but I am not sure if it's the right way as I was stuck to display my data. so I just choose .max
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.avg(gender_score) as pub_masc
.groupBy('cust_id').max()
any help would be appreciated.
Thanks in advance

Your Python code contains an invalid line .avg(gender_score) as pub_masc. Also you should group by and then average, not the other way round.
import pyspark.sql.functions as F
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.groupBy('cust_id')\
.agg(F.avg('gender_score').alias('pub_masc'))

how can I broadcast the tempTable instead of "sort merge join" when I use sparkSQL?

I set the autoBroadcast 200M ,table a is 20KB , table b is 20KB ,table c is 100G.
I found "a left join b on..." is a "broadcast join" and register the result as "TempTable" (TempTable is 30KB),my question is when I do "c left join TempTable on...",I expect that autobroadcast the TempTable to make a broadcast join but it made a sort merge join.I also tried cache the TempTable and broadcast DataFrame of the TempTable,but it doesn't work...
How can I broadcast the TempTable to make a broadcast join with sparkSQL?
I'm using spark-1.6.1
thanks!

Though it's very difficult to understand without any code snippet what you have tried so far. I am sharing some sample code tried at Spark 2.3 where broadcast function applied on Temp View(as Temp table is deprecated in Spark2). In following scala code, I have forcefully use broadcast hash join.
import org.apache.spark.sql.functions.broadcast
val adf = spark.range(99999999)
val bdf = spark.range(99999999)
adf.createOrReplaceTempView("a")
bdf.createOrReplaceTempView("b")
val bdJoinDF = spark.sql("select /*+ BROADCASTJOIN(b) */a.id, b.id from a join b on a.id = b.id")
val normalJoinDF = spark.sql("select a.id, b.id from a join b on a.id = b.id")
println(normalJoinDF.queryExecution) //== Physical Plan == *(5) SortMergeJoin [id#39512L], [id#39514L], Inner
println(bdJoinDF.queryExecution) //== Physical Plan == *(2) BroadcastHashJoin [id#39611L], [id#39613L], Inner, BuildRight, false

Oracle SQL query that deals with inner Joins and values

SELECT sc.TAAC_SHARE_CLASS_ID,
SCS.SHARE_CLASS_SID,
SCS.REPORTING_DT,
SCS.SHARE_CLASS_SNAPSHOT_SID,
SCS.DIST_UNMOD_30_DAY_YIELD_PCT,
SCS.DER_DIST_12_MO_YIELD_PCT,
SCS.DER_SEC_30_DAY_YIELD_PCT AS SCS_DER_SEC_30_DAY_YIELD_PCT,
SCS.DER_SEC_RESTATED_YIELD_PCT AS SCS_DER_SEC_RESTATED_YIELD_PCT
FROM SHARE_CLASS sc
INNER JOIN PORTFOLIO P ON (P.PORTFOLIO_SID=SC.PORTFOLIO_SID)
INNER JOIN SHARE_CLASS_SNAPSHOT SCS ON
(SCS.SHARE_CLASS_SID=sc.SHARE_CLASS_SID)
WHERE SCS.REPORTING_DT = '24-JUL-17' AND P.PORTFOLIO_ID = 638;
I ran this query and got the following output : image
Here, instead of getting separate rows for the same TAAC_SHARE_CLASS_ID, I want to merge the outputs of same TAAC_SHARE_CLASS_ID.
For example, the first row with TAAC_SHARE_CLASS_ID = 000648 should have values for all the 4 columns :
SCS.DIST_UNMOD_30_DAY_YIELD_PCT,
SCS.DER_DIST_12_MO_YIELD_PCT,
SCS.DER_SEC_30_DAY_YIELD_PCT,
SCS.DER_SEC_RESTATED_YIELD_PCT.
Hence the first row should have values for those columns as 2.96,3.2972596, 7541.085263433, 7550.
The last 4 rows of my output are not really required, as we have now merged those data into first 4 rows correspondingly.
How can I alter this query to achieve the same? Please help.

I suggest you group your results by TAAC_SHARE_CLASS_ID column, and MAX() the remaining columns, something like this:
SELECT sc.TAAC_SHARE_CLASS_ID,
max(SCS.SHARE_CLASS_SID) as SHARE_CLASS_SID,
max(SCS.REPORTING_DT) as REPORTING_DT,
max(SCS.SHARE_CLASS_SNAPSHOT_SID) as SHARE_CLASS_SNAPSHOT_SID,
max(SCS.DIST_UNMOD_30_DAY_YIELD_PCT) as DIST_UNMOD_30_DAY_YIELD_PCT,
max(SCS.DER_DIST_12_MO_YIELD_PCT) as DER_DIST_12_MO_YIELD_PCT,
max(SCS.DER_SEC_30_DAY_YIELD_PCT) AS SCS_DER_SEC_30_DAY_YIELD_PCT,
max(SCS.DER_SEC_RESTATED_YIELD_PCT) AS SCS_DER_SEC_RESTATED_YIELD_PCT
FROM SHARE_CLASS sc
INNER JOIN PORTFOLIO P ON (P.PORTFOLIO_SID=SC.PORTFOLIO_SID)
INNER JOIN SHARE_CLASS_SNAPSHOT SCS ON (SCS.SHARE_CLASS_SID=sc.SHARE_CLASS_SID)
WHERE SCS.REPORTING_DT = '24-JUL-17' AND P.PORTFOLIO_ID = 638
GROUP BY sc.TAAC_SHARE_CLASS_ID;

Join by a prefix, similar to SQL's ON ... LIKE

How does one perform a join/merge using a prefix (of varying length) of a column as a key? I am trying to translate the following SQL code:
SELECT a.person_id, a.tn_code, b.list_id
FROM tblA a
INNER JOIN tblB b
ON tn_code LIKE TnCode + "%"
tblA
person_id tn_code
1 C18.4
2 M8820/9
3 X20.
...
tblB
ListID TnCode
1.01 A0.1
1.01 A0.2
...
I have ideas such as preparing a new key TnCode_prefix = gsub("^(.*)\\.(.*)$", "\\1", TnCode) and then joining on the new column, or using data.table's rolling join, but they are only approximate translations? Is there an exact equivalent in R?
I am aware of using sqldf and simply passing the original SQL statement to sqldf, but I'm wondering if there is another way.

What about creating a prefix on the fly and joining on that? I used dplyr to create the prefix and do the join.
library(dplyr)
# Fake Data
set.seed(1093)
tblA = data.frame(person_id=sample(1:10, 50, replace=TRUE),
tn_code = paste0(sample(paste0(paste0(rep(LETTERS[1:3],3),c(40:42,401:403,421:423))), 50, replace=TRUE),
".", sample(160:170, 50, replace=TRUE)))
tblB = data.frame(ListID=paste0(sample(1:10, 50, replace=TRUE),".",
sample(10:20, 50, replace=TRUE)),
TnCode = paste0(sample(paste0(paste0(rep(LETTERS[1:3],3),c(40:42,401:403,421:423))), 50, replace=TRUE),
".", sample(160:170, 50, replace=TRUE)))
# Join on first letter of tn_code and TnCode
newTbl = tblA %>% mutate(join_prefix=gsub("(.*)\\..*", "\\1", tn_code)) %>%
left_join(tblB %>% mutate(join_prefix=gsub("(.*)\\..*", "\\1", TnCode)),
by="join_prefix")

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Join two tables and concat columns using Pyspark (databricks) - dataframe

Spark always returns a dataframe (until specified not to do so) Try this: Considering your tables are already spark dataframe left_join = table1.join(table2, table1.id==table1.id, "leftouter") left_join.show()

Related

Duplicate data 1 to 2 to 1 : 1 sql join

Join-Group PySpark - SQL to Pysaprk

how can I broadcast the tempTable instead of "sort merge join" when I use sparkSQL?

Oracle SQL query that deals with inner Joins and values

Join by a prefix, similar to SQL's ON ... LIKE

Categories

Resources