Join-Group PySpark - SQL to Pysaprk

Join-Group PySpark - SQL to Pysaprk - sql

I am trying to join 2 tables based on this SQL query using pyspark.
%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df c
LEFT JOIN pub_df b
ON c.pp = b.pp
GROUP BY c.cust_id
)
I tried following in pyspark but I am not sure if it's the right way as I was stuck to display my data. so I just choose .max
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.avg(gender_score) as pub_masc
.groupBy('cust_id').max()
any help would be appreciated.
Thanks in advance

Your Python code contains an invalid line .avg(gender_score) as pub_masc. Also you should group by and then average, not the other way round.
import pyspark.sql.functions as F
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.groupBy('cust_id')\
.agg(F.avg('gender_score').alias('pub_masc'))

Related

Join two tables and concat columns using Pyspark (databricks)

I have two tables in my database. I need to perform left outer join on these two tables with condition table1.id = table2.id also, source should match
Below are my two source tables.
Table 1 :
`source id type `
eu2 10000162 N4
sus 10000162 M1
pda 10000162 XM
Table 2 :
`source id code1 code2`
eu2 10000162 CDNG_GRP PROB_CD
sus 10000162 AANV NW
pda 10000162 PM2 VLPD
Expected output :
source id type concat
eu2 10000162 N4 CDNG_GRP-PROB_CD
sus 10000162 M1 AANV-NW
pda 10000162 XM PM2-VLPD
I want this result in Dataframe.
Thanks in advance !

Spark always returns a dataframe (until specified not to do so)
Try this:
Considering your tables are already spark dataframe
left_join = table1.join(table2, table1.id==table1.id, "leftouter")
left_join.show()

To get the desired result, you need to perform join on the source and id columns.
import pyspark.sql.functions as F
...
df = df1.join(df2, on=['id', 'source'], how='left') \
.withColumn('concat', F.concat('code1', F.lit('-'), 'code2')) \
.select(['source', 'id', 'type', 'concat'])
df.show(truncate=False)

Using query to combine pandas dataframes

I'm working on a problem where I need to merge two dataframes together and apply a condition similar to the 'where' clause in SQL. To start I have two dataframes with me :
Member_Timepoints = pd.DataFrame(list(zip([1001,1001,1002,1003],['2016-09-02','2018-01-30','2018-03-17','2019-01-10'])),columns = ['Member_ID','Discharge_Date'])
Enrollment_Information = pd.DataFrame(list(zip([1001,1001,1002,1003,1003,1003,1003], ['2015-07-01','2018-01-01','2018-03-01','2017-11-01','2018-08-01','2019-07-01','2019-09-01'], ['2018-01-01','2262-04-11','2018-08-01','2018-08-01','2019-06-01','2019-08-01','2262-04-11'])), columns = ['Member_ID','Coverage_Effective_Date','Coverage_Cancel_Date'])
Member_Timepoints['Discharge_Date'] = pd.to_datetime(Member_Timepoints['Discharge_Date'])
Enrollment_Information['Coverage_Effective_Date'] = pd.to_datetime(Enrollment_Information['Coverage_Effective_Date'])
Enrollment_Information['Coverage_Cancel_Date'] = pd.to_datetime(Enrollment_Information['Coverage_Cancel_Date'])
I need to join these dataframes together on 'Member_ID' and want to use the following condition as a filtration criteria :
Coverage_Effective_Date <= Discharge_Date and Coverage_Cancel_Date >= Discharge_Date + 30
I referred Join pandas dataframes based on different conditions to start, However, I am still struggling to merge the dataframes together with the above condition applied.
Can anyone please help me to implement this in pandas using query?

The first thing I've seen in this condition is data type and integer addition. You cannot add different data type. You should use timedelta:
from datetime import timedelta
some_date_type + timedelta(days=30)
For the query part you can use .loc after merging:
data = Enrollment_Information.merge(Member_Timepoints, on=['Member_ID'])
data.loc[(data['Coverage_Cancel_Date'] <= data['Discharge_Date'] ) &
(data['Coverage_Cancel_Date'] >= data['Discharge_Date']+timedelta(days=30)) ]

Un-nesting a nested SQL query

I am new to building sql queries and could use some help. I built a query that works fine as a standalone query. The problem is I need to use it in a report using ExecuteScalar function and nested queries are not allowed, I tried to rebuild using joins but I seem to be lost.
Can anyone help me "un-nest" this query?
SELECT
StockType2Job.Loaded
FROM
StockType2Job
WHERE
StockType2Job.IdStockType =
(SELECT StockType.IdStockType
FROM StockType
WHERE StockType.Number = '1001716.00')
AND
StockType2Job.IdStockType2JobGroup =
(SELECT StockType2JobGroup.IdStockType2JobGroup
FROM StockType2JobGroup
WHERE StockType2JobGroup.IdJob =
(SELECT Job.IdJob
FROM Job
WHERE Job.Number = '18-0085.02'
AND StockType2JobGroup.Caption = 'Breakout Room 1'))
Any help appreciated. Thanks

this query should work(on Oracle DB):
SELECT
StockType2Job.Loaded
FROM
(((StockType2Job a JOIN StockType b ON a.IdStockType=b.IdStockType)
JOIN StockType2JobGroup c ON a.IdStockType2JobGroup=c.IdStockType2JobGroup)
JOIN Job d ON c.IdJob=d.IdJob)
WHERE
b.Number = '1001716.00' AND
d.Number = '18-0085.02' AND
c.Caption = 'Breakout Room 1'

Equivalent of SQL SELECT in Pig

I am new to Pig and am trying to understand the basic commands. I have a data set A which I inner joined to data set B. I want to keep only some of the variables in the resultant data set. How do I do that? This is what I have so far
A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
Now both A and B have a lot of other columns that I don't need. In SQL I would do something like this:
SELECT A.science_score, B.math_score
FROM A
INNER JOIN B
ON A.Name = B.Student_Name
Can someone please help me figure how to do this?
Thanks!

You are looking for the FOREACH and GENERATE keywords.
selected = FOREACH AB GENERATE science_score, math_score;

A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
dump AB;
Please refer this below link.
How can I do this inner join properly in Apache PIG?

SQL DB2 Conditional Select

I'm working on a DB2 stored procedure and am having a little trouble getting the results I want. The problem with the following query is that it does not return rows from table A that don't pass the final where clause. I would like to receive all rows from table A that meet the first WHERE clause (WHERE A.GENRC_CD_TYPE = 'MDAA'). Then, add an email column from table B for each of those rows(WHERE (A.DESC) = B.MATL_PLNR_ID).
SELECT A.GENRC_CD,
A.DESC_30,
A.DOL,
A.DLU,
A.LU_LID,
B.EMAIL_ID_50
FROM GENRCCD A,
MPPLNR B
WHERE A.GENRC_CD_TYPE = 'MDAA'
AND (A.DESC_30) = B.MATL_PLNR_ID;
Any help is much appreciated, thanks!

Then what you need is a LEFT JOIN:
SELECT A.GENRC_CD,
A.DESC_30,
A.DOL,
A.DLU,
A.LU_LID,
B.EMAIL_ID_50
FROM GENRCCD A LEFT JOIN
MPPLNR B on A.DESC_30=B.MATL_PLNR_ID
WHERE A.GENRC_CD_TYPE = 'MDAA'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Join-Group PySpark - SQL to Pysaprk - sql

Related

Join two tables and concat columns using Pyspark (databricks)

Using query to combine pandas dataframes

Un-nesting a nested SQL query

Equivalent of SQL SELECT in Pig

SQL DB2 Conditional Select

Categories

Resources