i have 3 tables the first table "A" is the master table
id_grp|group_name |created_on |status|
------+--------------+-----------------------+------+
17|Teller |2022-09-09 16:00:44.842| 1|
18|Combined Group|2022-09-09 10:16:42.473| 1|
16|admnistrator |2022-09-08 10:11:14.313| 1|
Then i have another table table "b"
id_config|id_grp|id_utilis|
---------+------+---------+
159| 16| 1|
161| 16| 54|
164| 17| 55|
438| 17| 88|
166| 18| 39|
167| 18| 20|
439| 16| 89|
198| 18| 51|
Then i have the last table "C"
id_config|id_grp|id_pol|
---------+------+------+
46| 16| 7|
48| 17| 8|
51| 18| 8|
52| 18| 7|
84| 18| 9|
113| 17| 9|
but when i using group by with multiple join as follows
SELECT
a.id_grp,
a.group_name,
a.created_on,
a.status,
count(b.id_utilis) AS users,
count(c.id_pol) AS policy
FROM a
inner JOIN b on a.id_grp = b.id_grp
inner JOIN c on a.id_grp = c.id_grp
GROUP BY a.id_grp, a.group_name, a.created_on, a.status,
but i am getting wront result there both the count are creating matrix and multiplying each other
id_grp|group_name |created_on |status|users|policy|
------+--------------+-----------------------+------+-----+------+
17|Teller |2022-09-09 16:00:44.842| 1| 10| 10|
16|admnistrator |2022-09-08 10:11:14.313| 1| 3| 3|
18|Combined Group|2022-09-09 10:16:42.473| 1| 18| 18|
select *
from a
join (select id_grp, count(*) as users from b group by id_grp) b using(id_grp)
join (select id_grp, count(*) as policy from c group by id_grp) c using(id_grp)
id_grp
group_name
created_on
status
users
policy
17
Teller
2022-09-09 16:00:44
1
2
2
18
Combined Group
2022-09-09 10:16:42
1
3
3
16
admnistrator
2022-09-08 10:11:14
1
3
1
Fiddle
I have three dimension tables and a fact table and i need to write a query in such way that i join all the dimension columns with fact table to find out top 10 ATMs where most transactions are in the ’inactive’ state.I try the query with cartesian join but i dont know if this is the right way to join the tables.
select a.atm_number,a.atm_manufacturer,b.location,count(c.trans_id) as total_transaction_count,count(c.atm_status) as inactive_count
from dimen_atm a,dimen_location b,fact_atm_trans c
where a.atm_id = c.atm_id and b.location = c.location
order by inactive_count desc limit 10;
dimen_card_type
+------------+---------+
|card_type_id|card_type|
+------------+---------+
| 1| CIRRUS|
| 2| Dankort|
dimen_atm
+------+----------+----------------+---------------+
|atm_id|atm_number|atm_manufacturer|atm_location_id|
+------+----------+----------------+---------------+
| 1| 1| NCR| 16|
| 2| 2| NCR| 64|
+------+----------+----------------+---------------+
dimen_location
+-----------+--------------------+----------------+-------------+-------+------+------+
|location_id| location| streetname|street_number|zipcode| lat| lon|
+-----------+--------------------+----------------+-------------+-------+------+------+
| 1|Intern København|Rådhuspladsen| 75| 1550|55.676|12.571|
| 2| København| Regnbuepladsen| 5| 1550|55.676|12.571|
+-----------+--------------------+----------------+-------------+-------+------+------+
fact_atm_trans
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
|trans_id|atm_id|weather_loc_id|date_id|card_type_id|atm_status|currency| service|transaction_amount|message_code|message_text|rain_3h|clouds_all|weather_id|weather_main|weather_description|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
| 1| 1| 16| 5229| 3| Active| DKK|Withdrawal| 5980| null| null| 0.0| 80| 803| Clouds| broken cloudsr|
| 2| 1| 16| 4090| 10| Active| DKK|Withdrawal| 3992| null| null| 0.0| 32| 802| Clouds| scattered cloudsr|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+-----------
I want to delete the records from an old_df if a new_df has a del flag for the key metric_id. What is the right way to achieve this?
old_df (flag here is filled with nulls on purpose)
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| null| value2|
| 10| null| value9|
| 12| null|updated_value|
| 15| null| test_value2|
+---------+--------+-------------+
new_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| del| value2|
| 12| pass|updated_value|
| 15| del| test_value2|
+---------+--------+-------------+
result_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 12| pass|updated_value|
+---------+--------+-------------+
One easy way to do this is to join then filter:
result_df = (
old_df.join(new_df, on='metric_id', how='left')
.where((new_df['flag'].isNull()) | (new_df['flag'] != lit('del')))
.select('metric_id', new_df['flag'], new_df['value'])
)
Which produces
+---------+----+-------------+
|metric_id|flag| value|
+---------+----+-------------+
| 12|pass|updated_value|
+---------+----+-------------+
I'm using a left join because there might be records in old_df for which the primary key is not present in new_df (and you don't want to delete those).
I have sql code which works perfectly:
val sql ="""
select a.*,
b.fOOS,
b.prevD
from dataFrame as a
left join dataNoPromoFOOS as b on
a.shopId = b.shopId and a.skuId = b.skuId and
a.Date > b.date and a.date <= b.prevD
"""
result:
+------+------+----------+-----+-----+------------------+---+----------+------------------+----------+
|shopId| skuId| date|stock|sales| salesRub| st|totalPromo| fOOS| prevD|
+------+------+----------+-----+-----+------------------+---+----------+------------------+----------+
| 200|154057|2017-03-31|101.0| 49.0| 629.66| 1| 0|58.618803952304724|2017-03-31|
| 200|154057|2017-09-11|116.0| 76.0| 970.67| 1| 0| 63.3344597217295|2017-09-11|
| 200|154057|2017-11-10| 72.0| 94.0| 982.4599999999999| 1| 0|59.019226118850405|2017-11-10|
| 200|154057|2018-10-08|126.0| 34.0| 414.44| 1| 0| 55.16878756270067|2018-10-08|
| 200|154057|2016-08-03|210.0| 27.0| 307.43| 1| 0|23.530049844711286|2016-08-03|
| 200|154057|2016-09-03| 47.0| 20.0| 246.23| 1| 0|24.656378380329674|2016-09-03|
| 200|154057|2016-12-31| 66.0| 30.0| 386.5| 1| 1| 26.0423103074891|2017-01-09|
| 200|154057|2017-02-28| 22.0| 61.0| 743.2899999999998| 1| 0| 54.86808157636879|2017-02-28|
| 200|154057|2017-03-16| 79.0| 41.0|505.40999999999997| 1| 0| 49.79449369431623|2017-03-16|
when i use scala this code don't work
dataFrame.join(dataNoPromoFOOS,
dataFrame("shopId") === dataNoPromoFOOS("shopId") &&
dataFrame("skuId") === dataNoPromoFOOS("skuId") &&
(dataFrame("date").lt(dataNoPromoFOOS("date"))) &&
(dataFrame("date").geq(dataNoPromoFOOS("prevD"))) ,
"left"
).select(dataFrame("*"),dataNoPromoFOOS("fOOS"),dataNoPromoFOOS("prevD"))
result:
+------+------+----------+-----+-----+------------------+---+----------+----+-----+
|shopId| skuId| date|stock|sales| salesRub| st|totalPromo|fOOS|prevD|
+------+------+----------+-----+-----+------------------+---+----------+----+-----+
| 200|154057|2016-09-24|288.0| 34.0| 398.66| 1| 0|null| null|
| 200|154057|2017-06-11| 40.0| 38.0| 455.32| 1| 1|null| null|
| 200|154057|2017-08-18| 83.0| 20.0|226.92000000000002| 1| 1|null| null|
| 200|154057|2018-07-19|849.0| 58.0| 713.12| 1| 0|null| null|
| 200|154057|2018-08-11|203.0| 52.0| 625.74| 1| 0|null| null|
| 200|154057|2016-09-01|120.0| 24.0| 300.0| 1| 1|null| null|
| 200|154057|2016-12-22| 62.0| 30.0| 378.54| 1| 0|null| null|
| 200|154057|2017-05-11|105.0| 49.0| 597.12| 1| 0|null| null|
| 200|154057|2016-12-28| 3.0| 36.0| 433.11| 1| 1|null| null|
somebody know why sql code work and scala code don't join left table.
i think it's the date column, but i don't undestand how i can find my error
Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')
Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])
Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+
I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)
drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)
Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+
I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.
function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho
df1.join(df2, ['id']).drop(df2.id)
If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])
I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)
You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join