Coalesce columns in pyspark dataframes

Coalesce columns in pyspark dataframes - dataframe

res=to.join(tc, to.id1 == tc.id,how='left').select(to.id1.alias('Employee_id'), tc.name.alias('Employee_Name'), to.dept.alias('Employee_Dept'))
res.show()
+-----------+-------------+-------------+
|Employee_id|Employee_Name|Employee_Dept|
+-----------+-------------+-------------+
| 12| Prad| Physics|
| 13| null| Chem|
| 14| null| Maths|
+-----------+-------------+-------------+
I want to replace the null with say NONAME. Please advise the select syntax

Try something like this:
df.withColumn("EmployeeNameNoNull",coalesce(df.Employee_Name,lit('NONAME'))).show()

Related

SQL query to find an output table

I have three dimension tables and a fact table and i need to write a query in such way that i join all the dimension columns with fact table to find out top 10 ATMs where most transactions are in the ’inactive’ state.I try the query with cartesian join but i dont know if this is the right way to join the tables.
select a.atm_number,a.atm_manufacturer,b.location,count(c.trans_id) as total_transaction_count,count(c.atm_status) as inactive_count
from dimen_atm a,dimen_location b,fact_atm_trans c
where a.atm_id = c.atm_id and b.location = c.location
order by inactive_count desc limit 10;
dimen_card_type
+------------+---------+
|card_type_id|card_type|
+------------+---------+
| 1| CIRRUS|
| 2| Dankort|
dimen_atm
+------+----------+----------------+---------------+
|atm_id|atm_number|atm_manufacturer|atm_location_id|
+------+----------+----------------+---------------+
| 1| 1| NCR| 16|
| 2| 2| NCR| 64|
+------+----------+----------------+---------------+
dimen_location
+-----------+--------------------+----------------+-------------+-------+------+------+
|location_id| location| streetname|street_number|zipcode| lat| lon|
+-----------+--------------------+----------------+-------------+-------+------+------+
| 1|Intern KÃƒÂ¸benhavn|RÃƒÂ¥dhuspladsen| 75| 1550|55.676|12.571|
| 2| KÃƒÂ¸benhavn| Regnbuepladsen| 5| 1550|55.676|12.571|
+-----------+--------------------+----------------+-------------+-------+------+------+
fact_atm_trans
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
|trans_id|atm_id|weather_loc_id|date_id|card_type_id|atm_status|currency| service|transaction_amount|message_code|message_text|rain_3h|clouds_all|weather_id|weather_main|weather_description|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
| 1| 1| 16| 5229| 3| Active| DKK|Withdrawal| 5980| null| null| 0.0| 80| 803| Clouds| broken cloudsr|
| 2| 1| 16| 4090| 10| Active| DKK|Withdrawal| 3992| null| null| 0.0| 32| 802| Clouds| scattered cloudsr|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+-----------

Pyspark - how to drop records by primary keys?

I want to delete the records from an old_df if a new_df has a del flag for the key metric_id. What is the right way to achieve this?
old_df (flag here is filled with nulls on purpose)
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| null| value2|
| 10| null| value9|
| 12| null|updated_value|
| 15| null| test_value2|
+---------+--------+-------------+
new_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| del| value2|
| 12| pass|updated_value|
| 15| del| test_value2|
+---------+--------+-------------+
result_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 12| pass|updated_value|
+---------+--------+-------------+

One easy way to do this is to join then filter:
result_df = (
old_df.join(new_df, on='metric_id', how='left')
.where((new_df['flag'].isNull()) | (new_df['flag'] != lit('del')))
.select('metric_id', new_df['flag'], new_df['value'])
)
Which produces
+---------+----+-------------+
|metric_id|flag| value|
+---------+----+-------------+
| 12|pass|updated_value|
+---------+----+-------------+
I'm using a left join because there might be records in old_df for which the primary key is not present in new_df (and you don't want to delete those).

Joining tables and finding difference

I have a table which contains the following schema:
Table1
+------------------+--------------------+-------------------+-------------+-------------+
|student_id|project_id|name|project_name|approved|evaluation_type|grade| cohort_number|
I have another table with the following:
Table2
+-------------+----------+
|cohort_number|project_id|
My problem is: I want to get for each student_id the projects that he has not completed (no rows). The way i know all the projects he should have done is by checking the cohort_number. Basically I need the "diference" between the 2 tables. I want to fill table 1 with the missing entries, by comparing with table 2 project_id for that cohort_number.
I am not sure if I was clear.
I tried using LEFT JOIN, but I only get records where it matches. (I need the opposite)
Example:
Table1
|student_id|project_id|name| project_name| approved|evaluation_type| grade|cohort_number|
+----------+----------+--------------------+------+--------------------+--------+---------------+------------------
| 13| 18|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 7|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 27|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
Table2
+-------------+----------+
|cohort_number|project_id|
+-------------+----------+
| 3| 18|
| 3| 27|
| 4| 15|
| 3| 7|
| 3| 35|
I want:
|student_id|project_id|name| project_name| approved|evaluation_type| grade|cohort_number|
+----------+----------+--------------------+------+--------------------+--------+---------------+------------------
| 13| 18|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 7|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 27|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 35|Name| project/sd-03-bloc...| false| standard| 0| 3|
Thanks in advance

If I followed you correctly, you can get all distinct (student_id, cohort_number, name) tuples from table1, and then bring all corresponding rows from table2. This basically gives you one row for each project that a student should have completed.
You can then bring table1 with a left join. "Missing" projects are identified by null values in columns project_name, approved, evaluation_type, grade.
select
s.student_id,
t2.project_id,
s.name,
t1.project_name,
t1.approved,
t1.evaluation_type,
t1.grade,
s.cohort_number
from (select distinct student_id, cohort_number, name from table1) s
inner join table2 t2
on t2.cohort_number = s.cohort_number
left join table1 t1
on t1.student_id = s.student_id
and t1.project_id = t.project_id

extracting numpy array from Pyspark Dataframe

I have a dataframe gi_man_df where group can be n:
+------------------+-----------------+--------+--------------+
| group | number|rand_int| rand_double|
+------------------+-----------------+--------+--------------+
| 'GI_MAN'| 7| 3| 124.2|
| 'GI_MAN'| 7| 10| 121.15|
| 'GI_MAN'| 7| 11| 129.0|
| 'GI_MAN'| 7| 12| 125.0|
| 'GI_MAN'| 7| 13| 125.0|
| 'GI_MAN'| 7| 21| 127.0|
| 'GI_MAN'| 7| 22| 126.0|
+------------------+-----------------+--------+--------------+
and I am expecting a numpy nd_array i.e, gi_man_array:
[[[124.2],[121.15],[129.0],[125.0],[125.0],[127.0],[126.0]]]
where rand_double values after applying pivot.
I tried the following 2 approaches:
FIRST: I pivot the gi_man_df as follows:
gi_man_pivot = gi_man_df.groupBy("number").pivot('rand_int').sum("rand_double")
and the output I got is:
Row(number=7, group=u'GI_MAN', 3=124.2, 10=121.15, 11=129.0, 12=125.0, 13=125.0, 21=127.0, 23=126.0)
but here the problem is to get the desired output, I can't convert it to matrix then convert again to numpy array.
SECOND:
I created the vector in the dataframe itself using:
assembler = VectorAssembler(inputCols=["rand_double"],outputCol="rand_double_vector")
gi_man_vector = assembler.transform(gi_man_df)
gi_man_vector.show(7)
and I got the following output:
+----------------+-----------------+--------+--------------+--------------+
| group| number|rand_int| rand_double| rand_dbl_Vect|
+----------------+-----------------+--------+--------------+--------------+
| GI_MAN| 7| 3| 124.2| [124.2]|
| GI_MAN| 7| 10| 121.15| [121.15]|
| GI_MAN| 7| 11| 129.0| [129.0]|
| GI_MAN| 7| 12| 125.0| [125.0]|
| GI_MAN| 7| 13| 125.0| [125.0]|
| GI_MAN| 7| 21| 127.0| [127.0]|
| GI_MAN| 7| 22| 126.0| [126.0]|
+----------------+-----------------+--------+--------------+--------------+
but problem here is I can't pivot it on rand_dbl_Vect.
So my question is:
1. Is any of the 2 approaches is correct way of achieving the desired output, if so then how can I proceed further to get the desired result?
2. What other way I can proceed with so the code is optimal and performance is good?

This
import numpy as np
np.array(gi_man_df.select('rand_double').collect())
produces
array([[ 124.2 ],
[ 121.15],
.........])

To convert the spark df to numpy array, first convert it to pandas and then apply the to_numpy() function.
spark_df.select(<list of columns needed>).toPandas().to_numpy()

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')

Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b

Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+

I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)

drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+

I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.

function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)

I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho

df1.join(df2, ['id']).drop(df2.id)

If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])

I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)

You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Coalesce columns in pyspark dataframes - dataframe

Try something like this: df.withColumn("EmployeeNameNoNull",coalesce(df.Employee_Name,lit('NONAME'))).show()

Related

SQL query to find an output table

Pyspark - how to drop records by primary keys?

Joining tables and finding difference

extracting numpy array from Pyspark Dataframe

Join two data frames, select all columns from one and some columns from the other

Categories

Resources