Merge rows in spark scala Dataframe and apply aggregate function - dataframe

I have a following Dataframe:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| false|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId1| null| null| true|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+
The primary key in this example is notification_id.
The rows that have is_deleted = true, always have null values for other column except primary key.
The rows with is_deleted = false have a unique primary key.
I would like to merge the rows with the same primary key in order to obtain dataframe with merged is_delete column:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| true|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+

You can group by the primary key and use an any() aggregator on the is_deleted column, which will yield true if any of the rows with the same primary key have a true value for is_deleted:
val df_result = df_in.groupBy("notification_id").agg(
first("el1", ignoreNulls = true).alias("el1"),
first("el2", ignoreNulls = true).alias("el2"),
expr("any(is_deleted)").alias("is_deleted")
)

Related

Compare columns from two different dataframes based on id

I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)
Example: consider cataframes df1 and df2
df1:
+---+-------+-----+
| id|student|marks|
+---+-------+-----+
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
+---+-------+-----+
df2:
+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
+-----+--------+------+
Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks
In this example student with id 2 has 21 marks but in df2 23 marks
df1.exceptAll(df2).show()
// +---+-------+-----+
// | id|student|marks|
// +---+-------+-----+
// | 2| Ram| 21|
// +---+-------+-----+
I think diff will give the result you are looking for.
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])

Create new columns in Spark dataframe based on a hundred column pairs

I am trying to create around 9-10 columns based on values in 100 of columns(sch0,shm2...shm100) , however values of these columns would be the value in columns(idm0,idm1....idm100) which is part of same dataframe.
There are additional columns as well apart from these 2 pairs of 100.
Problem is, not all the scheme (schm0,schm1..schm100) would have values in it and we have to traverse through each to find out the values and create the columns accordingly, 85+ columns would be empty most of the time so we need to ignore them.
Input dataframe example:
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|sch0|idsm0|schm1|idsm1|schm2|idsm2|schm3|idsm3|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c| 0| 1| 2| 3| 4| 5| null| null|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
schm and idsm can go upto 100, so its basically the key-value pairs of 100 columns.
Expected output:
----+----+----+----------+-------+-------+
|col1|col2|col3|found_zero|found_2|found_4|
+----+----+----+----------+-------+-------+
| a| b| c| 1| 3| 5|
+----+----+----+----------+-------+-------+
Note: There is no fixed value in any column, any columns can have any value and the columns that we create has to be based on value found in any of the scheme columns (schm0...schm100) and the values in the created columns would be corresponding values of scheme i.e idsymbol (idsm0...idsm100)
I am finding it difficult to formulate a plan to do it, any help would be greatly appreciated.
Edited-
Adding another input example--
col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
| 2| 6| b1| id1| i| id2| xs| id3| ch| id4| null| null| null| null|
| 3| 5| b2| id5| x2| id6| ch| id7| be| id8| null| null| db| id15|
| 4| 7| b1| id9| ch| id10| xs| id11| us| id12| null| null| null| null|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
for one particular record, col(schm_0,schm_1....schm_100) can have around 9 to 10 unique values as not all the columns would be populated with values.
we need to create 9 different columns based on the 9 unique values, so in short for one row , we need to iterate over each of 100 schmeme columns and collect all the values which is found there, based on found values, separate columns need to be created...and the values in those created columns would be the value in idsm(idsm_0,idsm_1....idsm_100)
i.e if schm_0 has value 'cb' we need to create new column for eg 'col_cb' and value in this column 'col_cb' would be value in 'idsm_0' column.
similarly we need to do for all 100 columns(we need to leave out the empty ones).
Expected output-
+----+----+------+------+-----+------+------+------+------+------+------+
|col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db|
+----+----+------+------+-----+------+------+------+------+------+------+
| 2| 6| id1| null| id2| null| id4| id3| null| null| null|
| 3| 5| null| id5| null| id6| 1d7| null| id8| null| id15|
| 4| 7| id9| null| null| null| 1d10| id11| null| id12| null|
+----+----+------+------+-----+------+------+------+------+------+------+
Hope this clears the problem statements.
Any help on this would be highly appreciated.
Editing again for small issue-
as we see in above example , the columns that we create is based on the values found in schemesymbol columns,and there is already a defined set of columns that will get created which is 10 in numbers..columns for eg would be (col_a,col_b,col_c,cold_d,col_e,col_f,col_g_col_h,col_i,col_j)
not all the 10 keywords i.e(a,b,c....j) would be present all time in dataset under (shcheme0.....scheme99).
Requirement is we need to pass all 10 columns, if some of the keys(a,b,c...j) is not present ,the column created would be having null values.
You can get the required output that you are expecting but would be a multi step process.
First you would have to create two separate dataframes out of the original dataframe i.e one which contains schm columns and other which contains idsm columns. You will have to unpivot schm columns and idsm columns.
Then you would join both the dataframes based on unique combination of columns and filter the dataframe based on null values. You would then do group by based on unique columns and pivot on the schm columns and get the first value of the idsm columns.
//Sample Data
import org.apache.spark.sql.functions._
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
//creating two separate dataframes
val schmdf = initialdf.selectExpr("col1","col2", "stack(6, 'schm_0',schm_0, 'schm_1',schm_1,'schm_2',schm_2,'schm_3' ,schm_3, 'schm_4',schm_4,'schm_5',schm_5) as (schm,schm_value)").withColumn("id",split($"schm", "_")(1))
val idsmdf = initialdf.selectExpr("col1","col2", "stack(6, 'idsm_0',idsm_0, 'idsm_1',idsm_1,'idsm_2',idsm_2,'idsm_3' ,idsm_3, 'idsm_4',idsm_4,'idsm_5',idsm_5) as (idsm,idsm_value)").withColumn("id",split($"idsm", "_")(1))
//joining two dataframes and applying filter operation and giving alias for the column names to be used in next operation
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner").filter($"idsm_value" =!= "null").select("col1","col2","schm","schm_value","idsm","idsm_value").withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value").agg(first("idsm_value")).show
you can see the output as below :
+----+----+------+------+------+------+------+------+-----+------+------+------+
|col1|col2|col_b1|col_b2|col_be|col_ch|col_db|col_es|col_i|col_us|col_x2|col_xs|
+----+----+------+------+------+------+------+------+-----+------+------+------+
| 2| 6| id1| null| null| id4| null| null| id2| null| null| id3|
| 3| 5| null| id5| id8| id7| id15| null| null| null| id6| null|
| 4| 7| id9| null| null| id10| null| id00| null| id12| null| id11|
+----+----+------+------+------+------+------+------+-----+------+------+------+
Updated Answer using Map:
If you have n number of columns and you know them in advance you can use the below approach which is more generic as compared to above approach.
//sample Data
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
import org.apache.spark.sql.functions._
val schmcols = Seq("schm_0", "schm_1", "schm_2","schm_3","schm_4","schm_5")
val schmdf = initialdf.select($"col1",$"col2", explode(array(
schmcols.map(column =>
struct(
lit(column).alias("schm"),
col(column).alias("schm_value")
)): _*
)).alias("schmColumn"))
.withColumn("id",split($"schmColumn.schm", "_")(1))
.withColumn("schm",$"schmColumn.schm")
.withColumn("schm_value",$"schmColumn.schm_value").drop("schmColumn")
val idcols = Seq("idsm_0", "idsm_1", "idsm_2","idsm_3","idsm_4","idsm_5")
val idsmdf = initialdf.select($"col1",$"col2", explode(array(
idcols.map(
column =>
struct(
lit(column).alias("idsm"),
col(column).alias("idsm_value")
)): _*
)).alias("idsmColumn"))
.withColumn("id",split($"idsmColumn.idsm", "_")(1))
.withColumn("idsm",$"idsmColumn.idsm")
.withColumn("idsm_value",$"idsmColumn.idsm_value").drop("idsmColumn")
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner")
.filter($"idsm_value" =!= "null")
.select("col1","col2","schm","schm_value","idsm","idsm_value")
.withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value")
.agg(first("idsm_value")).show

Pyspark - how to drop records by primary keys?

I want to delete the records from an old_df if a new_df has a del flag for the key metric_id. What is the right way to achieve this?
old_df (flag here is filled with nulls on purpose)
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| null| value2|
| 10| null| value9|
| 12| null|updated_value|
| 15| null| test_value2|
+---------+--------+-------------+
new_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 10| del| value2|
| 12| pass|updated_value|
| 15| del| test_value2|
+---------+--------+-------------+
result_df
+---------+--------+-------------+
|metric_id| flag | value|
+---------+--------+-------------+
| 12| pass|updated_value|
+---------+--------+-------------+
One easy way to do this is to join then filter:
result_df = (
old_df.join(new_df, on='metric_id', how='left')
.where((new_df['flag'].isNull()) | (new_df['flag'] != lit('del')))
.select('metric_id', new_df['flag'], new_df['value'])
)
Which produces
+---------+----+-------------+
|metric_id|flag| value|
+---------+----+-------------+
| 12|pass|updated_value|
+---------+----+-------------+
I'm using a left join because there might be records in old_df for which the primary key is not present in new_df (and you don't want to delete those).

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')
Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])
Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+
I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)
drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)
Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+
I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.
function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho
df1.join(df2, ['id']).drop(df2.id)
If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])
I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)
You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join