Spark SQL group, map reduce

Spark SQL group, map reduce - sql

I have the following dataset with the name 'data':
+---------+-------------+------+
| name | subject| mark |
+---------+-------------+------+
| Anna| math| 80|
| Vlad| history| 67|
| Jack| art| 78|
| David| math| 71|
| Monica| art| 65|
| Alex| lit| 59|
| Mark| math| 82|
+---------+-------------+------+
I would like to do a map-reduce job.
The result show look like this or similar:
Anna, David : 1
Anna, Mark : 1
David, mark: 1
Vlad, None : 1
Jack, Monica: 1
Alex, None : 1
I have tried to do the following:
new_data = data.select(['name', 'subject']).show()
+---------+-------------+
| name | subject|
+---------+-------------+
| Anna| math|
| Vlad| history|
| Jack| art|
| David| math|
| Monica| art|
| Alex| lit|
| Mark| math|
+---------+-------------+
data_new.groupBy('name','subject').count().show(10)
However, this command does not give what I need.

You can do a self left join using the subject, get the distinct pairs, and add a column of 1.
import pyspark.sql.functions as F
result = df.alias('t1').join(df.alias('t2'),
F.expr("t1.subject = t2.subject and t1.name != t2.name"),
'left'
).select(
F.concat_ws(
', ',
F.greatest('t1.name', F.coalesce('t2.name', F.lit('None'))),
F.least('t1.name', F.coalesce('t2.name', F.lit('None')))
).alias('pair')
).distinct().withColumn('val', F.lit(1))
result.show()
+------------+---+
| pair|val|
+------------+---+
| Alex, None| 1|
| Anna, David| 1|
| Anna, Mark| 1|
| None, Vlad| 1|
| David, Mark| 1|
|Jack, Monica| 1|
+------------+---+

The process could be:
Grouping student with the same subject in an array
Call a udf function to create the array items permutation
Add a column that shows numbers for each subject
Call explode function to create separate3 rows for each item in the array
Let's do the steps one by one:
Step 1: Grouping
import pyspark.sql.functions as F
grouped_df = data_new.groupBy('subject').agg(F.collect_set('name').alias('students_array'))
Step 2: udf function
from itertools import permutations
def permutatoin(df_col):
result = sorted([e for e in set(permutations(df_col))])
return result
spark.udf.register("perWithPython", permutatoin)
grouped_df = grouped_df.select('*', permutatoin('students_array'))
Step 3: Create a new digit value column for each subject
grouped_df = grouped_df .withColumn('subject_no', F.rowNumber().over(Window.partitionBy('subject'))
Step 4: create separate rows
grouped_df.select(grouped_df.subject_no, explode(grouped_df.students_array)).show(truncate=False)

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks

What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance

Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

Using pyspark to create a segment array from a flat record

I have a sparsely populated table with values for various segments for unique user ids. I need to create an array with unique_id and relevant segment headers only
Please note that this is just an indicative dataset. I have several hundreds of segments like these.
------------------------------------------------
| user_id | seg1 | seg2 | seg3 | seg4 | seg5 |
------------------------------------------------
| 100 | M | null| 25 | null| 30 |
| 200 | null| null| 43 | null| 250 |
| 300 | F | 3000| null| 74 | null|
------------------------------------------------
I am expecting the output to be
-------------------------------
| user_id| segment_array |
-------------------------------
| 100 | [seg1, seg3, seg5] |
| 200 | [seg3, seg5] |
| 300 | [seg1, seg2, seg4] |
-------------------------------
Is there any function available in pyspark of pyspark-sql to accomplish this?
Thanks for your help!

I cannot find the direct way but you can do this.
cols= df.columns[1:]
r = df.withColumn('array', array(*[when(col(c).isNotNull(), lit(c)).otherwise('notmatch') for c in cols])) \
.withColumn('array', array_remove('array', 'notmatch'))
r.show()
+-------+----+----+----+----+----+------------------+
|user_id|seg1|seg2|seg3|seg4|seg5| array|
+-------+----+----+----+----+----+------------------+
| 100| M|null| 25|null| 30|[seg1, seg3, seg5]|
| 200|null|null| 43|null| 250| [seg3, seg5]|
| 300| F|3000|null| 74|null|[seg1, seg2, seg4]|
+-------+----+----+----+----+----+------------------+

Not sure this is the best way but I'd attack it this way:
There's the collect_set function which will always give you a unique value across a list of values you aggregate over.
do a union for each segment on:
df_seg_1 = df.select(
'user_id',
fn.when(
col('seg1').isNotNull(),
lit('seg1)
).alias('segment')
)
# repeat for all segments
df = df_seg_1.union(df_seg_2).union(...)
df.groupBy('user_id').agg(collect_list('segment'))

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

I want to get the distinct values and their respective counts of every column of a dataframe and store them as (k,v) in another dataframe.
Note: My Columns are not static, they keep changing. So, I cannot hardcore the column names instead I should loop through them.
For Example, below is my dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
My final result should be created in a new dataframe as (k,v) where k is the distinct record and v is the count of it.
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
Can anyone please help me with this, I'm using Spark 2.4.0 and Scala 2.11.12
Note: My columns are dynamic, so I can't hardcore the columns and do groupby on them.

I don't have exact solution to your query but I can surely provide you with some help that can get you started working on your issue.
Create dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
Next calculate count of distinct element of each column
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
Create a range to iterate over distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
Aggregate your data
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
Create data frame:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
I hope this helps you in some way.

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?
Pandas:
df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime', 'flt_flightnumber'], keep='first')
Spark dataframe (I use Spark 1.6.0) doesn't have the keep option
df.orderBy(['actual_datetime']).dropDuplicates(subset=['scheduled_datetime', 'flt_flightnumber'])
Imagine scheduled_datetime and flt_flightnumber are columns 6 ,17. By creating keys based on the values of these columns we can also deduplicate
def get_key(x):
return "{0}{1}".format(x[6],x[17])
df= df.map(lambda x: (get_key(x),x)).reduceByKey(lambda x,y: (x))
but how to specify to keep the first row and get rid of the other duplicates ? What about the last row ?

To everyone saying that dropDuplicates keeps the first occurrence - this is not strictly correct.
dropDuplicates keeps the 'first occurrence' of a sort operation - only if there is 1 partition. See below for some examples.
However this is not practical for most Spark datasets. So I'm also including an example of 'first occurrence' drop duplicates operation using Window function + sort + rank + filter.
See bottom of post for example.
This is tested in Spark 2.4.0 using pyspark.
dropDuplicates examples
import pandas as pd
# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)
col1 datestr
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-02-01
1 1 2018-02-01
2 2 2018-02-01
3 3 2018-02-01
4 4 2018-02-01
0 0 2018-03-01
1 1 2018-03-01
2 2 2018-03-01
3 3 2018-03-01
4 4 2018-03-01
# first example
# does not give first (based on datestr)
(spark.createDataFrame(dfall)
.orderBy('datestr')
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates NOT based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-03-01|
| 1|2018-02-01|
| 3|2018-02-01|
| 2|2018-02-01|
| 4|2018-01-01|
+----+----------+
# second example
# testing what happens with repartition
(spark.createDataFrame(dfall)
.orderBy('datestr')
.repartition('datestr')
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates NOT based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-02-01|
| 1|2018-01-01|
| 3|2018-02-01|
| 2|2018-02-01|
| 4|2018-02-01|
+----+----------+
#third example
# testing with coalesce(1)
(spark
.createDataFrame(dfall)
.orderBy('datestr')
.coalesce(1)
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
+----+----------+
# fourth example
# testing with reverse sort then coalesce(1)
(spark
.createDataFrame(dfall)
.orderBy('datestr', ascending = False)
.coalesce(1)
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates based on occurrence of sorted datestr```
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-03-01|
| 1|2018-03-01|
| 2|2018-03-01|
| 3|2018-03-01|
| 4|2018-03-01|
+----+----------+
window, sort, rank, filter example
# generating some example data with pandas
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
# into spark df
df_s = (spark.createDataFrame(dfall))
from pyspark.sql import Window
from pyspark.sql.functions import rank
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
+----+----------+
# however this fails if ties/duplicates exist in the windowing paritions
# and so a tie breaker for the 'rank' function must be added
# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-01-01' # note duplicates in this dataset
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)
col1 datestr
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-03-01
1 1 2018-03-01
2 2 2018-03-01
3 3 2018-03-01
4 4 2018-03-01
# this will fail, since duplicates exist within the window partitions
# and no way to specify ranking style exists in pyspark rank() fn
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 0|2018-01-01|
| 1|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
| 4|2018-01-01|
+----+----------+
# to deal with ties within window partitions, a tiebreaker column is added
from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("col1").orderBy("datestr",'tiebreak')
(df_s
.withColumn('tiebreak', monotonically_increasing_id())
.withColumn('rank', rank().over(window))
.filter(col('rank') == 1).drop('rank','tiebreak')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
+----+----------+

Use window and row_number functions.
Order by ascending or descending to select first or last.
from pyspark.sql import Window
from pyspark.sql import functions as f
window = Window.partitionBy("col1").orderBy("datestr").asc()
df = (df.withColumn('row', f.row_number().over(window))\
.filter(col('row') == 1)
.drop('row')
.show())

I did the following:
dataframe.groupBy("uniqueColumn").min("time")
This will group by the given column, and within the same group choose the one with min time (this will keep the first and remove others)

solution 1
add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)
solution 2:
turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.
One of my friend (sameer) mentioned that below(old solution) didn't work for him.
use dropDuplicates method by default it keeps the first occurance.

You can use a window with row_number:
import pandas as pd
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = spark.createDataFrame(pd.concat([df1,df2,df3]))
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col,row_number
window = Window.partitionBy('col1').orderBy(col('datestr'))
dfall.select('*', row_number().over(window).alias('posicion')).show()
dfall.select('*', row_number().over(window).alias('posicion')).where('posicion ==1').show()
+----+----------+--------+
|col1| datestr|posicion|
+----+----------+--------+
| 0|2018-01-01| 1|
| 0|2018-02-01| 2|
| 0|2018-03-01| 3|
| 1|2018-01-01| 1|
| 1|2018-02-01| 2|
| 1|2018-03-01| 3|
| 3|2018-01-01| 1|
| 3|2018-02-01| 2|
| 3|2018-03-01| 3|
| 2|2018-01-01| 1|
| 2|2018-02-01| 2|
| 2|2018-03-01| 3|
| 4|2018-01-01| 1|
| 4|2018-02-01| 2|
| 4|2018-03-01| 3|
+----+----------+--------+
+----+----------+--------+
|col1| datestr|posicion|
+----+----------+--------+
| 0|2018-01-01| 1|
| 1|2018-01-01| 1|
| 3|2018-01-01| 1|
| 2|2018-01-01| 1|
| 4|2018-01-01| 1|
+----+----------+--------+

I just did something perhaps similar to what you guys need, using drop_duplicates pyspark.
Situation is this. I have 2 dataframes (coming from 2 files) which are exactly same except 2 columns file_date(file date extracted from the file name) and data_date(row date stamp). Annoyingly I have rows which are with same data_date (and all other column cells too) but different file_date as they get replicated on every newcomming file with an addition of one new row.
I needed to capture all rows from the new file, plus that one row left over from the previous file. That row is not in the new file. Remaining columns on the right from data_date are same between the two files for the same data_date.
file_1_20190122 - df1
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- One row we want to keep where file_date 22nd
| AGGH|2019-01-22|2019-01-17|
| AGGH|2019-01-22|2019-01-18|
| AGGH|2019-01-22|2019-01-19|
| AGGH|2019-01-22|2019-01-20|
| AGGH|2019-01-22|2019-01-21|
| AGGH|2019-01-22|2019-01-22|
file_2_20190123 - df2
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-23|2019-01-17| \/ ALL rows we want to keep where file_date 23rd
| AGGH|2019-01-23|2019-01-18|
| AGGH|2019-01-23|2019-01-19|
| AGGH|2019-01-23|2019-01-20|
| AGGH|2019-01-23|2019-01-21|
| AGGH|2019-01-23|2019-01-22|
| AGGH|2019-01-23|2019-01-23|
This will require us to sort and concat df's, then deduplicate them on all columns but one.
Let me walk you through.
union_df = df1.union(df2) \
.sort(['station_code', 'data_date'], ascending=[True, True])
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- keep
| AGGH|2019-01-23|2019-01-17| <- keep
| AGGH|2019-01-22|2019-01-17| x- drop
| AGGH|2019-01-22|2019-01-18| x- drop
| AGGH|2019-01-23|2019-01-18| <- keep
| AGGH|2019-01-23|2019-01-19| <- keep
| AGGH|2019-01-22|2019-01-19| x- drop
| AGGH|2019-01-23|2019-01-20| <- keep
| AGGH|2019-01-22|2019-01-20| x- drop
| AGGH|2019-01-22|2019-01-21| x- drop
| AGGH|2019-01-23|2019-01-21| <- keep
| AGGH|2019-01-23|2019-01-22| <- keep
| AGGH|2019-01-22|2019-01-22| x- drop
| AGGH|2019-01-23|2019-01-23| <- keep
Here we drop already sorted duped rows excluding keys ['file_date', 'data_date'].
nonduped_union_df = union_df \
.drop_duplicates(['station_code', 'data_date', 'time_zone',
'latitude', 'longitude', 'elevation',
'highest_temperature', 'lowest_temperature',
'highest_temperature_10_year_normal',
'another_50_columns'])
And the result holds ONE row with earliest date from DF1 which is not in DF2 and ALL rows from DF2
nonduped_union_df.select(['station_code', 'file_date', 'data_date',
'highest_temperature', 'lowest_temperature']) \
.sort(['station_code', 'data_date'], ascending=[True, True]) \
.show(30)
+------------+----------+----------+-------------------+------------------+
|station_code| file_date| data_date|highest_temperature|lowest_temperature|
+------------+----------+----------+-------------------+------------------+
| AGGH|2019-01-22|2019-01-16| 90| 77| <- df1 22nd
| AGGH|2019-01-23|2019-01-17| 90| 77| \/- df2 23rd
| AGGH|2019-01-23|2019-01-18| 91| 75|
| AGGH|2019-01-23|2019-01-19| 88| 77|
| AGGH|2019-01-23|2019-01-20| 88| 77|
| AGGH|2019-01-23|2019-01-21| 88| 77|
| AGGH|2019-01-23|2019-01-22| 90| 75|
| AGGH|2019-01-23|2019-01-23| 90| 75|
| CWCA|2019-01-22|2019-01-15| 23| -2|
| CWCA|2019-01-23|2019-01-16| 7| -8|
| CWCA|2019-01-23|2019-01-17| 28| -6|
| CWCA|2019-01-23|2019-01-18| 0| -13|
| CWCA|2019-01-23|2019-01-19| 25| -15|
| CWCA|2019-01-23|2019-01-20| -4| -18|
| CWCA|2019-01-23|2019-01-21| 27| -6|
| CWCA|2019-01-22|2019-01-22| 30| 17|
| CWCA|2019-01-23|2019-01-22| 30| 13|
| CWCO|2019-01-22|2019-01-15| 34| 29|
| CWCO|2019-01-23|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-17| 23| 7|
| CWCO|2019-01-23|2019-01-17| 23| 7|
+------------+----------+----------+-------------------+------------------+
only showing top 30 rows
It may not be best suitable answer for this case, but it's the one worked for me.
Let me know, if stuck somewhere.
BTW - if anyone can tell me how to select all columns in a df, except one without listing them in a list - I will be very thankful.
Regards
G

I would try this way:
Assuming your data_df looks like this, and we want to keep the rows with the highest value in col1 per datestr:
col1 datestr
0 2018-01-01
1 2018-01-01
2 2018-01-01
3 2018-01-01
4 2018-01-01
0 2018-02-01
1 2018-02-01
2 2018-02-01
3 2018-02-01
4 2018-02-01
0 2018-03-01
1 2018-03-01
2 2018-03-01
3 2018-03-01
4 2018-03-01
you can do:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('datestr')
data_df = data_df.withColumn("max", F.max(F.col("col1"))\
.over(w))\
.where(F.col('max') == F.col('col1'))\
.drop("max")
this results in:
col1 datestr
4 2018-01-01
4 2018-02-01
4 2018-03-01

Given the below table:
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
| 0|2018-02-01|
| 1|2018-02-01|
| 2|2018-02-01|
| 3|2018-02-01|
| 4|2018-02-01|
| 0|2018-03-01|
| 1|2018-03-01|
| 2|2018-03-01|
| 3|2018-03-01|
| 4|2018-03-01|
+----+----------+
You can do it in two steps:
Group by the given table based upon the col1 and pick min date.
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
+----+----------+
left Join the resultant table with original table on col1 and min_datestr.

if datasets isnt not large convert to pandas data frame and drop duplicates keeping last or first then convert back.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark SQL group, map reduce - sql

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

Using pyspark to create a segment array from a flat record

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

spark dataframe drop duplicates and keep first

Categories

Resources