How to filter dates column with one condtion from the other column in Pyspark? - sql

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks

We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+

Related

How to efficiently split a dataframe in Spark based on a condition?

I have a situtation like that with this Spark dataframe:
id
value
1
0
1
3
2
4
1
0
2
2
3
0
4
1
Now what I want to obtain is to efficiently split this single dataframe in 3 different one such that each dataframe extracted from the original one is between two 0 in the "value" column (with the first zero indicating the beginning of each dataframe) using Apache Spark, so that I would obtain this as result:
Dataframe 1 (rows from first 0 value to the last value before the next 0):
id
value
1
0
1
3
2
4
Dataframe 2 (rows from the second zero value to the last value before the 3rd zero):
id
value
1
0
2
2
Dataframe 3:
id
value
3
0
4
1
While as samkart said it is not efficient/easy way to break data on basis of order of rows still if you are using spark v3.2+ you can leverage pandas on pyspark to do it in spark way like below
import pyspark.pandas as ps
from pyspark.sql import functions as F
from pyspark.sql import Window
pdf=ps.read_csv("/FileStore/tmp4/pand.txt")
sdf = pdf.to_spark(index_col='index')
sdf=sdf.withColumn("run",F.sum(F.when(F.col("value")==0,1).otherwise(0)).over(Window.orderBy("index")))
toval= sdf.agg(F.max(F.col("run"))).collect()[0][0]
for x in range (1,toval+1):
globals()[f"sdf{x}"]=sdf.filter(F.col("run")==x).drop("index","run")
For above data it will create 3 dataframe sdf1,sdf2,sdf3 like below
sdf1.show()
sdf2.show()
sdf3.show()
#output
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 1| 3|
| 2| 4|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 2| 2|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 4| 1|
+---+-----+

How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.
Data frame before group-by and pivot:
+----+-----------+-----------+-----------+-----------+
| id | quantity | cost | activity | category |
+----+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | skiing | outdoor |
| 2 | 0 | 2 | swimming | outdoor |
+----+-----------+-----------+-----------+-----------+
pivot code:
pivotDF = df.groupBy("category").pivot("activity").sum("quantity")
result:
+----+-----------+-----------+-----------+
| id | category | skiing | swimming |
+----+-----------+-----------+-----------+
| 1 | outdoor | 2 | 5 |
| 2 | outdoor | 4 | 7 |
+----+-----------+-----------+-----------+
The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.
Any help would be much appreciated.
You can provide more than 1 aggregation after the pivot.
Let's say the input dataframe looks like the following
# +---+---+----+--------+-------+
# | id|qty|cost| act| cat|
# +---+---+----+--------+-------+
# | 1| 2| 2| skiing|outdoor|
# | 2| 0| 2|swimming|outdoor|
# | 3| 1| 2| skiing|outdoor|
# | 4| 2| 4|swimming|outdoor|
# +---+---+----+--------+-------+
Do a pivot and use agg() to provide more than 1 aggregation.
data_sdf. \
groupBy('id', 'cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +---+-------+-----------+----------+-------------+------------+
# | id| cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# | 2|outdoor| null| null| 2| 0|
# | 1|outdoor| 2| 2| null| null|
# | 3|outdoor| 2| 1| null| null|
# | 4|outdoor| null| null| 4| 2|
# +---+-------+-----------+----------+-------------+------------+
Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.
Removing id from the groupBy makes the result much better.
data_sdf. \
groupBy('cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +-------+-----------+----------+-------------+------------+
# | cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor| 4| 3| 6| 2|
# +-------+-----------+----------+-------------+------------+

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

pyspark filter a dataframe using the min value for each id

Given a table like the following:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1| 2.80033330216659| 315.320000|
| 1| 2.8222365762732| 315.320000|
| 1| 5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2| 5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3| 5.77993740687426| 315.320000|
+--+------------------+-----------+
I'm trying to reduce the amount of records per id by only considering the diagnoses with the least diagnosis age per id. In SQL you would join the table to itself, something like:
SELECT a.id, a.diagnosis_age, a.diagnosis
FROM tbl1 a
INNER JOIN
(SELECT id, MIN(diagnosis_age) AS min_diagnosis_age
FROM tbl1
GROUP BY id) b
ON b.id = a.id
WHERE b.min_diagnosis_age = a.diagnosis_age
If it were an rdd you could do something like:
rdd.map(lambda x: (x["id"], [(x["diagnosis_age"], x["diagnosis"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: (x[0], [i for i in x[1] if i[0] == min(x[1])[0]]))
How would you achieve the same using only spark dataframe operations? If this is possible? Specifically no sql/ rdd operations.
thanks
You can use window with first function, and then filter out all others.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id").orderBy("diagnosis_age")
df.withColumn("least_age", F.first("diagnosis_age").over(w))\
.filter("diagnosis_age=least_age").drop("least_age").show()
+---+------------------+---------+
| id| diagnosis_age|diagnosis|
+---+------------------+---------+
| 1|2.1843037179180302| 315.32|
| 3| 5.744053881894209| 315.32|
| 2| 5.70572315231258| 315.32|
+---+------------------+---------+
You can also do this without window function, use groupBy min and first:
from pyspark.sql import functions as F
df.orderBy("diagnosis_age").groupBy("id")\
.agg(F.min("diagnosis_age").alias("diagnosis_age"), F.first("diagnosis").alias("diagnosis"))\
.show()
+---+------------------+---------+
| id| diagnosis_age|diagnosis|
+---+------------------+---------+
| 1|2.1843037179180302| 315.32|
| 3| 5.744053881894209| 315.32|
| 2| 5.70572315231258| 315.32|
+---+------------------+---------+
Note that I am ordering By diagnosis_age before the groupyBy to handle those cases where your required diagnosis value does not appear in the first row of the group. However , if your data is already ordered by the diagnosis_age you can use above code without the orderBy.

How to merge rows in hive?

I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id