Related
I have a PySpark dataframe that looks like this:
data = [(2010, 3, 12, 0, 'p1', 'state1'),
(2010, 3, 12, 0, 'p2', 'state2'),
(2010, 3, 12, 0, 'p3', 'state1'),
(2010, 3, 12, 0, 'p4', 'state2'),
(2010, 3, 12, 2, 'p1', 'state3'),
(2010, 3, 12, 2, 'p2', 'state1'),
(2010, 3, 12, 2, 'p3', 'state3'),
(2010, 3, 12, 4, 'p1', 'state1'),
(2010, 3, 12, 6, 'p1', 'state1')]
columns = ['year', 'month', 'day', 'hour', 'process_id','state']
df = spark.createDataFrame(data=data, schema=columns)
df.show()
+----+-----+---+----+----------+------+
|year|month|day|hour|process_id| state|
+----+-----+---+----+----------+------+
|2010| 3| 12| 0| p1|state1|
|2010| 3| 12| 0| p2|state2|
|2010| 3| 12| 0| p3|state1|
|2010| 3| 12| 0| p4|state2|
|2010| 3| 12| 2| p1|state3|
|2010| 3| 12| 2| p2|state1|
|2010| 3| 12| 2| p3|state3|
|2010| 3| 12| 4| p1|state1|
|2010| 3| 12| 6| p1|state1|
+----+-----+---+----+----------+------+
The dataframe is already sorted in an increasing order by the four columns: year, month, day and hour as above. The increment is in 2-hour interval.
I would like to find out, for each process_id, how many times its state changes within each day. For that, I intend to use groupby, something like this:
chg_count_df = df.groupby('process_id', 'year', 'month', 'day').
agg(.....)
For this example, the expected output is:
+----+-----+---+----------+----------+
|year|month|day|process_id| chg_count|
+----+-----+---+----------+----------+
|2010| 3| 12| p1| 2|
|2010| 3| 12| p2| 1|
|2010| 3| 12| p3| 1|
|2010| 3| 12| p4| 0|
+----+-----+---+----------+----------+
What should go into the agg(...) function? Or is there a better to way to do this?
You could employ lag window function to check if a state was changed. Then groupBy using sum.
from pyspark.sql import functions as F, Window as W
w = W.partitionBy('year', 'month', 'day', 'process_id').orderBy(F.desc('hour'))
df = df.withColumn('change', F.coalesce((F.lag('state').over(w) != F.col('state')).cast('int'), F.lit(0)))
df = df.groupBy('year', 'month', 'day', 'process_id').agg(F.sum('change').alias('chg_count'))
df.show()
# +----+-----+---+----------+---------+
# |year|month|day|process_id|chg_count|
# +----+-----+---+----------+---------+
# |2010| 3| 12| p1| 2|
# |2010| 3| 12| p2| 1|
# |2010| 3| 12| p3| 1|
# |2010| 3| 12| p4| 0|
# +----+-----+---+----------+---------+
chg_count_df = df.groupby('process_id', 'year', 'month', 'day').count()
I have given dataframe that looks like this.
THIS dataframe is sorted by date, and col1 is just some random value.
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
])
TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\
('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()
+----------+----+
| date|col1|
+----------+----+
|2020-08-01| 3|
|2020-08-02| 1|
|2020-08-03| -1|
|2020-08-04| -1|
|2020-08-05| 3|
|2020-08-06| -1|
|2020-08-07| 6|
|2020-08-08| 4|
|2020-08-09| 5|
+----------+----+
LOGIC : lead(col1) +1, if col1 ==-1, then from the previous value lead(col1) +2...
the resulted dataframe will look like this (want column is what i want as output)
+----------+----+----+
| date|col1|WANT|
+----------+----+----+
|2020-08-01| 3| 2|
|2020-08-02| 1| 6|
|2020-08-03| -1| 5|
|2020-08-04| -1| 4|
|2020-08-05| 3| 8|
|2020-08-06| -1| 7|
|2020-08-07| 6| 5|
|2020-08-08| 4| 6|
|2020-08-09| 5| -1|
+----------+----+----+
Let's look at last row, where col1==5, that 5 is leaded +1 which is in want==6 (2020-08-08)
If we have col==-1, then we add +1 more ,, if we have col==-1 repeated twice, then we add +2 more..
this is hard to explain in words,lastly since it created last column instead of null, replace with -1. I have a diagram
You can check if the following code and logic works for you:
create a sub-group label g which take running sum of int(col1!=-1), and we only concern about Rows with col1 == -1, and nullify all other Rows.
the residual is 1 and if col1 == -1, plus the running count on Window w2
take the prev_col1 over w1 which is not -1 (using nullif), (the naming of prev_col1 might be confusion since it takes only if col1 = -1 using typical pyspark's way to do ffill, otherwise keep the original).
set val = prev_col1 + residual, take the lag and set null to -1
Code below:
from pyspark.sql.functions import when, col, expr, count, desc, lag, coalesce
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('g').orderBy(desc('date'))
TEST_df.withColumn('g', when(col('col1') == -1, expr("sum(int(col1!=-1))").over(w1))) \
.withColumn('residual', when(col('col1') == -1, count('*').over(w2) + 1).otherwise(1)) \
.withColumn('prev_col1',expr("last(nullif(col1,-1),True)").over(w1)) \
.withColumn('want', coalesce(lag(expr("prev_col1 + residual")).over(w1),lit(-1))) \
.orderBy('date').show()
+----------+----+----+--------+---------+----+
| date|col1| g|residual|prev_col1|want|
+----------+----+----+--------+---------+----+
|2020-08-01| 3|null| 1| 3| 2|
|2020-08-02| 1|null| 1| 1| 6|
|2020-08-03| -1| 4| 3| 3| 5|
|2020-08-04| -1| 4| 2| 3| 4|
|2020-08-05| 3|null| 1| 3| 8|
|2020-08-06| -1| 3| 2| 6| 7|
|2020-08-07| 6|null| 1| 6| 5|
|2020-08-08| 4|null| 1| 4| 6|
|2020-08-09| 5|null| 1| 5| -1|
+----------+----+----+--------+---------+----+
I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+
I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4
Can I compute a discounted future cumulative sum using spark sql? Below is an example that computes the undiscounted cum future sum using window functions, and I hard coded in what I mean by the discounted cum sum:
from pyspark.sql.window import Window
def undiscountedCummulativeFutureReward(df):
windowSpec = Window \
.partitionBy('user') \
.orderBy('time') \
.rangeBetween(0, Window.unboundedFollowing)
tot_reward = F.sum('reward').over(windowSpec)
df_tot_reward = df.withColumn('undiscounted', tot_reward)
return df_tot_reward
def makeData(spark, gamma=0.5):
data = [{'user': 'bob', 'time': 3, 'reward': 10, 'discounted_cum': 10 + (gamma * 9) + ((gamma ** 2) * 11)},
{'user': 'bob', 'time': 4, 'reward': 9, 'discounted_cum': 9 + gamma * 11},
{'user': 'bob', 'time': 5, 'reward': 11, 'discounted_cum': 11.0},
{'user': 'jo', 'time': 4, 'reward': 6, 'discounted_cum': 6 + gamma * 7},
{'user': 'jo', 'time': 5, 'reward': 7, 'discounted_cum': 7.0},
]
schema = T.StructType([T.StructField('user', T.StringType(), False),
T.StructField('time', T.IntegerType(), False),
T.StructField('reward', T.IntegerType(), False),
T.StructField('discounted_cum', T.FloatType(), False)])
return spark.createDataFrame(data=data, schema=schema)
def main(spark):
df = makeData(spark)
df = undiscountedCummulativeFutureReward(df)
df.orderBy('user', 'time').show()
return df
When you run it you get:
+----+----+------+--------------+------------+
|user|time|reward|discounted_cum|undiscounted|
+----+----+------+--------------+------------+
| bob| 3| 10| 17.25| 30|
| bob| 4| 9| 14.5| 20|
| bob| 5| 11| 11.0| 11|
| jo| 4| 6| 9.5| 13|
| jo| 5| 7| 7.0| 7|
+----+----+------+--------------+------------+
That is discounted is sum \gamma^k r_k for k=0 to \infinity
I'm wondering if I can compute the discounted column with Window functions, like introduce a column with the rank, a literal with gamma, multiply things together - but still not quite clear - I suppose I can do it with some kind of UDF, but I think I'd have to first collect_as_list all the users, return a new list with the cum discounted sum, and then explode the list.
Suppose you were starting with the following DataFrame:
df.show()
#+----+----+------+
#|user|time|reward|
#+----+----+------+
#| bob| 3| 10|
#| bob| 4| 9|
#| bob| 5| 11|
#| jo| 4| 6|
#| jo| 5| 7|
#+----+----+------+
You can join this DataFrame to itself on the user column, and keep only those rows where the time column of the right table is greater than or equal to the time column of the left table. We make this easier by aliasing the DataFrames l and r.
After the join, you can group by user, time and reward from the left table and aggregate the reward column from the right table. However it seems that a groupBy followed by an orderBy is not guaranteed to maintain that order, so you should use a Window to be explicit.
from pyspark.sql import Window, functions as f
w = Window.partitionBy("user", "l.time", "l.reward").orderBy("r.time")
df = df.alias("l").join(df.alias("r"), on="user")\
.where("r.time>=l.time")\
.select(
"user",
f.col("l.time").alias("time"),
f.col("l.reward").alias("reward"),
f.collect_list("r.reward").over(w).alias("rewards")
)
df.show()
#+----+----+------+-----------+
#|user|time|reward| rewards|
#+----+----+------+-----------+
#| jo| 4| 6| [6]|
#| jo| 4| 6| [6, 7]|
#| jo| 5| 7| [7]|
#| bob| 3| 10| [10]|
#| bob| 3| 10| [10, 9]|
#| bob| 3| 10|[10, 9, 11]|
#| bob| 4| 9| [9]|
#| bob| 4| 9| [9, 11]|
#| bob| 5| 11| [11]|
#+----+----+------+-----------+
Now you have all of the elements required to compute your discounted_cum column.
Spark 2.1 and above:
You can use pyspark.sql.functions.posexplode to explode the rewards array along with the index in the list. This will make a new row for each value in the rewards array. Use distinct to drop duplicates that were introduced by using the Window function (instead of groupBy).
We'll call the index k and the reward rk. Now you can apply your function using pyspark.sql.functions.pow
gamma = 0.5
df.select("user", "time", "reward", f.posexplode("rewards").alias("k", "rk"))\
.distinct()\
.withColumn("discounted", f.pow(f.lit(gamma), f.col("k"))*f.col("rk"))\
.groupBy("user", "time")\
.agg(f.first("reward").alias("reward"), f.sum("discounted").alias("discounted_cum"))\
.show()
#+----+----+------+--------------+
#|user|time|reward|discounted_cum|
#+----+----+------+--------------+
#| bob| 3| 10| 17.25|
#| bob| 4| 9| 14.5|
#| bob| 5| 11| 11.0|
#| jo| 4| 6| 9.5|
#| jo| 5| 7| 7.0|
#+----+----+------+--------------+
Older Versions of Spark
For older versions of spark, you'll have to use row_number()-1 to get the values for k after using explode:
df.select("user", "time", "reward", f.explode("rewards").alias("rk"))\
.distinct()\
.withColumn(
"k",
f.row_number().over(Window.partitionBy("user", "time").orderBy("time"))-1
)\
.withColumn("discounted", f.pow(f.lit(gamma), f.col("k"))*f.col("rk"))\
.groupBy("user", "time")\
.agg(f.first("reward").alias("reward"), f.sum("discounted").alias("discounted_cum"))\
.show()
#+----+----+------+--------------+
#|user|time|reward|discounted_cum|
#+----+----+------+--------------+
#| jo| 4| 6| 9.5|
#| jo| 5| 7| 7.0|
#| bob| 3| 10| 17.25|
#| bob| 4| 9| 14.5|
#| bob| 5| 11| 11.0|
#+----+----+------+--------------+