groupby with when in pyspark - apache-spark-sql

I want to manipulate my transactional data frame based on some conditions. My actual data frame:
+---+------------+---+-------------------+
| id|install_date|age| txn_date|
+---+------------+---+-------------------+
| 1| 2019-10-01| 33|2019-09-20 15:27:22|
| 1| 2019-10-01| 33|2019-10-02 14:17:20|
| 1| 2019-10-01| 33|2019-10-07 15:17:12|
| 2| 2019-08-10| 45|2019-06-01 14:07:19|
| 2| 2019-08-10| 45|2019-05-01 15:27:22|
| 3| 2019-09-01| 37|2019-09-20 16:17:20|
| 3| 2019-09-01| 37|2019-10-10 15:27:22|
After that do some manipulation on it to reach this state -
---+------------+---+----------+--------------------+------------------------------------+
| id|install_date|age| txn_date|app_install_duration|first_txn_duration_after_app_install|
+---+------------+---+----------+--------------------+------------------------------------+
| 1| 2019-10-01| 33|2019-09-20| 16| -11|
| 1| 2019-10-01| 33|2019-10-02| 16| 1|
| 1| 2019-10-01| 33|2019-10-07| 16| 6|
| 2| 2019-08-10| 45|2019-06-01| 68| -70|
| 2| 2019-08-10| 45|2019-05-01| 68| -101|
| 3| 2019-09-01| 37|2019-09-20| 46| 19|
| 3| 2019-09-01| 37|2019-10-10| 46| 39|
+---+------------+---+----------+--------------------+------------------------------------+
Now, I want my dataframe to look like this:
+---+------------+---+----------+--------------------+------------------------------------+-----------
| id|install_date|age| txn_date|app_install_duration|first_txn_duration_after_app_install|is_active
+---+------------+---+----------+--------------------+------------------------------------+-----------
| 1| 2019-10-01| 33|2019-10-02| 16| 1| 1
| 2| 2019-08-10| 45|2019-06-01| 68| -70| 0
| 3| 2019-09-01| 37|2019-09-20| 46| 19| 1
+---+------------+---+----------+--------------------+------------------------------------+-----------
What I have done so far:
df=spark.createDataFrame([(1,'2019-10-01',33,'2019-09-20 15:27:22'),
(1,'2019-10-01',33,'2019-10-02 14:17:20'),
(1,'2019-10-01',33,'2019-10-07 15:17:12'),
(2,'2019-08-10',45,'2019-06-01 14:07:19'),
(2,'2019-08-10',45,'2019-05-01 15:27:22'),
(3,'2019-09-01',37,'2019-09-20 16:17:20'),
(3,'2019-09-01',37,'2019-10-10 15:27:22')],
['id','install_date','age','txn_date'])
df = df.withColumn('install_date',to_date(unix_timestamp(F.col('install_date'),'yyyy-MM-dd').cast("timestamp")))
df= df.withColumn('app_install_duration', F.datediff(F.current_date(), df.install_date))
df = df.withColumn('txn_date',to_date(unix_timestamp(F.col('txn_date'),'yyyy-MM-dd HH:mm:ss').cast("timestamp")))
df= df.withColumn('first_txn_duration_after_app_install', F.datediff(df.txn_date, df.install_date))
df.show()

Related

assign max+1 to Sequence field if the Register Number set reappears

I have a dataframe as like below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 2|
| 11| 11121000| 13| 2|
| 13| 11121000| 14| 2|
| 14| 11121000| 15| 2|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Each record set starts with recType equal to 01 and ends either with recType equal to 13 or 14.
In some cases, we have some duplicates registerNumber which assigns a duplicates sequence field to the record set.
In the given dataframe, the registerNumber value 11121000 is duplicate.
I want to assign a new sequence value to the duplicate registerNumber value 11121000. So the output dataframe should look as below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 6|
| 11| 11121000| 13| 6|
| 13| 11121000| 14| 6|
| 14| 11121000| 15| 6|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Please guide me, how to approach this problem.

Add total per group as a new row in dataframe in Pyspark

Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|
I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?

How to flatten a pyspark dataframe? (spark 1.6)

I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.
Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+