Pyspark dataframes group by - dataframe

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated

There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Related

Average time between actions by user (SQL and pandas)

Let's say I have a table like this with user_id and the time difference between actions already calculated.
**|user_id |sec_between_actions|**
| 329| 1|
| 329| 211|
| 329| 911|
| 329| 11|
| 329| 9|
| 12| 2|
| 12| 3|
| 12| 8|
| 12| 7|
| 12| 7|
| 1| 1|
| 1| 1|
| 111| 3|
| 111| 11|
| 18| 4|
| 29| 5|
| 29| 1|
(imagine a lot of records and lots of users)
My desire output would be something like that (using SQL):
**|user_id |avg_time_between_actions|**
| 329| 228,6|
| 12| 5,4|
| 1| 1|
| 111| 7|
| 18| 4|
| 29| 3|
For doing this in SQL, you have to use the group by function to group same user ids and then use the aggregation function to find the average of the grouped numbers.
SQL Code:
SELECT user_id, AVG(sec_between_actions) as avg_time_between_actions,
FROM table_name
GROUP BY user_id;
I am not sure why you have used , instead of ., but you can do this as well by changing the output but it doesn't seem logical.

Add total per group as a new row in dataframe in Pyspark

Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|
I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+

groupby with when in pyspark

I want to manipulate my transactional data frame based on some conditions. My actual data frame:
+---+------------+---+-------------------+
| id|install_date|age| txn_date|
+---+------------+---+-------------------+
| 1| 2019-10-01| 33|2019-09-20 15:27:22|
| 1| 2019-10-01| 33|2019-10-02 14:17:20|
| 1| 2019-10-01| 33|2019-10-07 15:17:12|
| 2| 2019-08-10| 45|2019-06-01 14:07:19|
| 2| 2019-08-10| 45|2019-05-01 15:27:22|
| 3| 2019-09-01| 37|2019-09-20 16:17:20|
| 3| 2019-09-01| 37|2019-10-10 15:27:22|
After that do some manipulation on it to reach this state -
---+------------+---+----------+--------------------+------------------------------------+
| id|install_date|age| txn_date|app_install_duration|first_txn_duration_after_app_install|
+---+------------+---+----------+--------------------+------------------------------------+
| 1| 2019-10-01| 33|2019-09-20| 16| -11|
| 1| 2019-10-01| 33|2019-10-02| 16| 1|
| 1| 2019-10-01| 33|2019-10-07| 16| 6|
| 2| 2019-08-10| 45|2019-06-01| 68| -70|
| 2| 2019-08-10| 45|2019-05-01| 68| -101|
| 3| 2019-09-01| 37|2019-09-20| 46| 19|
| 3| 2019-09-01| 37|2019-10-10| 46| 39|
+---+------------+---+----------+--------------------+------------------------------------+
Now, I want my dataframe to look like this:
+---+------------+---+----------+--------------------+------------------------------------+-----------
| id|install_date|age| txn_date|app_install_duration|first_txn_duration_after_app_install|is_active
+---+------------+---+----------+--------------------+------------------------------------+-----------
| 1| 2019-10-01| 33|2019-10-02| 16| 1| 1
| 2| 2019-08-10| 45|2019-06-01| 68| -70| 0
| 3| 2019-09-01| 37|2019-09-20| 46| 19| 1
+---+------------+---+----------+--------------------+------------------------------------+-----------
What I have done so far:
df=spark.createDataFrame([(1,'2019-10-01',33,'2019-09-20 15:27:22'),
(1,'2019-10-01',33,'2019-10-02 14:17:20'),
(1,'2019-10-01',33,'2019-10-07 15:17:12'),
(2,'2019-08-10',45,'2019-06-01 14:07:19'),
(2,'2019-08-10',45,'2019-05-01 15:27:22'),
(3,'2019-09-01',37,'2019-09-20 16:17:20'),
(3,'2019-09-01',37,'2019-10-10 15:27:22')],
['id','install_date','age','txn_date'])
df = df.withColumn('install_date',to_date(unix_timestamp(F.col('install_date'),'yyyy-MM-dd').cast("timestamp")))
df= df.withColumn('app_install_duration', F.datediff(F.current_date(), df.install_date))
df = df.withColumn('txn_date',to_date(unix_timestamp(F.col('txn_date'),'yyyy-MM-dd HH:mm:ss').cast("timestamp")))
df= df.withColumn('first_txn_duration_after_app_install', F.datediff(df.txn_date, df.install_date))
df.show()

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?

How to flatten a pyspark dataframe? (spark 1.6)

I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.
Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+