I have a dataframe like this:
df =
--------------
|col1 | col2 |
--------------
| A | 1 |
| A | 5 |
| B | 0 |
| A | 2 |
| B | 6 |
| B | 8 |
--------------
I want to partition by col1, find the median of col2 in each partition, and append the result to form a new column. The result should look like this:
result =
---------------------
|col1 | col2 | col3 |
---------------------
| A | 1 | 2 |
| A | 5 | 2 |
| B | 0 | 6 |
| A | 2 | 2 |
| B | 6 | 6 |
| B | 8 | 8 |
---------------------
For now, I'm using this code:
val df2 = df
.withColumn("tmp", percent_rank over Window.partition('col1).orderBy('col2))
.where("tmp <= 0.5")
.groupBy("col1").agg(max(col2) as "col3")
val result = df.join(df2, df("col1") === df2("col1")).drop(df2("col1"))
But this takes too much time and space resources to run when the dataframe is big. Please help me find a way to do the above more efficiently!
Any help is much appreciated!
With the data you have, you can do a Spark DataFrame groupBy statement with percentile_approx to perform the calculation.
// Creating the `df` dataset
val df = Seq(("A", 1), ("A", 5), ("B", 0), ("A", 2), ("B", 6), ("B", 8)).toDF("col1", "col2")
df.createOrReplaceTempView("df")
Use percentile_approx with groupBy to perform median calculation:
val df2 = spark.sql("select col1, percentile_approx(col2, 0.5) as median from df group by col1 order by col1")
df2.show()
with the output of df2 being:
+----+------+
|col1|median|
+----+------+
| A| 2.0|
| B| 6.0|
+----+------+
And now running the join to recreate the final result:
val result = df.join(df2, df("col1") === df2("col1"))
result.show()
//// output
+----+----+----+------+
|col1|col2|col1|median|
+----+----+----+------+
| A| 1| A| 2.0|
| A| 5| A| 2.0|
| B| 0| B| 6.0|
| A| 2| A| 2.0|
| B| 6| B| 6.0|
| B| 8| B| 6.0|
+----+----+----+------+
Related
I have a dataset like this one:
+----------+------------+
|id |event |
+----------+------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |C |
| 5 |A |
| 6 |D |
| 7 |B |
+----------+------------+
And I would like either to modify id or add another column where all the equal values in column "event" have the same id. And I would like the rows to remain in the same order as they are now.
This is how I would like the data to look at the end (the value of "id" doesn't matter as long as it's unique for each event)
+----------+------------+
|id |event |
+----------+------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 3 |C |
| 1 |A |
| 4 |D |
| 2 |B |
+----------+------------+
UPDATE
Adding monotonically_increasing_id() to see your data in the original input after setting an id:
The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive. The current implementation puts the
partition ID in the upper 31 bits, and the record number within each
partition in the lower 33 bits. The assumption is that the data frame
has less than 1 billion partitions, and each partition has less than 8
billion records.
output_df = (input_df
.withColumn('order', f.monotonically_increasing_id())
.withColumn('id', f.first('id').over(Window.partitionBy('event'))))
output_df.sort('order').show()
+---+-----+-----------+
| id|event| order|
+---+-----+-----------+
| 1| A| 8589934592|
| 2| B|17179869184|
| 3| C|25769803776|
| 3| C|34359738368|
| 1| A|42949672960|
| 6| D|51539607552|
| 2| B|60129542144|
+---+-----+-----------+
OLD
To "preserve" the dataframe order, create another column and keep id intact to sort whenever you want:
from pyspark.sql import Window
import pyspark.sql.functions as f
input_df = spark.createDataFrame([
[1, 'A'],
[2, 'B'],
[3, 'C'],
[4, 'C'],
[5, 'A'],
[6, 'D'],
[7, 'B']
], ['id', 'event'])
output_df = input_df.withColumn('group_id', f.first('id').over(Window.partitionBy('event')))
output_df.sort('id').show()
+---+-----+--------+
| id|event|group_id|
+---+-----+--------+
| 1| A| 1|
| 2| B| 2|
| 3| C| 3|
| 4| C| 3|
| 5| A| 1|
| 6| D| 6|
| 7| B| 2|
+---+-----+--------+
Just group your original dataframe by column event and aggregate like max(col('id')), and you will get a new dataframe like:
+----------+------------+
|event |maxid |
+----------+------------+
| A |5 |
| B |7 |
| C |4 |
| D |6 |
+----------+------------+
The next step is to join this new dataframe with you original dataframe (on column event), and the column maxid is what you want.
given a dataset with 2 columns:
| col1 | col2 |
| 1 | 2 |
| 2 | 2 |
| 1 | 2 |
| 1 | 2 |
I would like to add a column with the sum of col1 and col2
| col1 | col2 | col3 |
| 1 | 2 | 3 |
| 2 | 2 | 4 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
I have found this question which basically seems to do exactly the same but in Scala.
Any tip?
Assuming your data is present in df, the desired output can be obtained by using either of the below mentioned ways,
Using Dataframe operations
df.select("col1", "col2", (df3.col1 + df3.col2).alias("col3")).show()
Using Spark SQL
df.createOrReplaceTempView("temp_data")
spark.sql("select *, (col1 + col2) as col3 from temp_data").show()
Output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 2| 2| 4|
| 1| 2| 3|
| 1| 2| 3|
+----+----+----+
Please find the below answer to create a new column in df.
val df1 = df.withColumn("new col", col("col1") + col("col2"))
df1.show
I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop
Let's say I've got dataset like this:
| item | event | timestamp | user |
|:-----------|------------:|:---------:|:---------:|
| titanic | view | 1 | 1 |
| titanic | add to bag | 2 | 1 |
| titanic | close | 3 | 1 |
| avatar | view | 6 | 1 |
| avatar | close | 10 | 1 |
| titanic | view | 20 | 1 |
| titanic | purchase | 30 | 1 |
and so on. And I need to calculate sessionId for each user for continuous going events corresponding to particular item.
So for that particular data output should be the following :
| item | event | timestamp | user | sessionId |
|:-----------|------------:|:---------:|:---------:|:--------------:|
| titanic | view | 1 | 1 | session1 |
| titanic | add to bag | 2 | 1 | session1 |
| titanic | close | 3 | 1 | session1 |
| avatar | view | 6 | 1 | session2 |
| avatar | close | 10 | 1 | session2 |
| titanic | view | 20 | 1 | session3 |
| titanic | purchase | 30 | 1 | session3 |
I was trying to use similar approach as described here Spark: How to create a sessionId based on userId and timestamp with window:
Window.partitionBy("user", "item").orderBy("timestamp")
But that just doesn't work because the same user - item combination might be in different sessions. For example see session1 and session3.
And with that window they become the same session.
Need help with another approach how to implement that.
Here's one approach that first generates a column of timestamp value with conditional null, uses last(ts, ignoreNulls) along with rowsBetween to backfill with the last non-null timestamp value, and finally construct sessionId using dense_rank:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("titanic", "view", 1, 1),
("titanic", "add to bag", 2, 1),
("titanic", "close", 3, 1),
("avatar", "view", 6, 1),
("avatar", "close", 10, 1),
("titanic", "view", 20, 1),
("titanic", "purchase", 30, 1)
).toDF("item", "event", "timestamp", "user")
val win1 = Window.partitionBy($"user").orderBy($"timestamp")
val win2 = Window.partitionBy($"user").orderBy($"sessTS")
df.
withColumn( "firstTS",
when( row_number.over(win1) === 1 || $"item" =!= lag($"item", 1).over(win1),
$"timestamp" )
).
withColumn( "sessTS",
last($"firstTS", ignoreNulls = true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("sessionId", concat(lit("session"), dense_rank.over(win2))).
show
// +-------+----------+---------+----+-------+------+---------+
// | item| event|timestamp|user|firstTS|sessTS|sessionId|
// +-------+----------+---------+----+-------+------+---------+
// |titanic| view| 1| 1| 1| 1| session1|
// |titanic|add to bag| 2| 1| null| 1| session1|
// |titanic| close| 3| 1| null| 1| session1|
// | avatar| view| 6| 1| 6| 6| session2|
// | avatar| close| 10| 1| null| 6| session2|
// |titanic| view| 20| 1| 20| 20| session3|
// |titanic| purchase| 30| 1| null| 20| session3|
// +-------+----------+---------+----+-------+------+---------+
You seem to need to count the number of "view" records cumulatively. If so:
select t.*,
sum(case when event = 'view' then 1 else 0 end) over (partition by user order by timestamp) as session
from t;
I have a pyspark dataframe with 3 columns:
ID, each appearing multiple times;
DATE;
DELAY, 0 if this bill was payed on time, 1 otherwise.
It's already ordered by ID and DATE.
I need to create a column named CONSECUTIVE that shows how many consecutive bills were paid consecutively with DELAY=1 for each ID.
Example of data, and expected result:
ID | DATE | DELAY | CONSECUTIVE
101 | 1 | 1 | 1
101 | 2 | 1 | 2
101 | 3 | 1 | 3
101 | 4 | 0 | 0
101 | 5 | 1 | 1
101 | 6 | 1 | 2
213 | 1 | 1 | 1
213 | 2 | 1 | 2
Is there a way to do it without using Pandas? If so, how do I do it?
You can do this with 3 transformation with help of window.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
(101, 1, 1),
(101, 2, 1), # dasd
(101, 3, 0),
(101, 4, 1)
], ["id", 'date', 'delay'])
window = Window.partitionBy('id').orderBy('date')
last_value = F.last('rank').over(window.rowsBetween(-2, -1))
consecutive = F.when( F.col('delay')==0, 0) \
.otherwise( F.when(F.col('last_rank').isNull(), 1) \
.otherwise( F.col('last_rank')+1))
df \
.withColumn('rank', F.row_number().over(window)) \
.withColumn('rank', F.when(F.col('delay')!=0, F.col('rank')).otherwise(0)) \
.withColumn('last_rank', last_value) \
.withColumn('consecutive', consecutive).show()
results:
+---+----+-----+----+---------+-----------+
| id|date|delay|rank|last_rank|consecutive|
+---+----+-----+----+---------+-----------+
|101| 1| 1| 1| null| 1|
|101| 1| 1| 2| 1| 2|
|101| 1| 0| 0| 2| 0|
|101| 1| 1| 4| 0| 1|
+---+----+-----+----+---------+-----------+