Spark Dataframe - Create 12 rows for each cell of a master table

Spark Dataframe - Create 12 rows for each cell of a master table - dataframe

I have a table containing Employee IDs and I'd like to add an additional column for Month containing 12 values (1 for each month). I'd like to create a new table where there is 12 rows for each ID in my list.
Take the following example:
+-----+
|GFCID|
+-----+
| 1|
| 2|
| 3|
+-----+
+---------+
|Yearmonth|
+---------+
| 202101|
| 202102|
| 202203|
| 202204|
| 202205|
+---------+
My desired output is something on the lines of
ID Month
1 Jan
1 Feb
1 March
2 jan
2 March
and so on. I am using pyspark and my current syntax is as follows:
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.join(df2, df.GFCID == df2.Yearmonth, "outer")
df3.show()
And the output is
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| null| 202101|
| 3| null|
| null| 202205|
| null| 202102|
| null| 202204|
| 1| null|
| null| 202203|
| 2| null|
+-----+---------+
I understand this is wrong because there is no common key for the dataframes to join on. Would appreciate your help on this

Here is your code modified with the proper join crossJoin
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.crossJoin(df2)
df3.show()
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| 1| 202101|
| 1| 202102|
| 1| 202203|
| 1| 202204|
| 1| 202205|
| 2| 202101|
| 2| 202102|
| 2| 202203|
| 2| 202204|
| 2| 202205|
| 3| 202101|
| 3| 202102|
| 3| 202203|
| 3| 202204|
| 3| 202205|
+-----+---------+
Another way of doing it without using a join :
from pyspark.sql import functions as F
df2.withColumn("GFCID", F.explode(F.array([F.lit(i) for i in range(1, 13)]))).show()
+---------+-----+
|Yearmonth|GFCID|
+---------+-----+
| 202101| 1|
| 202101| 2|
| 202101| 3|
| 202101| 4|
| 202101| 5|
| 202101| 6|
| 202101| 7|
| 202101| 8|
| 202101| 9|
| 202101| 10|
| 202101| 11|
| 202101| 12|
| 202102| 1|
| 202102| 2|
| 202102| 3|
| 202102| 4|
...

Related

return constant numbers over partition

I have the following table:
data = [(1, "user_1", 'foo.com'), (2, "user_1", 'foo.com'), (3, "user_1", 'bar.com'), (4, "user_1", 'foo.com')]
schema = ['event_actions_order', 'user_name', 'website']
df = spark.createDataFrame(data, schema=schema)
df.show()
+-------------------+---------+---------+
|event_actions_order|user_name| website |
+-------------------+---------+---------+
| 1| user_1 | foo.com |
| 2| user_1 | foo.com |
| 3| user_1 | bar.com |
| 4| user_1 | foo.com |
+-------------------+---------+---------+
I want to be able to have a session number identifier each time a user move from one website to another website. I tried this:
w = Window.partitionBy('website').orderBy('event_actions_order')
df.select('event_actions_order', 'user_name', 'website').withColumn('test', row_number().over(w)).orderBy('event_actions_order').show()
+-------------------+---------+-------+----+
|event_actions_order|user_name|website|test|
+-------------------+---------+-------+----+
| 1| user_1|foo.com| 1|
| 2| user_1|foo.com| 2|
| 3| user_1|bar.com| 1|
| 4| user_1|foo.com| 3|
+-------------------+---------+-------+----+
but this is the output I'd like to have:
+-------------------+---------+-------+----+
|event_actions_order|user_name|website|test|
+-------------------+---------+-------+----+
| 1| user_1|foo.com| 1|
| 2| user_1|foo.com| 1|
| 3| user_1|bar.com| 2|
| 4| user_1|foo.com| 3|
+-------------------+---------+-------+----+
How can I achieve this?

You can use the lag function to compare to the previous row in your window. Then you can do a rolling sum to get the desired restult
w = Window.partitionBy('user_name').orderBy('event_actions_order')
(df
.withColumn('change', f.when(f.lag('website').over(w) == f.col('website'), 0).otherwise(1))
.withColumn('test', f.sum('change').over(w))
.drop('change')
).show()
+-------------------+---------+-------+----+
|event_actions_order|user_name|website|test|
+-------------------+---------+-------+----+
| 1| user_1|foo.com| 1|
| 2| user_1|foo.com| 1|
| 3| user_1|bar.com| 2|
| 4| user_1|foo.com| 3|
+-------------------+---------+-------+----+

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you

You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated

There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information:
trip-id
timestamp
speed
The data looks like this:
trip-id | timestamp | speed
001 | 1538204192 | 44.55
001 | 1538204193 | 47.20 <-- start of brake
001 | 1538204194 | 42.14
001 | 1538204195 | 39.20
001 | 1538204196 | 35.30
001 | 1538204197 | 32.22 <-- end of brake
001 | 1538204198 | 34.80
001 | 1538204199 | 37.10
...
001 | 1538204221 | 55.30
001 | 1538204222 | 57.20 <-- start of brake
001 | 1538204223 | 54.60
001 | 1538204224 | 52.15
001 | 1538204225 | 49.27
001 | 1538204226 | 47.89 <-- end of brake
001 | 1538204227 | 50.57
001 | 1538204228 | 53.72
...
A braking event occurs when there's a decrease in speed in 2 consecutive records based on timestamp.
I want to extract the braking events from the data in terms of event start timestamp, end timestamp, start speed & end speed.
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+
Here's my take:
Defined a window spec with partition according to trip-id, ordered by timestamp.
Applied window lag to move over consecutive rows and calculate speed difference.
Filter out records which have positive speed difference, as i am interested in braking events only.
Now that I only have records belonging to braking events, I want to group records belonging to same event. I guess i can do this based on the timestamp difference. If the difference between 2 records is 1 second, those 2 records belong to same braking event.
I am stuck here as i do not have a key belonging to same group so i can apply key based aggregation.
My question is:
How can I map to add a key column based on the difference in timestamp? So if 2 records have a difference of 1 seconds, they should have a common key. That way, I can reduce a group based on the newly added key.
Is there any better & more optimized way to achieve this? My approach could be very inefficient as it relies on row by row comparisons. What are the other possible ways to detect these kind of "sub-events" (e.g braking events) in a data-stream belonging to a specific event (data from single vehicle trip)?
Thanks in advance!
Appendix:
Example data file for a trip: https://www.dropbox.com/s/44a0ilogxp60w...

For Pandas users, there is pretty much a common programming pattern using shift() + cumsum() to setup a group-label to identify consecutive rows matching some specific patterns/conditions. With pyspark, we can use Window functions lag() + sum() to do the same and find this group-label (d2 in the following code):
Data Setup:
from pyspark.sql import functions as F, Window
>>> df.orderBy('timestamp').show()
+-------+----------+-----+
|trip-id| timestamp|speed|
+-------+----------+-----+
| 001|1538204192|44.55|
| 001|1538204193|47.20|
| 001|1538204194|42.14|
| 001|1538204195|39.20|
| 001|1538204196|35.30|
| 001|1538204197|32.22|
| 001|1538204198|34.80|
| 001|1538204199|37.10|
| 001|1538204221|55.30|
| 001|1538204222|57.20|
| 001|1538204223|54.60|
| 001|1538204224|52.15|
| 001|1538204225|49.27|
| 001|1538204226|47.89|
| 001|1538204227|50.57|
| 001|1538204228|53.72|
+-------+----------+-----+
>>> df.printSchema()
root
|-- trip-id: string (nullable = true)
|-- unix_timestamp: integer (nullable = true)
|-- speed: double (nullable = true)
Set up two Window Spec (w1, w2):
# Window spec used to find previous speed F.lag('speed').over(w1) and also do the cumsum() to find flag `d2`
w1 = Window.partitionBy('trip-id').orderBy('timestamp')
# Window spec used to find the minimal value of flag `d1` over the partition(`trip-id`,`d2`)
w2 = Window.partitionBy('trip-id', 'd2').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Three flags (d1, d2, d3):
d1 : flag to identify if the previous speed is greater than the current speed, if true d1 = 0, else d1 = 1
d2 : flag to mark the consecutive rows for speed-drop with the same unique number
d3 : flag to identify the minimal value of d1 on the partition('trip-id', 'd2'), only when d3 == 0 can the row belong to a group with speed-drop. this will be used to filter out unrelated rows
df_1 = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2))
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204192|44.55| 1| 1| 1|
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204198|34.80| 1| 3| 1|
| 001|1538204199|37.10| 1| 4| 1|
| 001|1538204221|55.30| 1| 5| 1|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
| 001|1538204227|50.57| 1| 7| 1|
| 001|1538204228|53.72| 1| 8| 1|
+-------+----------+-----+---+---+---+
Remove rows which are not with concern:
df_1 = df_1.where('d3 == 0')
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
+-------+----------+-----+---+---+---+
Final Step:
Now for df_1, group by trip-id and d2, find the min and max of F.struct('timestamp', 'speed') which will return the first and last records in the group, select the corresponding fields from the struct to get the final result:
df_new = df_1.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
).select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)
>>> df_new.show()
+-------+---------------+-------------+-----------+---------+
|trip-id|start timestamp|end timestamp|start speed|end speed|
+-------+---------------+-------------+-----------+---------+
| 001| 1538204193| 1538204197| 47.20| 32.22|
| 001| 1538204222| 1538204226| 57.20| 47.89|
+-------+---------------+-------------+-----------+---------+
Note: Remove the intermediate dataframe df_1, we can have the following:
df_new = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2)) \
.where('d3 == 0') \
.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
)\
.select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)

Hope this helps. Scala code.
Output
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+
CODE
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
import org.apache.spark.sql.functions._
scala> df.show
+-------+----------+-----+
|trip-id| timestamp|speed|
+-------+----------+-----+
| 001|1538204192|44.55|
| 001|1538204193| 47.2|
| 001|1538204194|42.14|
| 001|1538204195| 39.2|
| 001|1538204196| 35.3|
| 001|1538204197|32.22|
| 001|1538204198| 34.8|
| 001|1538204199| 37.1|
| 001|1538204221| 55.3|
| 001|1538204222| 57.2|
| 001|1538204223| 54.6|
| 001|1538204224|52.15|
| 001|1538204225|49.27|
| 001|1538204226|47.89|
| 001|1538204227|50.57|
| 001|1538204228|53.72|
+-------+----------+-----+
val overColumns = Window.partitionBy("trip-id").orderBy("timestamp")
val breaksDF = df
.withColumn("speeddiff", lead("speed", 1).over(overColumns) - $"speed")
.withColumn("breaking", when($"speeddiff" < 0, 1).otherwise(0))
scala> breaksDF.show
+-------+----------+-----+-------------------+--------+
|trip-id| timestamp|speed| speeddiff|breaking|
+-------+----------+-----+-------------------+--------+
| 001|1538204192|44.55| 2.6500000000000057| 0|
| 001|1538204193| 47.2| -5.060000000000002| 1|
| 001|1538204194|42.14|-2.9399999999999977| 1|
| 001|1538204195| 39.2|-3.9000000000000057| 1|
| 001|1538204196| 35.3|-3.0799999999999983| 1|
| 001|1538204197|32.22| 2.5799999999999983| 0|
| 001|1538204198| 34.8| 2.3000000000000043| 0|
| 001|1538204199| 37.1| 18.199999999999996| 0|
| 001|1538204221| 55.3| 1.9000000000000057| 0|
| 001|1538204222| 57.2|-2.6000000000000014| 1|
| 001|1538204223| 54.6| -2.450000000000003| 1|
| 001|1538204224|52.15|-2.8799999999999955| 1|
| 001|1538204225|49.27|-1.3800000000000026| 1|
| 001|1538204226|47.89| 2.6799999999999997| 0|
| 001|1538204227|50.57| 3.1499999999999986| 0|
| 001|1538204228|53.72| null| 0|
+-------+----------+-----+-------------------+--------+
val outputDF = breaksDF
.withColumn("breakevent",
when(($"breaking" - lag($"breaking", 1).over(overColumns)) === 1, "start of break")
.when(($"breaking" - lead($"breaking", 1).over(overColumns)) === 1, "end of break"))
scala> outputDF.show
+-------+----------+-----+-------------------+--------+--------------+
|trip-id| timestamp|speed| speeddiff|breaking| breakevent|
+-------+----------+-----+-------------------+--------+--------------+
| 001|1538204192|44.55| 2.6500000000000057| 0| null|
| 001|1538204193| 47.2| -5.060000000000002| 1|start of break|
| 001|1538204194|42.14|-2.9399999999999977| 1| null|
| 001|1538204195| 39.2|-3.9000000000000057| 1| null|
| 001|1538204196| 35.3|-3.0799999999999983| 1| end of break|
| 001|1538204197|32.22| 2.5799999999999983| 0| null|
| 001|1538204198| 34.8| 2.3000000000000043| 0| null|
| 001|1538204199| 37.1| 18.199999999999996| 0| null|
| 001|1538204221| 55.3| 1.9000000000000057| 0| null|
| 001|1538204222| 57.2|-2.6000000000000014| 1|start of break|
| 001|1538204223| 54.6| -2.450000000000003| 1| null|
| 001|1538204224|52.15|-2.8799999999999955| 1| null|
| 001|1538204225|49.27|-1.3800000000000026| 1| end of break|
| 001|1538204226|47.89| 2.6799999999999997| 0| null|
| 001|1538204227|50.57| 3.1499999999999986| 0| null|
| 001|1538204228|53.72| null| 0| null|
+-------+----------+-----+-------------------+--------+--------------+
scala> outputDF.filter("breakevent is not null").select("trip-id", "timestamp", "speed", "breakevent").show
+-------+----------+-----+--------------+
|trip-id| timestamp|speed| breakevent|
+-------+----------+-----+--------------+
| 001|1538204193| 47.2|start of break|
| 001|1538204196| 35.3| end of break|
| 001|1538204222| 57.2|start of break|
| 001|1538204225|49.27| end of break|
+-------+----------+-----+--------------+
outputDF.filter("breakevent is not null").withColumn("breakID",
when($"breakevent" === "start of break", concat($"trip-id",$"timestamp"))
.when($"breakevent" === "end of break", concat($"trip-id", lag($"timestamp", 1).over(overColumns))))
.groupBy("breakID").agg(first($"timestamp") as "start timestamp", last($"timestamp") as "end timestamp", first($"speed") as "start speed", last($"speed") as "end speed").show
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+

Use list and replace a pyspark column

Suppose I have a list new_id_acc = [6,8,1,2,4] and I have PySpark DataFrame
like
id_acc | name |
10 | ABC |
20 | XYZ |
21 | KBC |
34 | RAH |
19 | SPD |
I want to replace the pyspark column id_acc with new_id_acc value how can I achieve and do this.
I tried and found that lit() can be used but for a constant
value but didn't find anything how to do for list.
After replacement I want my PySpark Dataframe to look like this
id_acc | name |
6 | ABC |
8 | XYZ |
1 | KBC |
2 | RAH |
4 | SPD |

Probably long answer but it works.
df = spark.sparkContext.parallelize([(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'ABC'),(19,'SPD')]).toDF(('id_acc', 'name'))
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| ABC|
| 19| SPD|
+------+----+
new_id_acc = [6,8,1,2,4]
indx = ['ABC','XYZ','KBC','ABC','SPD']
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),StructField("new_id_ac", IntegerType(), True)])
df1=spark.createDataFrame(zip(indx,new_id_acc),schema = myschema)
df1.show()
+----+---------+
|indx|new_id_ac|
+----+---------+
| ABC| 6|
| XYZ| 8|
| KBC| 1|
| ABC| 2|
| SPD| 4|
+----+---------+
dfnew = df.join(df1, df.name == df1.indx,how='left').drop(df1.indx).select('new_id_ac','name').sort('name').dropDuplicates(['new_id_ac'])
dfnew.show()
+---------+----+
|new_id_ac|name|
+---------+----+
| 1| KBC|
| 6| ABC|
| 4| SPD|
| 8| XYZ|
| 2| ABC|
+---------+----+

The idea is to create a column of consecutive serial/row numbers and then use them to get the corresponding values from the list.
# Creating the requisite DataFrame
from pyspark.sql.functions import row_number,lit, udf
from pyspark.sql.window import Window
valuesCol = [(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'RAH'),(19,'SPD')]
df = spark.createDataFrame(valuesCol,['id_acc','name'])
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| RAH|
| 19| SPD|
+------+----+
You can create row/serial numbers like done here.
Note that A below is just a dummy value, as we don't need to order tha values. We just want the row number.
w = Window().orderBy(lit('A'))
df = df.withColumn('serial_number', row_number().over(w))
df.show()
+------+----+-------------+
|id_acc|name|serial_number|
+------+----+-------------+
| 10| ABC| 1|
| 20| XYZ| 2|
| 21| KBC| 3|
| 34| RAH| 4|
| 19| SPD| 5|
+------+----+-------------+
As a final step, we will access the elements from the list provided by the OP using the row number. For this we use udf.
new_id_acc = [6,8,1,2,4]
mapping = udf(lambda x: new_id_acc[x-1])
df = df.withColumn('id_acc', mapping(df.serial_number)).drop('serial_number')
df.show()
+------+----+
|id_acc|name|
+------+----+
| 6| ABC|
| 8| XYZ|
| 1| KBC|
| 2| RAH|
| 4| SPD|
+------+----+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark Dataframe - Create 12 rows for each cell of a master table - dataframe

Related

return constant numbers over partition

Pyspark crossJoin with specific condition

Pyspark dataframes group by

Spark: How to aggregate/reduce records based on time difference?

Use list and replace a pyspark column

Categories

Resources