Shift rows dynamically based on column value - dataframe

Below is my input dataframe:
+---+----------+--------+
|ID |date |shift_by|
+---+----------+--------+
|1 |2021-01-01|2 |
|1 |2021-02-05|2 |
|1 |2021-03-27|2 |
|2 |2022-02-28|1 |
|2 |2022-04-30|1 |
+---+----------+--------+
I need to groupBy "ID" and shift based on the "shift_by" column. In the end, the result should look like below:
+---+----------+----------+
|ID |date1 |date2 |
+---+----------+----------+
|1 |2021-01-01|2021-03-27|
|2 |2022-02-28|2022-04-30|
+---+----------+----------+
I have implemented the logic using UDF, but it makes my code slow. I would like to understand if this logic can be implemented without using UDF.
Below is a sample dataframe:
from datetime import datetime
from pyspark.sql.types import *
data2 = [(1, datetime.date(2021, 1, 1), datetime.date(2021, 3, 27)),
(2, datetime.date(2022, 2, 28), datetime.date(2022, 4, 30))
]
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("date1", DateType(), True),
StructField("date2", DateType(), True),
])
df = spark.createDataFrame(data=data2, schema=schema)

based on the comments and chats, you can try to calculate first and last values of the lat/lon fields of concern.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
import sys
data_sdf. \
withColumn('foo_first', func.first('foo').over(wd.partitionBy('id').orderBy('date').rowsBetween(-sys.maxsize, sys.maxsize))). \
withColumn('foo_last', func.last('foo').over(wd.partitionBy('id').orderBy('date').rowsBetween(-sys.maxsize, sys.maxsize))). \
select('id', 'foo_first', 'foo_last'). \
dropDuplicates()
OR, you can create structs and take min/max
data_sdf = spark.createDataFrame(
[(1, '2021-01-01', 2, 2),
(1, '2021-02-05', 3, 2),
(1, '2021-03-27', 4, 2),
(2, '2022-02-28', 1, 5),
(2, '2022-04-30', 5, 1)],
['ID', 'date', 'lat', 'lon'])
data_sdf. \
withColumn('dt_lat_lon_struct', func.struct('date', 'lat', 'lon')). \
groupBy('id'). \
agg(func.min('dt_lat_lon_struct').alias('min_dt_lat_lon_struct'),
func.max('dt_lat_lon_struct').alias('max_dt_lat_lon_struct')
). \
selectExpr('id',
'min_dt_lat_lon_struct.lat as lat_first', 'min_dt_lat_lon_struct.lon as lon_first',
'max_dt_lat_lon_struct.lat as lat_last', 'max_dt_lat_lon_struct.lon as lon_last'
)
# +---+---------+---------+--------+--------+
# | id|lat_first|lon_first|lat_last|lon_last|
# +---+---------+---------+--------+--------+
# | 1| 2| 2| 4| 2|
# | 2| 1| 5| 5| 1|
# +---+---------+---------+--------+--------+

Aggregation using min and max seems could work in your case.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, '2021-01-01', 2),
(1, '2021-02-05', 2),
(1, '2021-03-27', 2),
(2, '2022-02-28', 1),
(2, '2022-04-30', 1)],
['ID', 'date', 'shift_by'])
df = df.groupBy('ID').agg(
F.min('date').alias('date1'),
F.max('date').alias('date2'),
)
df.show()
# +---+----------+----------+
# | ID| date1| date2|
# +---+----------+----------+
# | 1|2021-01-01|2021-03-27|
# | 2|2022-02-28|2022-04-30|
# +---+----------+----------+

Related

Split row into multiple rows to limit length of array in column (spark / scala)

I have a dataframe that looks like this:
+--------------+--------------------+
|id | items |
+--------------+--------------------+
| 1|[a, b, .... x, y, z]|
+--------------+--------------------+
| 1|[q, z, .... x, b, 5]|
+--------------+--------------------+
| 2|[q, z, .... x, b, 5]|
+--------------+--------------------+
I want to split the rows so that the array in the items column is at most length 20. If an array has length greater than 20, I would want to make new rows and split the array up so that each array is of length 20 or less. So for the first row in my example dataframe, if we assume the length is 10 and I want at most length 3 for each row, I would like for it to be split like this:
+--------------+--------------------+
|id | items |
+--------------+--------------------+
| 1|[a, b, c] |
+--------------+--------------------+
| 1|[z, y, z] |
+--------------+--------------------+
| 1|[e, f, g] |
+--------------+--------------------+
| 1|[q] |
+--------------+--------------------+
Ideally, all rows should be of length 3 except the last row if the length of the array is not evenly divisible by the max desired length. Note - the id column is not unique
Using higher-order functions transform + filter along with slice, you can split the array into sub arrays of size 20 then explode it:
val l = 20
val df1 = df.withColumn(
"items",
explode(
expr(
s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i+1,$l), null)), x-> x is not null)"
)
)
)
You could try this:
import pandas as pd
max_item_length = 3
df = pd.DataFrame(
{"fake_index": [1, 2, 3],
"items": [["a", "b", "c", "d", "e"], ["f", "g", "h", "i", "j"], ["k", "l"]]}
)
df2 = pd.DataFrame({"fake_index": [], "items": []})
for i in df.index:
try:
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][:max_item_length]},
ignore_index=True)
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][max_item_length:]},
ignore_index=True)
except:
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1]}, ignore_index=True)
df = df2
print(df)
Input:
fake_index items
0 1 [a, b, c, d, e]
1 2 [f, g, h, i, j]
2 3 [k, l]
Output:
fake_index items
0 1 [a, b, c]
1 1 [d, e]
2 2 [f, g, h]
3 2 [i, j]
4 3 [k, l]
Since this requires a more complex transformation, I've used datasets. This might not be as performant, but it will get what you want.
Setup
Creating some sample data to mimic your data.
val arrayData = Seq(
Row(1,List(1, 2, 3, 4, 5, 6, 7)),
Row(2,List(1, 2, 3, 4)),
Row(3,List(1, 2)),
Row(4,List(1, 2, 3))
)
val arraySchema = new StructType().add("id",IntegerType).add("values", ArrayType(IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData), arraySchema)
/*
+---+---------------------+
|id |values |
+---+---------------------+
|1 |[1, 2, 3, 4, 5, 6, 7]|
|2 |[1, 2, 3, 4] |
|3 |[1, 2] |
|4 |[1, 2, 3] |
+---+---------------------+
*/
Transformations
// encoder for custom type of transformation
implicit val encoder = ExpressionEncoder[(Int, Array[Array[Int]])]
// Here we are using a sliding window of size 3 and step 3.
// This can be made into a generic function for a window of size k.
val df2 = df.map(r => {
val id = r.getInt(0)
val a = r.getSeq[Int](1).toArray
val arrays = a.sliding(3, 3).toArray
(id, arrays)
})
/*
+---+---------------------------------------------------------------+
|_1 |_2 |
+---+---------------------------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
+---+---------------------------------------------------------------+
*/
val df3 = df2
.withColumnRenamed("_1", "id")
.withColumnRenamed("_2", "values")
/*
+---+---------------------------------------------------------------+
|id |values |
+---+---------------------------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
+---+---------------------------------------------------------------+
*/
Use explode
Expode will create a new element for each array entry in the second column.
val df4 = df3.withColumn("values", functions.explode($"values"))
/*
+---+---------+
|id |values |
+---+---------+
|1 |[1, 2, 3]|
|1 |[4, 5, 6]|
|1 |[7] |
|2 |[1, 2, 3]|
|2 |[4] |
|3 |[1, 2] |
|4 |[1, 2, 3]|
+---+---------+
*/
Limitations
This approach is not without limitations.
Primarily, it will not be as performant on larger datasets since this code is no longer using dataframe built-in optimizations. However, the dataframe API might require the use of window functions, which can also have limited performance based on the size of the data. If it's possible to alter this data at the source, this would be recommended.
This approach also requires defining an encoder for something more complex. If the data schema changes, then different encoders will have to be used.

Divide function in pyspark

Suppose I have this dataframe on PySpark:
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
I want to create a function that takes column v2 divided by column v1, with the condition:
import numpy as np
from pyspark.sql.functions import pandas_udf
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_div(a,b):
if b == 0:
return np.nan
else:
return (a/b)
However the result turn out to be like this
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The output that I want should be like this:
+---------+---------+---+
|color_new|fruit_new|div|
+---------+---------+---+
| red| banana|10 |
| blue| banana|20 |
| red| carrot|30 |
| blue| grape|40 |
| red| carrot|50 |
| black| carrot|60 |
| red| banana|70 |
| red| grape|80 |
+---------+---------+---+
All you needed was a WHEN and OTHERWISE. See example below
# Create data frame
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80], ['orange', 'grapefruit', 0, 100]], schema=['color', 'fruit', 'v1', 'v2'])
# display result
df.show()
+------+----------+---+---+
| color| fruit| v1| v2|
+------+----------+---+---+
| red| banana| 1| 10|
| blue| banana| 2| 20|
| red| carrot| 3| 30|
| blue| grape| 4| 40|
| red| carrot| 5| 50|
| black| carrot| 6| 60|
| red| banana| 7| 70|
| red| grape| 8| 80|
|orange|grapefruit| 0|100|
+------+----------+---+---+
# Import functions
import pyspark.sql.functions as f
# apply case when
df1 = df.withColumn("divide", f.when(f.col("v1") == 0, None).otherwise(f.lit(f.col("v2")/f.col("v1"))))
# display result
df1.show()
+------+----------+---+---+------+
| color| fruit| v1| v2|divide|
+------+----------+---+---+------+
| red| banana| 1| 10| 10.0|
| blue| banana| 2| 20| 10.0|
| red| carrot| 3| 30| 10.0|
| blue| grape| 4| 40| 10.0|
| red| carrot| 5| 50| 10.0|
| black| carrot| 6| 60| 10.0|
| red| banana| 7| 70| 10.0|
| red| grape| 8| 80| 10.0|
|orange|grapefruit| 0|100| null|
+------+----------+---+---+------+

Counting longest sequence of specific elements in a list contained within a spark.sql database column

I have the following problem I would like to solve.
I have the following Dataframe that I created from a query
val temp = spark.sql("select Id, collect_list(from) as letter from f group by Id")
|Id| letter|
+-----------+---------------+
| 106| [c]|
| 101| [p]|
| 104|[c, c, c, t, u]|
| 100|[d, t, j, j, c]|
| 110| [p, n, f]|
| 113|[s, c, c, b, ..|
| 115|[u, s, t, c, ..|
| 11| [c, c, i, s]|
| 117| [d, d, p, s]|
| 118|[a, s, c, t, ..|
| 123| [d, n]|
| 125| [n, b]|
| 128| [c]|
| 131| [c, t, c, u]|
| 132| [c, u, i]|
| 134|[c, p, j, u, c]|
| 136|[b, a, t, n, c]|
| 137| [b, a]|
| 138| [b, t, c]|
| 141| [s]|
I would like to create a new column called "n"
This column would contain a numerical value which represents the longest sequence of letters in a cell before "c" appears. the longest sequence can be anywhere in the list.
For example the solution column for this section (assuming nothing is cut off by the ....) would be
0, 1, 3, 5, 3, 2, 4, 4, 4, 4, 2, 2, 1, 4, 2, 5, 5, 2, 3, 1
Any help would be greatly appreciated. Thank you!
Here is how you can use the spark functions, you can convert the given scala functions with spark functions as below
import org.apache.spark.sql.functions._
df.withColumn("n_trip",
array_max(
transform(
filter(
split(array_join($"trip", " "), "co"),
(col: Column) => (col =!= "" || col =!= null)
), (col: Column) => size(split(trim(col), " "))
)
))
.withColumn("n_trip", when($"n_trip".isNull, 0).otherwise($"n_trip"))
.show(false)
Update: Easy to understand
df.withColumn("split", split(array_join($"trip", " "), "co"))
.withColumn("filter", filter($"split", (col: Column) => col =!= "" || col =!= null))
.withColumn("n_trip", array_max(transform($"filter", (col: Column) => size(split(trim(col), " ")))))
.withColumn("n_trip", when($"n_trip".isNull, 0).otherwise($"n_trip"))
.drop("split", "filter")
.show(false)
Output:
+-----------+--------------------+------+
|passengerId|trip |n_trip|
+-----------+--------------------+------+
|10096 |[co] |0 |
|10351 |[pk] |1 |
|10436 |[co, co, cn, tj, us]|3 |
|1090 |[dk, tj, jo, jo, ch]|5 |
|11078 |[pk, no, fr] |3 |
|11332 |[sg, cn, co, bm] |2 |
|11563 |[us, sg, th, cn] |4 |
|1159 |[ca, cl, il, sg] |4 |
|11722 |[dk, dk, pk, sg] |4 |
|11888 |[au, se, ca, tj] |4 |
|12394 |[dk, nl] |2 |
|12529 |[no, be] |2 |
|12847 |[cn] |1 |
|13192 |[cn, tk, cg, uk] |4 |
|13282 |[co, us, iq] |2 |
|13442 |[cn, pk, jo, us, ch]|5 |
|13610 |[be, ar, tj, no, ch]|5 |
|13772 |[be, at] |2 |
|13865 |[be, th, cn] |3 |
|14157 |[sg] |1 |
+-----------+--------------------+------+
You could write a user defined function (udf) that would compute what you wish. There are plenty of ways to compute the longuest sequence. One simple way is to split the sequence on "co", compute the size of each sub sequence and take the max.
val longuest_seq = udf((x : Seq[String]) => {
x.reduce(_ +" "+_)
.split(" *co *")
.map(_.count(_ == ' ') + 1)
.max
})
val df = Seq(
(1, Array("x", "y", "co", "z")),
(2, Array("x")),
(3, Array("co", "t")),
(4, Array("a", "b", "c", "d", "co", "e"))
).toDF("id", "trip")
df.withColumn("n_trips", longuest_seq('trip)).show
which yields
+---+-------------------+-------+
| id| trip|n_trips|
+---+-------------------+-------+
| 1| [x, y, co, z]| 2|
| 2| [x]| 1|
| 3| [co, t]| 1|
| 4|[a, b, c, d, co, e]| 4|
+---+-------------------+-------+

How to get L2 norm of an array type column in PySpark?

I have a PySpark dataframe.
df1 = spark.createDataFrame([
("u1", [0, 1, 2]),
("u1", [1, 2, 3]),
("u2", [2, 3, 4]),
],
['user_id', 'features'])
print(df1.printSchema())
df1.show(truncate=False)
Output-
root
|-- user_id: string (nullable = true)
|-- features: array (nullable = true)
| |-- element: long (containsNull = true)
None
+-------+---------+
|user_id|features |
+-------+---------+
|u1 |[0, 1, 2]|
|u1 |[1, 2, 3]|
|u2 |[2, 3, 4]|
+-------+---------+
I want to get the L2 norm of the features, so I wrote a UDF-
def norm_2_func(features):
return features/np.linalg.norm(features, 2)
norm_2_udf = udf(norm_2_func, ArrayType(FloatType()))
df2 = df1.withColumn('l2_features', norm_2_udf(F.col('features')))
But it is throwing some errors. How can I achieve this?
The expected output is -
+-------+---------+----------------------+
|user_id|features | L2_norm|
+-------+---------+----------------------+
|u1 |[0, 1, 2]| [0.000, 0.447, 0.894]|
|u1 |[1, 2, 3]| [0.267, 0.534, 0.801]|
|u2 |[2, 3, 4]| [0.371, 0.557, 0.742]|
+-------+---------+----------------------+
Numpy arrays contain numpy dtypes which needs to be cast to normal Python dtypes (float/int etc.) before returning:
import numpy as np
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, FloatType
def norm_2_func(features):
return [float(i) for i in features/np.linalg.norm(features, 2)]
# you can also use
# return list(map(float, features/np.linalg.norm(features, 2)))
norm_2_udf = F.udf(norm_2_func, ArrayType(FloatType()))
df2 = df1.withColumn('l2_features', norm_2_udf(F.col('features')))
df2.show(truncate=False)
+-------+---------+-----------------------------------+
|user_id|features |l2_features |
+-------+---------+-----------------------------------+
|u1 |[0, 1, 2]|[0.0, 0.4472136, 0.8944272] |
|u1 |[1, 2, 3]|[0.26726124, 0.5345225, 0.80178374]|
|u2 |[2, 3, 4]|[0.37139067, 0.557086, 0.74278134] |
+-------+---------+-----------------------------------+

Pyspark: Aggregate data by de checking if value exist or not (not count or sum)

I have a dataset like this,
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(4, 1, 1, "2018-06-05", "Region C"),
(5, 3, 2, "2018-06-03", "Region D"),
(6, 1, 2, "2018-06-03", "Region A"),
(7, 4, 4, "2018-06-03", "Region A"),
(8, 4, 4, "2018-06-03", "Region B"),
(9, 5, 4, "2018-06-03", "Region A"),
(10, 5, 4, "2018-06-03", "Region B"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
And I can aggregate each customer's order for each region like this:
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0)
temp_result.show()
Now, insteade of sum or count, I'd like to simply aggregate the data by determining whether the value exist or not (i.e., 0 or 1), something like this
I can obtain the above result by
for field in temp_result.schema.fields:
if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
name = str(field.name)
temp_result = temp_result.withColumn(name, \
when(col(name) >= 1, 1).otherwise(0))
but is there simpler way to obtain it?
You're basically almost there - only a little tweak required to get your desired result. Within your aggregation, add the count comparison and convert boolean to integer (if necessary at all):
temp_result = test.groupBy("customerid")\
.pivot("location")\
.agg((count("orderid")>0).cast("integer"))\
.na.fill(0)
temp_result.show()
Results into:
+----------+--------+--------+--------+--------+
|customerid|Region A|Region B|Region C|Region D|
+----------+--------+--------+--------+--------+
| 5| 1| 1| 0| 0|
| 1| 1| 1| 1| 0|
| 3| 0| 0| 0| 1|
| 2| 0| 1| 0| 0|
| 4| 1| 1| 0| 0|
+----------+--------+--------+--------+--------+
In case you get a spark error, you might use this solution instead which does the count comparison via an additional step:
temp_result = test.groupBy("customerId", "location")\
.agg(count("orderid").alias("count"))\
.withColumn("count", (col("count")>0).cast("integer"))\
.groupby("customerId")\
.pivot("location")\
.agg(sum("count")).na.fill(0)
temp_result.show()