Average of array column of two dataframes and find the maximum index in pyspark - dataframe

I want to Combine the column values of two dataframe after performing some operations to create a new dataframe in pyspark. The columns of each dataframe are vectors with integer values. The operations done are taking the average of each values in the vectors of the dataframe and finding the index of the maximum element of the new vectors created.
Dataframe1:
|id| |value1 |
|:.| |:......|
| 0| |[0,1,2]|
| 1| |[3,4,5]|
Dataframe2:
|id| |value2 |
|:.| |:......|
| 0| |[1,2,3]|
| 1| |[4,5,6]|
Dataframe3:
|value3 |
|:............|
|[0.5,1.5,2.5]|
|[3.5,4.5,5.5]|
Dataframe4:
|value4|
|:.....|
|2 |
|2 |
Dataframe3 is obtained by taking the average of each elements of each vectors of dataframe 1 and 2 i.e.: first vector of dataframe3 [0.5,1.5,2.5] is obtained by [0+1/2,1+2/2,2+3/2]. Dataframe4 is obtained by taking the index of maximum value of each vector.i.e; Take first vector of dataframe3[0.5,1.5,2.5] maximum value is 2.5 and it occurs at index 2 so first element in Dataframe4 is 2. How we can implement this in pyspark .
V1:
+--------------------------------------+---+
|p1 |id |
+--------------------------------------+---+
|[0.01426862, 0.010903089, 0.9748283] |0 |
|[0.068229124, 0.89613986, 0.035630997]|1 |
+--------------------------------------+---+
V2:
+-------------------------+---+
|p2 |id |
+-------------------------+---+
|[0.0, 0.0, 1.0] |0 |
|[2.8160464E-27, 1.0, 0.0]|1 |
+-------------------------+---+
when df3 = v1.join(v2,on="id") is used
df3=
this is what I get
+-------------------------------------+---------------+
|p1 |p2 |
+-------------------------------------+---------------+
|[0.02203844, 0.010056663, 0.9679049] |[0.0, 0.0, 1.0]|
|[0.039553806, 0.015186918, 0.9452593]|[0.0, 0.0, 1.0]|
+-------------------------------------+---------------+
and when
df3 = df3.withColumn( "p3", F.expr("transform(arrays_zip(p1, p2), x -> (x.p1 + x.p2) / 2)"),)
df4 = df3.withColumn("p4",F.expr("array_position(p3, array_max(p3))"))
were p3 is the average value .I get all values of df4 as zero

First, I recreate your test data :
a = [
[0, [0,1,2]],
[1, [3,4,5]],
]
b = ["id", "value1"]
df1 = spark.createDataFrame(a,b)
c = [
[0, [1,2,3]],
[1, [4,5,6]],
]
d = ["id", "value2"]
df2 = spark.createDataFrame(c,d)
then, I process the data :
join
df3 = df1.join(df2, on="id")
df3.show()
+---+---------+---------+
| id| value1| value2|
+---+---------+---------+
| 0|[0, 1, 2]|[1, 2, 3]|
| 1|[3, 4, 5]|[4, 5, 6]|
+---+---------+---------+
create the average array
from pyspark.sql import functions as F, types as T
#F.udf(T.ArrayType(T.FloatType()))
def avg_array(array1, array2):
return list(map(lambda x: (x[0] + x[1]) / 2, zip(array1, array2)))
df3 = df3.withColumn("value3", avg_array(F.col("value1"), F.col("value2")))
# OR without UDF
df3 = df3.withColumn(
"value3",
F.expr("transform(arrays_zip(value1, value2), x -> (x.value1 + x.value2) / 2)"),
)
df3.show()
+---+---------+---------+---------------+
| id| value1| value2| value3|
+---+---------+---------+---------------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]|
+---+---------+---------+---------------+
get the index (the array_position start at 1, you can do a -1 if necessary)
df4 = df3.withColumn("value4",F.expr("array_position(value3, array_max(value3))"))
df4.show()
+---+---------+---------+---------------+------+
| id| value1| value2| value3|value4|
+---+---------+---------+---------------+------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]| 3|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]| 3|
+---+---------+---------+---------------+------+

Related

is there a PySpark function that will merge data from a column for rows with same id?

I have the following dataframe:
+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+
I need it to be in a df/rdd format
(1, [a, b, c])
(2, [f, g])
(3, [j])
I'm new to spark and was wondering if this operation can be performed by a single function
I tried using flatmap but I don't think I'm using it correctly
You can group by "A" and then use aggregate function for example collect_set or collect_array
import pyspark.sql.functions as F
df = [
{"A": 1, "B": "a"},
{"A": 1, "B": "b"},
{"A": 1, "B": "c"},
{"A": 2, "B": "f"},
{"A": 2, "B": "g"},
{"A": 3, "B": "j"}
]
df = spark.createDataFrame(df)
df.groupBy("A").agg(F.collect_set(F.col("B"))).show()
Output
+---+--------------+
| A|collect_set(B)|
+---+--------------+
| 1| [c, b, a]|
| 2| [g, f]|
| 3| [j]|
+---+--------------+
First step, create sample data.
#
# 1 - Create sample dataframe + view
#
# array of tuples - data
dat1 = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "f"),
(2, "g"),
(3, "j")
]
# array of names - columns
col1 = ["A", "B"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# make temp hive view
df1.createOrReplaceTempView("sample_data")
Second step, play around with temporary table.
%sql
select * from sample_data
%sql
select A, collect_list(B) as B_LIST from sample_data group by A
Last step, write code to execute Spark SQL to create dataframe that you want.
df2 = spark.sql("select A, collect_list(B) as B_LIST from sample_data group by A")
display(df2)
In summary, you can use the dataframe methods to create the same output. However, the Spark SQL looks clean and makes more sense.

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

I have a dataframe which returns the output as
I would like to transpose this into
Can someone help to understand how to prepare the pyspark code to achieve this result dynamically. I have tried Unpivot in sql but no luck.
df =spark.createDataFrame([
(78,20,19,90),
],
('Machines', 'Books', 'Vehicles', 'Plants'))
Create a new array of struct column that combines column names and value names. Use the magic inline to explode the struct field. Code below
df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()
+--------+-----+
| Fields|Count|
+--------+-----+
|Machines| 78|
| Books| 20|
|Vehicles| 19|
| Plants| 90|
+--------+-----+
As mentioned in unpivot-dataframe tutoral use:
df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
Or to generalise:
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
Full example:
df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])
# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
[Out]:
+--------+-----+
|Fields |Count|
+--------+-----+
|Machines|78 |
|Books |20 |
|Vehicles|19 |
|Plants |90 |
+--------+-----+

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.
Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())
Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+
You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

Aggregate while dropping duplicates in pyspark

I want to groupby aggregate a pyspark dataframe, while removing duplicates (keep last value) based on another column of this dataframe.
In summary, I would like to apply a dropDuplicates to a GroupedData object. So, for each group, I could keep only one row by some column, dynamically.
Example
The straight forward group aggregation, for the dataframe bellow, would be:
from pyspark.sql import functions
dataframe = spark.createDataFrame(
[
(1, "2020-01-01", 1, 1),
(2, "2020-01-01", 2, 1),
(3, "2020-01-02", 1, 1),
(2, "2020-01-02", 1, 1)
],
("id", "ts", "feature", "h3")
).withColumn("ts", functions.col("ts").cast("timestamp"))
# +---+-------------------+-------+---+
# | id| ts|feature| h3|
# +---+-------------------+-------+---+
# | 1|2020-01-01 00:00:00| 1| 1|
# | 2|2020-01-01 00:00:00| 2| 1|
# | 3|2020-01-02 00:00:00| 1| 1|
# | 2|2020-01-02 00:00:00| 1| 1|
# +---+-------------------+-------+---+
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sum("feature")
)
aggregated.show(truncate=False)
resulting in the following dataframe:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|5 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|5 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
The problem
I want the aggregation to use only the latest state of each id. In this case, id=2 have been updated to feature=1 at ts=2020-01-02 00:00:00, so all aggregations with base timestamp bigger than 2020-01-02 00:00:00 should use only this state for column feature when id=2. The expected aggregated dataframe is:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|3 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
How can I do this with pyspark?
Update
I have assumed that a MapType variable should not have duplicate keys in Spark. With that assumption, I thought I could aggregate the column creating a map id -> feature and then just aggregate the map values with sum (or whatever the final aggregation should be).
So I did:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.map_from_entries(
functions.collect_list(
functions.struct("id","feature")
)
).alias("id_feature")
)
aggregated.show(truncate=False)
But then I've found that maps can have duplicate keys:
+---+------------------------------------------+--------------------------------+
|h3 |window |id_feature |
+---+------------------------------------------+--------------------------------+
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|[1 -> 1, 2 -> 2] |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|[3 -> 1, 2 -> 1] |
+---+------------------------------------------+--------------------------------+
so it doesn't solve my problem. Instead, I just found another problem. When using the display function in a Databricks' notebook, it shows the MapType column without duplicated keys.
First, you can find the latest record for each id and time window and then join with the original dataframe with the latest records.
time_window = window(timeColumn="ts", windowDuration="3 days", slideDuration="1 day")
df2 = df.groupBy("h3", time_window, "id").agg(max("ts").alias("latest"))
df2.alias("a").join(df.alias("b"), (col("a.id") == col("b.id")) & (col("a.latest") == col("b.ts")), "left") \
.select("a.*", "feature") \
.groupBy("h3", "window") \
.agg(sum("feature")) \
.orderBy("window") \
.show(truncate=False)
Then, the result is the same as your expected one.
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-29 00:00:00, 2020-01-01 00:00:00]|3 |
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|2 |
+---+------------------------------------------+------------+
Since you are using Spark 2.4+, one way you can try is to use Spark SQL aggregate function, see below:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sort_array(functions.collect_list(
functions.struct("ts", "id", "feature")
), False).alias("id_feature")
)
I added ts field into the resulting array of structs from functions.collect_list. use functions.sort_array to sort the list by ts in descending order(to keep the latest record if duplicate exists). In the following aggregate function, we set the zero_value using a named_struct containing two fields: ids (MapType) to cache all processed id and total to do the sum only when the new id not exist in the cached ids.
aggregated.selectExpr("h3", "window", """
aggregate(
id_feature,
/* zero_value */
(map() as ids, 0L as total),
/* merge */
(acc, y) -> named_struct(
/* add y.id into the ids map */
'ids', map_concat(acc.ids, map(y.id,1)),
/* sum to total only when y.id doesn't exist in acc.ids map */
'total', acc.total + IF(acc.ids[y.id] is null,y.feature,0)
),
/* finish, take only acc.total, discard acc.ids map */
acc -> acc.total
) as id_features
""").show()
+---+--------------------+----------+
| h3| window|id_feature|
+---+--------------------+----------+
| 1|[2020-01-01 00:00...| 3|
| 1|[2019-12-31 00:00...| 3|
| 1|[2019-12-30 00:00...| 3|
| 1|[2020-01-02 00:00...| 2|
+---+--------------------+----------+

pyspark dataframe sum

I am trying to perform the following operation on pyspark.sql.dataframe
from pyspark.sql.functions import sum as spark_sum
df = spark.createDataFrame([
('a', 1.0, 1.0), ('a',1.0, 0.2), ('b', 1.0, 1.0),
('c' ,1.0, 0.5), ('d', 0.55, 1.0),('e', 1.0, 1.0)
])
>>> df.show()
+---+----+---+
| _1| _2| _3|
+---+----+---+
| a| 1.0|1.0|
| a| 1.0|0.2|
| b| 1.0|1.0|
| c| 1.0|0.5|
| d|0.55|1.0|
| e| 1.0|1.0|
+---+----+---+
Then, I am trying to do the following operation.
1) Select the rows when column df[_2] > df[_3]
2) For each row of selected from above, multiply df[_2] * df[_3], then take their sum
3) divide the result from above by the sum of column of df[_3]
Here is what I did:
>>> filter_df = df.where(df['_2'] > df['_3'])
>>> filter_df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a|1.0|0.2|
| c|1.0|0.5|
+---+---+---+
>>> result = spark_sum(filter_df['_2'] * filter_df['_3'])
/ spark_sum(filter_df['_3'])
>>> df.select(result).show()
+--------------------------+
|(sum((_2 * _3)) / sum(_3))|
+--------------------------+
| 0.9042553191489361|
+--------------------------+
But the answer should be (1.0 * 0.2 + 1.0 * 0.5) / (0.2+0.5) = 1.0
This is not correct. What??
It seems to me that such operation only taken on the original df, but not the filter_df. WTF?
You need to call it in filter_df.
>>> result = spark_sum(filter_df['_2'] * filter_df['_3'])
/ spark_sum(filter_df['_3'])
This is a transformation function which returns a column and gets applied on dataframe we apply it (lazy evaluation). Sum is an aggregate function and when called without any groups, it applies on whole dataset.
>>> filter_df.select(result).show()
+--------------------------+
|(sum((_2 * _3)) / sum(_3))|
+--------------------------+
| 1.0|
+--------------------------+