Perform groupBy with array of strings - Scala - dataframe

I have an array of strings defined in a variable which contains name of the column. I would like to perform group by and get count.
I am trying below code but throws error.
val keys = Array("Col1", "Col2")
val grouppedByDf = myDf.groupBy(keys.mkString(",").count
Can you please guide me what I am doing wrong here ?

import spark.implicits._
val df = Seq(("βήτα", "άλφα", 20), ("άλφα", "βήτα", 10), ("άλφα", "βήτα", 20), ("βήτα", "άλφα", 10)).toDF("α", "β", "ω")
val keys = Array("α", "β")
df
.groupBy(keys.map(col(_)): _*)
.count()
.show()
+----+----+-----+
| α| β|count|
+----+----+-----+
|βήτα|άλφα| 2|
|άλφα|βήτα| 2|
+----+----+-----+

Related

How to count hypothenuses with pandas udf, pyspark

I want to write a panda udf which will take two arguments cathetus1, and cathetus2 from other dataframe and return hypot.
# this data is list where cathetuses are.
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
#and this is creating dataframe where only cathetuses are showing.
this is function i have written so far.
def pandaUdf(cat1, cat2):
leg1 = []
leg2 = []
for i in data:
x = 0
leg1.append(i[x])
leg2.append(i[x+1])
hypoData.append(np.hypot(leg1,leg2))
return np.hypot(leg1,leg2)
#example_series = pd.Series(data)
and im trying to create a new column in df, which values will be hypos.
df.withColumn(col('Hypo'), pandaUdf(example_df.cathetus1,example_df.cathetus2)).show()
but this gives me an error --> col should be Column.
I dont understand how I can fix this error or why its even there.
You can apply np.hypot on the 2 cathetus directly without extracting individual values.
from pyspark.sql import functions as F
from pyspark.sql.types import *
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
"""
+---------+---------+
|cathetus1|cathetus2|
+---------+---------+
| 3.0| 4.0|
| 6.0| 8.0|
| 3.3| 5.6|
+---------+---------+
"""
def hypot(cat1: pd.Series, cat2: pd.Series) -> pd.Series:
return np.hypot(cat1,cat2)
hypot_pandas_df = F.pandas_udf(hypot, returnType=FloatType())
df.withColumn("Hypo", hypot_pandas_df("cathetus1", "cathetus2")).show()
"""
+---------+---------+----+
|cathetus1|cathetus2|Hypo|
+---------+---------+----+
| 3.0| 4.0| 5.0|
| 6.0| 8.0|10.0|
| 3.3| 5.6| 6.5|
+---------+---------+----+
"""

Recommendation - Creating a new dataframe with conditions

I've been studying Spark for a while but today I got stuck, I'm working in a Recommendation model using Audioscrobbler Dataset.
I have my model based in ALS and the following definition for make the recommendations:
def makeRecommendations(model: ALSModel, userID: Int,howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.select($"id".as("artist")).withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction", "user").
orderBy($"prediction".desc).
limit(howMany)
}
It's generating the expected output, but now I would like to create a new list of DataFrames using Predictions DF and User Data DF.
DataFrame Example
New list of DF consisting of the Predicted value from "Predictions DF" and "Listened" that will be 0 if the user didn't listened the artist or 1 if the user listened, something like this:
Expected DF
I tried the following solution:
val recommendationsSeq = someUsers.map { userID =>
//Gets the artists from user in testData
val artistsOfUser = testData.where($"user".===(userID)).select("artist").rdd.map(r => r(0)).collect.toList
// Recommendations for each user
val recoms = makeRecommendations(model, userID, numRecom)
//Insert a column listened with 1 if the artist in the test set for the user and 0 otherwise
val recomOutput = recoms.withColumn("listened", when($"artist".isin(artistsOfUser: _*), 1.0).otherwise(0.0)).drop("artist")
(recomOutput)
}.toSeq
But its very time consuming when the recommendation has more than 30 users. I believe there's a better way to do it,
Could someone give some idea?
Thanks,
You can try joining dataframes then goupby and count:
scala> val df1 = Seq((1205,0.9873411,1000019)).toDF("artist","prediction","user")
scala> df1.show()
+------+----------+-------+
|artist|prediction| user|
+------+----------+-------+
| 1205| 0.9873411|1000019|
+------+----------+-------+
scala> val df2 = Seq((1000019,1205,40)).toDF("user","artist","playcount")
scala> df2.show()
+-------+------+---------+
| user|artist|playcount|
+-------+------+---------+
|1000019| 1205| 40|
+-------+------+---------+
scala> df1.join(df2,Seq("artist","user")).groupBy('prediction).count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 0.9873411| 1|
+----------+-----+

Update pyspark dataframe from a column having the target column values

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.
I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.
def test():
data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
source_df.show()
if __name__ == "__main__":
spark = SparkSession.builder.appName('Name Group').getOrCreate()
test()
spark.stop()
Input:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jose_1| mase| firstname|jane|
| li_1| ken| lastname|keno|
| liz_1| durn| firstname| liz|
+---------+--------+-------------+----+
Expected output:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jane| mase| firstname|jane|
| li_1| keno| lastname|keno|
| liz| durn| firstname| liz|
+---------+--------+-------------+----+
For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.
Thanks
You can do a loop with all you columns:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.when(
F.col("target_column")==F.lit(col),
F.col("val")
).otherwise(F.col(col))
)

using numpy busday_count function in pyspark

Given I have two columns in pyspark that are not null:
df2 = sqlContext.sql("SELECT AssignedDate,primaryid from parts_so where AssignedDate is not null ")
df2 = df2.withColumn('sdate', df2.AssignedDate )
df2 = df2.withColumn('edate',current_timestamp() )
When I run df2 = df2.withColumn('days_outstanding', lit(get_hours2(df2.sdate,df2.edate ) )
def get_hours2(sdate,edate):
biz_days = np.busday_count(sdate,edate)
return biz_days
I get the following error:
object of type 'Column' has no len()
First of all let's create some random data
import datetime
import random
import pandas as pd
import numpy as np
sdate = [datetime.datetime.now() + datetime.timedelta(i) for i in range(5)]
edate = [date + datetime.timedelta(random.random()+3) for date in sdate]
data = {
'sdate': sdate,
'edate': edate
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
df.show()
+--------------------+--------------------+
| edate| sdate|
+--------------------+--------------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...|
|2019-12-07 19:42:...|2019-12-04 08:14:...|
|2019-12-08 21:26:...|2019-12-05 08:14:...|
|2019-12-09 18:57:...|2019-12-06 08:14:...|
|2019-12-11 04:08:...|2019-12-07 08:14:...|
+--------------------+--------------------+
You cannot use bare function to create another column in pyspark. We have to create UDF in order to do that.
NOTE: Please remember that you have to cast the result of the computation to int, because you might get a problem with pickling numpy type.
import pyspark.sql.types as T
import pyspark.sql.functions as F
#F.udf(returnType=T.IntegerType())
def get_hours2(sdate,edate):
biz_days = np.busday_count(sdate,edate)
return int(biz_days)
Finally we can use UDF on created DataFrame.
df = df.withColumn('days_outstanding', F.lit(get_hours2('sdate', 'edate')))
df.show()
+--------------------+--------------------+----------------+
| edate| sdate|days_outstanding|
+--------------------+--------------------+----------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...| 3|
|2019-12-07 19:42:...|2019-12-04 08:14:...| 3|
|2019-12-08 21:26:...|2019-12-05 08:14:...| 2|
|2019-12-09 18:57:...|2019-12-06 08:14:...| 1|
|2019-12-11 04:08:...|2019-12-07 08:14:...| 2|
+--------------------+--------------------+----------------+
I hope this helps you.

How to sort array of struct type in Spark DataFrame by particular field?

Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(sortUdf(collect_list("c2")))
.show(false)
//+---+----------------------------------+
//|a |UDF(collect_list(c2, 0, 0)) |
//+---+----------------------------------+
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|
//+---+----------------------------------+
For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(collect_list("c2").alias("list"))
val df2 = df.withColumn(
"list",
expr(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
)
)
df2.show(false)
//+---+------------------------------------+
//|a |list |
//+---+------------------------------------+
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
//+---+------------------------------------+
Where _2 is the name of the struct field you wan to use for sorting
If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
joined.as[Result]
.groupByKey(_.a)
.mapGroups((key, xs) => (key, xs.map(_.c2).toSeq.sortBy(_._2)))
.show(false)
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.