Scala Spark Dataframe sum list of json values in the column - dataframe

I have a spark data frame as given below:
id
col1
col2
1
[{"a":1}]
[{"d": 3, "e": 4}]
2
[{"a":2}]
[{"d": 5, "e": 10}]
I want to obtain the following data frame:
id
col2_sum
1
7
2
10
Datatypes:
id:StringType
col1:StringType
col2:StringType
Thanks in advance

Convert JSON string into map type using from_json then use aggregate function to sum the map values:
val df = Seq(
(1, """[{"a":1}]""", """[{"d": 3, "e": 4}]"""),
(2, """[{"a":2}]""", """[{"d": 5, "e": 10}]""")
).toDF("id", "col1", "col2")
val df1 = (df
.withColumn("col2", from_json(col("col2"), lit("array<map<string,int>>")))
.withColumn("col2", flatten(expr("transform(col2, x -> map_values(x))")))
.withColumn("col2_sum", expr("aggregate(col2, 0, (acc, x) -> acc + x)"))
.drop("col1", "col2")
)
df1.show
//+---+--------+
//| id|col2_sum|
//+---+--------+
//| 1| 7|
//| 2| 15|
//+---+--------+

Related

Zip array of structs with array of ints into array of structs column

I have a dataframe that looks like this:
val sourceData = Seq(
Row(List(Row("a"), Row("b"), Row("c")), List(1, 2, 3)),
Row(List(Row("d"), Row("e")), List(4, 5))
)
val sourceSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(StructField("structField", StringType))))),
StructField("ints", ArrayType(IntegerType))
))
val sourceDF = sparkSession.createDataFrame(sourceData, sourceSchema)
I want to transform it into a dataframe that looks like this:
val targetData = Seq(
Row(List(Row("a", 1), Row("b", 2), Row("c", 3))),
Row(List(Row("d", 4), Row("e", 5)))
)
val targetSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(
StructField("structField", StringType),
StructField("value", IntegerType)))))
))
val targetDF = sparkSession.createDataFrame(targetData, targetSchema)
My best idea so far is to zip the two columns then run a UDF that puts the int value into the struct.
Is there an elegant way to do this, namely without UDFs?
Using zip_with function:
sourceDF.selectExpr(
"zip_with(structs, ints, (x, y) -> (x.structField as structField, y as value)) as structs"
).show(false)
//+------------------------+
//|structs |
//+------------------------+
//|[[a, 1], [b, 2], [c, 3]]|
//|[[d, 4], [e, 5]] |
//+------------------------+
You can use array_zip function to zip structs and ints column then you can use transform function on zipped column to get required output.
sourceDF.withColumn("structs", arrays_zip('structs, 'ints))
.withColumn("structs",
expr("transform(structs, s-> struct(s.structs.structField as structField, s.ints as value))"))
.select("structs")
.show(false)
+------------------------+
|structs |
+------------------------+
|[{a, 1}, {b, 2}, {c, 3}]|
|[{d, 4}, {e, 5}] |
+------------------------+

Create a select with a struct within a list pyspark

I have the following Dataframe View df_view:
+---+-----+
| b | c |
+---+-----+
| 1 | 3 |
+---+-----+
I needed to select this data to form a key with a list of structs.
{
"a": [
{
"b": 1,
"c": 3
}
]
}
With the select below it only creates a struct but not the list
df = spark.sql(
'''
SELECT
named_struct(
'b', b,
'c', c
) AS a
FROM df_view
'''
)
And after that I'll save to the database
df.write
.mode("overwrite")
.format("com.microsoft.azure.cosmosdb.spark")
.options(**cosmosConfig)
.save()
How is it possible to create a struct inside a list in SQL?
You can wrap the struct in array():
df = spark.sql(
'''
SELECT
array(named_struct(
'b', b,
'c', c
)) AS a
FROM df_view
'''
)

In apache spark SQL, how to remove the duplicate rows when using collect_list in window function?

I have below dataframe,
+----+-----+----+--------+
|year|month|item|quantity|
+----+-----+----+--------+
|2019|1 |TV |8 |
|2019|2 |AC |10 |
|2018|1 |TV |2 |
|2018|2 |AC |3 |
+----+-----+----+--------+
by using window function I wanted to get below output,
val partitionWindow = Window.partitionBy("year").orderBy("month")
val itemsList= collect_list(struct("item", "quantity")).over(partitionWindow)
df.select("year", itemsList as "items")
Expected output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
But, when I use window function, there are duplicate rows for each item,
Current output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8]] |
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2]] |
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
I wanted to know which is best way to remove the duplicate rows?
I believe the interesting part here is that the aggregated list of items is to be sorted by month. So I've written code in three approaches as :
Creating a sample dataset:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class data(year : Int, month : Int, item : String, quantity : Int)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val inputDF = spark.createDataset(Seq(
data(2018, 2, "AC", 3),
data(2019, 2, "AC", 10),
data(2019, 1, "TV", 2),
data(2018, 1, "TV", 2)
)).toDF()
Approach1: Aggregating month, item and quantiy into list and then sorting the items by month using UDF as:
case class items(item : String, quantity : Int)
def getItemsSortedByMonth(itemsRows : Seq[Row]) : Seq[items] = {
if (itemsRows == null || itemsRows.isEmpty) {
null
}
else {
itemsRows.sortBy(r => r.getAs[Int]("month"))
.map(r => items(r.getAs[String]("item"), r.getAs[Int]("quantity")))
}
}
val itemsSortedByMonthUDF = udf(getItemsSortedByMonth(_: Seq[Row]))
val outputDF = inputDF.groupBy(col("year"))
.agg(collect_list(struct("month", "item", "quantity")).as("items"))
.withColumn("items", itemsSortedByMonthUDF(col("items")))
Approach2: Using window functions
val monthWindowSpec = Window.partitionBy("year").orderBy("month")
val rowNumberWindowSpec = Window.partitionBy("year").orderBy("row_number")
val runningList = collect_list(struct("item", "quantity")). over(rowNumberWindowSpec)
val tempDF = inputDF
// using row_number for continuous ranks if there are multiple items in the same month
.withColumn("row_number", row_number().over(monthWindowSpec))
.withColumn("items", runningList)
.drop("month", "item", "quantity")
tempDF.persist()
val yearToSelect = tempDF.groupBy("year").agg(max("row_number").as("row_number"))
val outputDF = tempDF.join(yearToSelect, Seq("year", "row_number")).drop("row_number")
Edit:
Added the third approach for posterity using Dataset API's - groupByKey and mapGroups:
//encoding to data class can be avoided if inputDF is not converted dataset of row objects
val outputDF = inputDF.as[data].groupByKey(_.year).mapGroups{ case (year, rows) =>
val itemsSortedByMonth = rows.toSeq.sortBy(_.month).map(s => items(s.item, s.quantity))
(year, itemsSortedByMonth)
}.toDF("year", "items")
Initially I was looking for an approach without an UDF. That was OK except for once aspect that I could not solve elegantly. With a simple map UDF it is extremely simple, simpler than the other answers. So, for posterity and a little later due to other commitments.
Try this...
import spark.implicits._
import org.apache.spark.sql.functions._
case class abc(year: Int, month: Int, item: String, quantity: Int)
val itemsList= collect_list(struct("month", "item", "quantity"))
val my_udf = udf { items: Seq[Row] =>
val res = items.map { r => (r.getAs[String](1), r.getAs[Int](2)) }
res
}
// Gen some data, however, not the thrust of the problem.
val df0 = Seq(abc(2019, 1, "TV", 8), abc(2019, 7, "AC", 10), abc(2018, 1, "TV", 2), abc(2018, 2, "AC", 3), abc(2019, 2, "CO", 7)).toDS()
val df1 = df0.toDF()
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
.withColumn("sortedItems", my_udf(col("sortedCol") ))
.drop("items").drop("sortedCol")
.orderBy($"year".desc)
df2.show(false)
df2.printSchema()
Noting the following that you should fix:
order by later is better imho
mistakes in data (fixed now)
ordering mth by String is not a good idea, need to convert to mth num
Returns:
+----+----------------------------+
|year|sortedItems |
+----+----------------------------+
|2019|[[TV, 8], [CO, 7], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+----------------------------+

Is there any method to find number of columns having data in pyspark data frame

I have a pyspark data frame that has 7 columns, I have to add a new column named "sum" and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer
This sum can be calculated like this:
df = spark.createDataFrame([
(1, "a", "xxx", None, "abc", "xyz","fgh"),
(2, "b", None, 3, "abc", "xyz","fgh"),
(3, "c", "a23", None, None, "xyz","fgh")
], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))
df2.show()
+---+----+----+----+----+----+----+---+
| ID|flag|col1|col2|col3|col4|col5|sum|
+---+----+----+----+----+----+----+---+
| 1| a| xxx|null| abc| xyz| fgh| 6|
| 2| b|null| 3| abc| xyz| fgh| 6|
| 3| c| a23|null|null| xyz| fgh| 5|
+---+----+----+----+----+----+----+---+
Hope this helps!

SparkSQL: conditional sum using two columns

I hope you can help me with this.
I have a DF as follows:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-01", "2015-01-01", 100),
(2, "a", "2014-12-01", "2015-01-02", 150),
(3, "a", "2014-12-01", "2015-01-03", 120),
(4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates defined by the difference between the column 'dateIns' and 'dateTrans'. In particular, I would like to have a way to define a conditional sum that sums all values within a predefined max difference between the above mentioned columns. I.e. all value that happened between 10, 20, 30 days from dateIns ('dateTrans' - 'dateIns' <=10, 20, 30).
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
I'm using pySpqrk, but very happy to get Scala solutions as well. Thanks a lot!
Lets make your a little bit more interesting so there are some events in the window:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-30", "2015-01-01", 100),
(2, "a", "2014-12-21", "2015-01-02", 150),
(3, "a", "2014-12-10", "2015-01-03", 120),
(4, "b", "2014-12-05", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns"))
.withColumn("dateTrans", to_date($"dateTrans"))
What you need is more or less something like this:
import org.apache.spark.sql.functions.{col, datediff, lit, sum}
// Find difference in tens of days
val diff = (datediff(col("dateTrans"), col("dateIns")) / 10)
.cast("integer") * 10
val dfWithDiff = df.withColumn("diff", diff)
val aggregated = dfWithDiff
.where((col("diff") < 30) && (col("diff") >= 0))
.groupBy(col("prodId"), col("diff"))
.agg(sum(col("value")))
And the results
aggregated.show
// +------+----+----------+
// |prodId|diff|sum(value)|
// +------+----+----------+
// | a| 20| 120|
// | b| 20| 100|
// | a| 0| 100|
// | a| 10| 150|
// +------+----+----------+
where diff is a lower bound for the range (0 -> [0, 10), 10 -> [10, 20), ...). This will work in PySpark as well if you remove val and adjust imports.
Edit (aggregate per column):
val exprs = Seq(0, 10, 20).map(x => sum(
when(col("diff") === lit(x), col("value"))
.otherwise(lit(0)))
.alias(x.toString))
dfWithDiff.groupBy(col("prodId")).agg(exprs.head, exprs.tail: _*).show
// +------+---+---+---+
// |prodId| 0| 10| 20|
// +------+---+---+---+
// | a|100|150|120|
// | b| 0| 0|100|
// +------+---+---+---+
with Python equivalent:
from pyspark.sql.functions import *
def make_col(x):
cnd = when(col("diff") == lit(x), col("value")).otherwise(lit(0))
return sum(cnd).alias(str(x))
exprs = [make_col(x) for x in range(0, 30, 10)]
dfWithDiff.groupBy(col("prodId")).agg(*exprs).show()
## +------+---+---+---+
## |prodId| 0| 10| 20|
## +------+---+---+---+
## | a|100|150|120|
## | b| 0| 0|100|
## +------+---+---+---+