How do I get a SQL row_number equivalent for a Spark RDD? - sql

I need to generate a full list of row_numbers for a data table with many columns.
In SQL, this would look like this:
select
key_value,
col1,
col2,
col3,
row_number() over (partition by key_value order by col1, col2 desc, col3)
from
temp
;
Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like
(key1, (1,2,3))
(key1, (1,4,7))
(key1, (2,2,3))
(key2, (5,5,5))
(key2, (5,5,9))
(key2, (7,5,5))
etc.
I want to order these using commands like sortBy(), sortWith(), sortByKey(), zipWithIndex, etc. and have a new RDD with the correct row_number
(key1, (1,2,3), 2)
(key1, (1,4,7), 1)
(key1, (2,2,3), 3)
(key2, (5,5,5), 1)
(key2, (5,5,9), 2)
(key2, (7,5,5), 3)
etc.
(I don't care about the parentheses, so the form can also be (K, (col1,col2,col3,rownum)) instead)
How do I do this?
Here's my first attempt:
val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3))
val temp1 = sc.parallelize(sample_data)
temp1.collect().foreach(println)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
// ((1,2),1,2,3)
// ((1,2),1,4,7)
// ((1,2),2,2,3)
temp1.map(x => (x, 1)).sortByKey().zipWithIndex.collect().foreach(println)
// ((((1,2),1,2,3),1),0)
// ((((1,2),1,4,7),1),1)
// ((((1,2),2,2,3),1),2)
// ((((3,4),5,5,5),1),3)
// ((((3,4),5,5,9),1),4)
// ((((3,4),7,5,5),1),5)
// note that this isn't ordering with a partition on key value K!
val temp2 = temp1.???
Also note that the function sortBy cannot be applied directly to an RDD, but one must run collect() first, and then the output isn't an RDD, either, but an array
temp1.collect().sortBy(a => a._2 -> -a._3 -> a._4).foreach(println)
// ((1,2),1,4,7)
// ((1,2),1,2,3)
// ((1,2),2,2,3)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
Here's a little more progress, but still not partitioned:
val temp2 = sc.parallelize(temp1.map(a => (a._1,(a._2, a._3, a._4))).collect().sortBy(a => a._2._1 -> -a._2._2 -> a._2._3)).zipWithIndex.map(a => (a._1._1, a._1._2._1, a._1._2._2, a._1._2._3, a._2 + 1))
temp2.collect().foreach(println)
// ((1,2),1,4,7,1)
// ((1,2),1,2,3,2)
// ((1,2),2,2,3,3)
// ((3,4),5,5,5,4)
// ((3,4),5,5,9,5)
// ((3,4),7,5,5,6)

The row_number() over (partition by ... order by ...) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames.
Create a test DataFrame:
from pyspark.sql import Row, functions as F
testDF = sc.parallelize(
(Row(k="key1", v=(1,2,3)),
Row(k="key1", v=(1,4,7)),
Row(k="key1", v=(2,2,3)),
Row(k="key2", v=(5,5,5)),
Row(k="key2", v=(5,5,9)),
Row(k="key2", v=(7,5,5))
)
).toDF()
Add the partitioned row number:
from pyspark.sql.window import Window
(testDF
.select("k", "v",
F.rowNumber()
.over(Window
.partitionBy("k")
.orderBy("k")
)
.alias("rowNum")
)
.show()
)
+----+-------+------+
| k| v|rowNum|
+----+-------+------+
|key1|[1,2,3]| 1|
|key1|[1,4,7]| 2|
|key1|[2,2,3]| 3|
|key2|[5,5,5]| 1|
|key2|[5,5,9]| 2|
|key2|[7,5,5]| 3|
+----+-------+------+

This is an interesting problem you're bringing up. I will answer it in Python but I'm sure you will be able to translate seamlessly to Scala.
Here is how I would tackle it:
1- Simplify your data:
temp2 = temp1.map(lambda x: (x[0],(x[1],x[2],x[3])))
temp2 is now a "real" key-value pair. It looks like that:
[
((3, 4), (5, 5, 5)),
((3, 4), (5, 5, 9)),
((3, 4), (7, 5, 5)),
((1, 2), (1, 2, 3)),
((1, 2), (1, 4, 7)),
((1, 2), (2, 2, 3))
]
2- Then, use the group-by function to reproduce the effect of the PARTITION BY:
temp3 = temp2.groupByKey()
temp3 is now a RDD with 2 rows:
[((1, 2), <pyspark.resultiterable.ResultIterable object at 0x15e08d0>),
((3, 4), <pyspark.resultiterable.ResultIterable object at 0x15e0290>)]
3- Now, you need to apply a rank function for each value of the RDD. In python, I would use the simple sorted function (the enumerate will create your row_number column):
temp4 = temp3.flatMap(lambda x: tuple([(x[0],(i[1],i[0])) for i in enumerate(sorted(x[1]))])).take(10)
Note that to implement your particular order, you would need to feed the right "key" argument (in python, I would just create a lambda function like those:
lambda tuple : (tuple[0],-tuple[1],tuple[2])
At the end (without the key argument function, it looks like that):
[
((1, 2), ((1, 2, 3), 0)),
((1, 2), ((1, 4, 7), 1)),
((1, 2), ((2, 2, 3), 2)),
((3, 4), ((5, 5, 5), 0)),
((3, 4), ((5, 5, 9), 1)),
((3, 4), ((7, 5, 5), 2))
]
Hope that helps!
Good luck.

val test = Seq(("key1", (1,2,3)),("key1",(4,5,6)), ("key2", (7,8,9)), ("key2", (0,1,2)))
test: Seq[(String, (Int, Int, Int))] = List((key1,(1,2,3)), (key1,(4,5,6)), (key2,(7,8,9)), (key2,(0,1,2)))
test.foreach(println)
(key1,(1,2,3))
(key1,(4,5,6))
(key2,(7,8,9))
(key2,(0,1,2))
val rdd = sc.parallelize(test, 2)
rdd: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ParallelCollectionRDD[41] at parallelize at :26
val rdd1 = rdd.groupByKey.map(x => (x._1,x._2.toArray)).map(x => (x._1, x._2.sortBy(x => x._1).zipWithIndex))
rdd1: org.apache.spark.rdd.RDD[(String, Array[((Int, Int, Int), Int)])] = MapPartitionsRDD[44] at map at :25
val rdd2 = rdd1.flatMap{
elem =>
val key = elem._1
elem._2.map(row => (key, row._1, row._2))
}
rdd2: org.apache.spark.rdd.RDD[(String, (Int, Int, Int), Int)] = MapPartitionsRDD[45] at flatMap at :25
rdd2.collect.foreach(println)
(key1,(1,2,3),0)
(key1,(4,5,6),1)
(key2,(0,1,2),0)
(key2,(7,8,9),1)

From spark sql, Read the data files...
val df = spark.read.json("s3://s3bukcet/key/activity/year=2018/month=12/date=15/*");
The above file has fields user_id, pageviews and clicks
Generate the activity Id (row_number) partitioned by user_id and order by clicks
val output = df.withColumn("activity_id", functions.row_number().over(Window.partitionBy("user_id").orderBy("clicks")).cast(DataTypes.IntegerType));

Related

Zip array of structs with array of ints into array of structs column

I have a dataframe that looks like this:
val sourceData = Seq(
Row(List(Row("a"), Row("b"), Row("c")), List(1, 2, 3)),
Row(List(Row("d"), Row("e")), List(4, 5))
)
val sourceSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(StructField("structField", StringType))))),
StructField("ints", ArrayType(IntegerType))
))
val sourceDF = sparkSession.createDataFrame(sourceData, sourceSchema)
I want to transform it into a dataframe that looks like this:
val targetData = Seq(
Row(List(Row("a", 1), Row("b", 2), Row("c", 3))),
Row(List(Row("d", 4), Row("e", 5)))
)
val targetSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(
StructField("structField", StringType),
StructField("value", IntegerType)))))
))
val targetDF = sparkSession.createDataFrame(targetData, targetSchema)
My best idea so far is to zip the two columns then run a UDF that puts the int value into the struct.
Is there an elegant way to do this, namely without UDFs?
Using zip_with function:
sourceDF.selectExpr(
"zip_with(structs, ints, (x, y) -> (x.structField as structField, y as value)) as structs"
).show(false)
//+------------------------+
//|structs |
//+------------------------+
//|[[a, 1], [b, 2], [c, 3]]|
//|[[d, 4], [e, 5]] |
//+------------------------+
You can use array_zip function to zip structs and ints column then you can use transform function on zipped column to get required output.
sourceDF.withColumn("structs", arrays_zip('structs, 'ints))
.withColumn("structs",
expr("transform(structs, s-> struct(s.structs.structField as structField, s.ints as value))"))
.select("structs")
.show(false)
+------------------------+
|structs |
+------------------------+
|[{a, 1}, {b, 2}, {c, 3}]|
|[{d, 4}, {e, 5}] |
+------------------------+

Array Pair Loading from databases using Qlik Sense

Does anyone has experience how to load/prepare data:
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
taken from SQL database (stored there as value) into qlik sense table:
ID, Value
1, a
2, b
3, v
4, d
Check out the annotated script below.
After its execution the result table will be:
set vSQLData = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')];
SQLData:
Load
// at this point the data will look like: "1, a", "2, b"
// will split the string on "," and will
// get the first value as ID
// and the second one as Valule
SubField(TempField2, ',', 1) as ID,
SubField(TempField2, ',', 2) as Value,
;
Load
// split the string by ")," and generate N number of rows
// then for each row remove "(", ")" and "'" characters
PurgeChar(SubField(TempField1, '),'), '''()''') as TempField2
;
Load
// remove "[" and "]" characters
PurgeChar('$(vSQLData)', '[]') as TempField1
AutoGenerate(1)
;

Split a list into three in Kotlin

I have the following data class:
data class Foo(val a: Int = 0, val b: Int = 0)
I have a list of Foo's with the following structure:
[ Foo(a = 1), Foo(a = 2), ..., Foo(b = 22), Foo(a = 5), Foo(a = 6), ... ]
(a group of items with a's, then one b, then a's again)
I would like to split above list into three sub-lists like that:
[ Foo(a = 1), Foo(a = 2), ...]
[ Foo(b = 22) ]
[ Foo(a = 5), Foo(a = 6), ...]
sublist of elements that have non-zero a property
list of one element that has non-zero b
remaining sublist of elements that have non-zero a property
Is it possible to achieve using groupBy or partition?
It is not possible to do it via groupBy or partition, because it is not possible to check the past state in those operations. However, you can do it via a fold operation and using mutable lists. Not sure if it fits to your needs but here it goes:
val input = listOf(Foo(a = 1), Foo(a = 2), Foo(b = 22), Foo(a = 5), Foo(a = 6))
val output: List<List<Foo>> = input.fold(mutableListOf<MutableList<Foo>>(mutableListOf())) { acc, foo ->
val lastList = acc.last()
val appendToTheLastList =
lastList.isEmpty() ||
(foo.a != 0 && lastList.last().a != 0) ||
(foo.b != 0 && lastList.last().b != 0)
when {
appendToTheLastList -> lastList.add(foo)
else -> acc.add(mutableListOf(foo))
}
return#fold acc
}
println(output)
outputs:
[[Foo(a=1, b=0), Foo(a=2, b=0)], [Foo(a=0, b=22)], [Foo(a=5, b=0),
Foo(a=6, b=0)]]
Note: I have to point out that this solution is not better than a solution with regular loops.
So you want to 1) ignore the first Foos where a=0, 2) start collecting them when you see Foos where a is non-zero, 3) when you hit a Foo where a=0, put that in another list, because b will be non-zero, 4) start collecting non-zero a's again in a third list?
If that's what you want (this is an extremely specific thing you want and you haven't been clear about it at all) you could do it this way:
data class Foo(val a: Int, val b: Int)
val stuff = listOf(Foo(0,1), Foo(1,2), Foo(3,0), Foo(0, 4), Foo(0, 5), Foo(6, 1), Foo(0,7))
fun main(args: Array<String>) {
fun aIsZero(foo: Foo) = foo.a == 0
// ignore initial zero a's if there are any
with(stuff.dropWhile(::aIsZero)) {
val bIndex = indexOfFirst(::aIsZero)
val listOne = take(bIndex)
val listTwo = listOf(elementAt(bIndex))
val listThree = drop(bIndex+1).filterNot(::aIsZero)
listOf(listOne, listTwo, listThree).forEach(::println)
}
}
You can't use partition or groupBy because your predicate depends on the value of a, but also on whether it happens to represent that one element you want to put in the b list, and for the others whether they appear before or after that b element. Which you don't know before you start processing the list.
You could mess around with indices and stuff, but honestly your use case seems so specific that it's probably better to just do it imperatively instead of trying to cram it into a functional approach.

In apache spark SQL, how to remove the duplicate rows when using collect_list in window function?

I have below dataframe,
+----+-----+----+--------+
|year|month|item|quantity|
+----+-----+----+--------+
|2019|1 |TV |8 |
|2019|2 |AC |10 |
|2018|1 |TV |2 |
|2018|2 |AC |3 |
+----+-----+----+--------+
by using window function I wanted to get below output,
val partitionWindow = Window.partitionBy("year").orderBy("month")
val itemsList= collect_list(struct("item", "quantity")).over(partitionWindow)
df.select("year", itemsList as "items")
Expected output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
But, when I use window function, there are duplicate rows for each item,
Current output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8]] |
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2]] |
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
I wanted to know which is best way to remove the duplicate rows?
I believe the interesting part here is that the aggregated list of items is to be sorted by month. So I've written code in three approaches as :
Creating a sample dataset:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class data(year : Int, month : Int, item : String, quantity : Int)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val inputDF = spark.createDataset(Seq(
data(2018, 2, "AC", 3),
data(2019, 2, "AC", 10),
data(2019, 1, "TV", 2),
data(2018, 1, "TV", 2)
)).toDF()
Approach1: Aggregating month, item and quantiy into list and then sorting the items by month using UDF as:
case class items(item : String, quantity : Int)
def getItemsSortedByMonth(itemsRows : Seq[Row]) : Seq[items] = {
if (itemsRows == null || itemsRows.isEmpty) {
null
}
else {
itemsRows.sortBy(r => r.getAs[Int]("month"))
.map(r => items(r.getAs[String]("item"), r.getAs[Int]("quantity")))
}
}
val itemsSortedByMonthUDF = udf(getItemsSortedByMonth(_: Seq[Row]))
val outputDF = inputDF.groupBy(col("year"))
.agg(collect_list(struct("month", "item", "quantity")).as("items"))
.withColumn("items", itemsSortedByMonthUDF(col("items")))
Approach2: Using window functions
val monthWindowSpec = Window.partitionBy("year").orderBy("month")
val rowNumberWindowSpec = Window.partitionBy("year").orderBy("row_number")
val runningList = collect_list(struct("item", "quantity")). over(rowNumberWindowSpec)
val tempDF = inputDF
// using row_number for continuous ranks if there are multiple items in the same month
.withColumn("row_number", row_number().over(monthWindowSpec))
.withColumn("items", runningList)
.drop("month", "item", "quantity")
tempDF.persist()
val yearToSelect = tempDF.groupBy("year").agg(max("row_number").as("row_number"))
val outputDF = tempDF.join(yearToSelect, Seq("year", "row_number")).drop("row_number")
Edit:
Added the third approach for posterity using Dataset API's - groupByKey and mapGroups:
//encoding to data class can be avoided if inputDF is not converted dataset of row objects
val outputDF = inputDF.as[data].groupByKey(_.year).mapGroups{ case (year, rows) =>
val itemsSortedByMonth = rows.toSeq.sortBy(_.month).map(s => items(s.item, s.quantity))
(year, itemsSortedByMonth)
}.toDF("year", "items")
Initially I was looking for an approach without an UDF. That was OK except for once aspect that I could not solve elegantly. With a simple map UDF it is extremely simple, simpler than the other answers. So, for posterity and a little later due to other commitments.
Try this...
import spark.implicits._
import org.apache.spark.sql.functions._
case class abc(year: Int, month: Int, item: String, quantity: Int)
val itemsList= collect_list(struct("month", "item", "quantity"))
val my_udf = udf { items: Seq[Row] =>
val res = items.map { r => (r.getAs[String](1), r.getAs[Int](2)) }
res
}
// Gen some data, however, not the thrust of the problem.
val df0 = Seq(abc(2019, 1, "TV", 8), abc(2019, 7, "AC", 10), abc(2018, 1, "TV", 2), abc(2018, 2, "AC", 3), abc(2019, 2, "CO", 7)).toDS()
val df1 = df0.toDF()
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
.withColumn("sortedItems", my_udf(col("sortedCol") ))
.drop("items").drop("sortedCol")
.orderBy($"year".desc)
df2.show(false)
df2.printSchema()
Noting the following that you should fix:
order by later is better imho
mistakes in data (fixed now)
ordering mth by String is not a good idea, need to convert to mth num
Returns:
+----+----------------------------+
|year|sortedItems |
+----+----------------------------+
|2019|[[TV, 8], [CO, 7], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+----------------------------+

Kotlin: How to convert list to map with list?

I have a list as below
{("a", 1), ("b", 2), ("c", 3), ("a", 4)}
I want to convert it to a map of list as below
{("a" (1, 4)), ("b", (2)), ("c", (3)))}
i.e. for a, we have a list of 1 and 4, since the key is the same.
The answer in
How to convert List to Map in Kotlin? only show unique value (instead of duplicate one like mine).
I tried associateBy in Kotlin
data class Combine(val alpha: String, val num: Int)
val list = arrayListOf(Combine("a", 1), Combine("b", 2), Combine("c", 3), Combine("a", 4))
val mapOfList = list.associateBy ( {it.alpha}, {it.num} )
println(mapOfList)
But doesn't seems to work. How could I do it in Kotlin?
Code
fun main(args: Array<String>) {
data class Combine(val alpha: String, val num: Int)
val list = arrayListOf(Combine("a", 1), Combine("b", 2), Combine("c", 3), Combine("a", 4))
val mapOfList = list.associateBy ( {it.alpha}, {it.num} )
println(mapOfList)
val changed = list
.groupBy ({ it.alpha }, {it.num})
println(changed)
}
Output
{a=4, b=2, c=3}
{a=[1, 4], b=[2], c=[3]}
How it works
First it takes the list
It groups the Combines by their alpha value to their num values
You may group the list by alpha first and then map the value to List<Int>:
data class Combine(val alpha: String, val num: Int)
val list = arrayListOf(Combine("a", 1), Combine("b", 2), Combine("c", 3), Combine("a", 4))
val mapOfList = list
.groupBy { it.alpha }
.mapValues { it.value.map { it.num } }
println(mapOfList)
Here's a slightly more concise version of Jacky Choi's solution.
It combines the grouping and the transforming into one call to groupBy().
val mapOfList = list
.groupBy (
keySelector = { it.name },
valueTransform = { it.num },
)