How to sort array of struct type in Spark DataFrame by particular field? - dataframe

Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
Spark produces following result:
|a |sort_array(collect_list(c2), true)|
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?

According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
//|a |UDF(collect_list(c2, 0, 0)) |
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|

For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
val df2 = df.withColumn(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
//|a |list |
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
Where _2 is the name of the struct field you wan to use for sorting

If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))[Result]
.mapGroups((key, xs) => (key,
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.


Perform groupBy with array of strings - Scala

I have an array of strings defined in a variable which contains name of the column. I would like to perform group by and get count.
I am trying below code but throws error.
val keys = Array("Col1", "Col2")
val grouppedByDf = myDf.groupBy(keys.mkString(",").count
Can you please guide me what I am doing wrong here ?
import spark.implicits._
val df = Seq(("βήτα", "άλφα", 20), ("άλφα", "βήτα", 10), ("άλφα", "βήτα", 20), ("βήτα", "άλφα", 10)).toDF("α", "β", "ω")
val keys = Array("α", "β")
.groupBy( _*)
| α| β|count|
|βήτα|άλφα| 2|
|άλφα|βήτα| 2|

Kotlin: zip two lists with index

Is there a way to zip two lists in Kotlin, knowing current index in the transform lambda?
Something like this?
val list1 = listOf("a", "b", "c")
val list2 = listOf("x", "y", "z")
val joined = list1.zipWithIndex(list2){ a,b,i -> listOf(a,b,c)}
print(joined) // [[a, x, 0], [b, y, 1], [c, z, 2]]
If you have only 2 lists, then the closest you can get is probably by using withIndex() and then destructuring inside the lambda passed to zip():
list1.withIndex().zip(list2) { (i, a), b -> listOf(i, a, b) }
Comparing to the solution in the comment, this may be a little more confusing as we first add index and then the second list, but this way we don't need map() and as we receive 2 arguments in the lambda we can destructure them fully.

Scala: how to get the mean and variance and covariance of a matrix?

I am new to scala and I desperately need some guidance on the following problem:
I have a dataframe like the one below (some elements may be NULL)
val dfDouble = Seq(
(1.0, 1.0, 1.0, 3.0),
(1.0, 2.0, 0.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(1.0, 4.0, 0.0, 2.0)).toDF("m1", "m2", "m3", "m4")
| m1| m2| m3| m4|
I need to get the following statistics out of this dataframe:
a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3) ... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation
A vector that contains the number of non-null for each column
vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)
In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)
Edited: apparently it makes a difference if the content of the dataframe is Double or Int. Revised the elements to be double
Used the following command per suggested answer and it worked!
val rdd = {
case a: Row => (0 until a.length).foldRight(Array[Double]())((b, acc) =>
{ val k = a.getAs[Double](b)
if(k == null)
else acc.+:(k)}).map(_.toDouble)
Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.
It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).
Having this DF:
val df = Seq(
(1, 1, 1, 3),
(1, 2, 0, 0),
(1, 3, 1, 1),
(1, 4, 0, 2),
(1, 5, 0, 1),
(2, 1, 1, 3),
(2, 2, 1, 1),
(2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")
Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:
val rdd = {
case a: Row =>
(0 until a.length).foldRight(Array[Int]())((b, acc) => {
val k = a.getAs[Int](b)
if(k == null) acc.+:(0) else acc.+:(k)
As you can see in your IDE, the inferred type is RDD[Array[Float]. Now convert this to a RDD[DenseVector]. As simple as doing:
val rowsRdd =
And now you can build your Matrix:
val mat: RowMatrix = new RowMatrix(rowsRdd)
Once you have the matrix, you can easily compute the different metrix per column:
println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)
It gives:
Mean: [1.375,2.625,0.5,1.375]
you can read more info about the capabilities of Spark and these distributed types in the doc:
You can also compute the Covariance matrix, doing the SVD, etc...

What is the equivalent of Python list, set, and map comprehensions in Kotlin?

In Python, there are list comprehensions and similar constructs for maps and sets. In Kotlin there is nothing at all in any of the documentation with a similar name.
What are the equivalents of these comprehensions? For example, those found in Python 3 Patterns, Recipes and Idioms. Which includes comprehensions for:
Note: this question is intentionally written and answered by the author (Self-Answered Questions), so that the idiomatic answers to commonly asked Kotlin topics are present in SO.
Taking examples from Python 3 Patterns, Recipes and Idioms we can convert each one to Kotlin using a simple pattern. The Python version of a list comprehension has 3 parts:
output expression
input list/sequence and variable
optional predicate
These directly correlate to Kotlin functional extensions to collection classes. The input sequence, followed by the optional predicate in a filter lambda, followed by the output expression in a map lambda. So for this Python example:
# === PYTHON
a_list = [1, 2, 3, 4, 5, 6]
# output | var | input | filter/predicate
even_ints_squared = [ e*e for e in a_list if e % 2 == 0 ]
# output: [ 4, 16, 36 ]
// === KOTLIN
var aList = listOf(1, 2, 3, 4, 5, 6)
// input | filter | output
val evenIntsSquared = aList.filter { it % 2 == 0 }.map { it * it }
// output: [ 4, 16, 36 ]
Notice that the variable is not needed in the Kotlin version since the implied it variable is used within each lambda. In Python you can turn these into a lazy generator by using the () instead of square brackets:
# === PYTHON
even_ints_squared = ( e**2 for e in a_list if e % 2 == 0 )
And in Kotlin it is more obviously converted to a lazy sequence by changing the input via a function call asSequence():
// === KOTLIN
val evenIntsSquared = aList.asSequence().filter { it % 2 == 0 }.map { it * it }
Nested comprehensions in Kotlin are created by just nesting one within the other's map lambda. For example, take this sample from in Python changed slightly to use both a set and a list comprehension:
# === PYTHON
noprimes = {j for i in range(2, 8) for j in range(i*2, 100, i)}
primes = [x for x in range(2, 100) if x not in noprimes]
# output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
// === KOTLIN
val nonprimes = (2..7).flatMap { (it*2..99).step(it).toList() }.toSet()
val primes = (2..99).filterNot { it in nonprimes }
// output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
Notice that the nested comprehension produces a list of lists which is converted to a flat list using flatMap() and then converted to a set using toSet(). Also, Kotlin ranges are inclusive, whereas a Python range is exclusive so you will see the numbers are slightly different in the ranges.
You can also use a sequence generator with co-routines in Kotlin to yield the values without needing the call to flatMap() or flatten():
// === KOTLIN
val nonprimes = sequence {
(2..7).forEach { (it*2..99).step(it).forEach { value -> yield(value) } }
val primes = (2..99).filterNot { it in nonprimes }
Another example from the referenced Python page is generating a matrix:
# === PYTHON
matrix = [ [ 1 if item_idx == row_idx else 0 for item_idx in range(0, 3) ] for row_idx in range(0, 3) ]
# [[1, 0, 0],
# [0, 1, 0],
# [0, 0, 1]]
And in Kotlin:
// === KOTLIN
val matrix = (0..2).map { row -> (0..2).map { col -> if (col == row) 1 else 0 }}
// [[1, 0, 0],
// [0, 1, 0],
// [0, 0, 1]]
Or in Kotlin instead of lists, you could also generate arrays:
// === KOTLIN
val matrix2 = Array(3) { row ->
IntArray(3) { col -> if (col == row) 1 else 0 }
Another of the examples for set comprehensions is to generate a unique set of properly cased names:
# === PYTHON
names = [ 'Bob', 'JOHN', 'alice', 'bob', 'ALICE', 'J', 'Bob' ]
fixedNames = { name[0].upper() + name[1:].lower() for name in names if len(name) > 1 }
# output: {'Bob', 'Alice', 'John'}
Is translated to Kotlin:
// === KOTLIN
val names = listOf( "Bob", "JOHN", "alice", "bob", "ALICE", "J", "Bob" )
val fixedNames = names.filter { it.length > 1 }
.map { it.take(1).toUpperCase() + it.drop(1).toLowerCase() }
// output: [Bob, John, Alice]
And the example for map comprehension is a bit odd, but can also be implemented in Kotlin. The original:
# === PYTHON
mcase = {'a':10, 'b': 34, 'A': 7, 'Z':3}
mcase_frequency = { k.lower() : mcase.get(k.lower(), 0) + mcase.get(k.upper(), 0) for k in mcase.keys() }
# output: {'a': 17, 'z': 3, 'b': 34}
And the converted, which is written to be a bit more "wordy" here to make it clearer what is happening:
// === KOTLIN
val mcase = mapOf("a" to 10, "b" to 34, "A" to 7, "Z" to 3)
val mcaseFrequency = { (key, _) ->
val newKey = key.toLowerCase()
val newValue = mcase.getOrDefault(key.toLowerCase(), 0) +
mcase.getOrDefault(key.toUpperCase(), 0)
newKey to newValue
// output: {a=17, b=34, z=3}
Further reading:
Kotlin adds more power than list/set/map comprehensions because of its extensive functional transforms that you can make to these collection types. See What Java 8 Stream.collect equivalents are available in the standard Kotlin library?
for more examples.
See Get Factors of Numbers in Kotlin
which shows another example of a Python comprehension versus Kotlin.
See Kotlin Extensions Functions for Collections in the API reference guide.
Just for exercise the closest to python will be:
infix fun <I, O> ((I) -> O).`in`(range: Iterable<I>): List<O> =
infix fun <I> Iterable<I>.`if`(cond: (I) -> Boolean): List<I> = this.filter(cond)
fun main() {
{ it: Int -> it + 1 } `in` 1..2 `if` {it > 0}
val newls = (1..100).filter({it % 7 == 0})
in Kotlin is equivalent to the following Python code
newls = [i for i in 0..100 if i % 7 ==0]
Map comprehension
import kotlin.math.sqrt
val numbers = "1,2,3,4".split(",")
val roots = numbers.associate { n -> n.toInt() to sqrt(n.toFloat()) }
println(roots) // prints {1=1.0, 2=1.4142135, 3=1.7320508, 4=2.0}
If keys are untransformed elements of source list, even simpler:
val roots = numbers.associateWith { n -> sqrt(n.toFloat()) }

Ungrouping the grouped table in Scala

I have the following data:
Next I need to group the above data by using select and group. so it will be something like:
where the last number shows the number of similar records. Now I need to filter the number of equivalent records > 2, and store it in a separate file
I have grouped data by sql queries in Scala command. To ungroup data, I though we may create a table and add the grouped data by (insert command), and line by line. It works, but this is very very slow, and took around one hour for few records. Is there any thoughts using Scala with many thanks.
The command is shown below:
import spark.sqlContext.implicits._
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
case class Rating(age: Double,edu: String, sex: String, salary: String)
val Result = sc.textFile("hdfs://NameNode01:9000/input/adult.csv").map(_.split(",")).map(p => Rating(p(0).trim.toDouble,p(1),p(2),p(3))).toDF()
val sal1=spark.sqlContext.sql("SELECT age,edu,sex,salary,count(*) as cnt from Start group by age,edu,sex,salary")
val sal2=spark.sqlContext.sql("SELECT age,edu,sex,salary,cnt from adult WHERE cnt>3")
var => ""+age(0)).collect()
var => ""+edu(1)).collect()
var => ""+sex(2)).collect()
var => ""+salary(3)).collect()
var => ""+cnt(4)).collect()
//convert age to double
val ages=
//convert the cnt to integer
val counts=
//length of the array
var cnt_length=counts.size
//create a table and add the sal2 records in it
val adlt2=spark.sqlContext.sql("CREATE TABLE adult3 (age double, edu string, sex string, salary string)")
//loop and enter the number of cn
var sql_querys="query"
var i=0
var j=0
var loop_cnt=0
for(i <-0 to cnt_length-1){
for(j <-0 to loop_cnt-1){
sql_querys="INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')"
val adlt3=spark.sqlContext.sql("INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')")
The main part is the loop at the end of the code.
Here is a shorter solution which uses only rdds:
val result = sc
.map({ (line: String) =>
val p = line.split(",")
(Rating(p(0).trim.toDouble,p(1),p(2),p(3)), 1)
.reduceByKey(_ + _)
.filter(_._2 > 2)
.flatMap(rating => Array.fill(rating._2)(rating._1))
It works as follows:
textfile loads the rdd from the file
map transforms the lines to pairs of the form (rating, 1)
reduceByKey groups the pairs by the rating and sums the 1s (i.e. counts the occurrences of each rating)
filter discards the ratings which appear less than 3 times
flatmap repeats each rating so many times as its count and then flattens all the results to a single rdd
Here are some reasons why the initial approach is not performant:
collect is used on the dataframe to read its contents on a local machine. This means that you directly loose all parallelization and clustering benefits of spark.
The for loop executes single insertions to the dataframe. The available transformations of spark objects (e.g. map, filter, reduce, single sql queries) are highly optimised to perform those actions in a distributed fashion. By using a for loop to perform single row actions, you loose this advantage and additionally you risk getting extreme overhead by dataframes which are copied during each iteration in the loop.
(minor) Converting an RDD to a dataframe adds some additional computational cost. Thus unless you plan to perform several operations which would benefit from performance features of dataframes or datasets, I would suggest keeping it simple by using just rdds.
You might want to consider ungrouping your dataframe using explode in accordance with the groupBy count:
import org.apache.spark.sql.functions._
case class Rating(age: Double, edu: String, sex: String, salary: String)
val Result = sc.textFile("/Users/leo/projects/spark/files/testfile.csv").
map(p => Rating(p(0).trim.toDouble, p(1).trim, p(2).trim, p(3).trim)).
val saDF1 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt")
val saDF2 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt").where($"cnt" > 2)
// Create a UDF to fill array of 1's to be later exploded
val fillArr = (n: Int) => Array.fill(n)(1)
val fillArrUDF = udf(fillArr)
val expandedDF1 = saDF1.withColumn("arr", fillArrUDF($"cnt"))
| age| edu| sex|salary|cnt| arr|
|33.0| Male|Bachelors| <=50K| 1| [1]|
|20.0|Female| Masters| >50K| 2| [1, 1]|
|53.0| Male| 11th| <=50K| 3|[1, 1, 1]|
// Ungroup dataframe using explode
val ungroupedDF1 = expandedDF1.withColumn("a", explode($"arr")).
select("age", "edu", "sex", "salary")
| age| edu| sex|salary|
|33.0| Male|Bachelors| <=50K|
|20.0|Female| Masters| >50K|
|20.0|Female| Masters| >50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
According to what I understood from your question that you want to filter out similar records which are greater than 2 and write to a file. If thats so following can be your solution.
You must already have original dataframe as
|age |edu |sex |salary|
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|20.0|Female|Masters |>50K |
|20.0|Female|Masters |>50K |
|33.0|Male |Bachelors|<=50K |
You don't need to write complex sql queries to find the count, you can just use inbuilt functions as
val columnNames = Result.columns
val finalTemp = Result.groupBy( _*).agg(count("salary").as("similar records"))
This should give output as
|age |edu |sex |salary|similar records|
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|53.0|Male |11th |<=50K |3 |
Now to filter, you can just use filter function as
val finalTable = finalTemp.filter($"similar records" < 3)
final output is
|age |edu |sex |salary|similar records|
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
you can save it to a file
finalTable.write.format("com.databricks.spark.csv").save("output path")
If you want the original data with filtered out then you can simply use join as
Result.join(finalTable, Seq(columnNames: _*)).show(false)
output is
|age |edu |sex |salary|similar records|
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|20.0|Female|Masters |>50K |2 |
You can save it to a file as above
Note: you will need following import for above functions to work
import org.apache.spark.sql.functions._