Scope DataFrame transformations in Spark - dataframe

I need to transform some DataFrame rows for which specific flag is set and leave all other rows untouched.
df.withColumn("a", when($"flag".isNotNull, lit(1)).otherwise($"a"))
.withColumn("b", when($"flag".isNotNull, "$b" + 1).otherwise($"b"))
.withColumn("c", when($"flag".isNotNull, concat($"c", "++")).otherwise($"c"))
There might be more columns like that and I am looking for a way to refactor this into something nicer.
I thought about:
df.filter($"flag".isNotNull)
.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
.union(df.filter($"flag".isNull))
but it scans/recalculates df twice. Even if I cache it, the plan contains lineage of each branch separately - and I actually chain multiple similar transformations, so final plan explodes expotentially and crashes.
Would it be possible to implement something like:
df.withScope($"flag".isNotNull) { scoped =>
scoped.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
}

Using when expressions is ok. You can write something like this:
val updates = Map(
"a" -> lit(1),
"b" -> $"b" + 1,
"c" -> concat($"c", "++")
)
val df2 = updates.foldLeft(df) { case (acc, (c, v)) =>
acc.withColumn(c, when($"flag".isNotNull, v).otherwise(col(c)))
}

Related

scala spark foldLeft on map with array

I have a configuration in a form of a map:
val config = Map[String, Array[String]] = Map("column1" -> Array("field1"), "column2" -> Array("field1","field2"))
I want to use foldLeft to apply this to a dataframe and dynamically filter nested fields using dropFields functions.
I came out with:
val result = config.foldLeft(""){(k, v) =>
df.withColumn( v._1, col(k + v._1).dropFields(v._2: _*))
}
but struggle to make foldleft work, any help would be appreciated.

How to get size of specfic value inside array Kotlin

here is example of the list. I want to make dynamic where maybe the the value will become more.
val list = arrayListOf("A", "B", "C", "A", "A", "B") //Maybe they will be more
I want the output like:-
val result = list[i] + " size: " + list[i].size
So the output will display every String with the size.
A size: 3
B size: 2
C size: 1
If I add more value, so the result will increase also.
You can use groupBy in this way:
val result = list.groupBy { it }.map { it.key to it.value.size }.toMap()
Jeoffrey's way is better actually, since he is using .mapValues() directly, instead of an extra call to .toMap(). I'm just leaving this answer her since
I believe that the other info I put is relevant.
This will give a Map<String, Int>, where the Int is the count of the occurences.
This result will not change when you change the original list. That is not how the language works. If you want something like that, you'd need quite a bit of work, like overwriting the add function from your collection to refresh the result map.
Also, I see no reason for you to use an ArrayList, especially since you are expecting to increase the size of that collection, I'd stick with MutableList if I were you.
I think the terminology you're looking for is "frequency" here: the number of times an element appears in a list.
You can usually count elements in a list using the count method like this:
val numberOfAs = list.count { it == "A" }
This approach is pretty inefficient if you need to count all elements though, in which case you can create a map of frequencies the following way:
val freqs = list.groupBy { it }.mapValues { (_, g) -> g.size }
freqs here will be a Map where each key is a unique element from the original list, and the value is the corresponding frequency of that element in the list.
This works by first grouping elements that are equal to each other via groupBy, which returns a Map<String, List<String>> where each key is a unique element from the original list, and each value is the group of all elements in the list that were equal to the key.
Then mapValues will transform that map so that the values are the sizes of the groups instead of the groups themselves.
An improved approach, as suggested by #broot is to make use of Kotlin's Grouping class which has a built-in eachCount method:
val freqs = list.groupingBy { it }.eachCount()

Extend Groupby to include multiply aggregation

I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)

Spark Dataframe size check on columns does not work as expected using vararg and if else - Scala

I do not want to use foldLeft or withColumn with when over all columns in a dataframe, but want a select as per https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015, embellished with an if else statement and cols with vararg. All I want is to replace an empty array column in a Spark dataframe using Scala. I am using size but it never computes the zero (0) correctly.
val resDF2 = aggDF.select(cols.map { col =>
( if (size(aggDF(col)) == 0) lit(null) else aggDF(col) ).as(s"$col")
}: _*)
if (size(aggDF(col)) == 0) lit(null) does not work here functionally, but it does run and size(aggDF(col)) returns the correct length if I return that.
I am wondering what the silly issue is. Must be something I am obviously overlooking!
if-else won't work with DataFrame API, these are for Scala logical expressions. With DataFrames you need when/otherwise:
val resDF2 = aggDF.select(cols.map { col => ( when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col))).as(s"$col") }: _*)
This can further be simplified because when without otherwise automatically returns null (i.e. otherwise(lit(null)) is the default):
val resDF2 = aggDF.select(cols.map { col => when(size(aggDF(col)) > 0,aggDF(col)).as(s"$col") }: _*)
See also https://stackoverflow.com/a/48074218/1138523

generating DataFrames in for loop in Scala Spark cause out of memory

I'm generating small dataFrames in for loop. At each round of for loop, I pass the generated dataFrame to a function which returns double. This simple process (which I thought could be easily taken care of by garbage collector) blow up my memory. When I look at Spark UI at each round of for loop it adds a new "SQL{1-500}" (my loop runs 500 times). My question is how to drop this sql object before generating a new one?
my code is something like this:
Seq.fill(500){
val data = (1 to 1000).map(_=>Random.nextInt(1000))
val dataframe = createDataFrame(data)
myFunction(dataframe)
dataframe.unpersist()
}
def myFunction(df: DataFrame)={
df.count()
}
I tried to solve this problem by dataframe.unpersist() and sqlContext.clearCache() but neither of them worked.
You have two places where I suspect something fishy is happening:
in the definition of myFunction : you really need to put the = before the body of the definition. I had typos like that compile, but produce really weird errors (note I changed your myFunction for debugging purposes)
it is better to fill your Seq with something you know and then apply foreach or some such
(You also need to replace random.nexInt with Random.nextInt, and also, you can only create a DataFrame from a Seq of a type that is a subtype of Product, such as tuple, and need to use sqlContext to use createDataFrame)
This code works with no memory issues:
Seq.fill(500)(0).foreach{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
println(df.count())
}
Edit: parallelizing the computation (across 10 cores) and returning the RDD of counts:
sc.parallelize(Seq.fill(500)(0), 10).map{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
df.count()
}
Edit 2: the difference between declaring function myFunction with = and without = is that the first is (a usual) function definition, while the other is procedure definition and is only used for methods that return Unit. See explanation. Here is this point illustrated in Spark-shell:
scala> def myf(df:DataFrame) = df.count()
myf: (df: org.apache.spark.sql.DataFrame)Long
scala> def myf2(df:DataFrame) { df.count() }
myf2: (df: org.apache.spark.sql.DataFrame)Unit