Spark Dataframe size check on columns does not work as expected using vararg and if else - Scala - dataframe

I do not want to use foldLeft or withColumn with when over all columns in a dataframe, but want a select as per https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015, embellished with an if else statement and cols with vararg. All I want is to replace an empty array column in a Spark dataframe using Scala. I am using size but it never computes the zero (0) correctly.
val resDF2 = aggDF.select(cols.map { col =>
( if (size(aggDF(col)) == 0) lit(null) else aggDF(col) ).as(s"$col")
}: _*)
if (size(aggDF(col)) == 0) lit(null) does not work here functionally, but it does run and size(aggDF(col)) returns the correct length if I return that.
I am wondering what the silly issue is. Must be something I am obviously overlooking!

if-else won't work with DataFrame API, these are for Scala logical expressions. With DataFrames you need when/otherwise:
val resDF2 = aggDF.select(cols.map { col => ( when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col))).as(s"$col") }: _*)
This can further be simplified because when without otherwise automatically returns null (i.e. otherwise(lit(null)) is the default):
val resDF2 = aggDF.select(cols.map { col => when(size(aggDF(col)) > 0,aggDF(col)).as(s"$col") }: _*)
See also https://stackoverflow.com/a/48074218/1138523

Related

Scope DataFrame transformations in Spark

I need to transform some DataFrame rows for which specific flag is set and leave all other rows untouched.
df.withColumn("a", when($"flag".isNotNull, lit(1)).otherwise($"a"))
.withColumn("b", when($"flag".isNotNull, "$b" + 1).otherwise($"b"))
.withColumn("c", when($"flag".isNotNull, concat($"c", "++")).otherwise($"c"))
There might be more columns like that and I am looking for a way to refactor this into something nicer.
I thought about:
df.filter($"flag".isNotNull)
.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
.union(df.filter($"flag".isNull))
but it scans/recalculates df twice. Even if I cache it, the plan contains lineage of each branch separately - and I actually chain multiple similar transformations, so final plan explodes expotentially and crashes.
Would it be possible to implement something like:
df.withScope($"flag".isNotNull) { scoped =>
scoped.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
}
Using when expressions is ok. You can write something like this:
val updates = Map(
"a" -> lit(1),
"b" -> $"b" + 1,
"c" -> concat($"c", "++")
)
val df2 = updates.foldLeft(df) { case (acc, (c, v)) =>
acc.withColumn(c, when($"flag".isNotNull, v).otherwise(col(c)))
}

How to read Key Value pair in spark SQL?

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :
It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

Extend Groupby to include multiply aggregation

I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)

Read parquet file having mixed data type in a column

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).
val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")
This throws me exception : Failed to merge incompatible data types IntegerType and StringType
Is there a way to explicitly type cast the column during read ?
The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:
def unionReduce(dfs: Seq[DataFrame]) = {
dfs.reduce{ (x, y) =>
def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
Try(df.withColumn(name, col(name).cast(dataType))) match {
case Success(newDf) => newDf
case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
}
}
fixedX.select(y.columns.map(col): _*).unionAll(y)
}
}
The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.
The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()
It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)