How do I convert a Column DataType to String in Spark Scala? - dataframe

I am trying to call the from_json method and want to fetch the schema of the JSON dynamically. The issue is with the third .withColumn line as it doesn't seem to like Seq[Column].
val randomStringGen = udf((length: Int) => {
scala.util.Random.alphanumeric.take(length).mkString
})
val randomKeyGen = udf((key: String, value: String) => {
s"""{"${key}": "${value}"}"""
})
val resultDF = initDF
.withColumn("value", randomStringGen(lit(10)))
.withColumn("keyValue", randomKeyGen(lit("key"), col("value")))
.withColumn("key", from_json(col("keyValue"), spark.read.json(Seq(col("keyValue")).toDS).schema))
error: value toDS is not a member of Seq[org.apache.spark.sql.Column]
.withColumn("key", from_json(col("keyValue"), spark.read.json(Seq(col("keyValue")).toDS).schema))
I have a known solution which is simply to hard code a sample JSON:
val jsData = """{"key": "value"}"""
and replace the col("keyValue") with the hardcoded variable.
.withColumn("key", from_json(col("keyValue"), spark.read.json(Seq(jsData).toDS).schema))
This works and produces exactly what I want, but if I have a large json, then this method can be quite cumbersome.

There are 2 little errors in what you're writing.
First, if you want to use the toDS method on a Seq you'll need to import spark.implicits._. This method is defined in there. Doing that should get rid of your first error.
Secondly, the from_json function you're trying to use has the following function signature:
def from_json(e: Column, schema: StructType): Column
So that second field, schema, should be a StructType. This is what the .schema method of a Dataset returns. So, one of your parentheses is at the wrong position.
Instead of
.withColumn("key", from_json(col("keyValue"), spark.read.json(Seq(col("keyValue")).toDS.schema)))
you should have
.withColumn("key", from_json(col("keyValue"), spark.read.json(Seq(col("keyValue")).toDS).schema))

Related

calling a double from object seems to return the possition not value of the double

In a function i have this:
val sunRise = SunEquation(2459622)
binding.timeDisplay.setText("$sunRise.n")
The SunEquation-Class looks like this:
class SunEquation(var jDate: Int,) {
val jYear = 2451545
val ttOffset = .0008
var n = jDate - jYear + ttOffset
}
the button- text that appears is:
com.example.soluna.SunEquation#6d1a94b.n
i would expect a double-value
You have to add curly brackets around the value you want to inject into the String, like this:
binding.timeDisplay.setText("${sunRise.n}")
The shorthand syntax without brackets only works for a single variable, but not
for access to a nested field or other more complex expressions.
In your case, this results in the object itself being injected into the String, which is resembled by com.example.soluna.SunEquation#6d1a94b based on the result of the corresponding toString() call, which defaults to the class name and the reference id of the object. Followed by the literal String .n.
Alternatively, you could extract the value into a val beforehand and reference that.
val customN = sunRise.n
binding.timeDisplay.setText("$customN")

Scala Spark: Parse SQL string to Column

I have two functions, foo and bar, that I want to write like follows:
def foo(df : DataFrame, conditionString : String) =
val conditionColumn : Column = something(conditionString) //help me define "something"
bar(df, conditionColumn)
}
def bar(df : DataFrame, conditionColumn : Column) = {
df.where(conditionColumn)
}
Where condition is a sql string like "person.age >= 18 AND person.citizen == true" or something.
Because reasons, I don't want to change the type signatures here. I feel this should work because if I could change the type signatures, I could just write:
def foobar(df : DataFrame, conditionString : String) = {
df.where(conditionString)
}
As .where is happy to accept a sql string expression.
So, how can I turn a string representing a column expression into a column? If the expression were just the name of a single column in df I could just do col(colName), but that doesn't seem to take the range of expressions that .where does.
If you need more context for why I'm doing this, I'm working on a databricks notebook that can only accept string arguments (and needs to take a condition as an argument), which calls a library I want to take column-typed arguments.
You can use functions.expr:
def expr(expr: String): Column
Parses the expression string into the column that it represents

How to pass in a map into UDF in spark

Here is my problem, I have a map of Map[Array[String],String], and I want to pass that into a UDF.
Here is my UDF:
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
And here is my Map variable:
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Here is how I call the function:
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c","d"))
I first got an error about immutable array, so I changed my array into immutable type, then I got an error about type mismatch. I googled a bit, apparently I can't pass in non-column type directly into a UDF. Can somebody help? Kudos.
Update: So I did convert everything to a wrapped array. Here is what I did:
val srdd = df.rdd.map{row => (WrappedArray.make[String](Array(row.getString(1),row.getString(5),row.getString(8))),row.getString(7))}
val lookupMap = srdd.collectAsMap()
def lookup(lookupMap:Map[collection.mutable.WrappedArray[String],String]) = udf((input:collection.mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c",$"d"))
Now I am having an error like this:
required: Map[scala.collection.mutable.WrappedArray[String],String]
-ksh: Map[scala.collection.mutable.WrappedArray[String],String]: not found [No such file or directory]
I tried to do something like this:
val m = collection.immutable.Map(1->"one",2->"Two")
val n = collection.mutable.Map(m.toSeq: _*)
but then I just got back to the error of column type.
First, you have to pass a Column as an argument of the UDF; Since you want this argument to be an array, you should use the array function in org.apache.spark.sql.functions, which creates an array Column from a series of other Columns. So the UDF call would be:
lookup(lookupMap)(array($"b",$"c",$"d"))
Now, since array columns are deserialized into mutable.WrappedArray, in order for the map lookup to succeed you'd best make sure that's the type used by your UDF:
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
So altogether:
import spark.implicits._
import org.apache.spark.sql.functions._
// Create an RDD[(mutable.WrappedArray[String], String)]:
val srdd = df.rdd.map { row: Row => (
mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))),
row.getString(7)
)}
// collect it into a map (I assume this is what you're doing with srdd...)
val lookupMap: Map[mutable.WrappedArray[String], String] = srdd.collectAsMap()
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap)(array($"b",$"c",$"d")))
Anna your code for srdd/lookupmap is of type org.apache.spark.rdd.RDD[(Array[String], String)]
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Where as in lookup method you are expecting a Map as a parameter
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
That is the reason why you are getting type mismatch error.
First make srdd from RDD[tuple] to a RDD[Map] and then try converting the RDD to Map to resolve this error.
val srdd = df.rdd.map { row => Map(
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString) ->
row.getString(7)
)}

generating DataFrames in for loop in Scala Spark cause out of memory

I'm generating small dataFrames in for loop. At each round of for loop, I pass the generated dataFrame to a function which returns double. This simple process (which I thought could be easily taken care of by garbage collector) blow up my memory. When I look at Spark UI at each round of for loop it adds a new "SQL{1-500}" (my loop runs 500 times). My question is how to drop this sql object before generating a new one?
my code is something like this:
Seq.fill(500){
val data = (1 to 1000).map(_=>Random.nextInt(1000))
val dataframe = createDataFrame(data)
myFunction(dataframe)
dataframe.unpersist()
}
def myFunction(df: DataFrame)={
df.count()
}
I tried to solve this problem by dataframe.unpersist() and sqlContext.clearCache() but neither of them worked.
You have two places where I suspect something fishy is happening:
in the definition of myFunction : you really need to put the = before the body of the definition. I had typos like that compile, but produce really weird errors (note I changed your myFunction for debugging purposes)
it is better to fill your Seq with something you know and then apply foreach or some such
(You also need to replace random.nexInt with Random.nextInt, and also, you can only create a DataFrame from a Seq of a type that is a subtype of Product, such as tuple, and need to use sqlContext to use createDataFrame)
This code works with no memory issues:
Seq.fill(500)(0).foreach{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
println(df.count())
}
Edit: parallelizing the computation (across 10 cores) and returning the RDD of counts:
sc.parallelize(Seq.fill(500)(0), 10).map{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
df.count()
}
Edit 2: the difference between declaring function myFunction with = and without = is that the first is (a usual) function definition, while the other is procedure definition and is only used for methods that return Unit. See explanation. Here is this point illustrated in Spark-shell:
scala> def myf(df:DataFrame) = df.count()
myf: (df: org.apache.spark.sql.DataFrame)Long
scala> def myf2(df:DataFrame) { df.count() }
myf2: (df: org.apache.spark.sql.DataFrame)Unit

lua call function from a string with function name

Is it possible in lua to execute a function from a string representing its name?
i.e: I have the string x = "foo", is it possible to do x() ?
If yes what is the syntax ?
To call a function in the global namespace (as mentioned by #THC4k) is easily done, and does not require loadstring().
x='foo'
_G[x]() -- calls foo from the global namespace
You would need to use loadstring() (or walk each table) if the function in another table, such as if x='math.sqrt'.
If loadstring() is used you would want to not only append parenthesis with ellipse (...) to allow for parameters, but also add return to the front.
x='math.sqrt'
print(assert(loadstring('return '..x..'(...)'))(25)) --> 5
or walk the tables:
function findfunction(x)
assert(type(x) == "string")
local f=_G
for v in x:gmatch("[^%.]+") do
if type(f) ~= "table" then
return nil, "looking for '"..v.."' expected table, not "..type(f)
end
f=f[v]
end
if type(f) == "function" then
return f
else
return nil, "expected function, not "..type(f)
end
end
x='math.sqrt'
print(assert(findfunction(x))(121)) -->11
I frequently put a bunch of functions in a table:
functions = {
f1 = function(arg) print("function one: "..arg) end,
f2 = function(arg) print("function two: "..arg..arg) end,
...,
fn = function(arg) print("function N: argh") end,
}
Then you can use a string as an table index and run your function like this
print(functions["f1"]("blabla"))
print(functions["f2"]("blabla"))
This is the result:
function one: blabla
function two: blablablabla
I find this to be cleaner than using loadstring(). If you don't want to create a special function table you can use _G['foo'].
loadstring is not the answer here. For starters you would need a return in the string, and other details I won't go into.
THC4k has the right idea; if you have the function name in the variable x, then the call you want is
_G[x](arg1, arg2, ...)
Names are not unique, there can be many functions names foo in different namespaces. But _G['foo'] is foo in the global namespace.
It sounds like you want to do an 'eval', which is supported in Lua like so:
assert(loadstring(x))()
You'll probably want to concatenate the "()" onto x first, though.