How to pass in a map into UDF in spark - sql

Here is my problem, I have a map of Map[Array[String],String], and I want to pass that into a UDF.
Here is my UDF:
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
And here is my Map variable:
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Here is how I call the function:
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c","d"))
I first got an error about immutable array, so I changed my array into immutable type, then I got an error about type mismatch. I googled a bit, apparently I can't pass in non-column type directly into a UDF. Can somebody help? Kudos.
Update: So I did convert everything to a wrapped array. Here is what I did:
val srdd = df.rdd.map{row => (WrappedArray.make[String](Array(row.getString(1),row.getString(5),row.getString(8))),row.getString(7))}
val lookupMap = srdd.collectAsMap()
def lookup(lookupMap:Map[collection.mutable.WrappedArray[String],String]) = udf((input:collection.mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c",$"d"))
Now I am having an error like this:
required: Map[scala.collection.mutable.WrappedArray[String],String]
-ksh: Map[scala.collection.mutable.WrappedArray[String],String]: not found [No such file or directory]
I tried to do something like this:
val m = collection.immutable.Map(1->"one",2->"Two")
val n = collection.mutable.Map(m.toSeq: _*)
but then I just got back to the error of column type.

First, you have to pass a Column as an argument of the UDF; Since you want this argument to be an array, you should use the array function in org.apache.spark.sql.functions, which creates an array Column from a series of other Columns. So the UDF call would be:
lookup(lookupMap)(array($"b",$"c",$"d"))
Now, since array columns are deserialized into mutable.WrappedArray, in order for the map lookup to succeed you'd best make sure that's the type used by your UDF:
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
So altogether:
import spark.implicits._
import org.apache.spark.sql.functions._
// Create an RDD[(mutable.WrappedArray[String], String)]:
val srdd = df.rdd.map { row: Row => (
mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))),
row.getString(7)
)}
// collect it into a map (I assume this is what you're doing with srdd...)
val lookupMap: Map[mutable.WrappedArray[String], String] = srdd.collectAsMap()
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap)(array($"b",$"c",$"d")))

Anna your code for srdd/lookupmap is of type org.apache.spark.rdd.RDD[(Array[String], String)]
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Where as in lookup method you are expecting a Map as a parameter
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
That is the reason why you are getting type mismatch error.
First make srdd from RDD[tuple] to a RDD[Map] and then try converting the RDD to Map to resolve this error.
val srdd = df.rdd.map { row => Map(
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString) ->
row.getString(7)
)}

Related

How can I loop over an array of sets/maps with varying data types

I want to make it so that I keep my code dry and create 3 (or more, or less) buttons with somewhat the same structure. So I create a list of objects to loop over and put the data inside the object to use in several places in the AppButton.
I might think a bit too Pythonic, because that's my main language and I only recently started using Kotlin. What I normally do in Python:
app_buttons = [
dict(
text="....",
icon="....",
uri_string="....",
),
...
]
I've tried something similar in Kotlin with mapOf:
val appButtons = arrayOf(
mapOf(
"title" to getString(R.string.app_btn_example1),
"icon" to R.drawable.ic_some_icon_1_64,
"uriString" to "myapp://example1",
),
...
)
and then loop over them and getting from the map:
for (entry in appButtons) {
buttons.add(
AppButton(
entry.get("text"),
entry.get("icon"),
) {
val intent = Intent(Intent.ACTION_VIEW, Uri.parse(entry.get("uriString"))).apply {
val name = getString(R.string.saved_account_key)
putExtra(name, sharedPref.getString(name, null))
}
startActivity(intent)
}
)
}
But then I get Type mismatch. Required String. Found {Comparable & java.io.Serializable}?. I don't know what types to put where...
Ok different approach, using setOf and destructuring:
val appButtons = arrayOf(
setOf(
getString(R.string.app_btn_example1),
R.drawable.ic_some_icon_1_64,
"myapp://example1",
),
...
)
for ((text, icon, uriString) in appButtons) {
buttons.add(
AppButton(
text,
icon
) {
...
}
)
}
But now I get the following:
Destructuring declaration initializer of type Set<{Comparable<*> & java.io.Serializable}> must have a 'component1()' function
Destructuring declaration initializer of type Set<{Comparable<*> & java.io.Serializable}> must have a 'component2()' function
Destructuring declaration initializer of type Set<{Comparable<*> & java.io.Serializable}> must have a 'component3()' function
How do I make this work? How do I create a basic list of objects and loop over them with the correct types? It feels so simple in Python. I'm clearly missing something.
Rather than using maps, you should create a data class. For example:
data class ButtonModel(
val title: String,
val icon: Int,
val uriString: String,
)
You can then create the array like this:
val appButtons = arrayOf(
ButtonModel(
title = getString(R.string.app_btn_example1),
icon = R.drawable.ic_some_icon_1_64,
uriString = "myapp://example1",
),
...
)
Or without the parameter labels if you prefer:
val appButtons = arrayOf(
ButtonModel(
getString(R.string.app_btn_example1),
R.drawable.ic_some_icon_1_64,
"myapp://example1",
),
...
)
Then, rather than getting them with get or [], you can just use the dot syntax:
buttons.add(
AppButton(
entry.text,
entry.icon,
) {
val intent = Intent(Intent.ACTION_VIEW, Uri.parse(entry.uriString)).apply {
val name = getString(R.string.saved_account_key)
putExtra(name, sharedPref.getString(name, null))
}
startActivity(intent)
}
)

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?
The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

From Column to Array Scala Spark

I am trying to apply a function on a Column in scala, but i am encountering some difficulties.
There is this error
found : org.apache.spark.sql.Column
required: Array[Double]
Is there a way to convert a Column to an Array?
Thank you
Update:
Thank you very much for your answer, I think I am getting closer to what I am trying to achieve. I give you a little bit of more context:
Here the code:
object Targa_Indicators_Full {
def get_quantile (variable: Array[Double], perc:Double) : Double = {
val sorted_vec:Array[Double]=variable.sorted
val pos:Double= Math.round(perc*variable.length)-1
val quant:Double=sorted_vec(pos.toInt)
quant
}
def main(args: Array[String]): Unit = {
val get_quantileUDF = udf(get_quantile _)
val plate_speed =
trips_df.groupBy($"plate").agg(sum($"time_elapsed").alias("time"),sum($"space").alias("distance"),
stddev_samp($"distance"/$"time_elapsed").alias("sd_speed"),
get_quantileUDF($"distance"/$"time_elapsed",.75).alias("Quant_speed")).
withColumn("speed", $"distance" / $"time")
}
Now I get this error:
type mismatch;
[error] found : Double(0.75)
[error] required: org.apache.spark.sql.Column
[error] get_quantileUDF($"distanza"/$"tempo_intermedio",.75).alias("IQR_speed")
^
[error] one error found
What can I do?
Thanks.
You cannot directly apply a function on the Dataframe column. You have to convert your existing function to UDF. Spark provides user to define custom user defined functions(UDF).
eg:
You have a dataframe with array column
scala> val df=sc.parallelize((1 to 100).toList.grouped(5).toList).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: array<int>]
You have defined a function to apply on the array type column
def convert( arr:Seq[Int] ) : String = {
arr.mkString(",")
}
You have to convert this to udf before applying on the column
val convertUDF = udf(convert _)
And then you can apply your function:
df.withColumn("new_col", convertUDF(col("value")))

Mapping Slick query to default projection after modifying column value

When creating a table query, I would like to modify my select statement by mapping the default table query. However, I cannot find a way to map the value of a column and still map to my case class
case class MyRecord(id: Int, name: String, value: Int)
class MyTable(tag: Tag) extends Table[MyRecord](tag, "MYTABLE") {
def id = column[Int]("id")
def name = column[String]("name")
def value = column[Int]("value")
def * = (id, name, value) <> (MyRecord.tupled, MyRecord.unapply)
}
lazy val tableQuery = TableQuery[MyTable]
I would like to trim the value of name with this function:
def trimLeading0: (Rep[String]) => Rep[String] = SimpleExpression.unary[String, String] {
(str, queryBuilder) =>
import slick.util.MacroSupport._
import queryBuilder._
b"TRIM(LEADING 0 FROM $str)"
}
Now I am at a loss about what to do here:
val trimmedTableQuery: Query[MyTable, MyRecord, Seq] = tableQuery.map(s => ???)
I have tried mapping the Rep like I would do with a case class:
val trimmedTableQuery = tableQuery.map(s => s.copy(name = trimLeading0(s.name)))
This refuses to compile with value copy is not a member of MyTable
My current workaround is to use a custom function instead of MyRecord.tupled for the default projection:
def trimming(t: (Int, String, Int)) = MyRecord(t._1, t._2.dropWhile(_ == "0"), t._3)
def * = (id, name, value) <> (trimming, MyRecord.unapply)
Alternatively, I could map the returned result of the DBIOAction returning a tuple to the case class, which is much less elegant:
val action = tableQuery.map{ s => (s.id, trimLeading0(s.name), s.value)}.result
val futureTuples: Future[Seq[(Int, String, Int)]] = db.run(action)
val records = futureTuples map (s => s.map(MyRecord.tupled))
But how can I do it inside the map method while building the query? OR would it be better to change the def name column description?
You can't mess with the default projection (i.e. def *) in MyTable as it needs to be symmetric. It's used for query and insert. But you can create a trimmedTableQuery based on a specialisation of MyTable with an overridden default projection. Then you can also have tableQuery based on the symmetric default projection. You will get an error if you try to do inserts based on the trimmedTableQuery (but you shouldn't need to do that, just use tableQuery for inserts).
lazy val tableQuery = TableQuery[MyTable]
lazy val trimmedTableQuery = new TableQuery(new MyTable(_) {
override def * = (id, trimLeading0(name), value) <> (MyRecord.tupled, MyRecord.unapply)
})

Slick 2 - Update columns in a table and return whole table object

How would you update a few columns in a table table while returning the entire updated table when using slick?
Assuming SomeTables is some TableQuery, you would typically write a query like this if you want to, for example, add an item to the table (and returning the newly added item)
val returnedItem = SomeTables returning SomeTables += someTable
How would you do the same if you want to update an item and return the whole back the whole item, I suspect you would do something like this
val q = SomeTables.filter(_.id === id).map(x => (x.someColumn,x.anotherColumn)) returning SomeTables
val returnedItem = q.update((3,"test"))
The following code however does not work, and I can't see any documentation on how to do this
Note that I am aware you can just query the item beforehand, update it, and then use copy on the original object, however this requires a lot of boilerplate (and DB trips as well)
This feature is not supported in Slick (v2 or v3-M1); although I don't see any specific reason prohibiting it's implementation, UPDATE ... RETURNING is not a standard SQL feature (for example, H2 does not support it: http://www.h2database.com/html/grammar.html#update). I'll leave as an exercise to the reader to explore how one might safely and efficiently emulate the feature for RDBMSes lacking UDPATE ... RETURNING.
When you call "returning" on a scala.slick.lifted.Query, it gives you a JdbcInsertInvokerComponent$ReturningInsertInvokerDef. You'll find no update method, although there is an insertOrUpdate method; however, insertOrUpdate only returns the returning expression result if an insert occurs, None is returned for updates, so no help here.
From this we can conclude that if you want to use the UPDATE ... RETURNING SQL feature, you'll either need to use StaticQuery or roll your own patch to Slick. You can manually write your queries (and re-implement your table projections as GetResult / SetParameter serializers), or you can try this snippet of code:
package com.spingo.slick
import scala.slick.driver.JdbcDriver.simple.{queryToUpdateInvoker, Query}
import scala.slick.driver.JdbcDriver.{updateCompiler, queryCompiler, quoteIdentifier}
import scala.slick.jdbc.{ResultConverter, CompiledMapping, JdbcBackend, JdbcResultConverterDomain, GetResult, SetParameter, StaticQuery => Q}
import scala.slick.util.SQLBuilder
import slick.ast._
object UpdateReturning {
implicit class UpdateReturningInvoker[E, U, C[_]](updateQuery: Query[E, U, C]) {
def updateReturning[A, F](returningQuery: Query[A, F, C], v: U)(implicit session: JdbcBackend#Session): List[F] = {
val ResultSetMapping(_,
CompiledStatement(_, sres: SQLBuilder.Result, _),
CompiledMapping(_updateConverter, _)) = updateCompiler.run(updateQuery.toNode).tree
val returningNode = returningQuery.toNode
val fieldNames = returningNode match {
case Bind(_, _, Pure(Select(_, col), _)) =>
List(col.name)
case Bind(_, _, Pure(ProductNode(children), _)) =>
children map { case Select(_, col) => col.name } toList
case Bind(_, TableExpansion(_, _, TypeMapping(ProductNode(children), _, _)), Pure(Ref(_), _)) =>
children map { case Select(_, col) => col.name } toList
}
implicit val pconv: SetParameter[U] = {
val ResultSetMapping(_, compiled, CompiledMapping(_converter, _)) = updateCompiler.run(updateQuery.toNode).tree
val converter = _converter.asInstanceOf[ResultConverter[JdbcResultConverterDomain, U]]
SetParameter[U] { (value, params) =>
converter.set(value, params.ps)
}
}
implicit val rconv: GetResult[F] = {
val ResultSetMapping(_, compiled, CompiledMapping(_converter, _)) = queryCompiler.run(returningNode).tree
val converter = _converter.asInstanceOf[ResultConverter[JdbcResultConverterDomain, F]]
GetResult[F] { p => converter.read(p.rs) }
}
val fieldsExp = fieldNames map (quoteIdentifier) mkString ", "
val sql = sres.sql + s" RETURNING ${fieldsExp}"
val unboundQuery = Q.query[U, F](sql)
unboundQuery(v).list
}
}
}
I'm certain the above can be improved; I've written it based on my somewhat limited understanding of Slick internals, and it works for me and can leverage the projections / type-mappings you've already defined.
Usage:
import com.spingo.slick.UpdateReturning._
val tq = TableQuery[MyTable]
val st = tq filter(_.id === 1048003) map { e => (e.id, e.costDescription) }
st.updateReturning(tq map (identity), (1048003, Some("such cost")))