From Column to Array Scala Spark - sql

I am trying to apply a function on a Column in scala, but i am encountering some difficulties.
There is this error
found : org.apache.spark.sql.Column
required: Array[Double]
Is there a way to convert a Column to an Array?
Thank you
Update:
Thank you very much for your answer, I think I am getting closer to what I am trying to achieve. I give you a little bit of more context:
Here the code:
object Targa_Indicators_Full {
def get_quantile (variable: Array[Double], perc:Double) : Double = {
val sorted_vec:Array[Double]=variable.sorted
val pos:Double= Math.round(perc*variable.length)-1
val quant:Double=sorted_vec(pos.toInt)
quant
}
def main(args: Array[String]): Unit = {
val get_quantileUDF = udf(get_quantile _)
val plate_speed =
trips_df.groupBy($"plate").agg(sum($"time_elapsed").alias("time"),sum($"space").alias("distance"),
stddev_samp($"distance"/$"time_elapsed").alias("sd_speed"),
get_quantileUDF($"distance"/$"time_elapsed",.75).alias("Quant_speed")).
withColumn("speed", $"distance" / $"time")
}
Now I get this error:
type mismatch;
[error] found : Double(0.75)
[error] required: org.apache.spark.sql.Column
[error] get_quantileUDF($"distanza"/$"tempo_intermedio",.75).alias("IQR_speed")
^
[error] one error found
What can I do?
Thanks.

You cannot directly apply a function on the Dataframe column. You have to convert your existing function to UDF. Spark provides user to define custom user defined functions(UDF).
eg:
You have a dataframe with array column
scala> val df=sc.parallelize((1 to 100).toList.grouped(5).toList).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: array<int>]
You have defined a function to apply on the array type column
def convert( arr:Seq[Int] ) : String = {
arr.mkString(",")
}
You have to convert this to udf before applying on the column
val convertUDF = udf(convert _)
And then you can apply your function:
df.withColumn("new_col", convertUDF(col("value")))

Related

Spark DataFrame CountVectorizedModel Error With DataType String

I have the following piece of code that tries to perform a simple action where I'm trying to convert from a sparse vector to a dense vector. Here is what I have so far:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._
// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()
val cvmDF = new CountVectorizerModel(tags)
.setInputCol("ocean_proximity")
.setOutputCol("sparseFeatures")
.transform(dfRaw)
val exprs = (0 until distinctOceanProximities.size).map(i => $"features".getItem(i).alias(s"$distinctOceanProximities(i)"))
val vecToSeq = udf((v: Vector) => v.toArray)
val df2 = cvmDF.withColumn("features", vecToSeq($"sparseFeatures")).select(exprs:_*)
df2.show()
When I ran this script, I get the following error:
java.lang.IllegalArgumentException: requirement failed: Column ocean_proximity must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:63)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema(CountVectorizer.scala:97)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema$(CountVectorizer.scala:95)
at org.apache.spark.ml.feature.CountVectorizerModel.validateAndTransformSchema(CountVectorizer.scala:272)
at org.apache.spark.ml.feature.CountVectorizerModel.transformSchema(CountVectorizer.scala:338)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.feature.CountVectorizerModel.transform(CountVectorizer.scala:306)
... 101 elided
I think it is expecting a Seq of String for the datatype but I have just a String. Any ideas how to fix this?
It was pretty simple. All I had to do is to convert the column from String to an Array of String just like this:
val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?
The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

How to pass in a map into UDF in spark

Here is my problem, I have a map of Map[Array[String],String], and I want to pass that into a UDF.
Here is my UDF:
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
And here is my Map variable:
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Here is how I call the function:
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c","d"))
I first got an error about immutable array, so I changed my array into immutable type, then I got an error about type mismatch. I googled a bit, apparently I can't pass in non-column type directly into a UDF. Can somebody help? Kudos.
Update: So I did convert everything to a wrapped array. Here is what I did:
val srdd = df.rdd.map{row => (WrappedArray.make[String](Array(row.getString(1),row.getString(5),row.getString(8))),row.getString(7))}
val lookupMap = srdd.collectAsMap()
def lookup(lookupMap:Map[collection.mutable.WrappedArray[String],String]) = udf((input:collection.mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c",$"d"))
Now I am having an error like this:
required: Map[scala.collection.mutable.WrappedArray[String],String]
-ksh: Map[scala.collection.mutable.WrappedArray[String],String]: not found [No such file or directory]
I tried to do something like this:
val m = collection.immutable.Map(1->"one",2->"Two")
val n = collection.mutable.Map(m.toSeq: _*)
but then I just got back to the error of column type.
First, you have to pass a Column as an argument of the UDF; Since you want this argument to be an array, you should use the array function in org.apache.spark.sql.functions, which creates an array Column from a series of other Columns. So the UDF call would be:
lookup(lookupMap)(array($"b",$"c",$"d"))
Now, since array columns are deserialized into mutable.WrappedArray, in order for the map lookup to succeed you'd best make sure that's the type used by your UDF:
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
So altogether:
import spark.implicits._
import org.apache.spark.sql.functions._
// Create an RDD[(mutable.WrappedArray[String], String)]:
val srdd = df.rdd.map { row: Row => (
mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))),
row.getString(7)
)}
// collect it into a map (I assume this is what you're doing with srdd...)
val lookupMap: Map[mutable.WrappedArray[String], String] = srdd.collectAsMap()
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap)(array($"b",$"c",$"d")))
Anna your code for srdd/lookupmap is of type org.apache.spark.rdd.RDD[(Array[String], String)]
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Where as in lookup method you are expecting a Map as a parameter
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
That is the reason why you are getting type mismatch error.
First make srdd from RDD[tuple] to a RDD[Map] and then try converting the RDD to Map to resolve this error.
val srdd = df.rdd.map { row => Map(
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString) ->
row.getString(7)
)}

convert RDD[String,String] to RDD[Int,Int]

I'm new to spark and facing issues finding out how to convert RDD elements data types. I have following text file:
1 2
2 3
3 4
when I create a new RDD ,it by default takes String Data type
val exampleRDD = sc.textFile("example.txt").map(x => (x.split(" ")(0),x.split(" ")(1)))
exampleRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at map at <console>:27
But i want it to be RDD[(Int, Int)]. I tried
val exampleRDD: RDD[(Int,Int)) =sc.textFile("example.txt").map(x => (x.split(" ")(0),x.split(" ")(1)))
but it gives error
error: not found: type RDD
Any help would be appreciated.
The error "error: not found: type RDD" is because, you would need to full class name as org.apache.spark.rdd.RDD.
But that doesn't still solve the problem. To return Int, you would have to convert string to Int.
val exampleRDD = sc.textFile("example.txt").map(x => (x.split(" ")(0).toInt,x.split(" ")(1).toInt))
Result:
exampleRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[36] at map at :34
sc.textFile("two.txt").map(_.split(" ")).map(ar => (ar(0).toInt, ar(1).toInt))
If you have more complex format, use spark-csv is a better choice to parse data.

ReactiveMongo : How to write macros handler to Enumeration object?

I use ReactiveMongo 0.10.0, and I have following user case class and gender Enumeration object:
case class User(
_id: Option[BSONObjectID] = None,
name: String,
gender: Option[Gender.Gender] = None)
object Gender extends Enumeration {
type Gender = Value
val MALE = Value("male")
val FEMALE = Value("female")
val BOTH = Value("both")
}
And I declare two implicit macros handler:
implicit val genderHandler = Macros.handler[Gender.Gender]
implicit val userHandler = Macros.handler[User]
but, when I run application, I get following error:
Error:(123, 48) No apply function found for reactive.userservice.Gender.Gender
implicit val genderHandler = Macros.handler[Gender.Gender]
^
Error:(125, 46) Implicit reactive.userservice.Gender.Gender for 'value gender' not found
implicit val userHandler = Macros.handler[User]
^
Anybody know how to write macros handler to Enumeration object?
Thanks in advance!
I stumbled upon your question a few times searching for the same answer. I did it this way:
import myproject.utils.EnumUtils
import play.api.libs.json.{Reads, Writes}
import reactivemongo.bson._
object DBExecutionStatus extends Enumeration {
type DBExecutionStatus = Value
val Error = Value("Error")
val Started = Value("Success")
val Created = Value("Running")
implicit val enumReads: Reads[DBExecutionStatus] = EnumUtils.enumReads(DBExecutionStatus)
implicit def enumWrites: Writes[DBExecutionStatus] = EnumUtils.enumWrites
implicit object BSONEnumHandler extends BSONHandler[BSONString, DBExecutionStatus] {
def read(doc: BSONString) = DBExecutionStatus.Value(doc.value)
def write(stats: DBExecutionStatus) = BSON.write(stats.toString)
}
}
You have to create a read/write pair by hand and populate with your values.
Hope you already solved this issue given the question age :D