Facing issue while using SparkUDF with multiple arguments - apache-spark-sql

I am trying to encript the data using SHA-256 by passing as an argument in Spark UDF but getting below error. Please find the program snippet and error details below.
Code Snippet:
package com.sample
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import java.security.MessageDigest
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction
import javax.xml.bind.DatatypeConverter;
import org.apache.spark.sql.Column
object Customer {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Customer-data").setMaster("local[2]").set("spark.executor.memory", "1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
//val hash_algm=sc.getConf.get("halgm")
val hash_algm="SHA-256"
val df = spark.read.format("csv").option("header", "true").load("file:///home/tcs/Documents/KiranDocs/Data_files/sample_data")
spark.udf.register("encriptedVal1", encriptedVal)
//calling encription UDF function
//val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df("customer_id"))).show()
val resDF2 = df.withColumn(("ssn_number"), encriptedVal(array("customer_id", hash_algm))).show()
println("data set"+resDF2)
sc.stop()
}
def encriptedVal = udf((s: String,s1:String) => {
val digest = MessageDigest.getInstance(s1)
val hash = digest.digest(s.getBytes("UTF-8"))
DatatypeConverter.printHexBinary(hash)
})
}
Error details are below:
Exception in thread "main" 2019-01-21 19:42:48 INFO SparkContext:54 -
Invoking stop() from shutdown hook java.lang.ClassCastException:
com.sample.Customer$$anonfun$encriptedVal$1 cannot be cast to
scala.Function1 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:104)
at
org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:85)
at com.sample.Customer$.main(Customer.scala:26) at
com.sample.Customer.main(Customer.scala)

The problem here is how you call the defined UDF. You should use it like the following:
val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df.col("customer_id"), lit(hash_algm)))
because it accepts two Columns object (both Columns must be String type as defined in your UDF).

Related

MapStruct Property has no write accessor in MetaData for target name

I have a FlatDTO that needs to be mapped to a nested Response containing InfoData and MetaData
The code for the response is generated by OpenAPI. So the below definitions can't be changed.
package org.mapstruct.example.kotlin.openapi
import com.fasterxml.jackson.annotation.JsonProperty
import javax.validation.Valid
data class Response(
#field:Valid #field:JsonProperty("infoData", required = true) val infoData: InfoData,
#field:Valid #field:JsonProperty("metaData", required = true) val metaData: MetaData
)
data class InfoData(
#field:JsonProperty("id", required = true) val id: kotlin.String,
)
data class MetaData(
#field:JsonProperty("firstProperty") val firstProperty: String? = null,
)
I have defined FlatDTO as follows.
package org.mapstruct.example.kotlin.models
data class FlatDTO(
var id: String? = null,
var firstProperty: String,
)
Here is my mapper class which maps FlatDTO to Response
package org.mapstruct.example.kotlin.mapper
import org.mapstruct.Mapper
import org.mapstruct.Mapping
import org.mapstruct.Mappings
import org.mapstruct.example.kotlin.models.FlatDTO
import org.mapstruct.example.kotlin.openapi.Response
#Mapper
interface DataMapper {
#Mappings(
Mapping(target = "infoData.id", source = "id"),
Mapping(target = "metaData.firstProperty", source = "firstProperty")
)
fun flatToResponse(flatDTO: FlatDTO): Response
}
When I try to build the code using mvn clean install
I get the following error.
error: Property "firstProperty" has no write accessor in MetaData for target name "metaData.firstProperty".
[ERROR] #org.mapstruct.Mappings(value = {#org.mapstruct.Mapping(target = "infoData.id", source = "id"), #org.mapstruct.Mapping(target = "metaData.firstProperty", source = "firstProperty")})
I understand that this message is trying to say that there is no setter function for firstProperty because it's defined as val but that code cannot be edited. I can write my own custom mapper that works just fine.
I wanted to understand if there is a way to use MapStruct in this scenario.

How to convert spark dataframe[double , String] to LabeledPoint?

Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

Case Class serialization in Spark

In a Spark app (Spark 2.1) I'm trying to send a case class as input parameter of a function that is meant to run on executors
object TestJob extends App {
val appName = "TestJob"
val out = "out"
val p = Params("my-driver-string")
val spark = SparkSession.builder()
.appName(appName)
.getOrCreate()
import spark.implicits._
(1 to 100).toDF.as[Int].flatMap(i => Dummy.process(i, p))
.write
.option("header", "true")
.csv(out)
}
object Dummy {
def process(i: Int, v:Params): Vector[String] = {
Vector { if( i % 2 == 1) v + "_odd" else v + "_even" }
}
}
case class Params(v: String)
When I run it with master local[*] everything goes well, while when running in a cluster, Params class state is not getting serialized and the output results in
null_even
null_odd
...
Could you please help me understanding what I'm doing wrong?
Googling around I found this post that gave me the solution:Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
In the end the problem is due to the extend Apps

sparkSql no such method error

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.

object not serializable org.apache.spark.SparkContext

I was learning about broadcast variables in Spark so I tried to make use of it. I'm using spark-shell (Version 1.6.0). Following is my code:
scala> val pageurls = sc.parallelize(List(("www.google.com","Google"),("www.yahoo.com","Yahoo"))
pageurls: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
pageCounts: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> val pageMaps = pageurls.collectAsMap
pageMaps: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val bMaps = sc.broadcast(pageMaps)
bMaps: org.apache.spark.broadcast.Broadcast[scala.collection.Map[String,String]] = Broadcast(2)
scala> bMaps.value
res0: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val newRdd = pageCounts.map{
| case (url,count) => (url,bMaps.value(url),count)}
newRdd: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[2] at map at <console>:35
scala> newRdd.collect
res1: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
The code worked fine when I run spark-shell and use the default SparkContext sc which get created when spark-shell is invoked. However, I created my own SparkContext and tried to run the same sequence of code. Before creating my own context, I stop the default created SparkContext using sc.stop
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
When I create SparkContext like this and use broadcasts variable, I get the following exception: org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
Why does it happen like that and what shall I do so that I don't get these errors?Anything I'm missing?
When you start spark-shell, spark-shell creates sparkcontext [ sc ] for you. One jvm can only have one spark-shell. You are trying to create another spark-shell in same jvm. It seems that the version of spark you are on, sparkConf is throwing the exception of class that is not serializable. To avoid this exception use :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf()
conf.setAppName("MyApp")
conf.set("spark.driver.allowMultipleContexts", "true")
conf.setMaster("local")
val sc = new SparkContext(conf)
References :
a] Multiple SparkContext detected in the same JVM
b] https://issues.apache.org/jira/browse/SPARK-2243
Edit
Solution 1:
Make function for broadcasting variable and call it from shell :
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
object Test{
def bVar( sc:SparkContext, pageMaps: scala.collection.Map[String, String] ) = {
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
}}
val result = Test.bVar(sc, pageMaps)
result: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Spark Accumulator throws "Task not serializable" error
Solution 2 : If you insist not to use function from shell, make sparkcontext and sparkconf as transient.
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
#transient val conf = new SparkConf().setMaster("local").setAppName("MyApp")
#transient val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
res3: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Should I leave the variable as transient?
http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/