I have a FlatDTO that needs to be mapped to a nested Response containing InfoData and MetaData
The code for the response is generated by OpenAPI. So the below definitions can't be changed.
package org.mapstruct.example.kotlin.openapi
import com.fasterxml.jackson.annotation.JsonProperty
import javax.validation.Valid
data class Response(
#field:Valid #field:JsonProperty("infoData", required = true) val infoData: InfoData,
#field:Valid #field:JsonProperty("metaData", required = true) val metaData: MetaData
)
data class InfoData(
#field:JsonProperty("id", required = true) val id: kotlin.String,
)
data class MetaData(
#field:JsonProperty("firstProperty") val firstProperty: String? = null,
)
I have defined FlatDTO as follows.
package org.mapstruct.example.kotlin.models
data class FlatDTO(
var id: String? = null,
var firstProperty: String,
)
Here is my mapper class which maps FlatDTO to Response
package org.mapstruct.example.kotlin.mapper
import org.mapstruct.Mapper
import org.mapstruct.Mapping
import org.mapstruct.Mappings
import org.mapstruct.example.kotlin.models.FlatDTO
import org.mapstruct.example.kotlin.openapi.Response
#Mapper
interface DataMapper {
#Mappings(
Mapping(target = "infoData.id", source = "id"),
Mapping(target = "metaData.firstProperty", source = "firstProperty")
)
fun flatToResponse(flatDTO: FlatDTO): Response
}
When I try to build the code using mvn clean install
I get the following error.
error: Property "firstProperty" has no write accessor in MetaData for target name "metaData.firstProperty".
[ERROR] #org.mapstruct.Mappings(value = {#org.mapstruct.Mapping(target = "infoData.id", source = "id"), #org.mapstruct.Mapping(target = "metaData.firstProperty", source = "firstProperty")})
I understand that this message is trying to say that there is no setter function for firstProperty because it's defined as val but that code cannot be edited. I can write my own custom mapper that works just fine.
I wanted to understand if there is a way to use MapStruct in this scenario.
Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show
In a Spark app (Spark 2.1) I'm trying to send a case class as input parameter of a function that is meant to run on executors
object TestJob extends App {
val appName = "TestJob"
val out = "out"
val p = Params("my-driver-string")
val spark = SparkSession.builder()
.appName(appName)
.getOrCreate()
import spark.implicits._
(1 to 100).toDF.as[Int].flatMap(i => Dummy.process(i, p))
.write
.option("header", "true")
.csv(out)
}
object Dummy {
def process(i: Int, v:Params): Vector[String] = {
Vector { if( i % 2 == 1) v + "_odd" else v + "_even" }
}
}
case class Params(v: String)
When I run it with master local[*] everything goes well, while when running in a cluster, Params class state is not getting serialized and the output results in
null_even
null_odd
...
Could you please help me understanding what I'm doing wrong?
Googling around I found this post that gave me the solution:Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
In the end the problem is due to the extend Apps
I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.
I was learning about broadcast variables in Spark so I tried to make use of it. I'm using spark-shell (Version 1.6.0). Following is my code:
scala> val pageurls = sc.parallelize(List(("www.google.com","Google"),("www.yahoo.com","Yahoo"))
pageurls: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
pageCounts: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> val pageMaps = pageurls.collectAsMap
pageMaps: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val bMaps = sc.broadcast(pageMaps)
bMaps: org.apache.spark.broadcast.Broadcast[scala.collection.Map[String,String]] = Broadcast(2)
scala> bMaps.value
res0: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val newRdd = pageCounts.map{
| case (url,count) => (url,bMaps.value(url),count)}
newRdd: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[2] at map at <console>:35
scala> newRdd.collect
res1: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
The code worked fine when I run spark-shell and use the default SparkContext sc which get created when spark-shell is invoked. However, I created my own SparkContext and tried to run the same sequence of code. Before creating my own context, I stop the default created SparkContext using sc.stop
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
When I create SparkContext like this and use broadcasts variable, I get the following exception: org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
Why does it happen like that and what shall I do so that I don't get these errors?Anything I'm missing?
When you start spark-shell, spark-shell creates sparkcontext [ sc ] for you. One jvm can only have one spark-shell. You are trying to create another spark-shell in same jvm. It seems that the version of spark you are on, sparkConf is throwing the exception of class that is not serializable. To avoid this exception use :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf()
conf.setAppName("MyApp")
conf.set("spark.driver.allowMultipleContexts", "true")
conf.setMaster("local")
val sc = new SparkContext(conf)
References :
a] Multiple SparkContext detected in the same JVM
b] https://issues.apache.org/jira/browse/SPARK-2243
Edit
Solution 1:
Make function for broadcasting variable and call it from shell :
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
object Test{
def bVar( sc:SparkContext, pageMaps: scala.collection.Map[String, String] ) = {
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
}}
val result = Test.bVar(sc, pageMaps)
result: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Spark Accumulator throws "Task not serializable" error
Solution 2 : If you insist not to use function from shell, make sparkcontext and sparkconf as transient.
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
#transient val conf = new SparkConf().setMaster("local").setAppName("MyApp")
#transient val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
res3: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Should I leave the variable as transient?
http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/