How to convert spark dataframe[double , String] to LabeledPoint? - apache-spark-sql

Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

Related

Facing issue while using SparkUDF with multiple arguments

I am trying to encript the data using SHA-256 by passing as an argument in Spark UDF but getting below error. Please find the program snippet and error details below.
Code Snippet:
package com.sample
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import java.security.MessageDigest
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction
import javax.xml.bind.DatatypeConverter;
import org.apache.spark.sql.Column
object Customer {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Customer-data").setMaster("local[2]").set("spark.executor.memory", "1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
//val hash_algm=sc.getConf.get("halgm")
val hash_algm="SHA-256"
val df = spark.read.format("csv").option("header", "true").load("file:///home/tcs/Documents/KiranDocs/Data_files/sample_data")
spark.udf.register("encriptedVal1", encriptedVal)
//calling encription UDF function
//val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df("customer_id"))).show()
val resDF2 = df.withColumn(("ssn_number"), encriptedVal(array("customer_id", hash_algm))).show()
println("data set"+resDF2)
sc.stop()
}
def encriptedVal = udf((s: String,s1:String) => {
val digest = MessageDigest.getInstance(s1)
val hash = digest.digest(s.getBytes("UTF-8"))
DatatypeConverter.printHexBinary(hash)
})
}
Error details are below:
Exception in thread "main" 2019-01-21 19:42:48 INFO SparkContext:54 -
Invoking stop() from shutdown hook java.lang.ClassCastException:
com.sample.Customer$$anonfun$encriptedVal$1 cannot be cast to
scala.Function1 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:104)
at
org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:85)
at com.sample.Customer$.main(Customer.scala:26) at
com.sample.Customer.main(Customer.scala)
The problem here is how you call the defined UDF. You should use it like the following:
val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df.col("customer_id"), lit(hash_algm)))
because it accepts two Columns object (both Columns must be String type as defined in your UDF).

Unable load a CSV file as dataframe in spark

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
Code:
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.builder()
.master("local")
.appName("SparkSessioncsvExample")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("D:/Scala/C2ImportCalEventSample.csv")}
But I'm able to load the file with the code:
val df = spark.sparkContext
.textFile("D:/Scala/C2ImportCalEventSample1.csv")
//.flatMap(header='true')
.map(line => line.split(","))
// .map(line => line.map()
.toDF()
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
jdk1.8.0_20
sbt-1.2.7
Thanks any anyone who can help me on this.
java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

Case Class serialization in Spark

In a Spark app (Spark 2.1) I'm trying to send a case class as input parameter of a function that is meant to run on executors
object TestJob extends App {
val appName = "TestJob"
val out = "out"
val p = Params("my-driver-string")
val spark = SparkSession.builder()
.appName(appName)
.getOrCreate()
import spark.implicits._
(1 to 100).toDF.as[Int].flatMap(i => Dummy.process(i, p))
.write
.option("header", "true")
.csv(out)
}
object Dummy {
def process(i: Int, v:Params): Vector[String] = {
Vector { if( i % 2 == 1) v + "_odd" else v + "_even" }
}
}
case class Params(v: String)
When I run it with master local[*] everything goes well, while when running in a cluster, Params class state is not getting serialized and the output results in
null_even
null_odd
...
Could you please help me understanding what I'm doing wrong?
Googling around I found this post that gave me the solution:Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
In the end the problem is due to the extend Apps

sparkSql no such method error

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.

object not serializable org.apache.spark.SparkContext

I was learning about broadcast variables in Spark so I tried to make use of it. I'm using spark-shell (Version 1.6.0). Following is my code:
scala> val pageurls = sc.parallelize(List(("www.google.com","Google"),("www.yahoo.com","Yahoo"))
pageurls: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
pageCounts: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> val pageMaps = pageurls.collectAsMap
pageMaps: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val bMaps = sc.broadcast(pageMaps)
bMaps: org.apache.spark.broadcast.Broadcast[scala.collection.Map[String,String]] = Broadcast(2)
scala> bMaps.value
res0: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val newRdd = pageCounts.map{
| case (url,count) => (url,bMaps.value(url),count)}
newRdd: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[2] at map at <console>:35
scala> newRdd.collect
res1: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
The code worked fine when I run spark-shell and use the default SparkContext sc which get created when spark-shell is invoked. However, I created my own SparkContext and tried to run the same sequence of code. Before creating my own context, I stop the default created SparkContext using sc.stop
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
When I create SparkContext like this and use broadcasts variable, I get the following exception: org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
Why does it happen like that and what shall I do so that I don't get these errors?Anything I'm missing?
When you start spark-shell, spark-shell creates sparkcontext [ sc ] for you. One jvm can only have one spark-shell. You are trying to create another spark-shell in same jvm. It seems that the version of spark you are on, sparkConf is throwing the exception of class that is not serializable. To avoid this exception use :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf()
conf.setAppName("MyApp")
conf.set("spark.driver.allowMultipleContexts", "true")
conf.setMaster("local")
val sc = new SparkContext(conf)
References :
a] Multiple SparkContext detected in the same JVM
b] https://issues.apache.org/jira/browse/SPARK-2243
Edit
Solution 1:
Make function for broadcasting variable and call it from shell :
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
object Test{
def bVar( sc:SparkContext, pageMaps: scala.collection.Map[String, String] ) = {
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
}}
val result = Test.bVar(sc, pageMaps)
result: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Spark Accumulator throws "Task not serializable" error
Solution 2 : If you insist not to use function from shell, make sparkcontext and sparkconf as transient.
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
#transient val conf = new SparkConf().setMaster("local").setAppName("MyApp")
#transient val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
res3: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Should I leave the variable as transient?
http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/