sparkSql no such method error - apache-spark-sql

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}

Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)

it may be the scala version does not match. you can check the scala verion and spark version.

Related

error: value show is not a member of Unit CaseFileDFTemp.show()

I ran below code in databricks scala notebook but I am getting error.
LIBRARY ADDED : azure-cosmosdb-spark_2.4.0_2.11-1.3.4-uber
CODE :
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType,LongType,FloatType,DoubleType, TimestampType}
import org.apache.spark.sql.cassandra._
//datastax Spark connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.driver.core.{ConsistencyLevel, DataType}
import com.datastax.spark.connector.writer.WriteConf
//Azure Cosmos DB library for multiple retry
import com.microsoft.azure.cosmosdb.cassandra
import sqlContext.implicits._
spark.conf.set("x","x")
spark.conf.set("x","x")
spark.conf.set("x","x")
spark.conf.set("x","x")
val CaseFileDFTemp = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "case_files", "keyspace" -> "shared"))
.load().show()
CaseFileDFTemp.show()
ERROR:
error: value show is not a member of Unit CaseFileDFTemp.show()
Can you please try creating the SQL context and try the show function.
import sqlContext.implicits._
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
Please let me know if it helps.
If you write
val CaseFileDFTemp = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "case_files", "keyspace" -> "shared"))
.load().show()
Then CaseFileDFTemp will have type Unit, because the show() will "consume" your dataframe. So remove show(), then it will work

How to convert spark dataframe[double , String] to LabeledPoint?

Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

Facing issue while using SparkUDF with multiple arguments

I am trying to encript the data using SHA-256 by passing as an argument in Spark UDF but getting below error. Please find the program snippet and error details below.
Code Snippet:
package com.sample
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import java.security.MessageDigest
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction
import javax.xml.bind.DatatypeConverter;
import org.apache.spark.sql.Column
object Customer {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Customer-data").setMaster("local[2]").set("spark.executor.memory", "1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
//val hash_algm=sc.getConf.get("halgm")
val hash_algm="SHA-256"
val df = spark.read.format("csv").option("header", "true").load("file:///home/tcs/Documents/KiranDocs/Data_files/sample_data")
spark.udf.register("encriptedVal1", encriptedVal)
//calling encription UDF function
//val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df("customer_id"))).show()
val resDF2 = df.withColumn(("ssn_number"), encriptedVal(array("customer_id", hash_algm))).show()
println("data set"+resDF2)
sc.stop()
}
def encriptedVal = udf((s: String,s1:String) => {
val digest = MessageDigest.getInstance(s1)
val hash = digest.digest(s.getBytes("UTF-8"))
DatatypeConverter.printHexBinary(hash)
})
}
Error details are below:
Exception in thread "main" 2019-01-21 19:42:48 INFO SparkContext:54 -
Invoking stop() from shutdown hook java.lang.ClassCastException:
com.sample.Customer$$anonfun$encriptedVal$1 cannot be cast to
scala.Function1 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:104)
at
org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:85)
at com.sample.Customer$.main(Customer.scala:26) at
com.sample.Customer.main(Customer.scala)
The problem here is how you call the defined UDF. You should use it like the following:
val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df.col("customer_id"), lit(hash_algm)))
because it accepts two Columns object (both Columns must be String type as defined in your UDF).

Unable load a CSV file as dataframe in spark

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
Code:
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.builder()
.master("local")
.appName("SparkSessioncsvExample")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("D:/Scala/C2ImportCalEventSample.csv")}
But I'm able to load the file with the code:
val df = spark.sparkContext
.textFile("D:/Scala/C2ImportCalEventSample1.csv")
//.flatMap(header='true')
.map(line => line.split(","))
// .map(line => line.map()
.toDF()
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
jdk1.8.0_20
sbt-1.2.7
Thanks any anyone who can help me on this.
java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

object not serializable org.apache.spark.SparkContext

I was learning about broadcast variables in Spark so I tried to make use of it. I'm using spark-shell (Version 1.6.0). Following is my code:
scala> val pageurls = sc.parallelize(List(("www.google.com","Google"),("www.yahoo.com","Yahoo"))
pageurls: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
pageCounts: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> val pageMaps = pageurls.collectAsMap
pageMaps: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val bMaps = sc.broadcast(pageMaps)
bMaps: org.apache.spark.broadcast.Broadcast[scala.collection.Map[String,String]] = Broadcast(2)
scala> bMaps.value
res0: scala.collection.Map[String,String] = Map(www.yahoo.com -> Yahoo, www.google.com -> Google)
scala> val newRdd = pageCounts.map{
| case (url,count) => (url,bMaps.value(url),count)}
newRdd: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[2] at map at <console>:35
scala> newRdd.collect
res1: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
The code worked fine when I run spark-shell and use the default SparkContext sc which get created when spark-shell is invoked. However, I created my own SparkContext and tried to run the same sequence of code. Before creating my own context, I stop the default created SparkContext using sc.stop
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
When I create SparkContext like this and use broadcasts variable, I get the following exception: org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
Why does it happen like that and what shall I do so that I don't get these errors?Anything I'm missing?
When you start spark-shell, spark-shell creates sparkcontext [ sc ] for you. One jvm can only have one spark-shell. You are trying to create another spark-shell in same jvm. It seems that the version of spark you are on, sparkConf is throwing the exception of class that is not serializable. To avoid this exception use :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf()
conf.setAppName("MyApp")
conf.set("spark.driver.allowMultipleContexts", "true")
conf.setMaster("local")
val sc = new SparkContext(conf)
References :
a] Multiple SparkContext detected in the same JVM
b] https://issues.apache.org/jira/browse/SPARK-2243
Edit
Solution 1:
Make function for broadcasting variable and call it from shell :
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName("MyApp")
val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
object Test{
def bVar( sc:SparkContext, pageMaps: scala.collection.Map[String, String] ) = {
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
}}
val result = Test.bVar(sc, pageMaps)
result: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Spark Accumulator throws "Task not serializable" error
Solution 2 : If you insist not to use function from shell, make sparkcontext and sparkconf as transient.
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
#transient val conf = new SparkConf().setMaster("local").setAppName("MyApp")
#transient val sc = new SparkContext(conf)
val pageurls = sc.parallelize(List(("www.google.com","Google"), ("www.yahoo.com","Yahoo")))
val pageCounts = sc.parallelize(List(("www.google.com",90),("www.yahoo.com",10)))
val pageMaps = pageurls.collectAsMap
val bMaps = sc.broadcast(pageMaps)
bMaps.value
val newRdd = pageCounts.map{case (url,count) => (url,bMaps.value(url),count)}
newRdd.collect
res3: Array[(String, String, Int)] = Array((www.google.com,Google,90), (www.yahoo.com,Yahoo,10))
Reference : Should I leave the variable as transient?
http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/