Unable load a CSV file as dataframe in spark - apache-spark-sql

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
Code:
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.builder()
.master("local")
.appName("SparkSessioncsvExample")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("D:/Scala/C2ImportCalEventSample.csv")}
But I'm able to load the file with the code:
val df = spark.sparkContext
.textFile("D:/Scala/C2ImportCalEventSample1.csv")
//.flatMap(header='true')
.map(line => line.split(","))
// .map(line => line.map()
.toDF()
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
jdk1.8.0_20
sbt-1.2.7
Thanks any anyone who can help me on this.

java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

Related

How to pipe spark rdd to python and return rdd back from the python

Here is my code in scala file
val data = List("cdn.resized-images.covethome.com")
val dataRDD = sc.makeRDD(data)
val scriptPath = "/Users/hareshbhuriya/Documents/final-python-integration/web_domain_cleansing.py"
val result = dataRDD.pipe(scriptPath)
I am reading that string from python file using
for line in sys.stdin:
output = sample_function(line)
print(output)
here output from the python file is pandas dataFrame and I want to get it inside the scala file from where I have called the pipe function.
Instead of passing the line as a parameter to sample_function if I pass string directly as cdn.resized-images.covethome.com I am getting the output as desired.
Like this
for line in sys.stdin:
output = sample_function("cdn.resized-images.covethome.com")
print(output)
But I want to pass string dynamically. How should I handle/manipulate the string coming from sys.stdin ? Please help
I required to use line.rstrip() then all works great. Thank you.

How to convert spark dataframe[double , String] to LabeledPoint?

Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

How can i save the Kafka read Structured Streaming data as a Dataframe and apply parsing on it?

I am trying to read realtime streaming data from Kafka topics through Spark Structured streaming, However my understanding i would need the streaming to stop at sometime so i can apply my parsing logic on it and push it to MongoDB. Is there a way i can save the streaming data into a separate dataframe with/without stopping the streaming?
I checked the guide and other blogs and i am not getting a straight forward answer for my requirement
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:9092, host:9092, host:9092")
.option("subscribe", "TOPIC_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.option("startingOffsets", "earliest")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val consoleOutput = dfs.writeStream
.outputMode("append")
.format("console")
.start()
consoleOutput.awaitTermination()
consoleOutput.stop()
I need the streaming data to be saved somehow in a dataframe either by stopping the streaming or without stopping
Below is the parsing logic what i have and instead of picking dataset from a file path i need the streamed data to be my new dataset and should be able to apply my rest of the logic and get output. Saving it to Mongo is not my primary focus now;
val log = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load("C:\\Users\\raheem_mohammed\\IdeaProjects\\diag.csv")
log.createOrReplaceTempView("logs")
val df = spark.sql("select _raw, _time from logs").toDF
//Adds Id number to each of the event
val logs = dfs.withColumn("Id", monotonicallyIncreasingId()+1)
//Register Dataframe as a temp table
logs.createOrReplaceTempView("logs")
val = spark.sql("select Id, value from logs")
//Extracts columns from _raw column. Also finds the probabilities of compositeNames.
//If true then the compositeName belongs to one of the four possibilities
val extractedDF = dfss.withColumn("managed_server", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("alert_summary", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("oracle_details", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("ecid", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
//.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+(\S+)\]""",2))
.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
.withColumn("composite_name", col("_raw").contains("composite_name"))
.withColumn("compositename", col("_raw").contains("compositename"))
.withColumn("composites", col("_raw").contains("composites"))
.withColumn("componentDN", col("_raw").contains("componentDN"))
//Filters out any NULL values if found
val finalData = extractedDF.filter(
col("managed_server").isNotNull &&
col("alert_summary").isNotNull &&
col("oracle_details").isNotNull &&
col("ecid").isNotNull &&
col("CompName").isNotNull &&
col("composite_name").isNotNull &&
col("compositename").isNotNull &&
col("composites").isNotNull &&
col("componentDN").isNotNull)
finalData.show(false)

sparkSql no such method error

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.

Spark SQL - Recursive reading in folder

I try to use HiveContext to take advantage of some windows functions in HiveQL in SparkSQL. But it cannot help me to read recursively data file in a folder (partition folder by year and month).
My folders:
data/outputOozie/22/year=2016 on driver
data/outputOozie/22/year=2016/month=10 on driver`
data/outputOozie/22/year=2016/month=9 on driver
data/outputOozie/22/year=2016/month=10/1 on driver
data/outputOozie/22/year=2016/month=10/2 on driver
data/outputOozie/22/year=2016/month=10/3 on driver
data/outputOozie/22/year=2016/month=9/1 on driver
data/outputOozie/22/year=2016/month=9/2 on driver
data/outputOozie/22/year=2016/month=9/3 on driver
Here is how I initiate my Hive Context:
val conf = new SparkConf().setAppName("Extraction process for ").setIfMissing("spark.master", "local[*]")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
sc.hadoopConfiguration.set("hive.mapred.supports.subdirectories","true")
//val hiveContext = sqlContext.asInstanceOf[HiveContext]
val hiveContext = sqlContext.asInstanceOf[HiveContext]
hiveContext.setConf("spark.sql.parquet.compression.codec", "snappy")
hiveContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
hiveContext.setConf("mapred.input.dir.recursive","true")
hiveContext.setConf("hive.mapred.supports.subdirectories","true")
Read file:
hiveContext.read.parquet(URLDecoder.decode(partitionLocation.get.toString,
"UTF-8"))
==> Exception: file not found
But it's ok for SQL Context:
val sqlContext = new SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Thanks for any suggestions!!!!