ElasticSearch to Spark RDD - serialization

I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch.
val sparkConf = new SparkConf().setAppName("Test").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = new JobConf()
conf.set("spark.serializer", classOf[KryoSerializer].getName)
conf.set("es.nodes", "localhost:9200")
conf.set("es.resource", "bank/account")
conf.set("es.query", "?q=firstname:Daniel")
val esRDD = sc.hadoopRDD(conf,classOf[EsInputFormat[Text, MapWritable]],
classOf[Text], classOf[MapWritable])
esRDD.first()
esRDD.collect()
The code runs fine and returns the correct result successfully with
esRDD.first()
However, esRDD.collect() will generate exception:
java.io.NotSerializableException: org.apache.hadoop.io.Text
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I believe this is related to the issue mentioned here http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
so I added this line accordingly
conf.set("spark.serializer", classOf[KryoSerializer].getName)
Am I supposed to do something else to get it working?
Thank you
Updates:
the serialziation setup problem was solved. by using
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
instead of
conf.set("spark.serializer", classOf[KryoSerializer].getName)
Now there is another one
There are 1000 distinct records in this dataset
esRDD.count()
returns 1000 no problem, however
esRDD.distinct().count()
returns 5 ! If I print the records
esRDD.foreach(println)
It prints out the 1000 records correctly. But if I use collect or take
esRDD.collect().foreach(println)
esRDD.take(10).foreach(println)
it will print DUPLICATED records, and there is indeed only 5 UNIQUE records shown up, which seems to be a random subset of the entire dataset - it's not the first 5 records.
If I save the RDD and read it back
esRDD.saveAsTextFile("spark-output")
val esRDD2 = sc.textFile("spark-output")
esRDD2.distinct().count()
esRDD2.collect().foreach(println)
esRDD2.take(10).foreach(println)
esRDD2 behaves as expected. I wonder if there is a bug, or something I don't understand about the behavior of collect/take. Or is it because I'm running everything locally.
By default Spark RDD seems to use 5 partitions, as shown in the number of part-xxxx files of the "spark-output" file. That's probably why esRDD.collect() and esRDD.distinct() returned 5 unique records, instead of some other random number. But that's still not right.

You should use the following codes to initialize:
val sparkConf = new SparkConf().setAppName("Test").setMaster("local").set("spark.serializer", classOf[KryoSerializer].getName)
val sc = new SparkContext(sparkConf)
val conf = new JobConf()
conf.set("es.nodes", "localhost:9200")
conf.set("es.resource", "bank/account")
conf.set("es.query", "?q=firstname:Daniel")

you can try
val spark = new SparkConf()
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.set("es.nodes",localhost)
.set("es.port","9200")
.appName("ES")
.master("local[*]")
val data = spark.read
.format("org.elasticsearch.spark.sql")
.option("es.query", "?q=firstname:Daniel")")
.load("bank/account").rdd
data.first()
data.collect()

Related

I cannot read a big CSV file with krangl, but I can read a small one

I am just starting with Kotlin and IntelliJ Idea. Making some progress, with the help of people at SO. But now I am stuck again. The following program runs flawlessly:
import krangl.*
val dataPath = "C:\\Users\\fsald\\Dropbox\\Temp\\AAPL.CSV"
val dataPath1 = "C:\\Users\\fsald\\Dropbox\\Code\\Julia\\IOFiles\\Input\\AllData.CSV"
fun main(args: Array<String>) {
krangl.irisData.print()
val data = krangl.DataFrame.readCSV(dataPath)
data.print()
}
I mean the two dataframes irisData and data are printed correctly on the screen (omitted).
However, if I add (just before the final curly bracket) a single line that attempts to read a large CSV file (about 600 columns and 25000 rows) the program crashes. The additional line and the error messages are here:
val data1 = krangl.DataFrame.readCSV(dataPath1)
Exception in thread "main" java.lang.NumberFormatException: invalid boolean cell value
at krangl.TableIOKt.cellValueAsBoolean(TableIO.kt:336)
at krangl.TableIOKt.dataColFactory(TableIO.kt:372)
at krangl.TableIOKt.dataColFactory(TableIO.kt:376)
at krangl.TableIOKt.readDelim(TableIO.kt:175)
at krangl.TableIOKt.readDelim$default(TableIO.kt:133)
at krangl.TableIOKt.readDelim(TableIO.kt:129)
at krangl.TableIOKt.readCSV(TableIO.kt:43)
at krangl.TableIOKt.readCSV$default(TableIO.kt:39)
at MainKt.main(Main.kt:12)
Process finished with exit code 1
Any ideas on the reason for this?

Using unionbyname to merge 2 Df2 with different columns not working with spark 3.2.1

Hi I am trying to merge 2 FDs with different columns in spark and I came across
unionByName which allows property allowMissingColumns
I am actually facing issues while using it here is my piece of code
import spark.implicits._
val data = Seq(("1","[unassigned]"))
val DefaultResponseDF = data.toDF("try_number","assessment_item_response_xid")
val data2 = Seq(("2"))
val DefaultResponseDF2 = data2.toDF("try_number")
DefaultResponseDF.unionByName(DefaultResponseDF2, allowMissingColumns=True).show()
When I run this on spark 3.2.1 and scala 2.12 in databricks cluster getting this error
error: not found: value True
DefaultResponseDF.unionByName(DefaultResponseDF2, allowMissingColumns=True).show()
^
Let me know if I am missing anything in using this. I believe it is available in spark 3.1 onwards I am on spark 3.2 so that cant be the issue .
let me know if anyone faced this before
Nvm it was because in scala True does not work it should be true
DefaultResponseDF.unionByName(DefaultResponseDF2,
allowMissingColumns=true).show()
This works

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

calculate median, average using hadoop spark1.6 dataframe, Failed to start database 'metastore_db'

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
1. using SQLContext
~~~~~~~~~~~~~~~~~~~~
1. import org.apache.spark.sql.SQLContext
2. val sqlctx = new SQLContext(sc)
3. import sqlctx._
val df = sqlctx.read.format("com.databricks.spark.csv").option("inferScheme","true").option("delimiter",";").option("header","true").load("/user/cloudera/data.csv")
df.select(avg($"col1")).show() // this works fine
sqlctx.sql("select percentile_approx(balance,0.5) as median from port_bank_table").show()
or
sqlctx.sql("select percentile(balance,0.5) as median from port_bank_table").show()
// both are not working , getting the below error
org.apache.spark.sql.AnalysisException: undefined function percentile_approx; line 0 pos 0
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
using HiveContext
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
so tried using hive context
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hivectx = new HiveContext(sc)
18/01/09 22:51:06 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
hivectx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#5be91161
scala> import hivectx._
import hivectx._
getting the below error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#be453c4,
see the next exception for details.
I can't find any percentile_approx,percentile function in Spark aggregation functions.It does not seem like this functionality is built into the Spark DataFrames. For more information please follow this How to calculate Percentile of column in a DataFrame in spark?
I hope so it will help you.
I don't think so, it should work, for that you should save the table in
dataFrame using saveAsTable. Then you will be able to run your query using
HiveContext.
df.someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("Table_name")
# In my case "mode" is working as mode("Overwrite")
hivectx.sql("select avg(col1) as median from Table_name").show()
It will work.

How to use NamedDataFrame from spark job server

I used SJS for my project and would like to know how NamedDataFrame from SJS works.
My first program does this
val schemaString = "parm1:int,parm2:string,parm3:string,parm4:string,parm5:int,parm6:string,parm7:int,parm8:int"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0), getFieldTypeInSchema(fieldName.split(":")(1)),true)))
val eDF1 = hive.applySchema(rowRDD1, schema)
this.namedObjects.getOrElseCreate("edf1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))
My second program does this to retrieve the DataFrame.
val eDF1: Option[NamedDataFrame] = this.namedObjects.get("eDF1")
Here I only able to use Option. How must I cast NamedDataFrame to a Spark DataFrame?
Is something of this equivalent available?
this.namedObjects.get[(Int,String,String,String,Int,String,Int,Int)]("eDF1")
Thanks!!
Edit1:
To be precise, without SJS persistence, this could be done on the df
eDF1.filter(eDF1.col("parm1")%2!==0)
How can I perform the same operation from a saved namedObject?
Take a look at https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server-extras/src/spark.jobserver/NamedObjectsTestJob.scala for an example
The following works on NamedDataFrame
Job1
this.namedObjects.getOrElseCreate("df:esDF1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))
Job2
val NamedDataFrame(eDF1, _, _) = namedObjects.get[NamedDataFrame]("df:esDF1").get
Now i can operate on eDF1 on the second job as a spark dataframe.