How to use NamedDataFrame from spark job server - apache-spark-sql

I used SJS for my project and would like to know how NamedDataFrame from SJS works.
My first program does this
val schemaString = "parm1:int,parm2:string,parm3:string,parm4:string,parm5:int,parm6:string,parm7:int,parm8:int"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0), getFieldTypeInSchema(fieldName.split(":")(1)),true)))
val eDF1 = hive.applySchema(rowRDD1, schema)
this.namedObjects.getOrElseCreate("edf1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))
My second program does this to retrieve the DataFrame.
val eDF1: Option[NamedDataFrame] = this.namedObjects.get("eDF1")
Here I only able to use Option. How must I cast NamedDataFrame to a Spark DataFrame?
Is something of this equivalent available?
this.namedObjects.get[(Int,String,String,String,Int,String,Int,Int)]("eDF1")
Thanks!!
Edit1:
To be precise, without SJS persistence, this could be done on the df
eDF1.filter(eDF1.col("parm1")%2!==0)
How can I perform the same operation from a saved namedObject?

Take a look at https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server-extras/src/spark.jobserver/NamedObjectsTestJob.scala for an example

The following works on NamedDataFrame
Job1
this.namedObjects.getOrElseCreate("df:esDF1", new NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY))
Job2
val NamedDataFrame(eDF1, _, _) = namedObjects.get[NamedDataFrame]("df:esDF1").get
Now i can operate on eDF1 on the second job as a spark dataframe.

Related

Copy records from one table to another using spark-sql-jdbc

I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "

How to read tables from a location and write data to a table of other cluster

I read table statistics from a metastore starting spark application setting up hive.metastore.uris. However I need write data to another hive.
I've tryed to clean Active and Default Session, build another session with the new metastore uri, but spark continues trying write to the table of the first hive.
val spark = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.metastore.uris", FIRST_METASTORE)
.config("spark.sql.hive.convertMetastoreOrc", "false")
.config("spark.sql.caseSensitive", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
val df = spark.sql("DESCRIBE FORMATTED source_table")
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val spark2 = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.metastore.uris", NEW_MESTASTORE)
.config("spark.sql.hive.convertMetastoreOrc", "false")
.config("spark.sql.caseSensitive", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
SparkSession.setDefaultSession(sparkSession2)
SparkSession.setActiveSession(sparkSession2)
df.write
.format("parquet")
.mode(SaveMode.Overwrite)
.insertInto("other_cluster_table")
}
As I said, it would be expected that dataframe should be wrote to the table location of the new metastore and catalog, but it doesn't. This happens because interface DataFrameWriter get information from df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) in order to insert into some existent table, but how could I deal with it?
After reading about multiple sparkContexts, I solve this question just writing the parquet directly to namenode/directory/to/partition/ and then adding partition to table using beeline.

How to auto update %spark.sql result in zeppelin for structured streaming query

I'm running structured streaming in (spark 2.1.0 with zeppelin 0.7) for data coming from kafka and I'm trying to visualize the streaming result with spark.sql
as below :
%spark2
val spark = SparkSession
.builder()
.appName("Spark structured streaming Kafka example")
.master("yarn")
.getOrCreate()
val inputstream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "n11.hdp.com:6667,n12.hdp.com:6667,n13.hdp.com:6667 ,n10.hdp.com:6667, n9.hdp.com:6667")
.option("subscribe", "st")
.load()
val stream = inputstream.selectExpr("CAST( value AS STRING)").as[(String)].select(
expr("(split(value, ','))[0]").cast("string").as("pre_post_paid"),
expr("(split(value, ','))[1]").cast("double").as("DataUpload"),
expr("(split(value, ','))[2]").cast("double").as("DataDowndownload"))
.filter("DataUpload is not null and DataDowndownload is not null")
.groupBy("pre_post_paid").agg(sum("DataUpload") + sum("DataDowndownload") as "size")
val query = stream.writeStream
.format("memory")
.outputMode("complete")
.queryName("test")
.start()
after it running I query on "test" as below:
%sql
select *
from test
it updates only when I running it manually, my question is How to make it updates as new data is processed (streaming visualization) as this example:
Insights Without Tradeoffs: Using Structured Streaming in Apache Spark
Replace the lines
"%sql
select *
from test"
with
%spark
spark.table("test").show()

ElasticSearch to Spark RDD

I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch.
val sparkConf = new SparkConf().setAppName("Test").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = new JobConf()
conf.set("spark.serializer", classOf[KryoSerializer].getName)
conf.set("es.nodes", "localhost:9200")
conf.set("es.resource", "bank/account")
conf.set("es.query", "?q=firstname:Daniel")
val esRDD = sc.hadoopRDD(conf,classOf[EsInputFormat[Text, MapWritable]],
classOf[Text], classOf[MapWritable])
esRDD.first()
esRDD.collect()
The code runs fine and returns the correct result successfully with
esRDD.first()
However, esRDD.collect() will generate exception:
java.io.NotSerializableException: org.apache.hadoop.io.Text
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I believe this is related to the issue mentioned here http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
so I added this line accordingly
conf.set("spark.serializer", classOf[KryoSerializer].getName)
Am I supposed to do something else to get it working?
Thank you
Updates:
the serialziation setup problem was solved. by using
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
instead of
conf.set("spark.serializer", classOf[KryoSerializer].getName)
Now there is another one
There are 1000 distinct records in this dataset
esRDD.count()
returns 1000 no problem, however
esRDD.distinct().count()
returns 5 ! If I print the records
esRDD.foreach(println)
It prints out the 1000 records correctly. But if I use collect or take
esRDD.collect().foreach(println)
esRDD.take(10).foreach(println)
it will print DUPLICATED records, and there is indeed only 5 UNIQUE records shown up, which seems to be a random subset of the entire dataset - it's not the first 5 records.
If I save the RDD and read it back
esRDD.saveAsTextFile("spark-output")
val esRDD2 = sc.textFile("spark-output")
esRDD2.distinct().count()
esRDD2.collect().foreach(println)
esRDD2.take(10).foreach(println)
esRDD2 behaves as expected. I wonder if there is a bug, or something I don't understand about the behavior of collect/take. Or is it because I'm running everything locally.
By default Spark RDD seems to use 5 partitions, as shown in the number of part-xxxx files of the "spark-output" file. That's probably why esRDD.collect() and esRDD.distinct() returned 5 unique records, instead of some other random number. But that's still not right.
You should use the following codes to initialize:
val sparkConf = new SparkConf().setAppName("Test").setMaster("local").set("spark.serializer", classOf[KryoSerializer].getName)
val sc = new SparkContext(sparkConf)
val conf = new JobConf()
conf.set("es.nodes", "localhost:9200")
conf.set("es.resource", "bank/account")
conf.set("es.query", "?q=firstname:Daniel")
you can try
val spark = new SparkConf()
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.set("es.nodes",localhost)
.set("es.port","9200")
.appName("ES")
.master("local[*]")
val data = spark.read
.format("org.elasticsearch.spark.sql")
.option("es.query", "?q=firstname:Daniel")")
.load("bank/account").rdd
data.first()
data.collect()

Extensions for Computationally-Intensive Cypher queries

As a follow up to a previous question of mine, I want to find all 30 pathways that exist between two given nodes within a depth of 4. Something to the effect of this:
start startnode = node(1), endnode(1000)
match startnode-[r:rel_Type*1..4]->endnode
return r
limit 30;
My database contains ~50k nodes and 2M relationships.
Expectedly, the computation time for this query is very, very large; I even ended up with the following GC message in the message.log file: GC Monitor: Application threads blocked for an additional 14813ms [total block time: 182.589s]. This error keeps occuring, and blocks all threads for an indefinite period of time. Therefore, I am looking for a way to lower the computational strain of this query on the server by optimizing the query.
Is there any extension I could use to help optimize this query?
Give this one a try:
https://github.com/wfreeman/findpaths
You can query the extension like so:
.../findpathslen/1/1000/4/30
And it will give you a json response with the paths found. Hopefully that helps you.
The meat of it is here, using the built-in graph algorithm to find paths of a certain length:
#GET
#Path("/findpathslen/{id1}/{id2}/{len}/{count}")
#Produces(Array("application/json"))
def fof(#PathParam("id1") id1:Long, #PathParam("id2") id2:Long, #PathParam("len") len:Int, #PathParam("count") count:Int, #Context db:GraphDatabaseService) = {
val node1 = db.getNodeById(id1)
val node2 = db.getNodeById(id2)
val pathFinder = GraphAlgoFactory.pathsWithLength(Traversal.pathExpanderForAllTypes(Direction.OUTGOING), len)
val pathIterator = pathFinder.findAllPaths(node1,node2).asScala
val jsonMap = pathIterator.take(count).map(p => obj(p))
Response.ok(compact(render(decompose(jsonMap))), MediaType.APPLICATION_JSON).build()
}