Facing issue when writing a query on createOrReplaceTempView in Spark Structured Streaming - apache-spark-sql

Below is my code in spark structured streaming within foreachBatch
df.writeStream.trigger(Trigger.ProcessingTime("10 seconds")).foreachBatch((batchDF: DataFrame, batchId: Long) => {
batchDF.persist
batchDF.createOrReplaceTempView("all_notifis");
batchDF.write.mode(SaveMode.Append).saveAsTable("api_notifications_topics");
val meta_data= spark.sql("select topic,partition,max(msg_timestamp) as msg_ts ,max(off_set) as max_offset from all_notifis group by topic,partition")
meta_data.write.mode(SaveMode.Append).saveAsTable("api_notifics_metadata");
batchDF.unpersist()
}).start().awaitTermination()
Even though I created tempview("all_notifis"), it is trying to fetch that table from hive default DB and throwing below error
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view all_notifis not found in database 'default';
Can anyone help with what is the issue?

It got resolved.. Instead of val meta_data= spark.sql("select topic,partition,max(msg_timestamp) as msg_ts ,max(off_set) as max_offset from all_notifis group by topic,partition") we have to give val meta_data=batchDF.sparkSession.sql(s"select topic,partition,max(off_set) as max_offset, min(off_set) as min_offset,max(msg_timestamp) as msg_ts from all_notifis group by topic,partition")

Related

Pyspark not reading all months from hive database

I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.

Error in SQL statement: AnalysisException: Table or view not found:

I've just started with Hive. I'm working on Databricks community. I write in python but wanted to write something in SQL but there is an error I cannot understand. I cannot see anything wrong in my code. Please help me.
spark.sql("create table happiness_perm as select * from happiness_tmp");
%sql
select Country, count(*) from happiness_perm group by Country
I tried use my data freame df_happiness instead happiness_perm and still I receive this:
Error in SQL statement: AnalysisException: Table or view not found: happiness_perm; line 1 pos 30;
'Aggregate ['Country], ['Country, unresolvedalias(count(1), None)]
+- 'UnresolvedRelation [happiness_perm], [], false
I would really appreciate your help!
Try this:
df = spark.sql("select * from happiness_tmp")
df.createOrReplaceTempView("happiness_perm")
First you get your data into a dataframe, then you write the contents of the dataframe to a table in the catalog.
You can then query the table.

SparkSQL Staging Table Row Count vs Hive Row count

I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html

SparkSQL errors when using SQL DATE function

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf
SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates