Hive Error : Unsupported language features in query in Spark execution - hive

Spark SQL Hive error for NOT EXISTS clause in sql query.
Platform : cdh5.6.0
Hive version: Hive 1.1.0
The below NOT EXISTS query is running fine in hive prompt:
SELECT a,b,c,d FROM interim_t WHERE NOT EXISTS (SELECT a FROM xyz_n ABC where (a=a) AND (b=b) AND (c=c)
But the same program is giving error "Unsupported language features in query" in spark execution.
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
#sqlContext.sql("INSERT INTO abc_p PARTITION (SRVC_TYPE_CD='1') SELECT a,b,c,d FROM interim_t WHERE NOT EXISTS (SELECT a FROM xyz_n ABC where (a=a) AND (b=b) AND (c=c)")
Execution:
spark-submit --verbose --deploy-mode client /data/abc.py
Error message:
Unsupported language features in query: INSERT INTO abc_p PARTITION
(SRVC_TYPE_CD='1') SELECT a,b,c,d FROM interim_t WHERE NOT EXISTS
(SELECT a FROM xyz_n ABC where (a=a) AND (b=b) AND (c=c))
I think sqlContext.sql is not supporting NOT EXISTS in hive queries. Could you please suggest some solution/alternatives.

I tried below on pyspark shell, executed just fine with no errors.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SELECT a,b,c,d FROM table1 i WHERE NOT EXISTS (SELECT a FROM table2 x where i.a=x.a AND i.b=x.b AND i.c=x.c)");
I have following content in test.py
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SELECT a,b,c,d FROM table1 i WHERE NOT EXISTS (SELECT a FROM table2 x where i.a=x.a AND i.b=x.b AND i.c=x.c)");
Executed the script
spark-submit --verbose --deploy-mode client test.py
Executed successfully. can you give a try ?
My set up Hive 2.1.0 and Spark 2.0.2
I suspect your hive version is the issue

I had same problem, below solution worked for me. Put these lines in your file and test:-
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.sql("SELECT a,b,c,d FROM interim_t WHERE NOT EXISTS (SELECT a FROM xyz_n ABC where (a=a) AND (b=b) AND (c=c)")
df.write.mode("overwrite").partitionBy("SRVC_TYPE_CD").saveAsTable("abc_p")
Apart from this you can try some more options like mode can be append. You can choose format for saving too. Like mode("xxx").format("parquet"). Format can be parquet, orc etc.

Related

Pyspark not reading all months from hive database

I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.

How to run sql query in PySpark notebook

I have an SQL query which I run in Azure Synapse analytics , to query data from ADLS.
Can I run the same query in Notebook using PySpark in Azure Synapse analytics?
I googled some ways to run sql in notebook, but looks like some modifications to be done to the code to do this.
%%sql or spark.sql("")
Query
SELECT *
FROM OPENROWSET(
BULK 'https://xxx.xxx.xxx.xxx.net/datazone/Test/parquet/test.snappy.parquet',
FORMAT = 'PARQUET'
)
Read the data lake file and write into a dataframe with saveAsTable and query the table as shown below.
df = spark.read.load('abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/<filename>', format='parquet')
df.write.mode("overwrite").saveAsTable("testdb.test2")
Using %%sql
%%sql
select * from testdb.test2
Using %%pyspark
%%pyspark
df = spark.sql("select * from testdb.test2")
display(df)

Cannot Create table with spark SQL : Hive support is required to CREATE Hive TABLE (AS SELECT);

I'm trying to create a table in spark (scala) and then insert values from two existing dataframes but I got this exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `stat_type_predicate_percentage`, ErrorIfExists
Here is the code :
case class stat_type_predicate_percentage (type1: Option[String], predicate: Option[String], outin: Option[INT], percentage: Option[FLOAT])
object LoadFiles1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
//statistics
val create = spark.sql("CREATE TABLE stat_type_predicate_percentage (type1 String, predicate String, outin INT, percentage FLOAT) USING hive")
val insert1 = spark.sql("INSERT INTO stat_type_predicate_percentage SELECT types.type, res.predicate, 0, 1.0*COUNT(subject)/(SELECT COUNT(subject) FROM MappingBasedProperties AS resinner WHERE res.predicate = resinner.predicate) FROM MappingBasedProperties AS res, MappingBasedTypes AS types WHERE res.subject = types.resource GROUP BY res.predicate,types.type")
val select = spark.sql("SELECT * from stat_type_predicate_percentage" )
}
How should I solve it?
--- Yo have to enable hive support in you sparksession
val spark = new SparkSession
.Builder()
.appName("JOB2")
.master("local")
.enableHiveSupport()
.getOrCreate()
This problem may be two fold
for one you might want to do what #Tanjin suggested in the comments and it might work afterwards ( Try adding .config("spark.sql.catalogImplementation","hive") to your SparkSession.builder )
but if you actually want to use an existing hive instance with its own metadata which you'll be able to query from outside your job. Or you might already want to use existing tables you might like to add to you configuration the hive-site.xml.
This configuration file contains some properties you probably want like the hive.metastore.uris which will enable your context add a new table which will be save in the store. And it will be able to read from tables in your hive instance thanks to the metastore which contains tables and locations.

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html

Reading partitioned parquet file into Spark results in fields in incorrect order

For a table with
create table mytable (
..
)
partitioned by (my_part_column String)
We are executing a hive sql as follows:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
data = hc.sql("select * from my_table limit 10")
The values read back show the "my_part_columns" as the FIRST items for each row instead of the last ones.
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1
https://issues.apache.org/jira/browse/SPARK-5049