Unable to query using file in Data Proc Hive Operator - hive

I am unable to query with .sql file in DataProcHiveOperator.
Though the documentation tells that we can query using file. Link of the documentation Here
It is working fine when I give query directly
Here is my sample code which is working fine with writing query directly:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='CREATE TABLE TABLE_NAME(NAME STRING);',
cluster_name='cluster-name',
region='us-central1',
dag=dag)
Querying with file :
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='gs://us-central1-bucket/data/sample_hql.sql',
query_uri="gs://us-central1-bucket/data/sample_hql.sql
cluster_name='cluster-name',
region='us-central1',
dag=dag)
There is no error on sample_hql.sql script.
It is reading file location as a query and throwing me the error as:
Query: 'gs://bucketpath/filename.q'
Error occuring - cannot recognize input near 'gs' ':' '/'
Similar issue has also been raised Here

The issue is because you have passed query='gs://us-central1-bucket/data/sample_hql.sql' as well.
You should pass exactly 1 of query or queri_uri.
The code in your question has both of them, so remove query or use the following code:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query_uri="gs://us-central1-bucket/data/sample_hql.sql",
cluster_name='cluster-name',
region='us-central1',
dag=dag)

Related

Some clarification around sql verbage within R, specifically with dbListFields

I'm using R to connect to an Oracle Database. I'm able to do so with this code:
library(rJava)
library(RJDBC)
jdbcDriver <- JDBC(driverClass="oracle.jdbc.driver.OracleDriver", classPath="jar files/ojdbc8.jar")
jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=blablablablabla.com)(PORT=1234))(CONNECT_DATA=(SERVICE_NAME=pdblhs)))", "myusername", "mypassword")
DataFrame <- dbGetQuery(jdbcConnection, "select protocol_id, disease_site_code, disease_site
From abc_place_prod.Rv_sip_PCL_disease_site
order by protocol_id")
# Close connection
dbDisconnect(jdbcConnection)
This snippet of code works. It downloads the data I want into a neat data frame and works as planned.
But I'm trying to learn how to navigate with R/SQL more and I wanted to use dbListFields to explore all the possible column names in that table I'm pulling from. Per the documentation here, you could type something like this:
dbListFields(con, "mtcars")
and it would work. So I tried:
dbListFields(jdbcConnection, "abc_place_prod.Rv_sip_PCL_disease_site")
and it returned this error:
Error in dbSendQuery(conn, paste("SELECT * FROM ", dbQuoteIdentifier(conn, :
Unable to retrieve JDBC result set
JDBC ERROR: ORA-00933: SQL command not properly ended
It seems like someone else experienced something similar here and I'm not sure it ever got answered. I've tried it with a semicolon at the end (that "ended" something else properly for me, and its mentioned in the comments of this question) and no luck.

Delta Table : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'FROM'

I am trying to run the query on EMR/EMR Notebooks (Spark with Scala) -
SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.`s3://a/b/c/d`)
But I am getting the following error -
The same query works fine on Databricks.
Another doubt that I have is - why does the colour of s3 location change post //.
So I tried to break the above query and only run the Describe HISTORY query. And for some reason it says -
Error Log -
An error was encountered:
org.apache.spark.sql.AnalysisException: Table or view not found: HISTORY;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:835)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:787)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:817)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:71)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:756)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:91)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:88)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:164)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withLocalMetrics(Analyzer.scala:104)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:126)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:125)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:125)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:630)
at org.apache.spark.sql.execution.command.DescribeColumnCommand.run(tables.scala:714)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3391)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3390)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:196)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:81)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:644)
... 50 elided
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'history' not found in database 'default';
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:84)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:141)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:98)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:722)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:706)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:832)
UPDATED (18-Feb-2021) -> What I have tried till now.
Query Using Spark Sql -
spark.sql("SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.s3://a/b/c/d)")
But this Didnt work. Same Error.
Create Spark Session with -
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
and spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
But its throwing the same error.
UPDATE 2 (18-Feb-2021) :- Trying the approach as mentioned by #alex.
Using PySpark.
It was working partly and but not completely.
Thanks in Advance.
Per documentation, to get support for DESCRIBE HISTORY you need to configure Spark SQL Extensions and Catalog by passing 2 properties (see docs):
spark.sql.extensions to value io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog to value org.apache.spark.sql.delta.catalog.DeltaCatalog
Update:
For Spark 2.4.x, the Delta 0.6.1 should be used, and its documentation has following code snippet to activate extensions:
spark.sparkContext._jvm.io.delta.sql.DeltaSparkSessionExtension() \
.apply(spark._jsparkSession.extensions())
spark = SparkSession(spark.sparkContext, spark._jsparkSession.cloneSession())

Create table name using username in Hive query running in Oozie workflow?

I've got a Hive SQL script/action as part of an Oozie workflow. I'm doing a CREATE TABLE AS SELECT to output the results. I want to name the table using the username plus an appended string (e.g. "User123456_output_table"), but can't seem to get the correct syntax.
set tablename=${hivevar:current_user()};
CREATE TABLE `${hiveconf:tablename}_output_table` AS SELECT ...
That doesn't work and gives:
Error while compiling statement: FAILED: IllegalArgumentException java.net.URISyntaxException: Relative path in absolute URI: ${hivevar:current_user()%7D_output_table
Or changing the first line to set tablename=${current_user()}; starts running the SELECT query but eventually stops with:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: [${current_user()}_output_table]: is not a valid table name
Or changing the first line to set tablename=current_user(); starts running the SELECT query but eventually stops with:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: [current_user()_output_table]: is not a valid table name
Alternatively, is there a way to pass the username from the Oozie workflow via a parameter?
I'm using Hue to do all this rather than the command line.
Thanks
This is wrong: set tablename=${hivevar:current_user()}; - it will not be resolved and substituted as is.
Hive does not calculate variables before substitution, it substitutes them as is, all functions in variables are NOT calculated. variables are just text replacement.
This:
set tablename=current_user();
CREATE TABLE `${hiveconf:tablename}_output_table` ...
gets resolved as
CREATE TABLE `current_user()_output_table` ...
And functions are not supported in table names, it will not work this way.
The solution is to calculate functions outside the script and pass them as parameters.
See this blog: https://prodlife.wordpress.com/2013/12/06/parameterizing-hive-actions-in-oozie-workflows/

Issue in SOAP UI with SQL using Grovvy Script, the working code stopped working

I am hitting DB2 database using grovvy scripts in SOAPUI, the below code was working fine for me for sometime.
However suddenly its not working and I am stuck and unable to find a solution
The query:SELECT Column_ID FROM Table_Name where Column_TIMESTAMP > '2016-10-26 05:37:22'
Below is the code
def sqlQueryPopID = "SELECT Column_ID FROM Table_Name where Column_TIMESTAMP>'2016-10-26 05:37:22'"
def popID
**// The code is not entering in the below loop, however was working fine a day back, not sure what happened**
sqlITOD.query(sqlQueryPopID) {resultSet ->
while (resultSet.next()){
popID = resultSet.getString(1)
log.info("Popultauin ID is:"+popID)
}
}//end of result set
log.info("The Sql Query:"+sqlITOD.query(sqlQueryPopID) )//This line is giving below Error
ERROR:
Wed Oct 26 05:37:58 EDT 2016:INFO:groovy.lang.MissingMethodException: No signature of method: groovy.sql.Sql.query() is applicable for argument types: (java.lang.String) values: [SELECT Column_ID FROM Table_Name where Column_TIMESTAMP>'2016-10-26 05:37:22']
Possible solutions: query(java.lang.String, groovy.lang.Closure), query(groovy.lang.GString, groovy.lang.Closure), query(java.lang.String, java.util.List, groovy.lang.Closure), query(java.lang.String, java.util.Map, groovy.lang.Closure), query(java.util.Map, java.lang.String, groovy.lang.Closure), every()
Please help me what went wrong as the current code was working fine, somehow it stopped working
The signature of the method is Sql.query(groovy.lang.GString,groovy.lang.Closure).
So this in your code log.info("The Sql Query:"+sqlITOD.query(sqlQueryPopID) ) cannot work. Since there is no query(String) method in Sql.
The rest of your code seems correct; I think that the problem is with your data, in your query you're filtering based on where Column_TIMESTAMP > '2016-10-26 05:37:22' so probably the thing is that you don't have any result in the DB which match this criteria.
By the way there is a more groovy way to do the same using eachRow instead:
sqlITOD.eachRow(sqlQueryPopID){ row ->
log.info("Popultauin ID is: ${row.id}")
}

SQL STATE 37000 [Microsoft][ODBC Microsoft Access Driver] Syntax Error or Access Violation

Good day!
I get this error:
SQL STATE 37000 [Microsoft][ODBC Microsoft Access Driver] Syntax Error
or Access Violation, when trying to run an embedded SQL statement on
Powerscript.
I am using MsSQL Server 2008 and PowerBuilder 10.5, the OS is Windows 7. I was able to determine one of the queries that is causing the problem:
SELECT top 1 CONVERT(DATETIME,:ls_datetime)
into :ldtme_datetime
from employee_information
USING SQLCA;
if SQLCA.SQLCODE = -1 then
Messagebox('SQL ERROR',SQLCA.SQLERRTEXT)
return -1
end if
I was able to come up with a solution to this by just using the datetime() function of PowerBuilder. But there are other parts of the program that is causing this and I am having a hard time in identifying which part of the program causes this. I find this very weird because I am running the same scripts here in my dev-pc with no problems at all, but when trying to run the program on my client's workstation I am getting this error. I haven't found any differences in the workstation and my dev-pc. I also tried following the instructions here, but the problem still occurs.
UPDATE: I was able to identify the other script that is causing the problem:
/////////////////////////////////////////////////////////////////////////////
// f_datediff
// Computes the time difference (in number of minutes) between adtme_datefrom and adtme_dateto
////////////////////////////
decimal ld_time_diff
SELECT top 1 DATEDIFF(MINUTE,:adtme_datefrom,:adtme_dateto)
into :ld_time_diff
FROM EMPLOYEE_INFORMATION
USING SQLCA;
if SQLCA.SQLCODE = -1 then
Messagebox('SQL ERROR',SQLCA.SQLERRTEXT)
return -1
end if
return ld_time_diff
Seems like passing datetime variables causes the error above. Other scripts are working fine.
Create a transaction user object inherited trom transaction.
Put logic in the sqlpreview of your object to capture and log the sql statement being sent to the db.
Instantiate it, connect to the db, and use it in your embedded sql.
Assuming the user gets the error you can then check what was being sent to the db and go from there.
The error in your first statement should be the second parameter to CONVERT function.
It's type is not a string, it's type is an valid expression
https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql
So I would expect that your
CONVERT(DATETIME,:ls_datetime)
would evaluate to
CONVERT(DATETIME, 'ls_datetime')
but it should be
CONVERT(DATETIME, DateTimeColumn)
The error in your second statement could be that you're providing an wrong datetime format.
So please check if your error still occurs when you use this function
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-dateformat-transact-sql
with the correct datetime format you're using