Delta Table : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'FROM' - amazon-emr

I am trying to run the query on EMR/EMR Notebooks (Spark with Scala) -
SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.`s3://a/b/c/d`)
But I am getting the following error -
The same query works fine on Databricks.
Another doubt that I have is - why does the colour of s3 location change post //.
So I tried to break the above query and only run the Describe HISTORY query. And for some reason it says -
Error Log -
An error was encountered:
org.apache.spark.sql.AnalysisException: Table or view not found: HISTORY;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:835)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:787)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:817)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:71)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:756)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:91)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:88)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:164)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withLocalMetrics(Analyzer.scala:104)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:126)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:125)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:125)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:630)
at org.apache.spark.sql.execution.command.DescribeColumnCommand.run(tables.scala:714)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3391)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3390)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:196)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:81)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:644)
... 50 elided
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'history' not found in database 'default';
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:84)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:141)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:98)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:722)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:706)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:832)
UPDATED (18-Feb-2021) -> What I have tried till now.
Query Using Spark Sql -
spark.sql("SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.s3://a/b/c/d)")
But this Didnt work. Same Error.
Create Spark Session with -
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
and spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
But its throwing the same error.
UPDATE 2 (18-Feb-2021) :- Trying the approach as mentioned by #alex.
Using PySpark.
It was working partly and but not completely.
Thanks in Advance.

Per documentation, to get support for DESCRIBE HISTORY you need to configure Spark SQL Extensions and Catalog by passing 2 properties (see docs):
spark.sql.extensions to value io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog to value org.apache.spark.sql.delta.catalog.DeltaCatalog
Update:
For Spark 2.4.x, the Delta 0.6.1 should be used, and its documentation has following code snippet to activate extensions:
spark.sparkContext._jvm.io.delta.sql.DeltaSparkSessionExtension() \
.apply(spark._jsparkSession.extensions())
spark = SparkSession(spark.sparkContext, spark._jsparkSession.cloneSession())

Related

AsterixDB ERROR: Code: 1 "HYR0010: Node asterix_nc2 does not exist" M1 Mac

I'm trying to set up a sample cluster with asterixDB on my M1 mac. I have my environment up and running and I am able to successfully make SQL queries with the following code:
drop dataverse csv if exists;
create dataverse csv;
use csv;
create type csv_type as {
lat: int32,
long: int32
};
create dataset csv_set (csv_type)
primary key lat;
However, when I try to load the dataset with a CSV file it seems to brick my sample cluster and throws the error: Error Code: 1 "HYR0010: Node asterix_nc2 does not exist". The code which causes this is below.
use csv;
load dataset csv_set using localfs
(("path"="127.0.0.1:///Users/nicholassantini/Downloads/test.csv"),
("format"="delimited-text"));
Thus far I have tried both java's newest release of version 18 and 17.0.3 as well as a variety of ports for the queries. I'm not sure what else to try. Some logs that I think are relevant say that it is failing to connect to the node. Not sure if that's an issue with the port or the node itself. Here is a snippet of those logs.
image.png
Also in case it matters, my CSV is a simple 2 column 2 row file with all single-digit integer values.
I appreciate any and all help.
After consulting the developer help email thread, I was able to find that the issue stems from the release of asterixDB that I was using (0.9.7.1). Upgrading to the newest release(0.9.8) fixed this issue.
The link can be found here:
https://ci-builds.apache.org/job/AsterixDB/job/asterixdb-snapshot-integration/lastSuccessfulBuild/artifact/asterixdb/asterix-server/target/asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip

Create table name using username in Hive query running in Oozie workflow?

I've got a Hive SQL script/action as part of an Oozie workflow. I'm doing a CREATE TABLE AS SELECT to output the results. I want to name the table using the username plus an appended string (e.g. "User123456_output_table"), but can't seem to get the correct syntax.
set tablename=${hivevar:current_user()};
CREATE TABLE `${hiveconf:tablename}_output_table` AS SELECT ...
That doesn't work and gives:
Error while compiling statement: FAILED: IllegalArgumentException java.net.URISyntaxException: Relative path in absolute URI: ${hivevar:current_user()%7D_output_table
Or changing the first line to set tablename=${current_user()}; starts running the SELECT query but eventually stops with:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: [${current_user()}_output_table]: is not a valid table name
Or changing the first line to set tablename=current_user(); starts running the SELECT query but eventually stops with:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: [current_user()_output_table]: is not a valid table name
Alternatively, is there a way to pass the username from the Oozie workflow via a parameter?
I'm using Hue to do all this rather than the command line.
Thanks
This is wrong: set tablename=${hivevar:current_user()}; - it will not be resolved and substituted as is.
Hive does not calculate variables before substitution, it substitutes them as is, all functions in variables are NOT calculated. variables are just text replacement.
This:
set tablename=current_user();
CREATE TABLE `${hiveconf:tablename}_output_table` ...
gets resolved as
CREATE TABLE `current_user()_output_table` ...
And functions are not supported in table names, it will not work this way.
The solution is to calculate functions outside the script and pass them as parameters.
See this blog: https://prodlife.wordpress.com/2013/12/06/parameterizing-hive-actions-in-oozie-workflows/

DBT 404 Not found: Dataset hello-data-pipeline:staging_benjamin was not found in location EU

When doing "DBT run" I get the following error
{{ config(materialized='table') }}
SELECT customer_id FROM `hello-data-pipeline.adwords.google_ads_campaign_stats`
I am making sure that my FROM location contains 3 parts
A project (hello-data-pipeline)
A database (adwords)
A table (google_ads_campaign_stats)
But I get the following error
15:41:51 | 2 of 3 START table model staging_benjamin.yo......................... [RUN]
15:41:51 | 2 of 3 ERROR creating table model staging_benjamin.yo................ [ERROR in
0.32s]
Runtime Error in model yo (models/yo.sql)
404 Not found: Dataset hello-data-pipeline:staging_benjamin was not found in location EU
NB. Bigquery does not show any error when doing this query in Bigquery Editor.
NB 2 DBT does not show any error when "running sql" command directly in the script editor
What I am doing wrong ?
You may need to specify a location where your query will run. Queries that run in a specific location may only reference data in that location. You may choose auto-select to run the query in the location where the data resides.
Read more about Dataset locations
OK I found. I needed to specify the location in the profile.yml file.
=> https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile/#dataset-locations
In DBT cloud you will find it when setting up your project
I had a similar error to your 'hello-data-pipeline:staging_benjamin was not found in location EU'
However, my issue was not that the dataset was not in the incorrect location. If was that DBT was not targeting the schema I wanted.
e.g. for your example it would be that hello-data-pipeline:staging_benjamin would actually not be the target schema you initially wanted.
Adding this bit of code on top of my query solved the issue.
{{ config(schema='marketing') }}
select ...
cf DBT's schemas: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas
here is another doc that helped me understand why this was happening:
"dbt Cloud IDE: The values are defined by your connection and credentials. To check any of these values, head to your account (via your profile image in the top right hand corner), and select the project under "Credentials".
https://docs.getdbt.com/reference/dbt-jinja-functions/target

how to do count in flink sql

I'd like to do count(0)in flink SQL, but it gives exception like
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: SQL parse failed. UDT in DDL is not supported yet.
don't know is there anything wrong?
expect the output should work fine
INSERT INTO request_join
select requestId,count(0) from requests
GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR),requestId;
The schema of the table is here
name: request_join
schema:
- '`requestId` VARCHAR'
- '`count` LONG'
properties:
'connector.type': 'kafka'
'connector.version': 'universal'
'connector.topic': 'request_join_test'
'connector.startup-mode': 'latest-offset'
'connector.properties.0.key': 'zookeeper.connect'
'connector.properties.0.value': '10.XXXXXXXXX'
'connector.properties.1.key': 'bootstrap.servers'
'connector.properties.1.value': '10.XXXXXXXXX'
'connector.properties.2.key': 'group.id'
'connector.properties.2.value': 'request_join_test'
'update-mode': 'append'
'format.type': 'json'
'format.json-schema': '{type: "object", properties: {requestId: { type: "string"},count:{type:
"number"}}}'
didn't find anything wrong, but it just doesn't work, if I do not count and delete the count from the schema it will work well so I'm sure the sql itself is good.
I checked the flink sql it says some of the functions are not supported in DDL, so don't flink support count? I can see from examples that it support SUM very well.
There is sth wrong with your schema,
schema:
- 'requestId VARCHAR'
- 'count BIGINT'

Unable to query using file in Data Proc Hive Operator

I am unable to query with .sql file in DataProcHiveOperator.
Though the documentation tells that we can query using file. Link of the documentation Here
It is working fine when I give query directly
Here is my sample code which is working fine with writing query directly:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='CREATE TABLE TABLE_NAME(NAME STRING);',
cluster_name='cluster-name',
region='us-central1',
dag=dag)
Querying with file :
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='gs://us-central1-bucket/data/sample_hql.sql',
query_uri="gs://us-central1-bucket/data/sample_hql.sql
cluster_name='cluster-name',
region='us-central1',
dag=dag)
There is no error on sample_hql.sql script.
It is reading file location as a query and throwing me the error as:
Query: 'gs://bucketpath/filename.q'
Error occuring - cannot recognize input near 'gs' ':' '/'
Similar issue has also been raised Here
The issue is because you have passed query='gs://us-central1-bucket/data/sample_hql.sql' as well.
You should pass exactly 1 of query or queri_uri.
The code in your question has both of them, so remove query or use the following code:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query_uri="gs://us-central1-bucket/data/sample_hql.sql",
cluster_name='cluster-name',
region='us-central1',
dag=dag)