I have been trying to query some really simple Hive views using Glue Data Catalog and Presto/Trino on EMR with no luck.
The error is either 'View not found' or 'Hive views not supported'. I have tried to configure Trino with the legacy and experimental (as explained in this doc) but when I override the default behavior in Trino (which is to ignore the views), the trino-server service just does not start.
The views I try to read are really simple, they should be supported by either legacy or experimental mode, also, the language in which they are defined is ANSI SQL.
Is there a known incompatibility between the Data Catalog and Presto? Or maybe with EMR? I know the problem is not versions as I have used multiple to test this behavior and it's always the same issue.
Update -> The create view statement I receive from Hive is something like this for all my views, only ANSI SQL is used:
CREATE VIEW `schema`.`view` AS
select
`table`.`col1`,
`table`.`col2`,
`table`.`col3`
from
`schema`.`table`
where
table.datetime = '20220623'
Update -> I was able to read view using Trino once I applied this configuration in EMR Software Settings under trino-connector-hive "hive.views-execution.enabled": "true" but I still can't query views using Presto
[
{
"classification": "presto-connector-hive",
"properties": {
"hive.metastore.glue.datacatalog.enabled": "true",
"hive.views-execution.enabled": "true"
}
}
]
Related
May be a very trivial question.
What is the actual difference between STRUCT and RECORD types in GCP BigQuery? Can I use them interchangably? If I have a table created with a column defined as STRUCT, will it show a "schema" mismatch if I try to re-run a Terraform script with the field type changed to RECORD?
I believe they are mostly the same thing, or you may view them as same concept in different components of BigQuery.
For historical reasons the Legacy SQL and storage documentation talks mostly about RECORD, while Standard SQL dialect uses STRUCT.
A column created with Standard SQL DDL as STRUCT will appear as RECORD in storage UI, and Terraform script using RECORD should be compatible.
I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile
Is it possible to use the equivalent of --autodetect in DataFlow?
i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?
(potentially related question)
If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.
I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).
You can use the util as follows:
TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.
I'm not sure about reading from BQ, but for writes I think that something like this will work on the latest java SDK.
.apply("WriteBigQuery", BigQueryIO.Write
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.to(outputTableName));
Note: BigQuery Table must be of the form: <project_name>:<dataset_name>.<table_name>.
Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.
I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.
Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.
The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:
$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE
$ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.
Basically, I am extracting from a table and inserting to the same table causing an internal error.
Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.
My questions are
I think BigQuery API should never fail with internal error. Why is that happening with my test?
Is the extracted Avro file compatible with an insert job?
There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?
UPDATED 2016-04-25:
So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.
Consider this Avro-schema:
{
"type": "record",
"name": "root",
"fields": [
{
"name": "x",
"type": "string"
}
]
}
The BigQuery table has one column, x that is NULLABLE.
If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.
If I change the table so that X is REQUIRED I get an internal error.
There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.
We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.
There's a problem with the error mapping for the AVRO reader here. The error should have been along the lines of: "The reference schema differs from the existing data: The required field 'api_key' is missing"
Looking at your load job configuration, it includes REQUIRED fields. It sounds like some of the data you are trying to load doesn't specify these required fields, so the operation fails.
I suggest avoiding required fields.
So, there's a bug in BigQuery: an insert job using Avro format does not work if the destination table exists, but gives an internal error. The workaround is to use createDisposition CREATE_IF_NEEDED and not to have the pre-existing table there. I verified that this works.
Hua Zung's comment says that the bug will be fixed in "the fix should be deployed in a couple weeks". Needless to say that existing major bugs in the live system should be documented somewhere.
While updating the system, I really recommend improving the Avro documentation. Now there's no mention on what the Avro schema should be like (type record, name root and fields array having the columns(?)) and not even the fact that each record in the Avro file maps to a row in the destination table (obvious, but should be mentioned). Also what happens with schema mismatch is not documented.
Thanks for the help, I'll be now switching to Avro-format. It's so much better than CSV.
It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?
Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.