Redshift Spectrum and Hive Metastore - Ambiguous Error - hive

From Redshift, I created an external schema using the Hive Metastore. I can see the Redshift metadata about the tables (such as using: select * from SVV_EXTERNAL_TABLES), however when querying one of these tables, I get an ambiguous error "error: Assert"
I tried creating the external schema and querying the tables. I can query the metadata about the tables, but cannot actually query the tables themselves.
I created the external schema as follows:
create external schema hive_schema
from hive metastore
database 'my_database_name'
uri 'my_ip_address' port 9083
iam_role 'arn:aws:iam::123456789:role/my_role_name';
Here is the error message when running "select * from hive_schema.my_table_name;"
-----------------------------------------------
error: Assert
code: 1000
context: loc->length() > 5 && loc->substr(0, 5) == "s3://" -
query: 1764
location: scan_range_manager.cpp:221
process: padbmaster [pid=26902]
-----------------------------------------------

What is the LOCATION of your Hive table? Seems like Redshift is asserting the location to start with s3://.
You should see LOCATIONs of your tables by running that query:
select location from SVV_EXTERNAL_TABLES
Where are your Hive tables stored? Is it maybe HDFS? I doubt whether Redshift supports any other locations than S3 - in the section Considerations When Using AWS Glue Data Catalog of this AWS guide they describe how to setup your Hive Metastore to store data in S3.

Related

BigQuery error in mk operation: Error while reading table... "failed to add partition key < > (type: TYPE_INT64) to schema

I have successfully loaded a great deal of externally hive partitioned data (in parquets) into bigquery without issue.
All my tables are sharing the same schema and are in the same dataset. However, some tables don't work when calling the bq mk command using the external table definition files I've defined.
The full output of the error after bq mk is as follows
BigQuery error in mk operation: Error while reading table: my_example,
error message: Failed to add partition key my_common_key (type:
TYPE_INT64) to schema, because another column with the same name was
already present. This is not allowed. Full partition schema:
[my_common_key:TYPE_INT64, my_secondary_common_key:TYPE_INT64].
External table definition files look like this
{ "hivePartitioningOptions": {
"mode": "AUTO",
"sourceUriPrefix": "gs://common_bucket/my_example" },
"parquetOptions": {
"enableListInference": false,
"enumAsString": false }, "sourceFormat": "PARQUET", "sourceUris": [
"gs://common_bucket/my_example/*" ]
You will see that I am relying on auto inferencing for schema with a source URI pattern, as there are numerous parquet files nested within two sub directories which hive uses as partition key. The full path is gs://common_bucket/my_example/my_common_key=2022/my_secondary_common_key=1 <- within here there are several parquets
Please check your data files under the bucket, does your Hive table has evolved and older files had that partition column you are using as part of data, but later the Hive tables are partitioned using "my_common_key" . I recently migrated large set of hive tables and had similar issue. Current hive table structure and underlying data has evolved over time.
One way to solve this issue will be to read that data using dataproc hive and export it back to a GCS location and then try to load to BigQuery.You can also try to use Spark SQL to do this.

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command
In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS
Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)

Query fails on presto-cli for a table created in hive in orc format with data residing in s3

I set up an Amazon EMR instance which includes 1 Master & 1 Core (m4 Large) with the following version details:
EMR : 5.5.0
Presto: Presto 0.170
Hadoop 2.7.3 HDFS
Hive 2.1.1 Metastore
My Spark app wrote out the data in ORC to Amazon S3. Then I created the table in hive (create external table TABLE ... partition() stored as ORC location 's3a"//'), and tried to query from presto-cli, and I get the following error for query SELECT * from TABLE:
Query 20170615_033508_00016_dbhsn failed: com.facebook.presto.spi.type.DoubleType
The only query that works is:
SELECT COUNT(*) from TABLE
Any ideas?
Found out the problem. The column orders when it was stored as orc did not match those when table was created in hive :)!!!

What does the hive metastore and name node do in a cluster?

In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?
The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.
Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.
Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.
Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.