How to update hive table metadata with latest AVRO schema file

How to update hive table metadata with latest AVRO schema file - hive

FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: /master_data/XYZ/DA12195/business_date=20181126/_schema.avsc)
The schema file exists in new partition with business_date=20181129, but hive table is still pointing to schema file in older partition.

Dropping external table and recreating it helped solve this problem.
Also MSCK REPAIR command helped recreating hive partitions.
File _schema.avsc file contain schema information about the AVRO table.We need to point hive table metadata to correct location of this file. serde and tblproperties needs to be updated for making this change

Related

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command

In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS

Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

How Can I create a Hive Table on top of a Parquet File

Facing issue on creating hive table on top of parquet file. Can someone help me on the same.? I have read many articles and followed the guidelines but not able to load a parquet file in Hive Table.

According "Using Parquet Tables in Hive" it is often useful to create the table as an external table pointing to the location where the files will be created, if a table will be populated with data files generated outside of Hive.
hive> create external table parquet_table_name (<yourParquetDataStructure>)
STORED AS PARQUET
LOCATION '/<yourPath>/<yourParquetFile>';

How to make a table that is automatically updated Hive

I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.

In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.

Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..

If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.

Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.

HIVE Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

I am getting the below error on creating a hive database
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/facebook/fb303/FacebookService$Iface
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop path:/home/hadoop_test/data/hadoop-1.2.1
hive path :/home/hadoop_test/data/hive-0.12.0
I have copied hive*.jar ,jline-.jar,antlr-runtime.jar from hive-0.12.0/lib to hadoop-1.2./lib

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
Make sure the location is specified correctly

In the following way, I solved the problem.
set hive.msck.repair.batch.size=1;
set hive.msck.path.validation=ignore;
If you can not set the value, and get the error.Error: Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime (state=42000,code=1)
add content in hive-site:
key:
hive.security.authorization.sqlstd.confwhitelist.append
value:
hive\.msck\.path\.validation|hive\.msck\.repair\.batch\.size

Set hive.metastore.schema.verification property in hive-site.xml to true, by default it is false.
For further details check this link.

Amazon Athena
If you get here because of Amazon Athena errors, you might use this bit below. First check that all you files have the same schema:
If you run an ALTER TABLE ADD PARTITION (or MSCK REPAIR TABLE) statement and mistakenly specify a partition that already exists and an incorrect Amazon S3 location, zero byte placeholder files of the format partition_value_$folder$ are created in Amazon S3. You must remove these files manually.
We removed the files with the awscli.
aws s3 rm s3://bucket/key/table/ --exclude="*" --include="*folder*" --recursive --dryrun
See also the docs with some extra steps included.

To proper fix this with MSCK
Remove the older partitions from metastore, if their path not exists, using
ALTER TABLE dbname.tablename DROP PARTITION IF EXISTS (partition_column_name > 0);
RUN MSCK REPAIR COMMAND
MSCK REPAIR TABLE dbname.tablename;
Why the step 1 is required because MSCK Repair command will through error if the partition is removed from the file system (HDFS), so by removing all the partitions from the metastore first and then sync with MSCK will properly add the required partitions

The reason why we got this error was we added a new column to the external Hive table. set hive.msck.path.validation=ignore; worked upto fixing hive queries but Impala had additional issues which were solved with below steps:
After doing an invalidate metadata, Impala queries started failing with Error: incompatible Parquet schema for column
Impala error SOLUTION: set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
if you're using Cloudera distribution below steps will make the change permanent and you don't have to set the option per session.
Cloudera Manager -> Clusters -> Impala -> Configuration -> Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
Add the value: PARQUET_FALLBACK_SCHEMA_RESOLUTION=name
NOTE: do not use SET or semi-colon when setting the parameter in Cloudera Manager

open hive cli using "hive --hiveconf hive.root.logger=DEBUG,console" to enable logs and debug from there, in my case a camel case name for partition was written on hdfs and i created hive table with its name fully in lowercase.

None of proposed solutions worked for me.
I discovered a 0B file named _$folder$ inside my table location path (at same level of partitions).
Removing it allowed me to run a MSCK REPAIR TABLE t without issues.
This file was comming from a s3 restore (roll back to a previous versionned state)

I faced the same error. Reason in my case was a directory created in the HDFS warehouse with the same name. When this directory was deleted, it resolved my issue.

It's probably because your metastore_db is corrubpted. Delete .lck files from metastore_db.

hive -e "msck repair table database.tablename"
it will repair table metastore schema of table;

setting the below property and then doing msck repair worked for me :
set hive.mapred.mode=unstrict;

I faced similar issue when the underlying hdfs directory got updated with new partitions and hence the hive metastore went out of sync.
Solved using the following two steps:
MSCK table table_name showed what all partitions are out of sync.
MSCK REPAIR table table_name added the missing partitions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas