HIVE Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask - hive

I am getting the below error on creating a hive database
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/facebook/fb303/FacebookService$Iface
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop path:/home/hadoop_test/data/hadoop-1.2.1
hive path :/home/hadoop_test/data/hive-0.12.0
I have copied hive*.jar ,jline-.jar,antlr-runtime.jar from hive-0.12.0/lib to hadoop-1.2./lib

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
Make sure the location is specified correctly

In the following way, I solved the problem.
set hive.msck.repair.batch.size=1;
set hive.msck.path.validation=ignore;
If you can not set the value, and get the error.Error: Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime (state=42000,code=1)
add content in hive-site:
key:
hive.security.authorization.sqlstd.confwhitelist.append
value:
hive\.msck\.path\.validation|hive\.msck\.repair\.batch\.size

Set hive.metastore.schema.verification property in hive-site.xml to true, by default it is false.
For further details check this link.

Amazon Athena
If you get here because of Amazon Athena errors, you might use this bit below. First check that all you files have the same schema:
If you run an ALTER TABLE ADD PARTITION (or MSCK REPAIR TABLE) statement and mistakenly specify a partition that already exists and an incorrect Amazon S3 location, zero byte placeholder files of the format partition_value_$folder$ are created in Amazon S3. You must remove these files manually.
We removed the files with the awscli.
aws s3 rm s3://bucket/key/table/ --exclude="*" --include="*folder*" --recursive --dryrun
See also the docs with some extra steps included.

To proper fix this with MSCK
Remove the older partitions from metastore, if their path not exists, using
ALTER TABLE dbname.tablename DROP PARTITION IF EXISTS (partition_column_name > 0);
RUN MSCK REPAIR COMMAND
MSCK REPAIR TABLE dbname.tablename;
Why the step 1 is required because MSCK Repair command will through error if the partition is removed from the file system (HDFS), so by removing all the partitions from the metastore first and then sync with MSCK will properly add the required partitions

The reason why we got this error was we added a new column to the external Hive table. set hive.msck.path.validation=ignore; worked upto fixing hive queries but Impala had additional issues which were solved with below steps:
After doing an invalidate metadata, Impala queries started failing with Error: incompatible Parquet schema for column
Impala error SOLUTION: set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
if you're using Cloudera distribution below steps will make the change permanent and you don't have to set the option per session.
Cloudera Manager -> Clusters -> Impala -> Configuration -> Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
Add the value: PARQUET_FALLBACK_SCHEMA_RESOLUTION=name
NOTE: do not use SET or semi-colon when setting the parameter in Cloudera Manager

open hive cli using "hive --hiveconf hive.root.logger=DEBUG,console" to enable logs and debug from there, in my case a camel case name for partition was written on hdfs and i created hive table with its name fully in lowercase.

None of proposed solutions worked for me.
I discovered a 0B file named _$folder$ inside my table location path (at same level of partitions).
Removing it allowed me to run a MSCK REPAIR TABLE t without issues.
This file was comming from a s3 restore (roll back to a previous versionned state)

I faced the same error. Reason in my case was a directory created in the HDFS warehouse with the same name. When this directory was deleted, it resolved my issue.

It's probably because your metastore_db is corrubpted. Delete .lck files from metastore_db.

hive -e "msck repair table database.tablename"
it will repair table metastore schema of table;

setting the below property and then doing msck repair worked for me :
set hive.mapred.mode=unstrict;

I faced similar issue when the underlying hdfs directory got updated with new partitions and hence the hive metastore went out of sync.
Solved using the following two steps:
MSCK table table_name showed what all partitions are out of sync.
MSCK REPAIR table table_name added the missing partitions.

Related

How to fix corrupted delta lake table on AWS S3

I ended up manually deleting some delta lake entries(hosted on S3) .
Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system.
I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html
but I am not sure how should I run this utility in my case.
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks.
you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN from the above command and it should do the stuff that you wat.

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command
In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS
Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

How to update hive table metadata with latest AVRO schema file

FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: /master_data/XYZ/DA12195/business_date=20181126/_schema.avsc)
The schema file exists in new partition with business_date=20181129, but hive table is still pointing to schema file in older partition.
Dropping external table and recreating it helped solve this problem.
Also MSCK REPAIR command helped recreating hive partitions.
File _schema.avsc file contain schema information about the AVRO table.We need to point hive table metadata to correct location of this file. serde and tblproperties needs to be updated for making this change

Hadoop Distcp aborting when copying data from one cluster to another

I am trying to copy data of a partitioned Hive table from one cluster to another.
I am using distcp to copy the data but the data underlying data is of a partitioned hive table.
I used the following command.
hadoop distcp -i {src} {tgt}
But as the table was partitioned the directory structure was created according to the partitioned tables. So it is showing error creating duplicates and aborting job.
org.apache.hadoop.toolsCopyListing$DulicateFileException: File would cause duplicates. Aborting
I also used -skipcrccheck -update -overwrite but none worked.
How to copy the data of a table from partitioned file path to destination?
Try to use this option -strategy dynamic
By default distcp is using uniformsize.
Check the below settings to see if they are false.Set them to true.
hive> set hive.mapred.supports.subdirectories;
hive.mapred.supports.subdirectories=false
hive> set mapreduce.input.fileinputformat.input.dir.recursive;
mapreduce.input.fileinputformat.input.dir.recursive=false
hadoop distcp -Dmapreduce.map.memory.mb=20480 -Dmapreduce.map.java.opts=-Xmx15360m -Dipc.client.fallback-to-simple-auth-allowed=true -Ddfs.checksum.type=CRC32C -m 500 \
-pb -update -delete {src} {target}
Ideally there can't be same file names. So, what's happening in your case is you trying to copy partitioned table from one cluster to other. And, 2 different named partitions have same file name.
Your solution is to correct Source path {src} in your command, such that you provide path uptil partitioned sub directory not the file.
For ex - Refer below :
/a/partcol=1/file1.txt
/a/partcol=2/file1.txt
If you use {src} as "/a/*/*" then you will get the error "File would cause duplicates."
But, if you use {src} as "/a" then you will not get error in copying.

hive 0.13 msck repair table only lists partitions not in metastore

I'm trying to use Hive(0.13) msck repair table command to recover partitions and it only lists the partitions not added to metastore instead of adding them to metastore as well.
here's the ouput of the command
partitions not in metastore externalexample:CreatedAt=26 04%3A50%3A56 UTC 2014/profileLocation="Chicago"
here's how I'm creating the external table
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/user/hue/exttable/';
Am I missing something?
I had a similar issue with the MSCK REPAIR TABLE listing the partitions that were not in the metastore but not actually adding them (and no error message).
I tried manually adding the partition with the ALTER TABLE ADD PARTITION command, and this gave me an error message, leading me to the root cause which was that the HDFS folder containing the 'missing' partition had been set up with incorrect permissions.
Once the permissions issue was resolved, then the MSCK REPAIR TABLE command worked correctly.
If you encounter this issue, it may be worthwhile to try adding it manually with the ALTER TABLE ADD PARTITION command. It may produce a useful error message that would help you determine the root cause of the problem.
Please make sure that the name of the partitions defined in your table definition match the name of the partition on hdfs.
For example, in your table creation example, I see that you haven't defined any paritions at all.
I think you want to do something like this (note the use of PARTITIONED BY):
create external table ExternalExample(tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT) PARTITIONED BY (CreatedAt STRING, profileLocation STRING) COMMENT 'This is the Twitter streaming data' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE location '/user/hue/exttable/';
Then on hdfs you should have the following folder structure:
/user/hue/exttable/CreatedAt=<someString>/profileLocation=<someString>/your-data-file
The partition names for MSCK REPAIR TABLE ExternalTable should be in lowercase then only it will add it to hive metastore, I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION, but after spending some time debugging found the issue that the partition names should be in lowercase i.e /some_external_path/mypartion=01 is valid and /some_external_path/myParition=01 is invalid;
Make your profileLocation to profilelocation or profile_location and test it should work.
My question is here Not able to recover partitions through alter table in Hive 1.2
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition
ALTER TABLE ExternalExample ADD PARTITION
for every partition
or in short you can run
MSCK REPAIR TABLE ExternalExample;
It will add any partitions that exist on HDFS but not in metastore to the metastore.
Ref https://issues.apache.org/jira/browse/HIVE-874
1) You need to specify partitions
2) Partition names must have all lower case letters . See this - https://singhanuvrat.com/hive-partition-column-name-camelcase-bad-idea-b89796d4e741#.16d7uqfot
you might not be running as the hive user:
sudo -u hive** hive -e "set hive.msck.path.validation=ignore;msck repair table T1"
set hive.msck.path.validation=ignore; ( this is for tables with large number of partitions.)
You are just missing the PARTITIONED BY (CreatedAt STRING, profileLocation STRING).