I ended up manually deleting some delta lake entries(hosted on S3) .
Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system.
I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html
but I am not sure how should I run this utility in my case.
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks.
you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN from the above command and it should do the stuff that you wat.
Is there a way to locate the hdfs/local path of a particular UDF's jar/class file?
When I run "show functions" I am able to see this UDF but I want to find out it's location.
Hive has two types of functions permanent/Built-in and temporary.
Permenent: Built-in functions are part of hive-exec**.jar under package "org.apache.hadoop.hive.ql.udf". Jar is under HIVE_HOME/lib/hive-exec*.jar.
Temporary: functions are added manually. in case of temporary function you will find details in your hive hql file. e.g.
ADD JAR xyz.jar;
CREATE TEMPORARY FUNCTION temp AS 'com.example.hive.udf.Temp';
Hive of 0.13+ UDF can also be added permanently using plugin.
Find out details about hive function:
DESCRIBE FUNCTION EXTENDED function_name;
DESCRIBE FUNCTION function_name;
I encounter weird behaviour when using Hive 13.0.1 on Amazon EMR.
This happens when I try to both use UDF and run external shell script that runs hive -e "..." commands
We have been using shell scripts to add partitions dynamically to a table and never encountered any problem in Hive 0.11
However in Hive 0.13.1 the following simplified example breaks:
add jar myjar;
create temporary function myfunc as '...';
create external table mytable...
!hive -e "";
select myfunc(someCol) from mytable;
Results in The UDF implementation class '...' is not present in the class path
Removing the shell command (!hive -e "") and the error disappears.
Adding the jar and function again after the shell and the error disappears (Adding just the function without the jar does not get rid of the error).
Is this known behavior or a bug, can I do anything besides reloading the jar and function before every usage?
AFAIK - this is the way it's always been. One hive shell cannot pass on the additional jars added to it's classpath to the child shell. and definitely not the function definitions.
We provide Hive/Hadoop etc. as a service in Qubole and have the notion of a hive bootstrap that is used to, for cases like this, capture common statements required for all queries. This is used extensively by most users. (caveat - i am one of Qubole and Hive's founders - but would recommend using Qubole over EMR for Hive).
I am getting the below error on creating a hive database
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/facebook/fb303/FacebookService$Iface
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop path:/home/hadoop_test/data/hadoop-1.2.1
hive path :/home/hadoop_test/data/hive-0.12.0
I have copied hive*.jar ,jline-.jar,antlr-runtime.jar from hive-0.12.0/lib to hadoop-1.2./lib
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
Make sure the location is specified correctly
In the following way, I solved the problem.
set hive.msck.repair.batch.size=1;
set hive.msck.path.validation=ignore;
If you can not set the value, and get the error.Error: Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime (state=42000,code=1)
add content in hive-site:
key:
hive.security.authorization.sqlstd.confwhitelist.append
value:
hive\.msck\.path\.validation|hive\.msck\.repair\.batch\.size
Set hive.metastore.schema.verification property in hive-site.xml to true, by default it is false.
For further details check this link.
Amazon Athena
If you get here because of Amazon Athena errors, you might use this bit below. First check that all you files have the same schema:
If you run an ALTER TABLE ADD PARTITION (or MSCK REPAIR TABLE) statement and mistakenly specify a partition that already exists and an incorrect Amazon S3 location, zero byte placeholder files of the format partition_value_$folder$ are created in Amazon S3. You must remove these files manually.
We removed the files with the awscli.
aws s3 rm s3://bucket/key/table/ --exclude="*" --include="*folder*" --recursive --dryrun
See also the docs with some extra steps included.
To proper fix this with MSCK
Remove the older partitions from metastore, if their path not exists, using
ALTER TABLE dbname.tablename DROP PARTITION IF EXISTS (partition_column_name > 0);
RUN MSCK REPAIR COMMAND
MSCK REPAIR TABLE dbname.tablename;
Why the step 1 is required because MSCK Repair command will through error if the partition is removed from the file system (HDFS), so by removing all the partitions from the metastore first and then sync with MSCK will properly add the required partitions
The reason why we got this error was we added a new column to the external Hive table. set hive.msck.path.validation=ignore; worked upto fixing hive queries but Impala had additional issues which were solved with below steps:
After doing an invalidate metadata, Impala queries started failing with Error: incompatible Parquet schema for column
Impala error SOLUTION: set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
if you're using Cloudera distribution below steps will make the change permanent and you don't have to set the option per session.
Cloudera Manager -> Clusters -> Impala -> Configuration -> Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
Add the value: PARQUET_FALLBACK_SCHEMA_RESOLUTION=name
NOTE: do not use SET or semi-colon when setting the parameter in Cloudera Manager
open hive cli using "hive --hiveconf hive.root.logger=DEBUG,console" to enable logs and debug from there, in my case a camel case name for partition was written on hdfs and i created hive table with its name fully in lowercase.
None of proposed solutions worked for me.
I discovered a 0B file named _$folder$ inside my table location path (at same level of partitions).
Removing it allowed me to run a MSCK REPAIR TABLE t without issues.
This file was comming from a s3 restore (roll back to a previous versionned state)
I faced the same error. Reason in my case was a directory created in the HDFS warehouse with the same name. When this directory was deleted, it resolved my issue.
It's probably because your metastore_db is corrubpted. Delete .lck files from metastore_db.
hive -e "msck repair table database.tablename"
it will repair table metastore schema of table;
setting the below property and then doing msck repair worked for me :
set hive.mapred.mode=unstrict;
I faced similar issue when the underlying hdfs directory got updated with new partitions and hence the hive metastore went out of sync.
Solved using the following two steps:
MSCK table table_name showed what all partitions are out of sync.
MSCK REPAIR table table_name added the missing partitions.
i have created a table in Hive "sample" and loaded a csv file "sample.txt" into it.
now i need that data from "sample" into my local /opt/zxy/sample.txt.
How can i do that?
Hortonworks' Sandbox lets you do it through its HCatalog menu. Otherwise, the syntax is
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/c' SELECT a.* FROM b
as per Hive language manual
Since your intention is just to copy the entire file from HDFS to your local FS, I would not suggest you to do it through a Hive query, because of the following reasons :
It'll start a Mapreduce job which will take more time than a normal copy.
It'll create file(s) with different names(000000_0, 000001_0 and so on), which will require you to rename the file manually afterwards.
You might face problem in opening these files as they are without any extension. Your OS would be unable to choose an application to open these files on its own. In such a case you either have to rename the file or manually select an application to open it.
To avoid these problems you could use HDFS get command :
bin/hadoop fs -get /user/hive/warehouse/sample/sample.txt /opt/zxy/sample.txt
Simple n easy. But if you need to copy some selected data, then you have to use a Hive query.
HTH
I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select * from sample' > /opt/zxy/sample.txt
Hope that helps.
Readers who are accessing Hive from Windows OS can check out this script on Github.
It's a Python+paramiko script that extracts Hive data to local Windows OS file-system.