How to find a Hive Table is ACID enabled? - hive

What are all the possible ways to find a given Hive Table is ACID or non-ACID?
As mentioned here, one way to achieve this is with below command in a Shell Script and check for the output
hive -e "describe extended <Database>.<tablename>;" | grep "transactional=true"
What are the other possible ways to achieve this? The solution can be in shell script/Apache Pig/Java which will be invoked via Oozie workflow.

Related

Can't access external Hive metastore with Pyspark

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

Hive - Is there a way to dynamically create tables from a list

I'm using Hive to aggregate stats, and I want to do a breakdown by the industry our customers fall under. Ideally, I'd like to write the stats for each industry to a separate output file per industry (e.g. industry1_stats, industry2_stats, etc.). I have a list of various industries our customers are in, but that list isn't pre-set.
So far, everything I've seen from Hive documentation indicates that I need to know what tables I'd want beforehand and hard-code those into my Hive script. Is there a way to do this dynamically, either in the Hive script itself (preferable) or through some external code before kicking off the Hive script?
I would suggest go for a shell script..
Get the list of columns
hive -e 'select distinct industry_name from [dbname].[table_name];' > list
Iterate over every line... passing every line(industry names) of list as argument to the do while loop
tail -n +1 list | while IFS=' ' read -r industry_name
do
hive -hiveconf MY_VAR=$industry_name -f my_script.hql
done
save the shell script as test.sh
and in my_script.hql
use uvtest;
create table ${hiveconf:MY_VAR} (id INT, name CHAR(10));
you'll have to place both the test.sh and my_script.hql in the same folder.
Below command should create all the tables from list of column names.
sh test.sh
Follow this link for using hive in shell scripts:
https://www.mapr.com/blog/quick-tips-using-hive-shell-inside-scripts
I wound up achieving this using Hive's dynamic partitioning (each partition writes to a separate directory on disk, so I can just iterate through that file). The official Hive documentation on partitioning and this blog post were particularly helpful for me.

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)

Hive 13.0 The UDF implementation class '...' is not present in the class path

I encounter weird behaviour when using Hive 13.0.1 on Amazon EMR.
This happens when I try to both use UDF and run external shell script that runs hive -e "..." commands
We have been using shell scripts to add partitions dynamically to a table and never encountered any problem in Hive 0.11
However in Hive 0.13.1 the following simplified example breaks:
add jar myjar;
create temporary function myfunc as '...';
create external table mytable...
!hive -e "";
select myfunc(someCol) from mytable;
Results in The UDF implementation class '...' is not present in the class path
Removing the shell command (!hive -e "") and the error disappears.
Adding the jar and function again after the shell and the error disappears (Adding just the function without the jar does not get rid of the error).
Is this known behavior or a bug, can I do anything besides reloading the jar and function before every usage?
AFAIK - this is the way it's always been. One hive shell cannot pass on the additional jars added to it's classpath to the child shell. and definitely not the function definitions.
We provide Hive/Hadoop etc. as a service in Qubole and have the notion of a hive bootstrap that is used to, for cases like this, capture common statements required for all queries. This is used extensively by most users. (caveat - i am one of Qubole and Hive's founders - but would recommend using Qubole over EMR for Hive).

Hive dropping Internal table does not delete the warehouse files of that folder

Hive files on hdfs not being deleted when managed (not external) table is dropped
I followed the link but that didn't help.
Can anyone please suggest a solution.
Yes, Even I had the same scenario and the solution described in the post did not work for me.
So what I did was a brute force delete from the Hive Warehouse folder. I know this might not be the best way to handle the situation, but it did help me to move ahead and create the table again without banging much of my head.
You can do the same if you want using the following command from Hadoop shell:
hadoop fs -rm hdfs://nn.example.com/file /user/hive/warehouse/schemaname/
Above works, but you get permission denied because of sticky bit.
1) Either do it as superuser or
2) Run - HADOOP_USER_NAME=<dir owner> hdfs dfs -rmr <hdfs dir>