Give MapReduce parameters to hive query - hive

Is there any way to give map reduce parameters to hive query?
e.g. I am doing this, and it does not set that parameter in MR job.
hive (default)> set mapreduce.map.output.value.class=org.apache.orc.mapred.OrcValue;
hive (default)> select count(*) from myTable;
When I check the configuration of the MR job hive launches, the value of mapreduce.map.output.value.class was not org.apache.orc.mapred.OrcValue

If you need to change the default Hive file format use hive.default.fileformat option, i.e.,
set hive.default.fileformat=orc;

Related

hive update lastAccessTime

I wanted to update lastAccessTime on hive table ,After google in the web,I get a solution :
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
But If I have two database A & B the hive sql:
set hive.exec.pre.hooks =
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
use A;
insert overwrite A.xxx
select c1,c2 from B.xxx;
hive returned me
org.apache.hadoop.hive.ql.metadata.InvalidTableException(Table not
found B.xxx
To retrieve a table's 'LastAccessTime', run the following commands through the Hive shell, replacing [database_name] and [table_name] with the relevant values.
use [database_name];
show table extended like '[table_name]';
This will return several metrics including the number of milliseconds (rather than the number of seconds) elapsed since Epoch. To format that number as a string representing the timestamp of that moment in the current system's time zone, remove the last three digits from the number and run it through the following command:
select from_unixtime([last_access_time]);
I happen to want the same effect.
Take sometime and finnaly make it;
your method is right.Just the value mastters;
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>hive\.exec\.pre\.hooks</value>
</property>
<property>
<name>hive.exec.pre.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec</value>
</property>
If still not working ,maybe your hive has a bug HIVE-18060UpdateInputAccessTimeHook fails for non-current database.
fixed in CDH5.15 CDH 5.15.0 Release Notes if you use cdh .

drop hive table partition through pig script

Currently we are dropping the table daily and running the script which loads the data to the tables. Script takes 3-4 hrs during which data will not be available. So now our aim is to make the old hive data available to analysts until new data load execution is complete.
I am achieving this thing in hql script by loading daily data to the hive tables partitioned on load_year, load_month and load_day and dropping the yesterdays data by dropping the partition.
But what is the option for pig script to achieve the same? Can we alter the table through pig script? I dont want to execute the other hql to drop partition after pig.
Thanks
Since HDP 2.3 you can use HCatalog commands inside Pig scripts. Therefore, you can use the HCatalog command to drop a Hive table partition. The following is an example of dropping a Hive partition:
-- Set the correct hcat path
set hcat.bin /usr/bin/hcat;
-- Drop a table partion or execute other any Hcatalog command
sql ALTER TABLE midb1.mitable1 DROP IF EXISTS PARTITION(activity_id = "VENTA_ALIMENTACION",transaction_month = 1);
Another way is to use sh command execution inside Pig Script. However I had some problems to escape special characters in ALTER commands. So, the first is the best option in my opinion.
Regards,
Roberto Tardío

HIVE : Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?

Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?
When a simple statement like this is executed select * from tablename, what hive does is simply to fetch the data from the file stored in hdfs and bring it out in a columnar output format. Basically it generates a statement like
hadoop fs -cat hdfs://schemaname/tablename.txt
hadoop fs -cat hdfs://schemaname/tablename.rc
hadoop fs -cat hdfs://schemaname/tablename.orc
Or in whichever format your table's file is stored.
If you try selecting a column or adding a where clause to the query or using any aggregate on the table, MR comes into picture for obvious reasons.
Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. Whereas whenever you do a 'select column', a map job internally picks that particular column and gives the output.
There was also a bug filed for this to make 'select column' query run without mapreduce. Check the details here: https://issues.apache.org/jira/browse/HIVE-887

Just get column names from hive table

I know that you can get column names from a table via the following trick in hive:
hive> set hive.cli.print.header=true;
hive> select * from tablename;
Is it also possible to just get the column names from the table?
I dislike having to change a setting for something I only need once.
My current solution is the following:
hive> set hive.cli.print.header=true;
hive> select * from tablename;
hive> set hive.cli.print.header=false;
This seems too verbose and against the DRY-principle.
If you simply want to see the column names this one line should provide it without changing any settings:
describe database.tablename;
However, if that doesn't work for your version of hive this code will provide it, but your default database will now be the database you are using:
use database;
describe tablename;
you could also do show columns in $table or see Hive, how do I retrieve all the database's tables columns for access to hive metadata
The solution is
show columns in table_name;
This is simpler than use
describe tablename;
Thanks a lot.
use desc tablename from Hive CLI or beeline to get all the column names. If you want the column names in a file then run the below command from the shell.
$ hive -e 'desc dbname.tablename;' > ~/columnnames.txt
where dbname is the name of the Hive database where your table is residing
You can find the file columnnames.txt in your root directory.
$cd ~
$ls
Best way to do this is setting the below property:
set hive.cli.print.header=true;
set hive.resultset.use.unique.column.names=false;

Best equivalent of SQL Server UPDATE command in Hive

What is the best (less expensive) equivalent of SQL Server UPDATE SET command in Hive?
For example, consider the case in which I want to convert the following query:
UPDATE TABLE employee
SET visaEligibility = 'YES'
WHERE experienceMonths > 36
to equivalent Hive query.
I'm assuming you have a table without partitions, in which case you should be able to do the following command:
INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;
There are other ways but they are much more convoluted, I think the way Bejoy described is the most efficient.
(source: Bejoy KS blog)
Note that if you have to do this on a partitioned table (which is likely if you have a lot of data), you would probably need to overwrite your partition when doing this.
You can create an external table and use the 'insert overwrite into local directory' and in case you want to change the column values, you can use 'CASE WHEN', 'IF' or other conditional operations. And copy the output file back to HDFS location.
You can upgrade your hive to 0.14.0
Starting from 0.14.0 hive supports UPDATE operation.
To do the same we need to create hive tables such that they support ACID output format and need to set additional properties in hive-site.xml.
How to do CURD operations in Hive