Hive calling MAP reduce for simple select query

Hive calling MAP reduce for simple select query - hive

I created a table in hive and loaded data into it. when executing simple " select * from " it is calling map reduce.
Actually simple select statement without where clause is just cat of HDFS file and should not execute MAP REDUCE.
Can you please suggest what to do for not calling MAP reduce for simple select

way 1) hive>set hive.fetch.task.conversion=minimal in your hive session and then trigger select * operation in your hive session to avoid mapreduce call
way 2) add the following property in hive-site configuration to avoid Mapreduce call for all the sessions
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description></description>
</property>

Related

Give MapReduce parameters to hive query

Is there any way to give map reduce parameters to hive query?
e.g. I am doing this, and it does not set that parameter in MR job.
hive (default)> set mapreduce.map.output.value.class=org.apache.orc.mapred.OrcValue;
hive (default)> select count(*) from myTable;
When I check the configuration of the MR job hive launches, the value of mapreduce.map.output.value.class was not org.apache.orc.mapred.OrcValue

If you need to change the default Hive file format use hive.default.fileformat option, i.e.,
set hive.default.fileformat=orc;

Mapreduce job not launching when running hive query with where clause

I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.

I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.

This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.

hive update lastAccessTime

I wanted to update lastAccessTime on hive table ,After google in the web,I get a solution :
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
But If I have two database A & B the hive sql:
set hive.exec.pre.hooks =
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
use A;
insert overwrite A.xxx
select c1,c2 from B.xxx;
hive returned me
org.apache.hadoop.hive.ql.metadata.InvalidTableException(Table not
found B.xxx

To retrieve a table's 'LastAccessTime', run the following commands through the Hive shell, replacing [database_name] and [table_name] with the relevant values.
use [database_name];
show table extended like '[table_name]';
This will return several metrics including the number of milliseconds (rather than the number of seconds) elapsed since Epoch. To format that number as a string representing the timestamp of that moment in the current system's time zone, remove the last three digits from the number and run it through the following command:
select from_unixtime([last_access_time]);

I happen to want the same effect.
Take sometime and finnaly make it;
your method is right.Just the value mastters;
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>hive\.exec\.pre\.hooks</value>
</property>
<property>
<name>hive.exec.pre.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec</value>
</property>
If still not working ,maybe your hive has a bug HIVE-18060UpdateInputAccessTimeHook fails for non-current database.
fixed in CDH5.15 CDH 5.15.0 Release Notes if you use cdh .

HIVE : Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?

Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?

When a simple statement like this is executed select * from tablename, what hive does is simply to fetch the data from the file stored in hdfs and bring it out in a columnar output format. Basically it generates a statement like
hadoop fs -cat hdfs://schemaname/tablename.txt
hadoop fs -cat hdfs://schemaname/tablename.rc
hadoop fs -cat hdfs://schemaname/tablename.orc
Or in whichever format your table's file is stored.
If you try selecting a column or adding a where clause to the query or using any aggregate on the table, MR comes into picture for obvious reasons.

Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. Whereas whenever you do a 'select column', a map job internally picks that particular column and gives the output.
There was also a bug filed for this to make 'select column' query run without mapreduce. Check the details here: https://issues.apache.org/jira/browse/HIVE-887

Hive: Any way to disable partition statistics?

Summary of the issue:
Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.
More details:
I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:
INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive ip=unknown-ip-addr cmd=append_partition : db=default tbl=some_table[14463,1410]
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042
Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.
It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.
I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:
set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;
Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!

Using set hive.stats.autogather=false will not take effect within the application. The reason is that when the Hive connection is created, it configures the hive configs to the metastore and once it is configured, it cannot be modified anymore.
You can disable the statistics in two ways:
1. Via the Hive shell
Using the Hive shell, type hive --hiveconf hive.stats.autogather=false.
2. Updating hive-site.xml
Update the following in hive-site.xml and restart the Hive session.
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>

https://cwiki.apache.org/confluence/display/Hive/StatsDev
According to the Hive documentation, this should be able to disable the statistics on partitions:
set hive.stats.autogather=false;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive calling MAP reduce for simple select query - hive

Related

Give MapReduce parameters to hive query

Mapreduce job not launching when running hive query with where clause

hive update lastAccessTime

HIVE : Why does Hive generate mapreduce job on select column from tablename Vs not generating mapreduce for select * from tablename?

Hive: Any way to disable partition statistics?

Categories

Resources