hive update lastAccessTime - hive

I wanted to update lastAccessTime on hive table ,After google in the web,I get a solution :
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
But If I have two database A & B the hive sql:
set hive.exec.pre.hooks =
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
use A;
insert overwrite A.xxx
select c1,c2 from B.xxx;
hive returned me
org.apache.hadoop.hive.ql.metadata.InvalidTableException(Table not
found B.xxx

To retrieve a table's 'LastAccessTime', run the following commands through the Hive shell, replacing [database_name] and [table_name] with the relevant values.
use [database_name];
show table extended like '[table_name]';
This will return several metrics including the number of milliseconds (rather than the number of seconds) elapsed since Epoch. To format that number as a string representing the timestamp of that moment in the current system's time zone, remove the last three digits from the number and run it through the following command:
select from_unixtime([last_access_time]);

I happen to want the same effect.
Take sometime and finnaly make it;
your method is right.Just the value mastters;
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>hive\.exec\.pre\.hooks</value>
</property>
<property>
<name>hive.exec.pre.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec</value>
</property>
If still not working ,maybe your hive has a bug HIVE-18060UpdateInputAccessTimeHook fails for non-current database.
fixed in CDH5.15 CDH 5.15.0 Release Notes if you use cdh .

Related

Replace data in a partition of a table in Bigquery

I have a use case where I have to replace the data in a partition of a table in BigQuery every time 15 mins. Are there any functions available in Bigquery similar to partition exchange in Bigquery or any provision to truncate data of a partition.
Regarding your requirement to load new data every fifteen minutes into a partitioned table you could use Data Manipulation Language (DML).
In order to update rows in a partitioned table you could use the UPDATE statement as stated in the documentation.
Also, in case that you wanted to overwrite partitions you could also use a load job using the CLI as stated here. Using --noreplace or --replace you can specify if you want to append or truncate the given partition.

HQL failed sqlexception

When i run
Select * from table_a
using my hive ide, I receive:
SQLException pa current sql input files exceed the maximum limit. Please optimize. Maxfile--→100000
I could find nothing when I google this error.
Seems like you table has exploded the maximum number of allowed HDFS file.
Go to hive (or beeline) and run following command, by default the value is 100000
set hive.exec.max.created.files;
To fix the issue, you need to go and check your insert Queries and understand why it's creating so many files.
However for time being and you concatenate some of the partition (better to do it for all) using the following command
alter table dbname.tblName partition (col_name=value) concatenate;

BigQuery: Atomically replace a date partition using DML

I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.
Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.
However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.
I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?
The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.
because they don't allow one to use the $ decorator
But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
Then I loaded into a specific partition:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
I tried to load into the wrong partition for the input data, and received an error:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
Some rows belong to different partitions rather than destination partition 20181105
I then replaced the data for the partition:
$ echo "2,0.11,2018-11-07" > row.csv
$ bq load --replace "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

List Impala tables that need invalidate/refresh

How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.

Hive: Any way to disable partition statistics?

Summary of the issue:
Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.
More details:
I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:
INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive ip=unknown-ip-addr cmd=append_partition : db=default tbl=some_table[14463,1410]
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042
Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.
It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.
I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:
set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;
Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!
Using set hive.stats.autogather=false will not take effect within the application. The reason is that when the Hive connection is created, it configures the hive configs to the metastore and once it is configured, it cannot be modified anymore.
You can disable the statistics in two ways:
1. Via the Hive shell
Using the Hive shell, type hive --hiveconf hive.stats.autogather=false.
2. Updating hive-site.xml
Update the following in hive-site.xml and restart the Hive session.
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
https://cwiki.apache.org/confluence/display/Hive/StatsDev
According to the Hive documentation, this should be able to disable the statistics on partitions:
set hive.stats.autogather=false;