We have a table in Hive containing location and name of employee.
In the next update of table, the location of employee is changed.
Both the ORC files are present in Hive Metastore location.
Now, how to trigger the select query to get the latest location of the employee?
Related
Coming across an issue and wondering if anyone would be able to help. There is a designated table in our BQ project that hosts sales myproject_dataset.sales_table. This table is not partitioned by _PARTITIONTIME but by the date identifier in the sales files, Sales_Date so I'm unable to query data in this table by day it was ingested but by date in the Sales file.
There was a file loaded into myproject_dataset.sales_table table with incorrect data for a particular date, ex. 2022-10-19. The issue is this file also contains records from previous dates as well, so executing the following command to remove the incorrect data won't solve the issue:
DELETE from myproject_dataset.sales_table
WHERE Sales_Date = 2022-10-19"
I queried using INFORMATION_SCHEMA.PARTITIONS to get the partition_ID of the incorrect file loaded into myproject_dataset.sales_table on the particular date, ex. 2022-10-19.
Is there a way to delete records by partition metadata, ex. partition_ID in a BQ table?
I queried using INFORMATION_SCHEMA.PARTITIONS to get the table metadata, particularly the partition_ID of the incorrect file loaded into myproject_dataset.sales_table on the particular date, ex. 2022-10-19.
I have a table with thousands of partition. I want to change all the partition location to diff cluster.
Ex:
for table test_table and partition day=2021041600
Old location: hdfs://cluster1/dir1/dir2/day=2021041600/\<files>
New location: hdfs://cluster2/dir1/dir2/day=2021041600/\<files>
I can achieve this using 2 ways.
We can fetch the list of all the partitions and update the partition location for every partition 1 by 1.
We can change the base location of the table and run the MSCK repair command on the table.
My question is which option would we better approach to take?
1st approach will work.
2nd approach (MSCK repair):
MSCK REPAIR will not work if you change table location because partitions are mounted to old locations outside table location.
Make table EXTERNAL, DROP, CREATE with new location, run MSCK REPAIR:
alter table test_table SET TBLPROPERTIES('EXTERNAL'='TRUE');
drop table test_table;
create table test_table ... location 'hdfs://cluster2/dir1/dir2';
MSCK REPAIR TABLE test_table;
I have a partitioned table Student which already has one partition column dept. I need to add new partition column gender
Will it be possible to add this new partition column in already partitioned hive table.
The table data does not have gender column. It is a new constant column to be added in hive table.
Partitions are hierarchical folders like table_location/dept=Accounting/gender=male/
Folder structure should exist. You can easily add non-partition column as the last one and it will return NULLs if the data does not contain that column, but to add a partition column the easiest way is to create new table partitioned as you want, insert overwrite that table from the old one (selecting partitions columns as the last ones), drop old table, rename new one.
See this answer about dynamic partitions load: https://stackoverflow.com/a/48901871/2700344
Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
How to apply Partition on hive table which is already partitioned. I am not able to fetch the partitioned data into the folder after the data is loaded.
1st rule of partitioning in hive is that the partitionioning column should be the last column in the data. since the data is already partitioned lets say we are partitioning data on gender M/F there will be two directories gender=M and gender=F be created inside each of the directories respective gender data will be available and last column again in this data will be gender.
If you want to partiton data again on partitioned table use insert into select and make sure last column you use is the partition column you want to on the partitioned data.
Did you add a partition manually with the Hdfs command ? In that case metastore will not keep track of partitions being added unless you specify " alter table add partition "...
try this
MSCK REPAIR TABLE table_name;
If that is not the case , then try to drop partitions and create the partitions again . Use alter table command to do this. but you will lose the data . and your partitioning column value should be mentioned as last column in case if you are doing a dynamic partition insert.