I have a functionality where I want to delete data from partition. If all the data is deleted then remove the partition as well.
Below are the steps we are found:
Search for data for the partition which is based on date.
Delete the data from partition
Check if any records are present if present performs compute statistics. For columns as well.
If there are no records delete the partition and perform compute statistics.
When we perform a select query, it is showing me the error for deleted partition.
Even if we call compute statistics, still it is showing that partition does not exist.
Do I miss anything?
We have tried to compute the statistics with some code changes but it won't work. Below are the steps we are following:
If partition deleted then below are the next 2 steps we are following:
ALTER TABLE transactions DROP IF EXISTS PARTITION (eventday=" + eventDay + ")
ALTER TABLE transactions ADD IF NOT EXISTS PARTITION (eventday=" + eventDay + ")
These steps are executed every time:
ANALYZE TABLE transactions PARTITION(eventday=" + eventDay + ") COMPUTE STATISTICS FOR COLUMNS
ANALYZE TABLE transactions PARTITION(eventday=" + eventDay + ") COMPUTE STATISTICS
The expected result is to avoid this exception as the partition is not available
but for a select query for delete partition, it is showing this error.
java.sql.SQLException: Query failed (#20190619_060000_00416_bztpj): Partition location does not exist: hdfs://XXXX/XX/XX
Related
Is it possible to see the total number of partitions of a table in impala?
For example db.table has 40.500 partitions
Use SHOW PARTITIONS statement.
SHOW PARTITIONS [database_name.]table_name
It will print partition list and you can count rows in the output minus header(3 rows) and footer(1 row). Unfortunately, there is no command which can return partition count already calculated except for Kudu tables: SHOW TABLE STATS prints the # of partitions in Kudu table.
Of course you can execute select count(distinct part_col1, part_col2...) from table, but it is not as efficient as SHOW partitions
one of synapse table we've 300 million rows and keep increasing. Every row as status column i.e active_row either 0 or 1. Active_row is int datatype. Users only query based active_row = 1 which has only 28 million row and rest of data i.e 270 million is inactive.
To increase the performance and avoid to full tablescan on active_row, i've converted the table in partition table on active_row as below
CREATE TABLE [repo].[STXXXXX]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX (
[ID] ASC
),
PARTITION
(
active_Row RANGE LEFT FOR VALUES (0,1)
)
)
as
select * from repo.nonptxx;
Users reported there is no performance improvement after moving to partition table. when i checked the below query i.e partition vs non-partition i don't see any difference in query explain plain interms of estimated sub tree, operation etc and all stats remain same figure. From sys.dm_pdw_nodes_db_partition_stats i can see 3 partition created on partition 1 having 270 million data spilt in 60 nodes and partition 2 of 60 nodes 30 million spilted and partition 3 of 60 nodes is empty.
select * from [repo].[STXXXXX] where active_row =1
vs
select * from repo.nonptxx where active_row =1
Please advise what's wrong and why there is no improvement after moving into partition table and how to tune it?
Are statistics updated?
Run UPDATE STATISTICS [schema_name].[table_name] and rerun your tests (OR create Stats if they don't exist).
You should see a Filter step w/ the smaller number of rows returned when querying a single partition in the tsql query plan right after the Get step. You won't see it in the dsql query plan. You won't see any subtree cost for a Select * which translates to a single Return operation from the individual nodes, however you will see the estimated number of rows per execution get smaller as you filter by partition (w/ stats up to date). Missing or outdated stats can produce some odd query plan results because the optimizer essentially doesn't have enough information to make a good decision...therefore unpredictable and sometimes poor results.
Another option you may want to consider if it doesn't give you the performance you're looking for is keeping the data w/o partitions and simply creating a non-clustered index on the column. Indexes don't always get used or behave exactly how you'd expect w/ SQL server, however in this use case typically a one column index will greatly help performance. The benefit with the index is if you have data moving from active to inactive it doesn't need to move records between physical partitions.
my script failing due to a heap space issue to process too many partitions. To avoid the issue I am trying to insert all the partitions into a single partition but I am facing the below error
FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different ''2021-01-16'': Table insclause-0 has 78 columns, but query has 79 columns.
set hive.exec.dynamic.partition=true;
set mapreduce.reduce.memory.mb=6144;
set mapreduce.reduce.java.opts=-Xmx5g;
set hive.exec.dynamic.partition=true;
insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
select * from db_temp.travel_history_denorm_temp_bq
distribute by start_date;```
Can someone please suggest what is the issue, I checked the schema for the tables it is the same. ?
You are inserting into static partition (partition value specified in the target table partition clause), in this case you should not have partition column in the select. And select * returns partition column (the last one), this is why query fails, it should be no partition column:
Static partition insert:
insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
select col1, col2, col3 ... --All columns except start_date partition column
from ...
Dynamic partition:
insert overwrite table db_temp.travel_history_denorm partition (start_date)
select * --All columns in the same order, including partition
from ...
Adding distribute by triggers additional reduce step, all records are being grouped according to distribute by and each reducer receives single partition. This can help to solve the OOM problem when you are loading many dynamic partitions in each reducer. Without distribute by each reducer will create files in each partitions, keeping too many buffers simultaneously.
In addition to distribute by you can set the maximum bytes per reducer. This setting will limit the amount of data processed by single reducer and also may help with OOM:
set hive.exec.reducers.bytes.per.reducer=16777216; --adjust for optimal performance
If this figure is too small, it will trigger too many reducers, if too big - then each reducer will process too much data. Adjust accordingly.
Also try this setting for dynamic partition load:
set hive.optimize.sort.dynamic.partition=true;
When enabled, dynamic partitioning column will be globally sorted.
This way we can keep only one record writer open for each partition
value in the reducer thereby reducing the memory pressure on reducers.
You can combine all these methods: distribute by partition key, bytes.per.reducer and sort.dynamic.partition for dynamic partition loading.
Also exception message can help to understand where exactly the OOM happens and fix accordingly.
I applied a partition on a DateTime column in a MSSQL table .
Created Partition function, Scheme and 4 file groups and given boundary values.
I have queried a result in this table with where condition on partitioned column.
In this how-to know, the query is reading total records or related filegroup.
How to know the query is using partition or not ?.
One way is with the actual query execution plan. The Actual Partition Count of the seek/scan operator will show the actual number of partitions touched.
Another method is to run the query with SET STATISTICS IO ON, where the scan count of the table will reflect the number of partitions used.
I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;
The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;