ORC files with Hive: java.io.IOException: Two readers - hive

I have an ACID hive table, with files in ORC format. When attempting a compaction, I end up with the following error: Task: ... exited : java.io.IOException: Two readers for ... The full error is as follow:
2019-06-03 07:01:05,357 ERROR [IPC Server handler 2 on 41085] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1558939181485_29861_m_000001_0 - exited : java.io.IOException: Two readers for {originalWriteId: 143, bucket: 536870912(1.0.0), row: 3386, currentWriteId 210}: new [key={originalWriteId: 143, bucket: 536870912(1.0.0), row: 3386, currentWriteId 210}, nextRecord={2, 143, 536870912, 3386, 210, null}, reader=Hive ORC Reader(hdfs://HdfsNameService/tbl/delete_delta_0000209_0000214/bucket_00001, 9223372036854775807)], old [key={originalWriteId: 143, bucket: 536870912(1.0.0), row: 3386, currentWriteId 210}, nextRecord={2, 143, 536870912, 3386, 210, null}, reader=Hive ORC Reader(hdfs://HdfsNameService/tbl/delete_delta_0000209_0000214/bucket_00000, 9223372036854775807)]
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.ensurePutReader(OrcRawRecordMerger.java:1171)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1126)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRawReader(OrcInputFormat.java:2402)
at org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:964)
at org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:941)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
This table is created and updated by merge'ing avro files into an orc table, hence the bunch of deltas, both delete_delta and delta.
I have many other such tables, which do not have this issue. This table has nothing out of the ordinary and is actually quite small (<100k rows, 2.5M on disk) and was in the last month updated 100 times (20k rows updated, 5M update data). The DDL is:
CREATE TABLE `contact_group`(
`id` bigint,
`license_name` string,
`campaign_id` bigint,
`name` string,
`is_system` boolean,
`is_test` boolean,
`is_active` boolean,
`remarks` string,
`updated_on_utc` timestamp,
`created_on_utc` timestamp,
`deleted_on_utc` timestamp,
`sys_schema_version` int,
`sys_server_ipv4` bigint,
`sys_server_name` string,
`load_ts` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://HdfsNameService/dwh/vault/contact_group'
TBLPROPERTIES (
'bucketing_version'='2',
'last_modified_by'='hive',
'last_modified_time'='1553512639',
'transactional'='true',
'transactional_properties'='default',
'transient_lastDdlTime'='1559522011')
This happens every few months. As everything else (select, merge) works, the fix is usually to create a second table (create table t as select * from contact_group) and switch the tables, but I would like to find the real underlying reason.
The only reference I found about my error is in the code itself, which does not help me much.
This is on hdp3.1, with Hive 3.

In my case, I could not resolved the problem using the solution suggested by #ShuBham ShaRma.
After looking at #cang_yun's finding, I tried to delete one of the bucket file (bucket_00001) and was able to run select statements on that table again. I'm not sure that this is the right way to do it, but it worked in my case.

I have faced the problem too.Through orc-tools I scan all files under delete_delta and it can be found that all the rows are same in these files(For example, there are 7 rows in bucket_00000, and the same 7 rows in the other file bucket_00001.).So the key(originalTransacion-bucket-rowId-currentWriteId)will be same when iterate the next bucket file.
Another fix is to create the table as bucket, maybe can avoid the problem.

In my case, it was caused from user error. It is caused from two tables referring to the same hdfs directory. When creating the table, I created the location name and accidentally copied the same directory to another table.
My program then performed changes on both transactional tables resulting in delta files that could not be resolved.

Here is the summary for the observed issue for one of our users:
Table is failing during fetch task operation from disk, it got corrupted with duplicate key identifiers in delete_delta files (https://issues.apache.org/jira/browse/HIVE-22318), there is a temporary workaround to read the table by setting set hive.fetch.task.conversion=none but this would not help in compaction or any fetch task operations to succeed.
Steps performed to create backup of table:
Connect with beeline and run below property in session:
set hive.fetch.task.conversion=none ;
Now you'll be able to run select statements over the mentioned table.
Run below statement to create a backup for the table
create table <backup_tbl_name> as select * from <problem_tbl> ;
Once you have the backup ready, logout from session and check the
backup table without setting any property, (check count and table
consistency from data quality perspective)
select * from <backup_tbl_name> ;
To create original copy from backup table:
Now you can drop problem table and replace with backup table
drop table <problem_tbl> ;
alter table <backup_tbl_name> rename to <original_tbl_name> ;
Note: To avoid this issue in future, create the table with a bucketing column in DDL

Related

Can't copy BigQuery table with DDL modifications?

What's the problem
I have a table in BigQuery where some schema had to change overtime.
I used DDL to change the schema doing the following:
Switching some columns from INT to FLOAT
Delete a FLOAT column and recreating as STRING column.
I attempted to copy the table into a new blank table and I get the following error:
Operation could not be completed. Error message: Table project_id:dataset.table_id with column level ddl operation does not support table copy.
I can no longer copy and snapshot this table? I can't find any documentation on this error at all.
How can I copy this table and what is going on?
My best guess
I assumed DDL changed the table and that was it.
I guess perhaps the way it works is more like Django Migrations and the DDL statement can't be copied so now I can never copy that table again?
I wouldn't have altered this table, had I known that was the case. Does that mean we're back to exporting all our data to GCS and reloading?

Old records appear in the Hadoop table after drop and creating new table with the same old name

I have a question regarding creating tables in Hadoop.
I create external table the following way:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
After I created this table I run Jenkins job (with sqoop process) which loads 70.000.000 records to this table.
Then I needed to remove this table, so I run:
DROP TABLE SCHEMA.TABLENAME
Later on I want to create a table with the same name as the previous one, but I need it to be empty. I make the same query as earlier, I do:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
But when I create table this way, it has 70.000.000 records inside it again, although I didn't run any job to populate it.
This is why I have two questions:
When I drop and create table with old name, then is it recovering records from the old table?
How can I drop (or truncate) table in bigsql/hive so that I have an empty table with the old name.
I am using bigsql and hive.
Dropping an external table doesn't remove the stored data, only the metadata from the Hive Metastore.
Refer Managed vs External Tables
Key points...
Use external tables when files are already present or in remote locations
files should remain even if the table is dropped
Create a managed table (remove EXTERNAL from your query), if you want to be able to DROP and/or TRUNCATE.
Or have your Jenkins job run hadoop fs -rm -skipTrash before the import.

Rename a table in Amazon Redshift

I've been trying to rename a table from "fund performance" to fund_performance in SQLWorkbench for a Redshift database. Commands I have tried are:
alter table schemaname."fund performance"
rename to fund_performance;
I received a message that the command executed successfully, and yet the table name did not change.
I then tried copying the table to rename it that way. I used
#CREATE TABLE fund_performance LIKE "schema_name.fund performance";
CREATE TABLE fund_performance AS SELECT * FROM schema_name."fund performance";
In both these cases I also received a message that the statements executed successfully, but nothing changed. Does anyone have any ideas?
Use following it may work out for you
SELECT * into schema_name.fund_performance FROM schema_name.[fund performance]
It will copy the data by creating new table as fund_performance but it won't create any constraints and Identity's
To Rename specific table without disturbing existing constraints
EXEC sp_rename 'schema_name.[fund performance]', 'schema_name.fund_performance';

Databricks - is not empty but it's not a Delta table

I run a query on Databricks:
DROP TABLE IF EXISTS dublicates_hotels;
CREATE TABLE IF NOT EXISTS dublicates_hotels
...
I'm trying to understand why I receive the following error:
Error in SQL statement: AnalysisException: Cannot create table ('default.dublicates_hotels'). The associated location ('dbfs:/user/hive/warehouse/dublicates_hotels') is not empty but it's not a Delta table
I already found a way how to solve it (by removing it manually):
dbutils.fs.rm('.../dublicates_hotels',recurse=True)
But I can't understand why it's still keeping the table?
Even though that I created a new cluster (terminated the previous one) and I'm running this query with a new cluster attached.
Anyone can help me to understand that?
I also faced a similar problem, then tried the command line CREATE OR REPLACE TABLE and it solved my problem.
DROP TABLE & CREATE TABLE work with entries in the Metastore that is some kind of database that keeps the metadata about databases and tables. There could be the situation when entries in metastore don't exist so DROP TABLE IF EXISTS doesn't do anything. But when CREATE TABLE is executed, then it additionally check for location on DBFS, and fails if directory exists (maybe with data). This directory could be left from some previous experiments, when data were written without using the metastore.
if the table created with LOCATION specified - this means the table is EXTERNAL, so when you drop it - you drop only hive metadata for that table, directory contents remains as it is. You can restore the table by CREATE TABLE if you specify the same LOCATION (Delta keeps table structure along with it's data in the directory).
if LOCATION wasn't specified while table creation - it's a MANAGED table, DROP will destroy metadata and directory contents

Hdfs data corruption issue

we have a data corruption issue at our hadoop cluster. We have a managed table on hive which contains three years of data partitioned by year.
Below two queries run fine without any issue
select count(*) from tkt_hist table where yr=2015
select count(*) from tkt_hist table where yr=2016
select count(*) from tkt_hist table where yr=2017
However, when we try to do group by per year, below error is shown.
Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://ASIACELLHDP/apps/hive/warehouse/gprod1t_base.db/toll_tkt_hist_old/yr=2015/mn=01/dy=01 does not exist. [ERROR_STATUS]
Even select will not work when we specify a year other than 2015.
//this works fine
Select * from tkt_hist where yr=2015 limit 10;
// below throws same error mentioned above.
Select * from tkt_hist where yr=2016;
Try increasing java heap space (increase reducer memory if it doesn't work).
For example:
set mapreduce.map.java.opts = -Xmx15360m
You will have to drop the partitions manually because msck repair table only adds partitions but doesn't remove existing ones.
You will have to iterate through the corrupt partitions list. For internal tables, you'll have to be specific, as dropping a partition deletes the underlying physical files.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr=2015, mn=01, dy=01);
You will need to do this for each partition. You could put it in a bash script and execute it with hive -e or beeline -e commands to work with a quoted query string.
If you are using an external table, then it's much easier to remove all partitions and then repair table.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr<>'', mn<>'', dy<>'');
Make sure to repair the table as the user owning the Hive DB as well as the HDFS path.
MSCK REPAIR TABLE tkt_hist;
This should add partitions folders currently available in the table path without adding the invalid partitions.
Note: If your user isn't the owner of the directory, ensure you have write permissions and do your work in the hive access client as beeline requires absolute ownership rights to work.