How to find corrupted avro file in unix

How to find corrupted avro file in unix - hive

I have few .avro files in Unix. How can I find if the files are corrupted?
I am trying to insert Avro data from tableA to tableB using insert statement using Hive. I have been getting vertex failed error. I am assuming that some particular Avro file is corrupted.
Error:
Vertex failed, vertexName=Map 1, vertexId=vertex_1523309222013_3304_3_00, diagnostics=[Task failed, taskId=task_1523309222013_3304_3_00_000001, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row <json rows>

You could use the Avro DataFileRepairTool to weed out bad avros. There is a mode for reporting errors. The tool is available in 1.8.x of avro.

Related

Unable to select count of rows of an ORC table through Hive Beeline command

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2
And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE and this throws the below set of exceptions
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1464)
at java.util.Collections.sort(Collections.java:177)
at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
There is another article where the same problem has been described ORC Split Generation issue with Hive Table
but there isnt any solution as such yet.
I also tried running CONCATENATE function on top of ORC Table but that didn't help either.
What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records. I reckon issue must only be with aggregate functions or may be I don't get the issue yet.
I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well. No issues with Spark in that front.
Adding on to it, When I change the execution engine to MR, then this works. Fails only when I run this on Tez Engine.
Any leads to resolve this issue is much appreciated.

The issue was resolved by the below steps based my previous analysis:
This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.
We were using Hadoop 3.1.4 & Tez 0.9.2
Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )
Solution :
We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version. Otherwise you could follow solution 2 as mentioned.
Recreated the tar and placed it across all dependent locations including HDFS as well. For us it was in /user/tez/share/tez.tar.gz location. It changes accordingly.
This error disappeared after I followed the steps and now I am able to do count of records on any table.
Solution 2 :
Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version. Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.

Hadoop replication Issue during hive inserts

I've been getting this error over and over when running a test that is inserting a bunch of records into a table using hive. The nature of the error makes it appear to not be related to my code but rather some system issue. I just can't figure out what that might be. The code I'm running is doing about 100 or so individual inserts into a table via hive jdbc.
org.apache.ibatis.exceptions.PersistenceException:
### Error updating database. Cause: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10293]: Unable to create temp file for insert values File /tmp/hive/hive/4a7308b4-34e5-4fee-91c8-3e8e946ebfd5/_tmp_space.db/Values__Tmp__Table__242/data_file could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:843)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

Error while loading AVRO files to BigQuery

I have successfully loaded large number of AVRO files (of same schema type into same table), stored on Google Storage, using bq CLI utility.
However, for some of the AVRO files I am getting very cryptic error while loading into bigquery, the error says:
The Apache Avro library failed to read data with the follwing error: EOF
reached (error code: invalid)
With avro-tools validated that the AVRO file is not corrupted, report output:
java -jar avro-tools-1.8.1.jar repair -o report 2017-05-15-07-15-01_48a99.avro
Recovering file: 2017-05-15-07-15-01_48a99.avro
File Summary:
Number of blocks: 51 Number of corrupt blocks: 0
Number of records: 58598 Number of corrupt records: 0
I tried creating a brand new table with one of the failing files in case it was due to schema mismatch but that didnt help as the error was exactly the same.
need help to figure out what could be causing the error here?

No way to pinpoint the issue without more information, but I ran into this error message and filed a ticket here.
I a number of files in a single load job were missing columns which was causing the error.
Explanation from the ticket.
BigQuery uses the alphabetically last file from the directory as the avro schema to read the other Avro files. I suspect the issue is with schema incompatibility between the last file and the "problematic" file. Do you know if all the files have the exact same schema or differ? One thing you could try to help verify this is to copy the alphabetically last file of the directory and the "problematic" file to a different folder and try to load those two files in one BigQuery load job and see if the error reproduces.

SemanticException [Error 10001]: Line 1:14 Table not found 'test1' through in hive

I am fetaching data from hive2 in SpagoBI 5.1 open source tool.
When i am creating dashboard, it shows an error:- Impossible to load dataset [test_connection] due to the following service errors: Method not supported;
And, in hive back ground, terminal Shows an error i.e., SemanticException [Error 10001]: Line 1:14 Table not found 'test1' through in hive.
In hive command, hive --service hiveserver2 10000 &.
Thanks

pig script unable to load nullable parquet data

I am trying to write a Pig Script for compacting small files having data in the parquet format. Below mentioned lines are trying to load the small files in the directory and then store them. The files have complex nested structures which are nullable and they contain lots of the NULLs.
LOGS = LOAD '/dt=20150307/hr=2015030700/*' USING parquet.pig.ParquetLoader();
STORE LOGS INTO '/user/compaction_output' USING parquet.pig.ParquetStorer();
I am getting the following error:
2015-04-29 17:00:45,883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2118: Cannot build an empty group
My suspicion is that it is because of the null values in the input files.
Can someone help out ?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find corrupted avro file in unix - hive

You could use the Avro DataFileRepairTool to weed out bad avros. There is a mode for reporting errors. The tool is available in 1.8.x of avro.

Related

Unable to select count of rows of an ORC table through Hive Beeline command

Hadoop replication Issue during hive inserts

Error while loading AVRO files to BigQuery

SemanticException [Error 10001]: Line 1:14 Table not found 'test1' through in hive

pig script unable to load nullable parquet data

Categories

Resources