Spark parquet reading error - apache-spark-sql

Spark parquet reading error - apache-spark-sql

I am working on a Spark project, Here i had one file which is in parquet format when I try to load this file using java it gives me the below error. But when I loaded the same file in hive with the same path and write a query select * from table_name, so its working fine and data is also coming properely. Please help me regarding this issue.
java.io.IOException: Could not read footer:
java.lang.RuntimeException: corrupted file: the footer index is not
within the file at
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:754)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:743)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at
org.apache.spark.scheduler.Task.run(Task.scala:88) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.RuntimeException: corrupted file: the footer index is not
within the file at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)

You can try below options
1) sqlContext.read.parquet("path")
2) sqlContext.read.format(fileFormat)
.option("header", header) // Use first line of all files as header
.option("inferSchema", inferSchema) // Automatically infer data types
.load(source)
If your issue didn't resolved, please post the sample of code.

Related

On New or Update Sftp connector issue with pipe closed

I am using On New or Update sftp connector to read around 3000+ files daily. Everything runs smooth. However,at the end of processing phase. the sftp connector throws below error,and does not process the last file which remains in the sftp folder for next run. this error scenario keeps repeating for each run. Hence, the last file of each run does not process.
20:03:37.733 06/15/2022 Worker-0 [MuleRuntime].uber.33: [demo-data-api].prcsFiles-Error-SuccessFlow.CPU_INTENSIVE #38f99b06 ERROR
event:c3ccc560-f1be-11ec-a890-02732233ad66
********************************************************************************
Message : "org.mule.weave.v2.module.reader.ReaderParsingException: org.mule.runtime.api.exception.MuleRuntimeException - Exception was found trying to retrieve the contents of file /home/transaction/data.json
org.mule.runtime.api.exception.MuleRuntimeException: Exception was found trying to retrieve the contents of file /home/transaction/data.json
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:427)
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:423)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:349)
at org.mule.extension.sftp.internal.connection.SftpFileSystem.retrieveFileContent(SftpFileSystem.java:117)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:111)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:93)
at org.mule.extension.file.common.api.AbstractConnectedFileInputStreamSupplier.getContentInputStream(AbstractConnectedFileInputStreamSupplier.java:81)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:65)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:33)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.lambda$new$1(LazyStreamSupplier.java:29)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.get(LazyStreamSupplier.java:42)
at org.mule.extension.file.common.api.stream.AbstractNonFinalizableFileInputStream.lambda$createLazyStream$0(AbstractNonFinalizableFileInputStream.java:48)
at $java.io.InputStream$$EnhancerByCGLIB$$55e4687e.read(<generated>)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:102)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.consumeStream(AbstractInputStreamBuffer.java:111)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:239)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:202)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.doGet(FileStoreInputStreamBuffer.java:125)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.get(AbstractInputStreamBuffer.java:93)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.assureDataInLocalBuffer(BufferedCursorStream.java:126)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.doRead(BufferedCursorStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.AbstractCursorStream.read(AbstractCursorStream.java:124)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.read(BufferedCursorStream.java:26)
at java.io.InputStream.read(InputStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.ManagedCursorStreamDecorator.read(ManagedCursorStreamDecorator.java:96)
at org.mule.weave.v2.el.SeekableCursorStream.read(MuleTypedValue.scala:306)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.handleBOM(SeekableStreamSourceReader.scala:179)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.readAscii(SeekableStreamSourceReader.scala:163)
at org.mule.weave.v2.module.json.reader.JsonTokenizer.$init$(JsonTokenizer.scala:21)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonTokenizer.<init>(IndexedJsonTokenizer.scala:15)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonParser.parser(IndexedJsonParser.scala:17)
at org.mule.weave.v2.module.json.reader.JsonReader.readValue(JsonReader.scala:40)
at org.mule.weave.v2.module.json.reader.JsonReader.doRead(JsonReader.scala:30)
at org.mule.weave.v2.module.reader.Reader.read(Reader.scala:35)
at org.mule.weave.v2.module.reader.Reader.read$(Reader.scala:33)
at org.mule.weave.v2.module.json.reader.JsonReader.read(JsonReader.scala:20)
at org.mule.weave.v2.el.MuleTypedValue.value(MuleTypedValue.scala:147)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType(DelegateValue.scala:17)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType$(DelegateValue.scala:16)
at org.mule.weave.v2.el.MuleTypedValue.valueType(MuleTypedValue.scala:177)
at org.mule.weave.v2.model.types.ObjectType$.accepts(Type.scala:1068)
at org.mule.weave.v2.interpreted.node.executors.BinaryOverloadedStaticExecutor.executeBinary(BinaryOverloadedStaticExecutor.scala:45)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.doExecute(ChainedBinaryOpNode.scala:37)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.execute(ChainedBinaryOpNode.scala:7)
at org.mule.weave.v2.interpreted.node.NullSafeNode.doExecute(NullSafeNode.scala:14)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.NullSafeNode.execute(NullSafeNode.scala:8)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.doExecute(BinaryOpNode.scala:15)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.execute(BinaryOpNode.scala:9)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.doExecute(DocumentNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.execute(DocumentNode.scala:11)
at org.mule.weave.v2.interpreted.InterpretedMappingExecutableWeave.execute(InterpreterMappingCompilerPhase.scala:196)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:250)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.$anonfun$evaluate$2(WeaveExpressionLanguageSession.scala:101)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:285)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRoute.shouldExecute(ExecutableRoute.java:41)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.lambda$route$0(ChoiceRouter.java:161)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.route(ChoiceRouter.java:161)
at org.mule.runtime.core.api.util.func.CheckedConsumer.accept(CheckedConsumer.java:19)
at org.mule.runtime.core.api.rx.Exceptions.lambda$checkedConsumer$0(Exceptions.java:51)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:482)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:213)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onNext(MonoFlatMapMany.java:242)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:496)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:180)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:447)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:534)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.mule.service.scheduler.internal.AbstractRunnableFutureDecorator.doRun(AbstractRunnableFutureDecorator.java:151)
at org.mule.service.scheduler.internal.RunnableFutureDecorator.run(RunnableFutureDecorator.java:54)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.mule.extension.sftp.api.SftpConnectionException: Error occurred while trying to connect to host
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
at org.mule.extension.sftp.api.SftpConnectionException.<init>(SftpConnectionException.java:38)
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
... 112 more
Caused by: 4:
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1540)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1290)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:347)
... 110 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:307)
at com.jcraft.jsch.Channel$MyPipedInputStream.updateReadSide(Channel.java:362)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1311)
... 112 mor, while reading `comingData` as Json.
Trace:
at main (Unknown)
at org.mule.weave.v2.el.utils.ExceptionHandler$.handleLocatableException(ExceptionHandler.scala:24)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:291)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRou... [truncated]
SFTP Configuration:
<sftp:config name="SFTP_Config" doc:name="SFTP Config" doc:id="81f37ff8-d629-4f64-ab2c-5632350b8fca" >
<sftp:connection host="sample.com" port="00" username="example1" password="111111" connectionTimeout="30" responseTimeout="30">
<pooling-profile maxActive="10" maxIdle="10" maxWait="10" evictionCheckIntervalMillis="60000" minEvictionMillis="120000"/>
</sftp:connection>
</sftp:config>
<flow name="prcsFiles" doc:id="55ac597a-378f-40b6-8041-df7ca8254ebe" maxConcurrency="1">
<sftp:listener doc:name="On New or Updated File" doc:id="9a5d6eae-0fc6-46a5-be10-de98fb7ee16a" config-ref="SFTP_Config" directory="home/transaction/" outputMimeType="application/json" timeBetweenSizeCheckUnit="MILLISECONDS">
<reconnect-forever/>
<scheduling-strategy >
<cron expression="0 15 10 ? * *" timeZone="UTC" />
</scheduling-strategy>
</sftp:listener>
</flow>
I am using cron expressing which runs once daily. does someone came across this issue before.
Any thoughts will be appreciated.

Spark 1.4.1 - Using pyspark

I tried using this command , I get error
Code
instances = sqlContext.sql("SELECT instance_id ,instance_usage_code
FROM ib_instances WHERE (instance_usage_code) = 'OUT_OF_ENTERPRISE' ")
instances.write.format("orc").save("instances2")
hivectx.sql(""" CREATE TABLE IF NOT EXISTS instances2 (instance_id
string, instance_usage_code STRING)""" )
hivectx.sql (" LOAD DATA LOCAL INPATH '/home/hduser/instances2' into
table instances2 ")
Error
Traceback (most recent call last): File
"/home/hduser/spark_script.py", line 57, in
instances.write.format("orc").save("instances2") File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/s
ql/readwriter.py", line 304, in save File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/java_gateway.py", line 538, in call File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.AssertionError: assertion failed: The ORC data source can
only be used with HiveContext. at
scala.Predef$.assert(Predef.scala:179) at
org.apache.spark.sql.hive.orc.DefaultSource.createRelation(OrcRelation
.scala:54) at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:322)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
ava:57) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:207) at
java.lang.Thread.run(Thread.java:745)

My guess is you create a standard SQLContext, instead of a Hive one (that adds a couple of options). Create your sqlContext as a HiveContext instance. The scala version is:
val sqlContext = new HiveContext(sc)

Getting error message Unexpected character ' ' when running LOAD command in PIG

I have a file stored in HDFS at this path: /user/hdfs/countries
(the file is in comma separated format).
To import this HDFS data into PIG I ran the below command in PIG:
test = load ‘/ user/hdfs/countries’ using PigStorage(',') as (id:int, Name:chararray, Language:chararray);
where,
ID: is the primary key column in HDFS file
Name and Language are the column names in HDFS file
I am getting below error when I run the above mentioned pig command:
Pig Stack Trace
ERROR 1200: <line 1, column 12> Unexpected character ''
Failed to parse: <line 1, column 12> Unexpected character ''
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:243)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:179)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Can someone please help me with this? Is my command incorrect or any jar file is missing?
Thank you in advance!

It tells you exactly where the problem is: the ‘ should be replaced by ' which is not the same character.
Also, the space after the / seems fishy.

Pig script fails with java.io.EOFException: Unexpected end of input stream

I am having a Pig script to pick up a set of fields using regular expression and store the data to a Hive table.
--Load data
cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);
--There are two types of data, filter type1 - The field dst_country seems unique there
cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');
--Parse each line and pick up the required fields
cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\ssrc=(.*?)\\ssrc_port=(.*?)\\ssrc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\ssrc_country=(.*?)\\s(.*?\\s.*)+')
) AS (
rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
dstcountry:charArray, srccountry:charArray, rest:charArray );
--Store to hive table
STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();
The script works fine on a small file but breaks with the following exception on a bigger file (750 MB). Any idea how can I debug and find the root cause?
2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)

Check the size of the text you are loading into line:chararray.If the size is greater than hdfs block size (64 MB) then you will get an error.

pig file load error

I am trying to run this commang over pig env.
grunt> A = LOAD inp;
But I am getting this error in the log files:
Pig Stack Trace:
ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Failed to parse: mismatched input 'inp' expecting QUOTEDSTRING
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1565)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
And in console Iam getting like this:
grunt> A = LOAD inp;
2012-10-26 12:18:34,627 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Details at logfile: /usr/local/hadoop/pig_1351232517175.log
Can any body provide me appropriate solution for this?

The syntax for load has been used wrongly. Check out the correct example provided herewith.
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Example from http://pig.apache.org/docs
I believe the error log is self explanatory, it says - expecting QUOTEDSTRING

Please put the file name in single quotes to solve this issue.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark parquet reading error - apache-spark-sql

Related

On New or Update Sftp connector issue with pipe closed

Spark 1.4.1 - Using pyspark

Getting error message Unexpected character ' ' when running LOAD command in PIG

Pig script fails with java.io.EOFException: Unexpected end of input stream

pig file load error

Categories

Resources