Spark throws Error: FileNotFoundException when writing data frame to S3

Spark throws Error: FileNotFoundException when writing data frame to S3 - dataframe

we have a data frame which we want to write to s3 as parquet format and in overwrite mode.
every time we write the dataframe it's always a new folder. The code to write the s3 location is as follows:
df.write
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("maxRecordsPerFile", maxRecordsPerFile)
.mode("overwrite")
.format(format)
.save(output)
What we observe is, at times we get FilenotFoundException (full trace below). Can somebody help me understand
when i am writing to a new s3 location (meaning nobody is reading from the location); why does the writing program throw the below exception?
how to fix it? --i see couple of stackoverflows pointing to this exception. But they say that it happens when you try to read when write is happening. But my case is not like that. i dont read when write happens.
my spark is 2.3.2 ; EMR-5.18.1 ; the code is written in scala
I am using s3:// as output folder path. Should i change it to some s3n or s3a ? will that help?
Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

I finally was able to solve the problem
The df : DataFrame was formed on the same s3 folder to which the same is being written in overwrite mode.
So during the overwrite; the source folder is getting cleared --which was resulting into the error
Hope this helps somebody.

Related

Randomly getting java.lang.ClassCastException in snappy job

Snappy job written in Scala aborts with exception:
java.lang.ClassCastException: com.....$Class1 cannot be cast to com.....$Class1.
Class1 is custom class that is stored in RDD. Interesting thing is this error is thrown while casting same class. So far, no patterns are found.
In the job, we fetch data from hbase, enrich data with analytical metadata using Dataframes and push it to a table in SnappyData. We are using Snappydata 1.2.0.1.
Not sure why is this happening.
Below is Stack Trace:
Job aborted due to stage failure: Task 76 in stage 42.0 failed 4 times, most recent failure: Lost task 76.3 in stage 42.0 (TID 3550, HostName, executor XX.XX.x.xxx(10360):7872): java.lang.ClassCastException: cannot be cast to
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:86)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$2.hasNext(WholeStageCodegenExec.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.hasNext(WholeStageCodegenExec.scala:514)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:233)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1006)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.computeInternal(WholeStageCodegenExec.scala:557)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.(WholeStageCodegenExec.scala:504)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:503)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:103)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:126)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:326)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.spark.executor.SnappyExecutor$$anon$2$$anon$3.run(SnappyExecutor.scala:57)
at java.lang.Thread.run(Thread.java:748)

Classes are not unique by name. They're unique by name + classloader.
ClassCastException of the kind you're seeing happens when you pass data between parts of the app where one or both parts are loaded in a separate classloader.
You might need to clean up your classpath, you might need to resolve the classes from the same classloader, or you might have to serialize the data (especially if you have features that rely on reloading code at runtime).

Array in output schema caused exception

I am following this WordCount example using the Google BigQuery-Hadoop connector:
https://developers.google.com/hadoop/writing-with-bigquery-connector#completecode
The example works fine as it is.
To test array in the output schema, I have altered just one line in the code by adding an array object definition to the output schema:
String outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Number','type': 'INTEGER'},{'name':'Persons','mode':'REPEATED','type':'RECORD','fields':[{'name': 'name','type': 'STRING'},{'name': 'age','type': 'INTEGER'}]}]";
Now when I run the WordCount example, it gives this exception:
java.lang.IllegalStateException
at com.google.gson.JsonArray.getAsString(JsonArray.java:133)
at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.getSchemaFromString(BigQueryUtils.java:97)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getRecordWriter(BigQueryOutputFormat.java:121)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:568)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Does anyone know what the issue is?
Thank you

This is actually a bug in the current version of the BigQuery connector which prevents it from supporting inner records with more than 1 field.
We have a fix internally and it's slated to go out with the next release (0.4.3) which may still be a couple weeks out; if you'd like to help try out a staging build, feel free to reach out to gcp-hadoop-contact#google.com and we can provide instructions.

PriviledgedActionException: Able to populate Hbase via Hive, however, unable to query HBase via Hive

I'm using the current Cloudera Quick Start VM. I've created an Hive table with some data. Then, I've created an external table with the Hive Storage Handler. I was able to populate the HBase table. However, while quering the Hive/HBase table, I got the following error (NullpointerException):
14/04/16 01:18:51 ERROR security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:BeeswaxException(message:java.io.IOException: java.lang.NullPointerException, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, handle:QueryHandle(id:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11), SQLState: )
14/04/16 01:18:51 ERROR beeswax.BeeswaxServiceImpl: Caught BeeswaxException
BeeswaxException(message:java.io.IOException: java.lang.NullPointerException, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, handle:QueryHandle(id:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11), SQLState: )
at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState.fetch(BeeswaxServiceImpl.java:545)
at com.cloudera.beeswax.BeeswaxServiceImpl$5.run(BeeswaxServiceImpl.java:986)
at com.cloudera.beeswax.BeeswaxServiceImpl$5.run(BeeswaxServiceImpl.java:981)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at com.cloudera.beeswax.BeeswaxServiceImpl.doWithState(BeeswaxServiceImpl.java:772)
at com.cloudera.beeswax.BeeswaxServiceImpl.fetch(BeeswaxServiceImpl.java:980)
at com.cloudera.beeswax.api.BeeswaxService$Processor$fetch.getResult(BeeswaxService.java:987)
at com.cloudera.beeswax.api.BeeswaxService$Processor$fetch.getResult(BeeswaxService.java:971)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I embedded Guava, zookeeper, hbase and hive-hbase-handler JARs. I followed the instructions made in this tutorial: http://www.n10k.com/blog/hbase-via-hive-pt2/
I am using the current Cloudera-Quick-Start VM. Job and Task-Tracker logs as well as Beeswax logs are telling me nothing.
Do you have any ideas about what I am doing wrong?
I am thankfull for any advise!
Best regards, Lena

This is the solution:
Nullpointer exception in HBase MapReduce
The logs were misleading (for me). HBase or Hive was not able to resolve the NameNode.

How to load a delimited file with empty fields in pig

I am using the below command to load the file, when I try to dump or illustrate the loaded data, it fails with the below error. I have checked the sanity of the data, each line contains correct number of delimiters, but when the field is empty the delimiter immediately follows , I tried loading the below single sample line. It does not work.
hs_2_inr = LOAD 'hs_2_inr.dat' USING PigStorage('^') as ( year:chararray, country:chararray, s_no:chararray, hs_8:chararray, hs_8_desc:chararray, prevyr_inr:chararray, curyr_inr:chararray, growth:chararray, dummy:chararray);
Here is the sample data
1997^BOTSWANA^1.^10063001^*RICE PARBOILED^^2.43^^
Below is the exception
2013-06-30 21:02:23,015 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:84)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:739)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:626)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:323)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
2013-06-30 21:02:23,016 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Exception
so how do I load a file with empty fields in pig?

Your code works fine. As you mentioned in your comment, ILLUSTRATE was your problem. Per the docs, ILLUSTRATE wasn't being maintained for a while. Do not rely on it. You shouldn't need it in any non-diagnostic code anyway. Use DESCRIBE instead.
In newer Pig versions, the warning on ILLUSTRATE seems to have gone away, so it may be safe again, but I'd still rely more heavily on DESCRIBE to avoid a source of potential issues. In Pig 0.10, which I'm on, ILLUSTRATE still gave me the same error you received.

Commons Fileupload Security manager enabled

I am trying to upload a file to the database in the form of a blob during which the file upload program writes a temporary file to the disk.
I'm using Wso2 Stratos application server (based on Tomcat) is preventing such temp file to be written to the disk due to security reasons. I have attached the stack trace of the error.
I'm using Apache Commons Fileupload Library. Here is my upload class http://paste.org/47685 and The error is throwing from line 57. I need to avoid writing temp files How can I solve this problem?
This is my error log
java.security.AccessControlException: access denied (java.io.FilePermission F:\W
SO2ST~1.2\bin\..\tmp\upload_4e2fd9dc_1368bb5a330__7ffa_00000002.tmp write)
at java.security.AccessControlContext.checkPermission(AccessControlConte
xt.java:323)
at java.security.AccessController.checkPermission(AccessController.java:
546)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at java.lang.SecurityManager.checkWrite(SecurityManager.java:962)
at java.io.FileOutputStream.<init>(FileOutputStream.java:169)
at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at org.apache.commons.io.output.DeferredFileOutputStream.thresholdReache
d(DeferredFileOutputStream.java:178)
at org.apache.commons.io.output.ThresholdingOutputStream.checkThreshold(
ThresholdingOutputStream.java:224)
at org.apache.commons.io.output.ThresholdingOutputStream.write(Threshold
ingOutputStream.java:128)
at org.apache.commons.fileupload.util.Streams.copy(Streams.java:103)
at org.apache.commons.fileupload.util.Streams.copy(Streams.java:66)
at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadB
ase.java:366)
at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(
ServletFileUpload.java:126)
at controler.UploadDocumentServlet.doPost(UploadDocumentServlet.java:62)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:273
)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:270
)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:3
05)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.
java:165)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:298)
at org.apache.catalina.core.ApplicationFilterChain.access$000(Applicatio
nFilterChain.java:57)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilt
erChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilt
erChain.java:189)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:240)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:164)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authentica
torBase.java:462)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:164)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:100)
at org.wso2.carbon.server.CarbonStuckThreadDetectionValve.invoke(CarbonS
tuckThreadDetectionValve.java:154)
at org.wso2.carbon.server.TomcatServer$1.invoke(TomcatServer.java:254)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
563)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:399)
at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcesso
r.java:396)
at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.pr
ocess(Http11NioProtocol.java:356)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoin
t.java:1534)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
utor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:908)
at java.lang.Thread.run(Thread.java:705)

So you don't have permissions to write to F:\W
SO2ST~1.2\tmp\upload_4e2fd9dc_1368bb5a330__7ffa_00000002.tmp, do you have permissions to write to any directory on the file system? (if that tmp folder doesn't exist then that might be your problem)
If so you just need to confgure the tmp directory of your factory to be a directory you can write to (there should a tmp folder for the active user where you can store files, like C:\Documents and Settings\MyUser\Temp, or something like that)
DiskFileItemFactory.setRepository(File)

I figured out how to solve this temp file issue.
By default DiskFileItemFactory() size is 10,240 byts, it a file exceed this amount it will create a temp file for storing the file. So that's the bug my files are more than 10,240 byts size. So by increasing the size of the filefactory object solves the problem. refer this link
http://www.techiepark.com/tutorials/file-upload-using-java

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark throws Error: FileNotFoundException when writing data frame to S3 - dataframe

I finally was able to solve the problem The df : DataFrame was formed on the same s3 folder to which the same is being written in overwrite mode. So during the overwrite; the source folder is getting cleared --which was resulting into the error Hope this helps somebody.

Related

Randomly getting java.lang.ClassCastException in snappy job

Array in output schema caused exception

PriviledgedActionException: Able to populate Hbase via Hive, however, unable to query HBase via Hive

How to load a delimited file with empty fields in pig

Commons Fileupload Security manager enabled

Categories

Resources