FileUploadBase UnknownSizeException when uploading a huge file

FileUploadBase UnknownSizeException when uploading a huge file - file-upload

I am trying to upload a huge file size[more than 5 gb] using struts1.2form file and apache.commons.fileupload 1.0. I saw that maximum limit for file upload in struts1 is 256M. Is there any way to change this?
I am getting the below exception.
org.apache.commons.fileupload.FileUploadBase$UnknownSizeException: the request was rejected because its size is unknown
at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:305)
at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:268)
at org.apache.struts.upload.CommonsMultipartRequestHandler.handleRequest(CommonsMultipartRequestHandler.java:182)
at org.apache.struts.util.RequestUtils.populate(RequestUtils.java:389)
at org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:191)
at org.apache.struts.action.ActionServlet.process(ActionServlet.java:1858)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:745)
By default file size limit is 250MB. I have increased to 350MB. It works fine.
After that i have increased it to 10G and got the below exception.
Is it possible to upload a huge file using struts1.2? is there any other way to upload huge file?

configure the maximum limit in struts
controller
controller processorClass="your class" nocache="true" locale="true" contentType="text/html;charset=UTF-8" **maxFileSize="15G"/**
When you are using apache.commons.fileuploa-1.1 there is a constraint that FileUpload refuses parsing requests of unknown length. In the succeeding version stream has been introduced to overcome the size issue.
Below issues were reported in apache.commons.fileupload in 1.2 & 1.3
1. After uploading, temp file is not removed
2. Input stream is not closed which leads to memleak.
REF: refuses parsing request of unknown lenght
Memleak when stream is not closed.
Memleak
Hope it helps....

Related

Spark throws Error: FileNotFoundException when writing data frame to S3

we have a data frame which we want to write to s3 as parquet format and in overwrite mode.
every time we write the dataframe it's always a new folder. The code to write the s3 location is as follows:
df.write
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("maxRecordsPerFile", maxRecordsPerFile)
.mode("overwrite")
.format(format)
.save(output)
What we observe is, at times we get FilenotFoundException (full trace below). Can somebody help me understand
when i am writing to a new s3 location (meaning nobody is reading from the location); why does the writing program throw the below exception?
how to fix it? --i see couple of stackoverflows pointing to this exception. But they say that it happens when you try to read when write is happening. But my case is not like that. i dont read when write happens.
my spark is 2.3.2 ; EMR-5.18.1 ; the code is written in scala
I am using s3:// as output folder path. Should i change it to some s3n or s3a ? will that help?
Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

I finally was able to solve the problem
The df : DataFrame was formed on the same s3 folder to which the same is being written in overwrite mode.
So during the overwrite; the source folder is getting cleared --which was resulting into the error
Hope this helps somebody.

Convert pyspark dataframe to pandas dataframe

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line :
pd_df=spark_df.toPandas()
I got this error:
first Part
Py4JJavaError: An error occurred while calling o170.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 39.0 failed 1 times, most recent failure: Lost task 3.0 in stage 39.0 (TID 89, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
...
...
Second Part
Exception happened during processing of request from ('127.0.0.1', 56842)
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:56657)
Traceback (most recent call last):
...
...
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
...
...
and I tried also to take sample of the original pyspark dataframe
smaple_pd_df=spark_df.sample(0.05).toPandas()
I got an error looks like the first part only of the previous error

You get
java.lang.OutOfMemoryError which probably means that you are trying to load all data into a single node which doesn't have enough RAM to handle the entire DataFrame. If you are using a cloud solution provider such as Databricks, try increasing the size of cluster RAM.

What toPandas() does is collect the whole dataframe into a single node (as explained in #ulmefors's answer).
More specifically, it collects it to the driver. The specific option you should be fine-tuning is spark.driver.memory, increase it accordingly.
Otherwise, if you're planning on doing further transformations on this (rather large) pandas dataframe, you could consider doing them in pyspark first and then collecting the (smaller) result into the driver, hopefully that will fit in memory.
More details are available in the Spark configuration documentation, here.

Failed to import file

I have a RDF file that can be imported without any issues in another RDF store (Stardog) but keeps failing in GraphDB with this error :
15:58:18.900 [import-task-3-thread-1] ERROR c.o.f.i.MultipartFileImportRunnableTask - Could not import file
java.lang.NullPointerException: null
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at org.eclipse.rdf4j.common.lang.service.ServiceRegistry.get(ServiceRegistry.java:95)
at org.eclipse.rdf4j.rio.Rio.createParser(Rio.java:100)
at org.eclipse.rdf4j.rio.Rio.createParser(Rio.java:118)
at org.eclipse.rdf4j.repository.util.RDFLoader.loadInputStreamOrReader(RDFLoader.java:279)
at org.eclipse.rdf4j.repository.util.RDFLoader.load(RDFLoader.java:197)
at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.add(AbstractRepositoryConnection.java:329)
at com.ontotext.trree.monitorRepository.MonitorRepositoryConnection.add(MonitorRepositoryConnection.java:159)
at com.ontotext.trree.parallel.ParallelRDFLoader.add(ParallelRDFLoader.java:125)
at com.ontotext.forest.impex.ParallelAwareImporter.lambda$add$3(ParallelAwareImporter.java:48)
at com.ontotext.forest.impex.ParallelAwareImporter.wrapInBeginCommit(ParallelAwareImporter.java:66)
at com.ontotext.forest.impex.ParallelAwareImporter.add(ParallelAwareImporter.java:48)
at com.ontotext.forest.impex.MultipartFileImportRunnableTask.load(MultipartFileImportRunnableTask.java:38)
at com.ontotext.forest.impex.ImportRunnableTask.run(ImportRunnableTask.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This file can be found here : http://boetik-artistik.be/humidity_by_city.owls
All referenced ontologies are resolvable from my machine.
Thanks or your help.
Kind regards,
Johan,

I have just tried this out myself on GraphDB 8.3.1. I got a similar error when I allowed GraphDB to auto detect the import format. However, when I selected the format as "RDF/XML", it imported without a problem.
The problem is with the file extension. It should be .rdf rather than .owls.

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Our application is a Java based desktop application which will download the binary data from the source, parses it and add it to HSQLDB database. When downloading from the sources individually, application works perfectly. But when doing the same from multiple sources simultaneously with each source in an individual thread, I am getting an error of
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 23 in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
or sometimes,
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 1016 in statement [CHECKPOINT]
followed by
java.sql.SQLException: File input/output error: C:\ProgramData\test\data\database\db.script.new in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
Java: 1.8;
HSQL version: 1.8.10
We are not in the position to migrate the HSQLDB to latest version because of various reasons.
HSQL Properties:
hsqldb.script_format=0
runtime.gc_interval=0
sql.enforce_strict_size=false
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
hsqldb.log_size=200
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
Any help or hint will be appreciated.

This is an 7 year old version which is not ideal for multi-threaded usage.
The simple solution is to perform the database updates with a single thread. You can retrofit your multi-threaded application with a synchronized block over a singleton object around the code that performs the database update.

Error when indexing content from MySQL database in Apache Solr

When I am indexing the data from a MySQL database to an Apache Solr server running under Tomcat6 on port 8180, I am receiving a 400 Bad Request error message. Upon investigating the server logs for tomcat6, the following is the exception message:
INFO: {add=[(null)]} 0 1
Jan 25, 2012 3:37:46 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=null] unknown field 'job_id'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:331)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:662)
Please tell me any solution to this....
Thanks

Your index is defined by a schema.xml file. There all the fields you like to index appear. However, you are trying to add a Solr document with a field named job_id. This field is NOT IN YOUR SCHEMA. Add this field or remove it from the document.

Look around "job_id" this is not existing where you are thinking it is/should be.
ERROR: [doc=null] unknown field 'job_id' at

Yes, either define schema or use ElasticSearch :)

You should look at dynamicField in schema.xml. See example at http://wiki.apache.org/solr/SchemaXml

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

FileUploadBase UnknownSizeException when uploading a huge file - file-upload

Related

Spark throws Error: FileNotFoundException when writing data frame to S3

Convert pyspark dataframe to pandas dataframe

Failed to import file

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Error when indexing content from MySQL database in Apache Solr

Categories

Resources