itext outofmemory error while attempting to count the number of pages in a pdf file - pdf

I'm trying to execute the following code:
PdfReader reader = new PdfReader("/path/to/file.pdf");
int pages = reader.getNumberOfPages();
It works on most files, but on one particular file, it crashes with error:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuffer.append(StringBuffer.java:320)
at com.itextpdf.text.pdf.PRTokeniser.readString(PRTokeniser.java:158)
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:224)
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:229)
...goes on for a while
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:229)
I know that it's something wrong with the input file. I'm just wondering if there's a way of knowing before attempting to make the method call, that the file is going to cause a problem.

It turns out it was a bug with the version of itext I am using (5.0.1). I logged a query with the developers, and a fix was put in - that I tested - and which hopefully will find it's way into the next version (5.0.2)

Related

syncfusion.pdf.pdfException"Could Not Find valid signature (%pds-).'

string docuAddr = #"C:\Users\psimmon\source\repos\PDFTESTAPP\PDFTESTAPP\TempForms\forms-www.courts.state.co.us-Forms-PDF-JDF1117.pdf";
byte[] bytes = Encoding.Unicode.GetBytes(docuAddr);
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(bytes, true);---blows here
PdfLoadedForm myForm = loadedDocument.Form;
PdfLoadedFormFieldCollection fields = myForm.Fields;
not sure what I have done wrong here, but the PDF file is opening, either in a browser or a fileexployer window. so it has to be me, guessed at most of this, all you very smart folks, I could use your gray matter. forgive my stupidity.
The reported exception “could not find valid signature (%PDF-)” may occurs due to the file is not a PDF document. We suspect it seems the other format files are saved with the “.pdf” extension. We could not open and repair this type of document on our end, we have already added the details in our documentation,
Please find some of the following corrupted error messages that cannot be repaired:
UG: https://help.syncfusion.com/file-formats/pdf/open-and-save-pdf-file-in-c-sharp-vb-net#possible-error-messages-of-invalid-pdf-documents-while-loading
If you want to find this type of corrupted document, Syncfusion PDF Library provides support to check and report whether the existing PDF document is corrupted or not with corruption details and structure-level syntax errors.
UG: https://help.syncfusion.com/file-formats/pdf/working-with-document#find-corrupted-pdf-document
Blog: https://www.syncfusion.com/blogs/post/how-to-find-corrupted-pdf-files-in-c-sharp.aspx
KB: https://www.syncfusion.com/kb/9686/how-to-identify-the-corrupted-pdf-document-using-c-and-vb-net

programmatically rename file in ocrmyfile

I'm a new programmer and I'm making a first attempt at a larger data science project. To do this I have made a class that is supposed to open PDFs with ocrmypdf and then uses a while statement to walk through all the documents in a folder.
class DocumentReader:
This class is used to open and read a
# document using OCR and then
# creating the document in its place
def __init__(self,file):
self.file = file
def convert(self):
ocrmypdf.ocr(self.file,new_doc,deskew=True)
and here is the while statement:
count = 0
while count <final:
for file in os.listdir('PayStubs'):
if file.endswith(".pdf"):
index = str(file).find('.pdf')
new_doc = file[:index]+'_new'+file[index:]
d1=DocumentReader(file)
d1.convert()
I can make each of the classes work if I run them individually but it is the '.pdf' extension when I try to run them programmatically that is messing me up. Does anyone know how to create a new file name programmatically for the second argument in the ocrmypdf command?
I have tried several different ways of making this work but I keep getting errors. The most common errors that my attempts have yielded are:
InputFileError: File not found - 20070928ch6495.pdf.pdf
and
isadirectoryerror: [errno 21] is a directory: '_new/'
I'm to the point where I'm running in circles. Any help would be greatly appreciated. thanks!

how to use very old iText(under 0.99) to create bookmarks / outlines?

may I know how to use old iText(very old version under 0.99, package path = com.lowagie.xxx) to create bookmarks to jump in the internal pdf pls?
like the api in new iText jar:
PdfOutline outoline2 = com.itextpdf.pdf.PdfAction.gotoLocalPage("destinationName", false)
we have found below code to create bookmark, but find old iText needs to use the filename(see outFileName in below code). but what we want is a jump in internal pdf (not remote pdf)
olineSignature = new PdfOutline(root, new PdfAction(outFileName, "Signature2TxtDestination"), "Signature2TxtOutline");
FYI, we don't know what page number in advance, so no way to use the api as below: old PdfAction.gotoLocalPage(int, PdfDestination, PdfWriter)
anybody can help me? Thanks.#Bruno Lowagie, #itext :)
We are in the progress of upgrading to new iText(itext5+), but now we do get a request to create bookmarks(using old iText) for others to retrieve the created bookmarks.
My memory can't go that far back but local destinations are most probably not supported. Your only chance is to do an interim upgrade to the Jurassic 2.1.7 that should be more or less compatible with that Pleistocene 0.99.

Can't replace mongo document

I am attempting to save documents to a mongoDB cluster (sharded replica sets) and am having a strange issue. I am using pymongo 2.7.2 and TokuMX 1.5 mongodb 2.4.10.
When I attempt to save (overwrite) existing documents I am getting an exception that looks like the document I am saving is too large:
doc = db.collection.find_one()
db.collection.save(doc)
pymongo.errors.OperationFailure: BSONObj size: 18798961 (0x71D91E01) is invalid. Size must be between 0 and 16793600(16MB) First element: op: "u"
However this works fine:
doc = db.collection.find_one()
db.collection.remove({'_id': doc['_id']})
db.collection.save(doc)
The document in question is about 9mb, so it looks like when I attempt to replace the document it is somehow doubling the size of the document, exceeding the 16mb limit.
Any ideas as to what could cause this behavior?
Apparently this is a known issue with TokuMX. Oplog entries are twice the size of the document, so replacing a 9mb document will result in a 18mb oplog entry- which raises the exception.
The solution would be to limit document writes to less than 8mb so that oplog entries never exceed 16mb.
I think this is a side effect of how save is implemented in PyMongo.
Under the hood if the document has a _id then the save(doc) is turned into an update(doc, doc). That is where the doubling is coming into play since the query+update is 18MB.
When you removed the _id you changed the save(doc) into a insert(doc) of a new document with a new _id. I don't think that is what you wanted.
Rather than use save I would recommend constructing a query with just the _id field from the original document and doing the update call manually. I would even go so far as you should enter a Jira ticket to get PyMongo to do this for you.
HTH,
Rob.

an error 3013 thrown when writing a file Adobe AIR

I'm trying to write/create a JSON file from a AIR app, I'm trying not so show a 'Save as' dialogue box.
Here's the code I'm using:
var fileDetails:Object = CreativeMakerJSX.getFileDetails();
var fileName:String = String(fileDetails.data.filename);
var path:String = String(fileDetails.data.path);
var f:File = File.userDirectory.resolvePath( path );
var stream:FileStream = new FileStream();
stream.open(f, FileMode.WRITE );
stream.writeUTFBytes( jsonToExport );
stream.close();
The problem I'm having is that I get a 'Error 3013. File or directory in use'. The directory/path is gathered from a Creative Suite Extension I'm building, this path is the same as the FLA being developed in CS that the Extension is being used with.
So I'm not sure if the problem is that there are already files in the directory I'm writing the JSON file to?
Do I need to add a timer in order to close the stream after a slight delay, giving some time to writing the file?
Can you set up some trace() commands? I would need to know what the values of the String variables are, and the f.url.
Can you read from the file that you are trying to write to, or does nothing work?
Where is CreativeMakerJSX.getFileDetails() coming from? Is it giving you data about a file that is in use?
And from Googling around, this seems like it may be a bug. Try setting up a listener for when you are finished, if you have had the file open previously.
I re-wrote how the file was written, no longer running into this issue.