GhostScript .NET not continuing past certain pages - pdf

I've created a program which needs to convert PDF files into image files, and for this GhostScript is the best choice. But once in a while, the library stalls completely on a page and doesn't continue, it just keeps using CPU power and working, as though it might be caught in an infinite loop. The error is easily reproduce-able as it happens every time on the specific PDF files that it occurs on, though no error is given from GhostScript of any kind, and nothing is out of the ordinary in the PDF files themselves as far as I can see.
I have however been able to find out that the stalling is due to a specific element or elements in the pdf files, and by deleting the elements the pdf will easily render in GhostScript, but this is not a solution, nor an answer I can use.
PDF link* - http://www.filedropper.com/usjunis1-32webtest
*saved with free version of PDF-XChange Editor, so it has watermarks at the top, but it is the square that creates the stalling. I've also seen it happen on vector graphics objects, so it is not limited to squares.
Code -
private void startImageProcessing(String pdfFile)
{
GhostscriptVersionInfo gvi = new GhostscriptVersionInfo(new Version(0, 0, 0), Directory.GetCurrentDirectory() + #"\gsdll32.dll", string.Empty, GhostscriptLicense.GPL);
Ghostscript.NET.Processor.GhostscriptProcessor processor = new Ghostscript.NET.Processor.GhostscriptProcessor(gvi, true);
processor.StartProcessing(CreateTestArgs(pdfFile, pdfFile.Substring(0, pdfFile.Length - 4) + "\\"+prefix+"-%03d.jpg", 72 * scale), new ConsoleStdIO(true));
}
private static string[] CreateTestArgs(string inputPath, string outputPath, int dpi)
{
List<string> gsArgs = new List<string>();
gsArgs.Add("-dSAFER");
gsArgs.Add("-dBATCH");
gsArgs.Add("-dNOPAUSE");
gsArgs.Add("-sDEVICE=jpeg");
gsArgs.Add("-r" + dpi);
gsArgs.Add("-dJPEGQ=100");
gsArgs.Add("-dNumRenderingThreads=" + Environment.ProcessorCount.ToString());
gsArgs.Add("-dTextAlphaBits=4");
gsArgs.Add("-dGraphicsAlphaBits=4");
gsArgs.Add(#"-sOutputFile=" + outputPath);
gsArgs.Add(#"-f" + inputPath);
return gsArgs.ToArray();
}
I've also created a pdf file only containing one of the wrong elements for testing, and it has both had the error when saved by Adobe Acrobat, and PDF-XChange Editor, so the error is not due to a specific program that I've used to save the PDF either.

Related

iText 7 Chinese characters and merge with existing pdf template

I have to rephrase my question, basically my request is very straight forward, i want to display Asian characters in the generated pdf file from iText7.
As of now i have download the NotoSansCJKsc-Regular.otf file and assign a variable to hold the path, below is my code:
public static string FONT = #"D:\Projects\Resources\NotoSansCJKsc-Regular.otf";
PdfWriter writer = new PdfWriter(#"C:\temp\test.pdf");
PdfDocument pdfDoc = new PdfDocument(writer);
Document doc = new Document(pdfDoc, PageSize.A4);
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
doc.SetFont(fontChinese);
but the issue i am facing now is whenever the code runs to this section:
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
i am always getting this error: The request could not be performed because of an I/O device error. and this error doesn't make sense to me and I am struggling to find out the solution, could someone in here had the similar issue plz, the code is in C#.
Many thanks.
I can confirm that above code is working as expected, the .otf file that I was originally downloaded was corrupted, hence I got above error.

OutOfMemory on custom extractor

I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.
Run on remote/master
Run it for one file (gzipped, 11Mb), it works fine.
Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
Run it for one or more files (gzipped 500+ Mbs), works fine.
Extractor looks like this:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var stream = new StreamReader(input.BaseStream))
{
var xml = stream.ReadToEnd();
// Clean stiched XML
xml = UtilsXml.CleanXml(xml);
// Get nodes - one for each stiched file
var d = new XmlDocument();
d.LoadXml(xml);
var root = d.FirstChild;
for (int i = 0; i < root.ChildNodes.Count; i++)
{
output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
yield return output.AsReadOnly();
}
yield break;
}
}
and error message looks like this:
==== Caught exception System.OutOfMemoryException
at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924
So what am I doing wrong? And how do I debug this on remote?
Thanks!
Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.
Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).
Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.
The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.
If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.
We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.

Save an image present in PDF on local File System

This is my first experience of using PDFBox jar files. Also, I have recently started working on TestComplete. In short, all these things are new for me and I have been stuck on one issue for last few hours. I will try to explain as much as I can. Would really appreciate any help!
Objective:
To save an image present in a PDF file on the file system
Issue:
When this line gets executed objImage.write2file_2(strSavePath);, I get the error Object doesn't support this property or method.
I am taking some help from here
Code:
function fn_PDFImage()
{
var objPdfFile, strPdfFilePath, strSavePath, objPages, objPage, objImages, objImage, imgbuffer;
strPdfFilePath = "C:\\Users\\aabb\\Desktop\\name.pdf";
strSavePath = "C:\\Users\\aabb\\Desktop\\abc";
objPdfFile = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(strPdfFilePath);
objPages = objPdfFile.getDocumentCatalog().getAllPages();
//getting a page with index=1
objPage = objPages.get(1)
objImages = objPage.getResources().getXObjects().values().toArray();
Log.Message(objImages.length); //This is returning 14. i.e, 14 images
//getting an image with index=1
objImage = objImages.items(1);
Log.Message(typeof objImage); //returns "Object" which means it is not null
//saving the image
objImage.write2file_2(strSavePath); //<---GETTING AN ERROR HERE
}
ERROR:
If you are bothered about the method namewrite2file_2, please read this excerpt from the link which I have shared:
In Java, the constructor of a class has the name of this class.
TestComplete changes the constructor names to newInstance(). If a
class has overloaded constructors, TestComplete names them like
newInstance, newInstace_2, newInstance_3 and so on.
Additional Info:
I have imported Jar file(pdfbox-app-1.8.13.jar) and their classes in testcomplete. I am not sure if I need to import some other jar file or its class here:
XObjects are not always image XObjects. And write2file is in the class PDXObjectImage so you need to check your object type first.
Re the second question asked in the comment: the form XObject isn't something you can save. XObject forms are content streams with resources etc, similar to pages. However what you can do is to explore these too whether the resources have images. See how this is done in the ExtractImages source code of PDFBox 1.8.
However there are other places where there can be images (e.g. patterns, soft masks, inline images); this is only available in PDFBox 2.*, see the ExtractImages source code there. (Note that the class names are different).

Adobe breaks stamped PDF when saving as new file / what is difference in Adobe 'save as' vs. Foxit Reader 'save as' feature

I'm reaching out to larger community of developers in seek of help to understand the real cause and possibly finding a fix. I have asked questions from Aspose, and they have also tracked the issue (PDFNET-42880) in their system. I think they are not going to investigate this anytime soon as it is very specific case. And now I am posting this here to ask more details about:
What is difference in Adobe 'save as' vs. Foxit Reader 'save as' vs. Windows Reader 'save as' feature?
Issues with Adobe product that are not so obvious to figure out. I don't even know what to ask :D
Link to their (Aspose) old forum: https://www.aspose.com/community/forums/thread/845549/removing-stamps-fails-after-saving-stamped-file-from-adobe-acrobat.aspx
Case:
Created PDF with forms using OpenOffice (version 3.4.0), stamped with Aspose PDF, opened with Adobe Reader DC (or Adobe Acrobat XI), filled, saved as new file. Now this new file is fine, but when I try to remove stamps using Aspose (and replace with new stamp later), this is where things get interesting.
Files that I've tested with: https://1drv.ms/f/s!Auvpijam7a73iDzOqc6wZPuY9l81
Stamp_Location.png
OoPdfFormExample_WithStamp.pdf
OoPdfFormExample_WithStamp_StampRemoved.pdf
OoPdfFormExample_WithStamp_SavedFromFoxit.pdf
OoPdfFormExample_WithStamp_SavedFromFoxit_StampRemoved.pdf
OoPdfFormExample_WithStamp_SavedFromWindowsReader.pdf
OoPdfFormExample_WithStamp_SavedFromWindowsReader_StampRemoved.pdf
OoPdfFormExample_WithStamp_SavedFromAdobeReader.pdf
OoPdfFormExample_WithStamp_SavedFromAcrobat_StampRemoved.pdf
C# code that is used to remove the stamp(s):
/// <summary>
/// Removes stamps from PDF file.
/// </summary>
/// <param name="pdfFile"></param>
private static void RemoveStamps( string pdfFile )
{
// Create PDF content editor.
Aspose.Pdf.Facades.PdfContentEditor contentEditor = new Aspose.Pdf.Facades.PdfContentEditor();
// Open the temp file.
contentEditor.BindPdf( pdfFile );
// Process all pages.
foreach ( Page page in contentEditor.Document.Pages )
{
// Get the stamp infos.
Aspose.Pdf.Facades.StampInfo[] stampInfos = contentEditor.GetStamps( page.Number );
//Process all stamp infos
foreach ( Aspose.Pdf.Facades.StampInfo stampInfo in stampInfos )
{
// Use try catch so we can output possible error w/out break point.
try
{
contentEditor.DeleteStampById( stampInfo.StampId );
}
catch ( Exception e )
{
Console.WriteLine( e );
}
}
}
// Save changes to the temp file.
contentEditor.Save( StampRemovedPdfFile );
}
Using Adobe: The process of removing stamp works fine, but trying to open the file will end up having an issue with the file.
"An error exists on this Page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem."
EDIT: After testing more, and just opening file to Aspose, and saving it without modifications, that didn't break the file, only once the stamp was removed with Aspose method it was broken.
Using Foxit: Only difference in the process is that opening the file to Foxit Reader and save form there. The stamp is removed and file is fine, works with any PDF reader.
Using Windows (10) Reader: Only difference in the process is that opening the file to Windows Reader and save from there. The stamp is removed and file is fine, works with any PDF reader.
Ok - The thing you are referring to is not a stamp annotation. It's an XObject that gets drawn into the page content. Why Aspose refers to it as a Stamp is... well... a mystery. When you remove the "stamp" (not a stamp) Aspose seems to be removing the XObject but not the instructions to draw it from the page Contents stream... that's why you're getting the error in Acrobat. The other applications are more permissive with bad PDF and my guess is when they write out the file, they are removing references to non-existent objects. You can make Acrobat attempt to fix problems like this by selecting Save As Optimized PDF. However, you are far better off removing the drawing instruction in addition to the XObject.
Because of the way you've created the file and added the "stamp", your page content stream is an array of streams. Remove the last item in the array, which is the instruction to draw the XObject, and you file will work without errors in all the viewers. Note: It won't always be the case that the last item in the content array will be your stamp. It's just that your stamp is the last thing to get drawn so it goes at the end.
If your intention is to "replace" the "stamp", you'll want to do so by removing the XObject as you are doing now, then remove the instruction, then add the new "stamp".

Remove PDFont caching with Apache tika

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.