Text Extraction, Not Image Extraction

Text Extraction, Not Image Extraction - pdf

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Related

JavaFX - set programmatically the destination path to print directly a node to pdf file

I want to print a node to a pdf file using "Microsoft Print to PDF" printer. Supposing that the Printer object is already extracted I have the next function which is working perfectly.
public static void printToPDF(Printer printer, Node node) {
PrinterJob job = PrinterJob.createPrinterJob(printer);
if (job != null) {
job.getJobSettings().setPrintQuality(PrintQuality.HIGH);
PageLayout pageLayout = job.getPrinter().createPageLayout(Paper.A4, PageOrientation.PORTRAIT,
Printer.MarginType.HARDWARE_MINIMUM);
boolean printed = job.printPage(pageLayout, node);
if (printed) {
job.endJob();
} else {
System.out.println("Printing failed.");
}
} else {
System.out.println("Could not create a printer job.");
}
}
The only issue that I have here, is that a dialog box is popping up and asking for a destination path to save the pdf. I was struggling to find a solution to set the path programmatically, but with no success. Any suggestions? Thank you in advance.

After some more research I came with an ugly hack. I accessed jobImpl private field from PrinterJob, and I took attributes out of it. Therefore I inserted the destination attribute, and apparently it is working as requested. I know it is not nice, but ... is kind of workable. If you have any nicer suggestion, please do not hesitate to post them.
try {
java.lang.reflect.Field field = job.getClass().getDeclaredField("jobImpl");
field.setAccessible(true);
PrinterJobImpl jobImpl = (PrinterJobImpl) field.get(job);
field.setAccessible(false);
field = jobImpl.getClass().getDeclaredField("printReqAttrSet");
field.setAccessible(true);
PrintRequestAttributeSet printReqAttrSet = (PrintRequestAttributeSet) field.get(jobImpl);
field.setAccessible(false);
printReqAttrSet.add(new Destination(new java.net.URI("file:/C:/deleteMe/wtv.pdf")));
} catch (Exception e) {
System.err.println(e);
}

Trying to replace graphics resources in a PDF - PDFBox 2.0.8

I'm trying to manipulate image resources in some PDF files; the workflow is: extract image resources -> process each -> replace old ones with the new.
Simple task really, I have working code for extracting and replacing, but when I replace, the new file size is nearly twice the original.
To replace the images, I use PDResources.put(COSName, PDXObject). Any ideas what would cause the size increase in the resulting document? It happens even if I completely omit the middle step in the workflow to process each image resource.
public static void PDFBoxReplaceImages() throws Exception {
PDDocument document = PDDocument.load(new File("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof PDImageXObject) {
counter++;
String path = "C:\\Users\\Markus\\workspace\\pdf-test\\images\\"+counter+".png";
PDImageXObject newImg =
PDImageXObject.createFromFile(path, document);
pdResources.put(c, newImg);
}
}
}
document.save("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf");
}

How to merge 10000 pdf into one using pdfbox in most effective way

PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.

I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?

According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.

for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

Docx4j Test - No File is Output

Attempting to write my first class with docx4j (http://www.docx4java.org). Basically the idea is to find a string of text in the .docx file and replace it with another string of text. Essentially a mail merge. While I'm not receiving any errors, the merged document itself is not being saved in the path I've suggested. This makes me think it's a file path problem but I don't see anything wrong with it.
package efi.mailmerge.servlets;
import java.util.List;
import javax.xml.bind.JAXBElement;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.wml.Text;
public class WordDocTest {
/**
* Open word document /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx, replace a piece of text and save
* the result to /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx.
*
* The text <<CUS_FNAME>> will be replaced with John.
*
* #param args
*/
public static void main(String[] args) {
// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx"));
// Build a list of "text" elements
List texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
// Loop through all "text" elements
for (Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();
// Get the text value
String textValueBefore = text.getValue();
// Perform the replacement
String textValueAfter = textValueBefore.replaceAll("<<CUS_FNAME>>", "John");
// Show the element before and after the replacement
System.out.println("textValueBefore = " + textValueBefore);
System.out.println("textValueAfter = " + textValueAfter);
// Update the text element now that we have performed the replacement
text.setValue(textValueAfter);
}
wordMLPackage.save(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx"));
} catch (Docx4JException e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
} catch (Exception e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
}
}
}
On lines 26 and 50 you can see the input/output paths. I've confirmed that the Sample.docx input file does exist and that the uploads directory has write permissions. Can you see anything wrong with my file paths here? I could be completely on the wrong path, but this is all very new to me so I'm learning as I go.
Any and all help is very much appreciated.

At first sight, I would suggest trying with your path written the following way :
wordMLPackage.save(new java.io.File("\\Users\\Jeff\\Development\\ReServe-Unleashed\\Dev\\MailMerge\\uploads\\Sample-Out.docx"));
If it still not works, please provide the stack traces ? It could help. (if no doc is saved, there must be an exception thrown)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Text Extraction, Not Image Extraction - pdf

Related

JavaFX - set programmatically the destination path to print directly a node to pdf file

Trying to replace graphics resources in a PDF - PDFBox 2.0.8

How to merge 10000 pdf into one using pdfbox in most effective way

replace string in PDF document (ITextSharp or PdfSharp)

Docx4j Test - No File is Output

Categories

Resources