PDFBox getText not returning all of the visible text - pdf

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question

As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

Related

Why is flying saucer always printing PDF on A4 paper?

I'm trying to save an html document to PDF using flyingsaucer but the generated document always ends up having an A4 dimension when I look at the Document Properties from Adobe Reader (Page Size: 8.26 x 11.69 in).
I did read the documentation and I'm passing the css #page {size: letter;} style. And while it does have an effect on the output, the page size always remains 8.26 x 11.69 in Adobe Reader. For example, if I set the page size to legal, my PDF is still the size of a A4 but the top of the document is missing as if it had fell off the "paper".
I'm not sure if the problem falls on the itext side or the flying saucer side. I was using a fairly old version so my first step was to upgrade to the latest 9.1.6 version of flying saucer. I also moved from itext 2.0.8 to openPDF 1.0.1 but I'm still getting the same behavior.
I also traced in the debugger up to the com.lowagie.text.Document creation in ITextRenderer and at this point the document size passed is correct. That makes me think that the issue might be in openPDF / iText but I can't find what I'm doing wrong.
It turns out the PDF generation was correctly using the #page size declaration and the problem was occurring later in our software. What I had not noticed is that after the generation of the PDF another method was called to merge multiple PDFs into one. This method should probably not have been called, but that's another story.
The bottom line is this method created a new com.lowagie.text.Document(), which by default creates an A4 sized document, and then was iterating over all pages of the pdf, adding the pages to the new document using pdfWriter.getImportedPage(pdfReader, currentPage++). These imported pages did not retain their original size.
I fixed it by passing the page size of the fist page when creating the merged document object:
document = new Document(pdfReader.getPageSize(1));
The real problem is that you're (unwittingly) using software that is no longer supported. Anything that still has the namespace lowagie (the founder and CTO of iText) is really outdated.
If you simply want to convert HTML to pdf, why not use iText directly and cut out the middle-man?
We have multiple options for you.
XMLWorker (iText5 based code that converts HTML to pdf)
pdfHTML (iText7 based add-on that converts HTML5/CSS3 to pdf)
This is a rather extensive code-sample for using pdfHTML:
public void createPdf(String src, String dest, String resources) throws IOException {
try {
FileOutputStream outputStream = new FileOutputStream(dest);
WriterProperties writerProperties = new WriterProperties();
//Add metadata
writerProperties.addXmpMetadata();
PdfWriter pdfWriter = new PdfWriter(outputStream, writerProperties);
PdfDocument pdfDoc = new PdfDocument(pdfWriter);
pdfDoc.getCatalog().setLang(new PdfString("en-US"));
//Set the document to be tagged
pdfDoc.setTagged();
pdfDoc.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
//Set meta tags
PdfDocumentInfo pdfMetaData = pdfDoc.getDocumentInfo();
pdfMetaData.setAuthor("Joris Schellekens");
pdfMetaData.addCreationDate();
pdfMetaData.getProducer();
pdfMetaData.setCreator("iText Software");
pdfMetaData.setKeywords("example, accessibility");
pdfMetaData.setSubject("PDF accessibility");
//Title is derived from html
// pdf conversion
ConverterProperties props = new ConverterProperties();
FontProvider fp = new FontProvider();
fp.addStandardPdfFonts();
fp.addDirectory(resources);//The noto-nashk font file (.ttf extension) is placed in the resources
props.setFontProvider(fp);
props.setBaseUri(resources);
//Setup custom tagworker factory for better tagging of headers
DefaultTagWorkerFactory tagWorkerFactory = new AccessibilityTagWorkerFactory();
props.setTagWorkerFactory(tagWorkerFactory);
HtmlConverter.convertToPdf(new FileInputStream(src), pdfDoc, props);
pdfDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
You can find more information at http://itextpdf.com/itext7/pdfHTML

decrease font size on exisiting text in pdf

I have a huge pdf containing >1000 pages, need to edit existing text, offcourse it is added by me using pdfbox addtext to each page example ... the text font size was very big text runs out of page..
now i want to decrease the size of font so that it will be within page limits... or i can clear the existing text and replace a the same text with new font...
credits to Tilman Hausherr for answer
If you used the code you linked to, then you will find the "added message" in the content stream array, as the second last item. PDPage.getCosObject().getItem(COSName.Contents) and save the file.
public void removeStamp(File src) throws IOException {
PDDocument doc = PDDocument.load(src);
PDPageTree pages = doc.getPages();
for (PDPage page : pages) {
COSArray array = ((COSArray) page.getCOSObject().getItem(COSName.CONTENTS));
array.remove(array.size() - 1);
}
doc.save(src);
}

Possible to put HTML annotations on PDF?

I know that we can now put text, links and videos..but can we put HTML as annotation as well?
If there's a SDK, please point me to it as well.
I have tried to search as much as possible but couldn't find anything on it.
Updated: okay, here are more details. I'm creating a script to create a PDF from an image, and at the same time have to place annotations on top of the image. When the person click the annotation, the HTML will be shown. I understand there are link annotations and shape annotation, but what I'm looking for is the ability to place HTML markup/codes in the annotation. For example, i would be able to design a simple form or style some text or even a embed YouTube video.
I hope I'm clear.
Thanks!
Here goes a very basic sample code : Please add Itext jar in your project
Code :
import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Image;
//input is image in String format
public void createfromimage(String input){
Document document = new Document(PageSize.A4.rotate());
document.setMargins(0,0,0,0);
String output = "C:/Users/username/Downloads/text.pdf";
try {
FileOutputStream fileOutputStream = new FileOutputStream(output);
PdfWriter pdfWriter = PdfWriter.getInstance(document, fileOutputStream);
Image image = Image.getInstance(input);
document.setPageSize(new Rectangle(image.getWidth(),image.getHeight()));
document.open();
pdfWriter.open();
document.add(image);
document.close();
pdfWriter.close();
} catch (Exception e){
e.printStackTrace();
}
}
You can add annotations in above way, and for Link annotations, refer to link below :
https://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationLink.html
please note, this is just a simple example.

PDFBOX, Reading a pdf line by line and extracting text properties

I am using pdfbox to extract text from pdf files. I read the pdf document as follows
PDFParser parser = null;
String text = "";
PDFTextStripper stripper = null;
PDDocument pdoc = null;
COSDocument cdoc = null;
File file = new File("path");
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
e.printStackTrace();
}
try {
parser.parse();
cdoc = parser.getDocument();
stripper = new PDFTextStripper();
pdoc = new PDDocument(cdoc);
stripper.setStartPage(1);
stripper.setEndPage(2);
text = stripper.getText(pdoc);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line.
How can I achieve this with pdfbox library
extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library
Properties such as bold and italic are not first-class properties in a PDF.
Bold or italic writing in PDFs is achieved either using
different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by
looking at the font name: it may contain a substring "bold", "italic", "oblique"...
looking at some optional properties of the font, e.g. font weight...
inspecting embedded font file.
Neither of these methods is fool-proof; or
using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
drawing the glyph twice, the second time slightly displaced, also for a bold impression,
using a text or transformation matrix to slant the letters for an italic impression.
By overriding the PDFTextStripper methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.

Some pdf file watermark does not show using iText

Our company using iText to stamp some watermark text (not image) on some pdf forms. I noticed 95% forms shows watermark correctly, about 5% does not. I tested, copy 2 original pdf files, one was marked ok, other one does not ok, then tested in via a small program, same result: one got marked, the other does not. I then tried the latest version of iText jar file (version 5.0.6), same thing. I checked pdf file properties, security settings etc, seems nothing shows any hint. The result file does changed size and markd "changed by iText version...." after executed program.
Here is the sample watermark code (using itext jar version 2.1.7), note topText, mainText, bottonText parameters passed in, make 3 lines of watermarks show in the pdf as watermark.
Any help appreciated !!
public class WatermarkGenerator {
private static int TEXT_TILT_ANGLE = 25;
private static Color MEDIUM_GRAY = new Color(160, 160, 160);
private static int SUPPORT_FONT_SIZE = 42;
private static int PRIMARY_FONT_SIZE = 54;
public static void addWaterMark(InputStream pdfInputStream,
OutputStream outputStream, String topText,
String mainText, String bottomText) throws Exception {
PdfReader reader = new PdfReader(pdfInputStream);
int numPages = reader.getNumberOfPages();
// Create a stamper that will copy the document to the output
// stream.
PdfStamper stamp = new PdfStamper(reader, outputStream);
int page=1;
BaseFont baseFont =
BaseFont.createFont(BaseFont.HELVETICA_BOLDOBLIQUE,
BaseFont.WINANSI, BaseFont.EMBEDDED);
float width;
float height;
while (page <= numPages) {
PdfContentByte cb = stamp.getOverContent(page);
height = reader.getPageSizeWithRotation(page).getHeight() / 2;
width = reader.getPageSizeWithRotation(page).getWidth() / 2;
cb = stamp.getUnderContent(page);
cb.saveState();
cb.setColorFill(MEDIUM_GRAY);
// Top Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, topText, width,
height+PRIMARY_FONT_SIZE+16, TEXT_TILT_ANGLE);
cb.endText();
// Primary Text
cb.beginText();
cb.setFontAndSize(baseFont, PRIMARY_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, mainText, width,
height, TEXT_TILT_ANGLE);
cb.endText();
// Bottom Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, bottomText, width,
height-PRIMARY_FONT_SIZE-6, TEXT_TILT_ANGLE);
cb.endText();
cb.restoreState();
page++;
}
stamp.close();
}
}
We solved problem by change Adobe LifecycleSave file option. File->Save->properties->Save as, then look at Save as type, default is Acrobat 7.0.5 Dynamic PDF Form File, we changed to use 7.0.5 Static PDF Form File (actually any static one will work). File saved in static one do not have this watermark disappear problem. Thanks Mark for pointing to the right direction.
You're using the underContent rather than the overContent. Don't do that. It leaves you at the mercy of big, white-filled rectangles that some folks insist on drawing first thing. It's a hold over from less-than-good PostScript interpreters and hasn't been necessary for Many Years.
Okay, having viewed your PDF, I can see the problem is that this is an XFA-based form (from LiveCycle Designer). Acrobat can (and often does) rebuild the entire file based on the XFA (a type of xml) it contains. That's how your changes are lost. When Acrobat rebuilds the PDF from the XFA, all the existing PDF information is pitched, including your watermark.
The only way to get this to work would be to define the watermark as part of the XFA file contained in the PDF.
Detecting these forms isn't all that hard:
PdfReader reader = new PdfReader(...);
AcroFields acFields = reader.getAcroFields();
XfaForm xfaForm = acFields.getXfaForm();
if (xfaForm != null && xfaForm.isXfaPresent()) {
// Ohs nose.
throw new ItsATrapException("We can't repel XML of that magnitude!");
}
Modifying them on the other hand could be Quite Challenging, but here's the specs.
Once you've figured out what needs to be changed, it's a simple matter of XML manipulation... but that "figure it out" part could be interesting.
Good hunting.