PDFBOX, Reading a pdf line by line and extracting text properties

PDFBOX, Reading a pdf line by line and extracting text properties - pdfbox

I am using pdfbox to extract text from pdf files. I read the pdf document as follows
PDFParser parser = null;
String text = "";
PDFTextStripper stripper = null;
PDDocument pdoc = null;
COSDocument cdoc = null;
File file = new File("path");
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
e.printStackTrace();
}
try {
parser.parse();
cdoc = parser.getDocument();
stripper = new PDFTextStripper();
pdoc = new PDDocument(cdoc);
stripper.setStartPage(1);
stripper.setEndPage(2);
text = stripper.getText(pdoc);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line.
How can I achieve this with pdfbox library

extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library
Properties such as bold and italic are not first-class properties in a PDF.
Bold or italic writing in PDFs is achieved either using
different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by
looking at the font name: it may contain a substring "bold", "italic", "oblique"...
looking at some optional properties of the font, e.g. font weight...
inspecting embedded font file.
Neither of these methods is fool-proof; or
using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
drawing the glyph twice, the second time slightly displaced, also for a bold impression,
using a text or transformation matrix to slant the letters for an italic impression.
By overriding the PDFTextStripper methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.

Related

How to use font in other PDF files? (itext7 PDF)

I am now trying to modify a PDF file with ONLY text content. When I use
TextRenderInfo.getFont()
it returns me a Font which is actually an indirect object.
pdf.inderect.object.belong.to.other.pdf.document.Copy.object.to.current.pdf.document
would be thrown in this case when close the PdfDocument.
Is there a way to let me reuse this Font in a new PDF file? OR, is there a way to in-place edit the text content in PDF (without changing the font, color, fontSize)?
I'm using itext7.
Thanks

First of all, from the error message I see that you are not using the latest version of iText, which is 7.0.2 at the moment. So I recommend that you update your iText version.
Secondly, it is indeed possible to use a font in another document. But to do that, you first have to copy the corresponding font object to that other document (as stated in the exception message by the way). But you should be warned that this approach has some limitations, e.g. in case of a font subset, you will only be able to use the glyphs that are present in the original font subset in the source document and will not be able to use other glyphs.
PdfFont font = textRenderInfo.getFont(); // font from source document
PdfDocument newPdfDoc = ... // new PdfDocument you want to write some text to
// copy the font dictionary to the new document
PdfDictionary fontCopy = font.getPdfObject().copyTo(newPdfDoc);
// create a PdfFont instance corresponding to the font in the new document
PdfFont newFont = PdfFontFactory.createFont(fontCopy);
// Use newFont in newPdfDoc, e.g.:
Document doc = new Document(newPdfDoc);
doc.add(new Paragraph("Hello").setFont(newFont));

PDFBox getText not returning all of the visible text

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question

As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

PDFBox pdf to image generates overlapping text

For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.
This is the code I am using. It is very simple code which calls PDFToImage. But the output jpg image file looks really bad with lot of commas inserted and some overlapping text.
String [] args_2 = new String[7];
String pdfPath = "C:\\bitcoin.pdf";
args_2[0] = "-startPage";
args_2[1] = "1";
args_2[2] = "-endPage";
args_2[3] = "1";
args_2[4] = "-outputPrefix";
args_2[5] = "my_image_2";
//args_2[6] = "-resolution";
//args_2[7] = "1000";
args_2[6] = pdfPath;
try {
PDFToImage.main(args_2);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font
So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:
different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.
This results in
The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.
Using PDFBox 2.0.0-SNAPSHOT
Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:
PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
#SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();
PDFRenderer renderer = new PDFRenderer(document);
for (int i = 0; i < pages.size(); i++)
{
BufferedImage image = renderer.renderImage(i);
ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}
The result with this code is:
Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.
PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by
ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.
If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

How to merge PDFs to a single file without multiple copies of the same font?

I create PDFs and concatenate them into a single PDF.
My resulting PDF is a lot bigger than I had expected in file size.
I realised that my output PDF has a ton of duplicate font, and it is the reason of unexpectedly big file size.
Here, my question is:
I would like to create PDFs which only embed font information, so let they use Windows System Font.
When I merge them into a single PDF, I insert actual font which PDF needs.
If possible, please let me know how to do it.

I've created the MergeAndAddFont example to explain the different options.
We'll create PDFs using this code snippet:
public void createPdf(String filename, String text, boolean embedded, boolean subset) throws DocumentException, IOException {
// step 1
Document document = new Document();
// step 2
PdfWriter.getInstance(document, new FileOutputStream(filename));
// step 3
document.open();
// step 4
BaseFont bf = BaseFont.createFont(FONT, BaseFont.WINANSI, embedded);
bf.setSubset(subset);
Font font = new Font(bf, 12);
document.add(new Paragraph(text, font));
// step 5
document.close();
}
We use this code to create 3 test files, 1, 2, 3 and we'll do this 3 times: A, B, C.
The first time, we use the parameters embedded = true and subset = true, resulting in the files testA1.pdf with text "abcdefgh" (3.71 KB), testA2.pdf with text "ijklmnopq" (3.49 KB) and testA3.pdf with text "rstuvwxyz" (3.55 KB). The font is embedded and the file size is relatively low because we only embed a subset of the font.
Now we merge these files using the following code, using the smart parameter to indicate whether we want to use PdfCopy or PdfSmartCopy:
public void mergeFiles(String[] files, String result, boolean smart) throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy;
if (smart)
copy = new PdfSmartCopy(document, new FileOutputStream(result));
else
copy = new PdfCopy(document, new FileOutputStream(result));
document.open();
PdfReader[] reader = new PdfReader[3];
for (int i = 0; i < files.length; i++) {
reader[i] = new PdfReader(files[i]);
copy.addDocument(reader[i]);
}
document.close();
for (int i = 0; i < reader.length; i++) {
reader[i].close();
}
}
When we merge the document, be it with PdfCopy or PdfSmartCopy, the different subsets of the same font will be copied as separate objects in the resulting PDF testA_merged1.pdf / testA_merged2.pdf (both 9.75 KB).
This is the problem you are experiencing: PdfSmartCopy can detect and reuse identical objects, but the different subsets of the same font aren't identical and iText can't merge different subsets of the same font into one font.
The second time, we use the parameters embedded = true and subset = false, resulting in the files testB1.pdf (21.38 KB), testB2.pdf (21.38 KB) and testA3.pdf (21.38 KB). The font is fully embedded and the file size of a single file is a lot bigger than before because the full font is embedded.
If we merge the files using PdfCopy, the font will be present in the merged document redundantly, resulting in the bloated file testB_merged1.pdf (63.16 KB). This is definitely not what you want!
However, if we use PdfSmartCopy, iText detects an identical font stream and reuses it, resulting in testB_merged2.pdf (21.95 KB) which is much smaller than we had with PdfCopy. It's still bigger than the document with the subsetted fonts, but if you're concatenating a huge amount of files, the result will be better if you embed the complete font.
The third time, we use the parameters embedded = false and subset = false, resulting in the files testC1.pdf (2.04 KB), testC2.pdf (2.04 KB) and testC3.pdf (2.04 KB). The font isn't embedded, resulting in an excellent file size, but if you compare with one of the previous results, you'll see that the font looks completely different.
We merge the files using PdfSmartCopy, resulting in testC_merged1.pdf (2.6 KB). Again, we have an excellent file size, but again we have the problem that the font isn't visualized correctly.
To fix this, we need to embed the font:
private void embedFont(String merged, String fontfile, String result) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(fontfile, "r");
byte fontbytes[] = new byte[(int)raf.length()];
raf.readFully(fontbytes);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontbytes);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontbytes.length));
// create a reader object
PdfReader reader = new PdfReader(merged);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(result));
PdfName fontname = new PdfName(BaseFont.createFont(fontfile, BaseFont.WINANSI, BaseFont.NOT_EMBEDDED).getPostscriptFontName());
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
Now, we have the file testC_merged2.pdf (22.03 KB) and that's actually the answer to your question. As you can see, the second option is better than this third option.
Caveats: This example uses the Gravitas One font as a simple font. As soon as you use the font as a composite font (you tell iText to use it as a composite font by choosing the encoding IDENTITY-H or IDENTITY-V), you can no longer choose whether or not to embed the font, whether or not to subset the font. As defined in ISO-32000-1, iText will always embed composite fonts and will always subset them.
This means that you can't use the above solutions when you need special fonts (Chinese, Japanese, Korean). In that case, you shouldn't embed the fonts, but use so-called CJK fonts. They CJK fonts will use font packs that can be downloaded by Adobe Reader.

iTextSharp - PDF field contents become invisible

I have a PDF where I add some TextFields.
var txtFld = new TextField(stamper.Writer, new Rectangle(cRightX - cWidthX, cTopY3, cRightX, cTopY), FieldNameProtocol) { Font = bf, FontSize = cHeaderFontSize, Alignment = Element.ALIGN_RIGHT, Options = PdfFormField.FF_MULTILINE };
stamper.AddAnnotation(txtFld.GetTextField(), 1);
The ‘bf’ above is a Unicode font that gets embedded in the PDF:
BaseFont bf = BaseFont.CreateFont(UnicodeFontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED); // Create a Unicode font to write in Greek...
Later-on I fill those fields with greek text.
var acrof = stamper.AcroFields;
acrof.SetField(fieldName, field.Value/*, field.Value*/); // Set the text of the form field.
acrof.SetFieldProperty(fieldName, "setfflags", PdfFormField.FF_READ_ONLY, null); // Make it readonly.
When I view the PDF, most of the times the text is missing and if I click on the (invisible) TextField in Acrobat, then the text becomes visible (until it loses focus again).
Any idea what is going on here?
I have also tried using non-embedded font, but I get the same thing (although I still seem to get embedded fonts in PDF that are similar to the font I use). I don't know if I am missing sth.

It seemed that I was doing the following at the wrong order (the following is the correct):
acrof.SetFieldProperty(field.Name, "setfflags", PdfFormField.FF_READ_ONLY, null); // Make it readonly.
acrof.SetFieldProperty(field.Name, "textfont", bf, null);
acrof.SetField(field.Name, field.Value/*, field.Value*/); // Set the text of the form field.
At least that's hat I think it was wrong.
I have made many changes.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PDFBOX, Reading a pdf line by line and extracting text properties - pdfbox

Related

How to use font in other PDF files? (itext7 PDF)

PDFBox getText not returning all of the visible text

PDFBox pdf to image generates overlapping text

How to merge PDFs to a single file without multiple copies of the same font?

iTextSharp - PDF field contents become invisible

Categories

Resources