I am parsing a pdf in PDFBox to extract all the text from it
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\Users\\admin\\Downloads\\Airtel.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
BUT its not giving any text in output
help
PDFBox extracts text similar to how Adobe Reader copies&pastes text.
If you open your document in Adobe Reader and press <Ctrl-A> to mark all text (you'll see that hardly anything is marked) and copy&paste it to an editor, you'll find that Adobe Reader also hardly extracts anything.
The reason why neither PDFBox nor Adobe Reader (nor any other normal text extractor) extracts the text from your document is that there virtually is no text at all in it! The "text" you see is not drawn using text drawing operations but instead is painted using by defining the outlines of each "character" as path and filling the area in that path. Thus, there is no indication to a text extractor that there even is text.
There actually are two characters of real text in your document, the '-' sign between the "Previous balance" and the "Payments" boxes and the '-' sign between the "Payments" and the "Adjustments" boxes. And even those two characters are not extracted as desired because the font does not provide the information which Unicode codepoint those characters represent.
Your virtually only chance, therefore, to extract the text content of the document is to apply OCR to the document.
Related
I created a pdf document using the code looks like the following:
// The text parameter equels 'שדג' it is Hebrew. unicode equivalent is '\u05E9\u05D3\u05D2'
private static void createSimplePdf(String filename, String text) throws Exception {
final String path = RunItextApp.class.getResource("/Arial.ttf").getPath();
final PdfFont font = PdfFontFactory.createFont(path, PdfEncodings.IDENTITY_H);
Style hebrewStyle = new Style()
.setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(14)
.setFont(font);
final PdfWriter pdfWriter = new PdfWriter(filename);
final PdfDocument pdfDocument = new PdfDocument(pdfWriter);
final Document pdf = new Document(pdfDocument);
pdf.add(
new Paragraph(text)
.setFontScript(Character.UnicodeScript.HEBREW)
.addStyle(hebrewStyle)
);
pdf.close();
System.out.println("The document '" + filename + "' has been created.");
}
and after that, I tried to open this document using pdfbox util and I got the following data:
but I got an unexpected result in the Contents:stream section especially Tj tag. I expected string like the following 05E905D305D2 but I got 02b902a302a2. I tried to convert this hex string to normal string and I got the following result: ʹʣʢ but I expected that string שדג.
What do I wrong? Hot to convert this 02b902a302a2 string and get שדג?
This answer writes in a comment #usr2564301. Thanks for the help!
The numbers you get are not Unicode characters but font indexes instead. (Check how the font is embedded!) The text in a PDF does not specifically care about Unicode – it may or may not be this. Good PDF creators add a /ToUnicode table to help decoding, but it's optional.
We have a method to check if a checkbox in a PDF (No forms) is checked or not and it works great on one company's PDF. But on another, there is no way to tell if the checkbox is checked or not.
Here is the code that works on one company's PDF
protected static final String[] HOLLOW_CHECKBOX = {"\uF06F", "\u0086"};
protected static final String[] FILLED_CHECKBOX = {"\uF06E", "\u0084"};
protected boolean isBoxChecked(String boxLabel, String content) {
content = content.trim();
for (String checkCharacter : FILLED_CHECKBOX) {
String option = String.format("%s %s", checkCharacter, boxLabel);
String option2 = String.format("%s%s", checkCharacter, boxLabel);
if (content.contains(option) || content.contains("\u0084 ") || content.contains(option2)) {
return true;
}
}
return false;
}
However, when I do the same for another company's PDF there is nothing in the extracted text near the checkbox to tell us if it is checked or not.
The big issue is we have no XML Schema, no Metadata, and no forms on these PDFs, it is just raw String, so you can see a checkbox is difficult to have in a String, but that is all we have. Here is code example of pulling the String in the PDF from a page to some other page, all the text in between
protected String getTextFromPages(int startPage, int endPage, PDDocument document) throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(startPage);
stripper.setEndPage(endPage);
return stripper.getText(document);
}
I wish the pdfs had an easier way to extract the text/data, but these vendors that make the PDFs decided it was better to keep that out of them.
No we cannot have the vendor's/ other companies change anything, we receive these PDFs from the courts system that have been submitted by lawyers that we don't know and that the lawyers bought the PDF software that generates these files.
We also cannot do it the even longer way of trying to figure out the object model that PDFBox creates of the document with things like
o.apache.pdfbox.util.PDFStreamEngine - processing substream token: PDFOperator{Tf}
because these are 80-100 page PDFs and would take us years just to code to parse one vendor's format.
I'm going in trouble using PDFBox 2.0.0-RC3 and producing a digital signature field into a PDF.
This is the piece of code i use:
public static void main(String[] args) throws IOException, URISyntaxException
{
PDDocument document;
document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
PDSignatureField signatureBox = new PDSignatureField(acroForm);
signatureBox.setPartialName("ENSGN-MY_SIGNATURE_FIELD-001");
acroForm.getFields().add(signatureBox);
PDAnnotationWidget widget = signatureBox.getWidgets().get(0);
PDRectangle rect = new PDRectangle();
rect.setLowerLeftX(50);
rect.setLowerLeftY(750);
rect.setUpperRightX(250);
rect.setUpperRightY(800);
widget.setRectangle(rect);
page.getAnnotations().add(widget);
try {
document.save("/tmp/mySignatureFieldGEN_PDFBOX.pdf");
document.close();
} catch (Exception io) {
System.out.println(io);
}
}
The code generates a pdf document, i open it with Acrobat Reader and this is the result:
PDF BOX Generated
As you can see, the signature panel on the left is void but the signature field on the left is present and works.
I generate the same PDF with PDFTron. This is the result:
PDF Tron Generated
In this case the signature panel on the left show correctly the presence of the signature field.
I would like to obtain this second case (correct) but i don't understand why PDF Box can do this.
Many thanks
add this:
widget.setPage(page);
This sets the /P entry.
Now the panel on the left appears. How did I get the idea? I got a document with such an empty signature field (from here), and compared it with yours with PDFDebugger.
Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5
I want to return the Document object from below code.
At present I get a document has no pages exception.
private static Document GeneratePdfAcroFields(PdfReader reader, Document docReturn)
{
if (File.Exists(System.Configuration.ConfigurationSettings.AppSettings["TEMP_PDF"]))
File.Delete(System.Configuration.ConfigurationSettings.AppSettings["TEMP_PDF"]);
PdfStamper stamper = new PdfStamper(reader, new FileStream(System.Configuration.ConfigurationSettings.AppSettings["TEMP_PDF"],FileMode.Create));
AcroFields form = stamper.AcroFields;
///INSERTING TEXT DYNAMICALLY JUST FOR EXAMPLE.
form.SetField("topmostSubform[0].Page16[0].topmostSubform_0_\\.Page78_0_\\.TextField3_9_[0]", "This value was dynamically added.");
stamper.FormFlattening = false;
stamper.Close();
FileStream fsRead = new FileStream(System.Configuration.ConfigurationSettings.AppSettings["TEMP_PDF"], FileMode.Open);
Document docret = new Document(reader.GetPageSizeWithRotation(1));
return docret;
}
Thanks Chris.
Just to reiterate what #BrunoLowagie is saying, passing Document objects around almost never makes >sense. Despite what the name might sound like, a Document doesn't represent a PDF in any way. Calling >ToString() or GetBytes() (if that method actually existed) wouldn't get you a PDF. A Document is just a >one-way funnel for passing human-friendly commands over to an engine that actually writes raw PDF >tokens. The engine, however, is also not even a PDF. The only thing that truly is a PDF is the raw >bytes of the stream that is being written to. – Chris Haas