I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.
In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.
Is there a way to do this?
Related
i have a question about search text in a PDF file in attach here:
pdf shared link google drive.
If I search text example "1500" , I see 4 occurences but there are only 2 occurenes in page 2.....the same if I search text "musei" find 2 occurrences, but this text is only in page 1.
The research parse the single page and find all document text in every single page, because I have double results.
Can anyone explain why happen this?
Did this PDF file generated in a particular way respect other where searching text is ok?
Thanks a lot
That PDF is indeed special, each page contains the text of both pages. On the first page the text from the second page is right of the right page border, and on the second page the text from the first page is left of the left page border. Furthermore, the contents of the respectively other page are additionally outside the clip area.
I enlarged the page boxes (media box, crop box, ...) of the first page to the right and of the second page to the left, and then marked all text (Ctrl-A) to show even the text outside the clip area, and you see:
For text extraction that only extracts the text in the visible areas, you should restrict your text extraction routine to the crop box of the respective page.
When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?
I have a PDF file that contains several annotations.If you notice the image there are several boxes in Yellow and Beige. These boxes can be edited in Adobe Reader. Could anyone help me find-out the total number of these boxes present in the pdf file using VBA?
Also, I tried converting the pdf to word using vba, but those boxes weren't present in the word file; so it didn't work out.
Here is the pdf file: https://drive.google.com/file/d/0B7uN4B3mxUlZMjB1T3BuM0o1VGs/view?usp=sharing
The text in those boxes is always blue while other text is black. Maybe that could be used.
Another way would be to use pdfseparate from http://www.foolabs.com/xpdf/download.html, and count how often the string <</AP <</N occurs in the generated file.
Or you could convert the pdf to an image and then count the number of colored rectangles.
You could also use one of the commercial tools available for creating/editing pdfs e.g. http://www.pdflib.com/, I believe that one supports VBA.
I have a visual basic program that extracts text from a PDF and imports the text into excel. It relies on reading the text like a human, reading left to right across the page. However, there are instances on this particular PDF where if I go to select the text with my mouse, I click and drag straight across but Adobe starts to select/highlight words on the above and below lines before continuing to highlight across the page. This gives me data that I do not want/need. The page has renderable text and is not from a scanned document.
Is there a way to "reset" the way Adobe interprets the text on the PDF? Since the information on the left is far from the information on the right, it treats them almost like separate columns.
I've tried saving the PDF in different formats such as a txt or postscript and distilling to another PDF but they all seem to result in the same outcome. This is weird to me because I have other similar PDFs where this isn't an issue.
Any help or thoughts would be greatly appreciated, thanks.
As PDF (in its basic form) essentially means placing strings on a canvas, the concept of "sentence" or "reading order" is not built in.
In order to extract text, you would have to read out the bounding box of the piece of text, and then use some logic and heuristics to assemble your text based on the coordinates of the bounding box.
Things can be easier if the PDF is a structured PDF, where the text contents is embedded as text in the document. This is also the prime requirement for an accessible document. So, if your document is accessible, you can rely on the structure for the correct reading order.
Using iTextSharp or some other library, is it possible to determine if a given part of the text is visible?
Using other word, is it possible to extract (to simple text) only the visible parts?
For example, if a given text segment has its font color white, then that text segment would be invisible when printing or displaying on screen.
My goal is to find a way to extract only the text that would be visible when printing that document!