How to detect visible text in a text field in a PDF?

How to detect visible text in a text field in a PDF? - pdfbox

When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?

Related

How to convert a multi-page PDF table to a spreadsheet format?

I have a huge PDF file with 300+ pages on which a big 10+ column table is spread. I am using Linux and would like to have a simple command line command which would convert this table to a text importable to a spreadsheet.
Currently I am using pdftotext -layout, and gives quite good results, other than every page is considered independently and column widths and positions change from page to page (due to different maximum column content width on each page), so I cannot simply import the resulting text file to a spreadsheet application and split it to columns by a fixed column width.
I have tried to crop every column on every page (their position is identical across the whole PDF file), but in the result the empty rows are merged together, so the rows with content will be shifted with respect to each other.
If pdftotext had an option to convert the file with a STRICT LAYOUT (not by column content width), that would help. Or if I could stack all pages in PDF file to a single page, that could also solve it.
What are the options to solve this problem?

You are misunderstanding the nature of the content of a PDF file. There are no tables in PDf, there is no metadata (generally) to describe the content as a table. The text you see on the page may not be laid out in the reading order.
For example the PDF file might contain a line of text drawn at the top of the page, then one at the bottom, then a paragraph in the middle before jumping back up to the top for a headline.
In addition there may be no spaces between two text fragments. Text is drawn at an absolute position on the page, so you can draw (for example ) cell A, then move the current point by say 1 cm, rthen draw cell B and so on. Since there's no 'space' characters between the two cells, a naive text extraction will, naturally, assume the two lines of text are coninuous.
The STRICT LAYOUT you want isn't impossible, but you can't do it with a simple text file, because the original layout isn't made up of simple text characters, sometimes the space between two characters, or two fragments of text is done by moving the current point before drawing the text.
Ghostscript's txtwrite device in its simplest mode attempts to replicate the layout by replacing the white space with actual space characters in a fixed pitch font. This 'might' be good enough for you, but it equally well might not. That's because it operates by defining the smallest distance used on the page as being one space character. All distances between text is then replaced by a number of space characters, as many as are required to make up the space. This can (and often does) result in very wide output files with a lot of white space.
Essentially what you seem to want isn't really possible, you can't take a rich format like PDF and replicate it, including the layout, with nothing more than text characters.

Forcing screen-reader read alt-text on a text element when creating a PDF in iText7

Is there any way to add alt text to a text element in iText? I have seen there is a way to do it for images. Basically, I would like the screen reader to read something besides the actual text that is being displayed. There are two situations in my document that I would need to do this.
One is when the screen-reader is reading an acronym I would like the alt-text to force the screen reader to read each letter instead of trying to read a word. (ie read DIET as D-I-E-T instead of diet)
The second is when it is reading a phone number I would like it to read outloud "phone" before the number. In the document it is currently just the number which would be a little confusing for disabled users. I am unable to actually change the layout to include the word "phone" for non-technical reasons.

There is a method for that.
new Paragraph("Lorem").getAccessibilityProperties().setActualText("Ipsum")
You can call this method on every class that implements IAccessibleElement.

PDFBox retrieve text from overlapping boxes

I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.
In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.
Is there a way to do this?

Determine the Text that can Display in Multiline PDTextField

Is there a way to determine the text that will actually display in a PDTextField when the PDF prints? If I call setValue and then getValue, it returns all of the text even though it will not all display.
I am trying to fill out a form with a limited size multiline text field that has the notation to attach another page for more details. I would like to limit the text to that which will display and generate the added detail page.
Thanks for indulging a PDFbox newbie.

There is no direct way to find that out as the details of the text layout such as line breaks, padding, line spacing are hidden inside the non public class PlainTextFormatter inside the org.apache.pdfbox.pdmodel.interactive.formpackage. So you'd need to replicate that code.
PDFBox tries to resemble the calculations done by Adobe Acrobat and Adobe Reader but the details of such calculations are not part of the PDF specification. So doing your calculation is only valid for a similar layout model. Other form filling applications might have a slightly different layout model and as a result your results will not apply to these.
In addition to that Acrobat (and PDFBox) place text although it might be partially clipped. Look at the results of the AlignmentTest.javaunit test to see what I mean. So one might have a different expectation to what 'fitting' really means.
As I've thought about passing the information about which text fitted back to the calling application anyway I've opened an enhancement request https://issues.apache.org/jira/browse/PDFBOX-3413 for that.

Is there a way to change the order/way Acrobat selects text of a PDF?

I have a visual basic program that extracts text from a PDF and imports the text into excel. It relies on reading the text like a human, reading left to right across the page. However, there are instances on this particular PDF where if I go to select the text with my mouse, I click and drag straight across but Adobe starts to select/highlight words on the above and below lines before continuing to highlight across the page. This gives me data that I do not want/need. The page has renderable text and is not from a scanned document.
Is there a way to "reset" the way Adobe interprets the text on the PDF? Since the information on the left is far from the information on the right, it treats them almost like separate columns.
I've tried saving the PDF in different formats such as a txt or postscript and distilling to another PDF but they all seem to result in the same outcome. This is weird to me because I have other similar PDFs where this isn't an issue.
Any help or thoughts would be greatly appreciated, thanks.

As PDF (in its basic form) essentially means placing strings on a canvas, the concept of "sentence" or "reading order" is not built in.
In order to extract text, you would have to read out the bounding box of the piece of text, and then use some logic and heuristics to assemble your text based on the coordinates of the bounding box.
Things can be easier if the PDF is a structured PDF, where the text contents is embedded as text in the document. This is also the prime requirement for an accessible document. So, if your document is accessible, you can rely on the structure for the correct reading order.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas