Programmatically determine text visibility in PDF document

Programmatically determine text visibility in PDF document - pdf

Using iTextSharp or some other library, is it possible to determine if a given part of the text is visible?
Using other word, is it possible to extract (to simple text) only the visible parts?
For example, if a given text segment has its font color white, then that text segment would be invisible when printing or displaying on screen.
My goal is to find a way to extract only the text that would be visible when printing that document!

Related

Search for Text a PDF - double results

i have a question about search text in a PDF file in attach here:
pdf shared link google drive.
If I search text example "1500" , I see 4 occurences but there are only 2 occurenes in page 2.....the same if I search text "musei" find 2 occurrences, but this text is only in page 1.
The research parse the single page and find all document text in every single page, because I have double results.
Can anyone explain why happen this?
Did this PDF file generated in a particular way respect other where searching text is ok?
Thanks a lot

That PDF is indeed special, each page contains the text of both pages. On the first page the text from the second page is right of the right page border, and on the second page the text from the first page is left of the left page border. Furthermore, the contents of the respectively other page are additionally outside the clip area.
I enlarged the page boxes (media box, crop box, ...) of the first page to the right and of the second page to the left, and then marked all text (Ctrl-A) to show even the text outside the clip area, and you see:
For text extraction that only extracts the text in the visible areas, you should restrict your text extraction routine to the crop box of the respective page.

How to detect visible text in a text field in a PDF?

When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?

PDFBox retrieve text from overlapping boxes

I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.
In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.
Is there a way to do this?

QML formatting text in TextEdit

I'm new to QML, I'm trying to create text editor where you can format text to make it bold, italic, underlined, justification etc. basically I want this to act as general text editor ( LibreOffice Writer or other )
Next step is to convert formatted text in TextEdit field to HTML code, so if text in field is bold then <B>...</B> is added to text, etc.
I managed to create this kind of editor in GDK using text buffer and tags but I don't know where to start with QML.

You can still use tags for formatting text inside a TextEdit Element, as shown in the documentation.

PDF itext TOC generation

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1.
Generating TOC is next challenge.

You don't. PDFs don't have styles. They have "current Graphic State", which includes:
current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
Including a separate text transformation matrix which is combined with the CTM.
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Programmatically determine text visibility in PDF document - pdf

Related

Search for Text a PDF - double results

How to detect visible text in a text field in a PDF?

PDFBox retrieve text from overlapping boxes

QML formatting text in TextEdit

PDF itext TOC generation

Categories

Resources