Which sequence of PDF operators are needed to set the background color of Tf and TF contents? - pdf

I need to manipulate the content streams of a page in such a way that if the contents of Tf of one of the elements of TF matches specific values the background area of those squares/glyphs needs to change.
I think that I would need save the graphics state, after creating two different string objects, then apply a fill operator then restore the graphics state.
My question is: would the fill operator recognize the area of the matched string and fill just this?
Second: would I need to repeat this sequence for each element of the TF array?

It's not quite that simple.
You have to determine the position of the text yourself (by keeping track of the current transformation matrix for the whole page content stream and the text matrix for the text object in which your text in question is drawn) and then insert a path outlining that area and filling it just before the text object in question.
But this in particular means that it is not necessary to split the strings of the text drawing instructions to have the search text be drawn by itself.
By the way, if this change of background is meant to represent something like a text marker marking, an alternative to changing the page content would be to create text markup annotations for the determined coordinates. That way you would merely have to parse the page content stream for the coordinates, you don't have to change it. In particular if the text drawing instructions may also be in some form Xobject referenced from the page content instead of the page content itself, this may simplify the code.

Related

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

How to detect visible text in a text field in a PDF?

When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?

Determine the Text that can Display in Multiline PDTextField

Is there a way to determine the text that will actually display in a PDTextField when the PDF prints? If I call setValue and then getValue, it returns all of the text even though it will not all display.
I am trying to fill out a form with a limited size multiline text field that has the notation to attach another page for more details. I would like to limit the text to that which will display and generate the added detail page.
Thanks for indulging a PDFbox newbie.
There is no direct way to find that out as the details of the text layout such as line breaks, padding, line spacing are hidden inside the non public class PlainTextFormatter inside the org.apache.pdfbox.pdmodel.interactive.formpackage. So you'd need to replicate that code.
PDFBox tries to resemble the calculations done by Adobe Acrobat and Adobe Reader but the details of such calculations are not part of the PDF specification. So doing your calculation is only valid for a similar layout model. Other form filling applications might have a slightly different layout model and as a result your results will not apply to these.
In addition to that Acrobat (and PDFBox) place text although it might be partially clipped. Look at the results of the AlignmentTest.javaunit test to see what I mean. So one might have a different expectation to what 'fitting' really means.
As I've thought about passing the information about which text fitted back to the calling application anyway I've opened an enhancement request https://issues.apache.org/jira/browse/PDFBOX-3413 for that.

PDF itext TOC generation

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1.
Generating TOC is next challenge.
You don't. PDFs don't have styles. They have "current Graphic State", which includes:
current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
Including a separate text transformation matrix which is combined with the CTM.
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.

Adjust size of File Upload Button

I want to adjust the size of the "Browse" section seen in the file upload button in HTML. When I try to adjust the size using "size" or "width" attributes, only the whole size is reduced. But I want only the size of the "Browse" button part to be reduced and not the textbox part which displays file path. Can I do this without using CSS? If yes , how?
The file input element is notoriously difficult to style. One of the problems is that it's really a single element, even though it renders as two elements.
One approach is to obscure the entire element behind the scenes and present the user with custom elements instead. Here's an article about it. Basically the file input element is hidden and some custom elements backed by some JavaScript are handling the UI and passing the necessary information to the file input.
It is very difficult to change the appearance of the Browse button as it is typically hardwired into the browser.
However, at Quirksmode.org|Styling an input type="file", there is a long post that discusses complex CSS techniques for changing the appearance of file input elements.