PDFBox converting inches or centimeters into the coordinate system - pdfbox

I am new to PDFBox (and PDF generation) and I am having difficulties generating my own PDF.
I do have text with certain coordinates in inches/centimeters and I need to convert them to the units PDFBox uses. Any suggestions/utilities than can do this automatically?
PDPageContentStream.moveTextPositionByAmount(x,y) is making no sense to me.

In general PDFBox uses the PDF user space coordinates when creating a PDF. This means:
The coordinates of a page are delimited by its CropBox defaulting to its MediaBox, the values increasing left to right and bottom to top. Thus, if you create a page using new PDPage() or new PDPage(PDPage.PAGE_SIZE_*) the origin of the coordinate system starts in the lower left corner of the page.
The unit in user space starts as the default user space unit which is defined by the UserUnit of the page. Most often (e.g. if you create a page using any of the PDPage constructors and don't explicitly change that value) it is not explicitly set and, therefore, its default kicks in which is 1⁄72 inch.
The user space coordinate system can be changed pretty arbitrarily by concatenating
some matrix to the current transformation matrix. The current transformation matrix starts as the identity matrix.
In PDFBox you do this using one of the PDPageContentStream.concatenate2CTM() overloads.
As soon as you switch to text mode using PDPageContentStream.beginText(), the coordinate system used is furthermore influenced by the transformation introduced by the text matrix.
In PDFBox you set the text matrix using one of the PDPageContentStream.setTextMatrix() overloads.
As you are new to PDFBox (as you say) and new to PDF in general (as I presume because otherwise you would likely have recognized the coordinates), I would advise you to initially refrain from using transformations wherever possible and, therefore, remain in state where the coordinate system starts in the lower left, is neither rotated nor skewed, and has a unit length of 1/72 inch.
For this context you actually can use constants provided by PDFBox for conversion:
Multiply coordinates in inch by PDPage.DEFAULT_USER_SPACE_UNIT_DPI to get default user space coordinates.
Multiply coordinates in mm by PDPage.MM_TO_UNITS to get default user space coordinates.
If you want to have fun with coordinates, though, look at the PDF specification ISO-32000-1 and study the sections 8.3 Coordinate Systems and 9.4.4 Text Space Details.
The PDPage constants pointed to above used to be accessible in early PDFBox 1.8.x versions but then got hidden (private), and eventually were removed in the transition to PDFBox 2.x.
For reference, the constants were defined as
private static final int DEFAULT_USER_SPACE_UNIT_DPI = 72;
private static final float MM_TO_UNITS = 1/(10*2.54f)*DEFAULT_USER_SPACE_UNIT_DPI;

Related

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

Is the /Widths array of a PDF font object redundant information?

A refrence to pdf informs that a pdf dictionary to defined a font resource needs to contain a property /Widhts giving this information:
(Required except for the standard 14 fonts; indirect reference
preferred) An array of ( LastChar − FirstChar + 1) widths, each
element being the glyph width for the character code that equals
FirstChar plus the array index. For character codes outside the range
FirstChar to LastChar , the value of MissingWidth from the
FontDescriptor entry for this font is used. The glyph widths are
measured in units in which 1000 units corresponds to 1 unit in text
space. These widths must be consistent with the actual widths given in
the font program. (See implementation note 61 in Appendix H.)
emphasis added.
What good is it to provide the widths again is they are obviously included in the font program?
Plainly: Can somebody confirm or reject wether the information one is supposed to provide here, the glyph width is blantantly redundant information, considering it is even mentioned to be contained in the font-program?
Or do some font programs inlcude glyphs without specifying their widths?
Is it because there are font programs that do not include the widths, or is this merely an execercise in patience, indented to complicate the generation of PDF files, hoping people then stick to Adobe software?
Are the /Widths entries required to test if a referenced font (being not embedded), is "correct" (i.e. the pdf viewer is supposed to check if the font-program wanted by the pdf, might be the one found on the platform, comparing the /Widths)?
The Widths array is documented as being present so that application programs can determine the metrics of glyphs without being required to decode a font. This might be of use (for example) when drawing a selection box around text, or highlighting text in some manner.
See pages 393 and 394 of the PDF 1.7 specification:
The width information for each glyph is stored both in the font
dictionary and in the font program itself. (The two sets of widths
must be identical; storing this information in the font dictionary,
although redundant, enables a consumer application to determine glyph positioning without having to look inside
the font program.)
I should also mention that there are many PDF producers which regard abusing the Widths array as a convenient way to alter the spacing of a font. Where the Widths of the Font array do not match the metrics of the glyphs in the font program, Acrobat uses the Widths array values (which is the implementation note in Appendix H referred to by the text you quoted). I also seem to recall that the latest version of the specification lifts the exception for the base 14 fonts, all fonts are supposed to have a /Widths array now.
We've got numerouus examples of PDF files where the metrics array do not match the Widths in the font program.
Note that the Preflight checker in Acrobat Pro, when checking for PDF/A compatibility, will throw an error if the Widths and metrics differ.
So while it is technically true that the /Widths array is redundant, because the same information can be retrieved from the font, it is convenient for some applications to have the informaiton in a more readily accessible form and if (as a PDF consumer) you hope to match the rendering from Acrobat, you need to use it.

LaTeX (Context) document size over 226in

I've been looking left, right and center but was not able to find a definitive answer as to how and whether it is possible to increase document size in LaTeX over the magic 200inch limit.
For a project I work on I need to be able to dynamically generate PDF/EPS that can be used to print on up to 32 yards long medium. The important factor is, that this print job must be seamless and contains color stripes, symbols and texts on repeat throughout the entire length.
I know that PDF 1.6+ supports a much larger document size via setting UserUnits. Is that something that can be used in LaTeX and if so, how would I go about this?

Extracting text from PDF with correct/sensible coordinates

My company licenses both iTextSharp and PdfTools. Trying to figure out the root cause I built Apache's PdfBox: All show the same behavior, so rather than creating two support requests and a post on the PdfBox list I'm trying SO first for the general problem.
For a real world PDF (according to the document's properties it was created by "SAP NetWeaver 740") all extracted text coordinates are way off, while the content is fine. Across all the tools I listed above:
The page size (as in, mediabox and cropbox) is 842.0 x 595.0 - a portrait invoice. My default test word (all are off, but that's the one that caused my investigation) starts at roughly 80% in. All tools report the coordinates of that text with x=778 - outside of the page bounds. The y coordinate seems to be fine though. Probably related, the width is off (too wide by a large margin) while the height is again fine.
Now, maybe the PDF is broken in some way. But then again: The text is rendered fine of course. If I select the text in - say - Acrobat Reader, that works fine (i.e. the selection rectangle matches the text on the screen). And I assume that SAP generates rather bland/unsophisticated documents, tbh.
I guess my question boils down to: Under which circumstances would text appear to be outside of the page's boundaries? What might cause the horizontal position to be totally out of whack (and always too large)?

Get x/y and width/height of all characters in a PDF using GhostScript

I need to get the x/y, width/height, and page number of each individual character in a PDF, ideally as percentages.
Clearly, Ghost Script is able to do this as it wouldn't be possible to convert PDFs to raster images otherwise. Is there a simple way to get Ghostscript to give me this information or am I going to need to modify the source to hook into this functionality?
Glyphs are rendered to bitmaps (using FreeType) and stored in the glyph cache tagged with the font and matrix so that they can be uniquely identified. When text is rendered to the page the cache is consulted first and if a hit exists that bitmap is drawn at the current point. If not then the glyph is rendered and cached.
However, extremely large point sizes are left uncached, and rendered each time to avoid filling up or overflowing the cache.
So in order to retrieve this information using Ghostscript you would need to write a device which has a set of text methods. You would need to capture the bitmaps from the glyph in order to determine the width and height of the glyphs, and the current point would give you the position on the page. The output_page method would tell you that a page had completed, so you would need to track the page number yourself.
You could look at the txtwrite device to see how text is processed, and the epswrite device to see how to retrieve bitmaps, you'll need some combination of both.
Be aware that 'text' in a PDF file need not be text. What appears to be text can be bitmaps, or vectors. Text can be encoded in unusual ways, and there may be no way to retrieve Unicode or other identifiable information about the glyphs (again the txtwrite device shows how you might extract such information if possible).
Also, fonts are not always embedded in PDF files, in which case a substitute font is used, which would mess up your width/height information.
This is quite a big project.