pdf.js get info about embedded fonts - pdf

I am using pdf.js. Fetching the Text I get blocks with font info
Object {
str: "blabla",
dir: "ltr",
width: 191.433141,
height: 12.546,
transform: Array[6],
fontName: "g_d0_f2"
}
Is it possible to get somehow more information about g_d0_f2.

Notice the PDF.js getTextContent will not and not suppose to match glyphs in PDFs. The PDF32000 specification has two different algorithms for text display and extraction. Even if you can lookup font data in the page.commonObjs, it might not be really helpful for extracted text content display due to glyphs encoding mismatch.
The page's getTextContent is doing text extraction and getOperatorList gets (glyph) display operators. See how src/display/svg.js renderer displays glyphs.

Related

How to format images with markdown, Sphinx latexpdf

I have a markdown document that I wish to output in pdf format. I cannot find a way to format the images correctly.
If I use
<center><img src="_images/parameter_form.png" alt="Select Parameters" width="400"></center>
I works perfectly with make html, but with make latexpdf the image does not appear at all.
If I use (based on this stack overflow answer)
![Select Parameters](_images/parameter_form.png) { width=400 }
The images is not sized and the format string { width=400 } appears in the text
What am I doing wrcng?
The answer is to use MyST. This Sphinx parser allows a richer syntax in markdown and specifically the image tag
```{image} _images/parameter_form.png
:alt: Select Parameters
:width: 400px
:align: center
```
Perfect!

How to correctly position a header image with docx4j?

I am trying to convert this Word document with a header showing an image on the right
http://www.filesnack.com/files/cduiejc7
to PDF using this sample code:
https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ConvertOutPDF.java
Here's the result:
http://www.filesnack.com/files/ctjs659h
While the Word document has the header image on the right, the converted PDF shows it on the left.
How can I make docx4j to reproduce the original document as PDF?
Your image is positioned relative to a paragraph:
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="791936E3" wp14:editId="575B92C8">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:posOffset>5317388</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>-325755</wp:posOffset>
</wp:positionV>
docx4j potential to support stuff like that in PDF output is limited by what XSL FO supports. See docx4j's TextBoxTest class for what we can do with text boxes.
Currently, although we can position some textBoxes; we don't do the same for floating images: https://github.com/plutext/docx4j/issues/127
In the meantime, a possible workaround for some cases (eg float right) is to use a table.
Or possibly, you could try putting the image inside a text box!

Adjust PDF options for PhantomJS

I use PhantomJS for PDF generation.
This is my command:
./phantomjs rasterize.js <someurl> test.pdf
It generates pdf file but:
The PDF looks nothing like the original website
I can't set the page orientation
Also is there any other options I can use for pdf generation?
The following change to rasterize.js also doesn't seem to work:
{ format: system.args[3], orientation: 'Letter', margin: '1cm' }
Rasterize.js is a very basic example of screen capture. There are some default values in this example that you can change to your needs.
page.viewportSize
Simulates the size of the window like in a traditional browser. In rasterize.js, it's { width: 600, height: 600 } ; not a common resolution and you may need to change this.
page.paperSize
Defines the size of the web page when rendered as a PDF. There are two modes : Manual (given a width and a height) or automatic (given a format). Do not hesitate to read the webpage documentation and the wiki page.
In your case, orientation: 'Letter' is invalid.
Supported formats are 'A3', 'A4', 'A5', 'Legal', 'Letter', 'Tabloid'.
Supported orientation are 'portrait' and 'landscape'.
Take a look at the source code and change it to your needs !

How to find x,y location of a text in pdf

Is there any tool to find the X-Y location on a text content in a pdf file ?
Docotic.Pdf Library can do it. See C# sample below:
using (PdfDocument doc = new PdfDocument("your_pdf.pdf"))
{
foreach (PdfTextData textData in doc.Pages[0].Canvas.GetTextData())
Console.WriteLine(textData.Position + " " + textData.Text);
}
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object.
If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
TET, the Text Extraction Toolkit from the pdflib family of products can do that. TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...)
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Is there a way to add "alt text" to links in PDFs in Adobe Acrobat?

In Adobe Acrobat Pro, it's not that difficult to add links to a page, but I'm wondering if there's also a way to add "alt text" (sometimes called "title text") to links as well. In HTML, this would be done as such:
link
Then when the mouse is hovering over the link, the text appears as a little tooltip. Is there an equivalent for PDFs? And if so, how do you add it?
Actually PDF does support alternate text. It's part of Logical Structure documented PDF Reference 1.7 section 10.8.2 "Alternate Descriptions"
/Span << /Lang (en-us) /Alt (six-point star) >> BDC (✡) Tj EMC
In PDF syntax, Link annotations support a Contents entry to serve as an alternate description:
/Annots [
<<
/Type /Annot
/Subtype /Link
/Border [1 1 1]
/Dest [4 0 R /XYZ 0 0 null]
/Rect [ 50 50 80 60 ]
/Contents (Link)
>>
]
Quoting "PDF Reference - 6th edition - Adobe® Portable Document Format - Version 1.7 - November 2006" :
Contents text string (Optional) Text to be displayed for the annotation or, if this type of annotation does not display text, an alternate description of the annotation’s contents in human-readable form. In either case, this text is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes
And later on:
For all other annotation types (Link , Movie , Widget , PrinterMark , and TrapNet), the Contents entry provides an alternate representation of the annotation’s contents in human-readable form
This is displayed well with Sumatra PDF v3.1.2, when a border is present:
However this is not widely supported by other PDF readers.
The W3C, in its PDF Techniques for WCAG 2.0 recommend another ways to display alternative texts on links for accessibility purposes:
PDF11: Providing links and link text using the Link annotation and the /Link structure element in PDF documents
PDF13: Providing replacement text using the /Alt entry for links in PDF documents
No, it's not possible to add alt text to a link in a PDF. There's no provision for this in the PDF reference.
On a related note, links in PDFs and links in HTML documents are handled quite differently. A link in a PDF is actually a type of annotation, which sits on top of the page at specified co-ordinates, and has no technical relationship to the text or image in the document. Where as links in HTML documents bare a direct relationpship to the elements which they hyperlink.
Alt text, or alternate text, is not the same as title text. Title text is meta data intended for human consumption. Alt text is alternate text content upon media in case the media fails to load.
There is also a trick using an invisible form button that doesn't do anything but allows a small popup tooltip text to be added when the mouse hovers over it.
Officially, per PDF 1.7 as defined in ISO 32000-1 14.9.3 (see Adobe website for a free download of a document that is equivaent to the ISO standard for PDF 1.7), one would provide alternate text for an annotation - like for example a Link annotation - by adding a key "Contents" to its data structure and provide the alt text as a text string value for that key.
Unfortunately Acrobat does not seem to provide a user interface to add or edit this "Contents" text string for Link annotations, and even if it is present it will not be used for the tool tip. Instead the tool tip always seems to be the target of the Link annotation, at least if it points to a URL.
So on a visual level one could hack around this by adding some other invisible elements on top of the area of the Link annotation that have the desired behavior. Not a very nice hack, at least for my taste. In addition it does not help with accessibility of the PDF, as it introduces yet another stray object...
Facing the same problem I used the JS lib "pdf-lib" (https://pdf-lib.js.org/docs/api/classes/pdfdocument) to edit the content of the pdf file and add the missing attributes on annotations.
const pdfLib = require('pdf-lib');
const fs = require('fs');
function getNewMap(pdfDoc, str){
return pdfDoc.context.obj(
{
Alt: new pdfLib.PDFString(str),
Contents: new pdfLib.PDFString(str)
}).dict;
}
const pdfData = await fs.readFile('your-pdf-document.pdf');
const pdfDoc = await pdfLib.PDFDocument.load(pdfData);
pdfDoc.context.enumerateIndirectObjects().forEach(_o => {
const pdfRef = _o[0];
const pdfObject = _o[1];
if (typeof pdfObject?.lookup === "function"){
if (pdfObject.lookup(pdfLib.PDFName.of('Type'))?.encodedName === "/Annot"){
// We use the link URI to implement annotation "Alt" & "Contents" attributes
const annotLinkUri = pdfObject.lookup(pdfLib.PDFName.of('A')).get(pdfLib.PDFName.of('URI')).value;
const titleFromUri = annotLinkUri.replace(/(http:|)(^|\/\/)(.*?\/)/g, "").replace(/\//g, "").trim();
// We build the new map with "Alt" and "Contents" attributes and assign it to the "Annot" object dictionnary
const newMap = getNewMap(pdfDoc, titleFromUri);
const newdict = new Map([...pdfObject.dict, ...newMap]);
pdfObject.dict = newdict;
}
}
})
// We save the file
const pdfBytes = await pdfDoc.save();
await fs.promises.writeFile("your-pdf-document-accessible.pdf", pdfBytes);