Is there a way to add "alt text" to links in PDFs in Adobe Acrobat? - pdf

In Adobe Acrobat Pro, it's not that difficult to add links to a page, but I'm wondering if there's also a way to add "alt text" (sometimes called "title text") to links as well. In HTML, this would be done as such:
link
Then when the mouse is hovering over the link, the text appears as a little tooltip. Is there an equivalent for PDFs? And if so, how do you add it?

Actually PDF does support alternate text. It's part of Logical Structure documented PDF Reference 1.7 section 10.8.2 "Alternate Descriptions"
/Span << /Lang (en-us) /Alt (six-point star) >> BDC (✡) Tj EMC

In PDF syntax, Link annotations support a Contents entry to serve as an alternate description:
/Annots [
<<
/Type /Annot
/Subtype /Link
/Border [1 1 1]
/Dest [4 0 R /XYZ 0 0 null]
/Rect [ 50 50 80 60 ]
/Contents (Link)
>>
]
Quoting "PDF Reference - 6th edition - Adobe® Portable Document Format - Version 1.7 - November 2006" :
Contents text string (Optional) Text to be displayed for the annotation or, if this type of annotation does not display text, an alternate description of the annotation’s contents in human-readable form. In either case, this text is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes
And later on:
For all other annotation types (Link , Movie , Widget , PrinterMark , and TrapNet), the Contents entry provides an alternate representation of the annotation’s contents in human-readable form
This is displayed well with Sumatra PDF v3.1.2, when a border is present:
However this is not widely supported by other PDF readers.
The W3C, in its PDF Techniques for WCAG 2.0 recommend another ways to display alternative texts on links for accessibility purposes:
PDF11: Providing links and link text using the Link annotation and the /Link structure element in PDF documents
PDF13: Providing replacement text using the /Alt entry for links in PDF documents

No, it's not possible to add alt text to a link in a PDF. There's no provision for this in the PDF reference.
On a related note, links in PDFs and links in HTML documents are handled quite differently. A link in a PDF is actually a type of annotation, which sits on top of the page at specified co-ordinates, and has no technical relationship to the text or image in the document. Where as links in HTML documents bare a direct relationpship to the elements which they hyperlink.

Alt text, or alternate text, is not the same as title text. Title text is meta data intended for human consumption. Alt text is alternate text content upon media in case the media fails to load.

There is also a trick using an invisible form button that doesn't do anything but allows a small popup tooltip text to be added when the mouse hovers over it.

Officially, per PDF 1.7 as defined in ISO 32000-1 14.9.3 (see Adobe website for a free download of a document that is equivaent to the ISO standard for PDF 1.7), one would provide alternate text for an annotation - like for example a Link annotation - by adding a key "Contents" to its data structure and provide the alt text as a text string value for that key.
Unfortunately Acrobat does not seem to provide a user interface to add or edit this "Contents" text string for Link annotations, and even if it is present it will not be used for the tool tip. Instead the tool tip always seems to be the target of the Link annotation, at least if it points to a URL.
So on a visual level one could hack around this by adding some other invisible elements on top of the area of the Link annotation that have the desired behavior. Not a very nice hack, at least for my taste. In addition it does not help with accessibility of the PDF, as it introduces yet another stray object...

Facing the same problem I used the JS lib "pdf-lib" (https://pdf-lib.js.org/docs/api/classes/pdfdocument) to edit the content of the pdf file and add the missing attributes on annotations.
const pdfLib = require('pdf-lib');
const fs = require('fs');
function getNewMap(pdfDoc, str){
return pdfDoc.context.obj(
{
Alt: new pdfLib.PDFString(str),
Contents: new pdfLib.PDFString(str)
}).dict;
}
const pdfData = await fs.readFile('your-pdf-document.pdf');
const pdfDoc = await pdfLib.PDFDocument.load(pdfData);
pdfDoc.context.enumerateIndirectObjects().forEach(_o => {
const pdfRef = _o[0];
const pdfObject = _o[1];
if (typeof pdfObject?.lookup === "function"){
if (pdfObject.lookup(pdfLib.PDFName.of('Type'))?.encodedName === "/Annot"){
// We use the link URI to implement annotation "Alt" & "Contents" attributes
const annotLinkUri = pdfObject.lookup(pdfLib.PDFName.of('A')).get(pdfLib.PDFName.of('URI')).value;
const titleFromUri = annotLinkUri.replace(/(http:|)(^|\/\/)(.*?\/)/g, "").replace(/\//g, "").trim();
// We build the new map with "Alt" and "Contents" attributes and assign it to the "Annot" object dictionnary
const newMap = getNewMap(pdfDoc, titleFromUri);
const newdict = new Map([...pdfObject.dict, ...newMap]);
pdfObject.dict = newdict;
}
}
})
// We save the file
const pdfBytes = await pdfDoc.save();
await fs.promises.writeFile("your-pdf-document-accessible.pdf", pdfBytes);

Related

iText 7 need to skip reading page header elements

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.
General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.
Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

acroform field.setRichTextValue is not working

I have a field from acroform and I see field.setValue() and field.setRichTextValue(...). The first one set the correct value, but second one seems not working, rich text value is not display.
Here is code im using :
PDDocument pdfDocument = PDDocument.load(new File(SRC));
pdfDocument.getDocument().setIsXRefStream(true);
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
acroForm.setNeedAppearances(false);
acroForm.getField("tenantDataValue").setValue("Deuxième texte");
acroForm.getField("tradingAddressValue").setValue("Text replacé");
acroForm.getField("buildingDataValue").setValue("Deuxième texte");
acroForm.getField("oldRentValue").setValue("750");
acroForm.getField("oldChargesValue").setValue("655");
acroForm.getField("newRentValue").setValue("415");
acroForm.getField("newChargesValue").setValue("358");
acroForm.getField("increaseEffectiveDateValue").setValue("Texte 3eme contenu");
// THIS RICH TEXT NOT SHOW ANYTHING
PDTextField field = (PDTextField) acroForm.getField("tableData");
field.setRichText(true);
String val = "\\rtpara[size=12]{para1}{This is 12pt font, while \\span{size=8}{this is 8pt font.} OK?}";
field.setRichTextValue(val);
I expect field named "tableData" to be setted with rich text value!
You can download the PDF form I am using with this code : download pdf form
and you can download the output after runn this code and flatten form data download output here
To sum up what has been said in the comments to the question plus some studies of the working version...
Wrong rich text format
The OP in his original code used this as rich text
String val = "\\rtpara[size=12]{para1}{This is 12pt font, while \\span{size=8}{this is 8pt font.} OK?}";
which he took from this document. But that document is the manual for the LaTeX richtext package which provides commands and documentation needed to “easily” produce such rich strings. I.e. the \rtpara... above is not PDF rich text but instead a LaTeX command that produces PDF rich text (if executed in a LaTeX context).
The document actually even demonstrates this using the example
\rtpara[indent=first]{para1}{Now is the time for
\span{style={bold,italic,strikeit},color=ff0000}{J\374rgen}
and all good men to come to the aid of \it{their}
\bf{country}. Now is the time for \span{style=italic}
{all good} women to do the same.}
for which the instruction generates two values, a rich text value and a plain text value:
\useRV{para1}: <p dir="ltr" style="text-indent:12pt;
margin-top:0pt;margin-bottom:0pt;">Now is the time
for <span style="text-decoration:line-through;
font-weight:bold;font-style:italic;color:#ff0000;
">J\374rgen</span> and all good men to come to the
aid of <i>their</i> <b>country</b>. Now is the
time for <span style="font-style:italic;">all
good</span> women to do the same.</p>
\useV{para1}: Now is the time for J\374rgen and all
good men to come to the aid of their country. Now
is the time for all good women to do the same.
As one can see in the \useRV{para1} result, PDF rich text uses (cut down) HTML markup for rich text.
For more details please lookup the PDF specification, e.g. section 12.7.3.4 "Rich Text Strings" in the copy of ISO 32000-1 published by Adobe here
PDFBox does not create rich text appearances
The OP in his original code uses
acroForm.setNeedAppearances(false);
This sets a flag that claims that all form fields have appearance streams (in which the visual appearance of the respective form field plus its content are elaborated) and that these streams represent the current value of the field, so it effectively tells the next processor of the PDF that it can use these appearance streams as-is and does not need to generate them itself.
As #Tilman quoted from the JavaDocs, though,
/**
* Set the fields rich text value.
*
* <p>
* Setting the rich text value will not generate the appearance
* for the field.
* <br>
* You can set {#link PDAcroForm#setNeedAppearances(Boolean)} to
* signal a conforming reader to generate the appearance stream.
* </p>
*
* Providing null as the value will remove the default style string.
*
* #param richTextValue a rich text string
*/
public void setRichTextValue(String richTextValue)
So setRichTextValue does not create an appropriate appearance stream for the field. To signal the next processor of the PDF (in particular a viewer or form flattener) that it has to generate appearances, therefore, one needs to use
acroForm.setNeedAppearances(true);
Making Adobe Acrobat (Reader) generate the appearance from rich text
When asked to generate field appearances for a rich text field, Adobe Acrobat has the choice to do so either based on the rich text value RV or the flat text value V. I did some quick checks and Adobe Acrobat appears to use these strategies:
If RV is set and the value of V equals the value of RV without the rich text markup, Adobe Acrobat assumes the value of RV to be up-to-date and generates an appearance from this rich text string according to the PDF specification. Else the value of RV (if present at all) is assumed to be outdated and ignored!
Otherwise, if the V value contains rich text markup, Adobe Acrobat assumes this value to be rich text and creates the appearance according to this styling.
This is not according to the PDF specification.
Probably some software products used to falsely put the rich text into the V value and Adobe Acrobat started to support this misuse for larger compatibility.
Otherwise the V value is used as a plain string and an appearance is generated accordingly.
This explains why the OP's original approach using only
field.setRichTextValue(val);
showed no change - the rich text value was ignored by Adobe Acrobat.
And it also explains his observation
then instead of setRichTextValue simply using field.setValue("<body xmlns=\"http://www.w3.org/1999/xhtml\"><p style=\"color:#FF0000;\">Red
</p><p style=\"color:#1E487C;\">Blue
</p></body>") works ! in acrobat reader (without flatten) the field is correctly formatted
Be aware, though, that this is beyond the PDF specification. If you want to generate valid PDF, you have to set both RV and V and have the latter contain the plain version of the rich text of the former.
For example use
String val = "<?xml version=\"1.0\"?>"
+ "<body xfa:APIVersion=\"Acroform:2.7.0.0\" xfa:spec=\"2.1\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:xfa=\"http://www.xfa.org/schema/xfa-data/1.0/\">"
+ "<p dir=\"ltr\" style=\"margin-top:0pt;margin-bottom:0pt;font-family:Helvetica;font-size:12pt\">"
+ "This is 12pt font, while "
+ "<span style=\"font-size:8pt\">this is 8pt font.</span>"
+ " OK?"
+ "</p>"
+ "</body>";
String valClean = "This is 12pt font, while this is 8pt font. OK?";
field.setValue(valClean);
field.setRichTextValue(val);
or
String val = "<body xmlns=\"http://www.w3.org/1999/xhtml\"><p style=\"color:#FF0000;\">Red
</p><p style=\"color:#1E487C;\">Blue
</p></body>";
String valClean = "Red\rBlue\r";
field.setValue(valClean);
field.setRichTextValue(val);

PDF metadata to open document in Actual Size (100%) view

I am generating a PDF document using jsPDF. Is there a way to store metadata in the PDF document that will force Acrobat to open it in 100% view mode (Actual Size) vs sized to fit?
In other words does PDF document specification allow that to specify it in the document itself?
This is definitely possible, because a PDF document can contain information on how it should open.
You might create such a document in Acrobat and then find the opening information, and/or you might have a look at the Portable Document Format Reference, which is part of the Acrobat SDK, downloadable from the Adobe website.
However, I don't know whether you can insert that structure into the PDF with your tool.
I figured it out; in the Catalog section of the PDF document, there is a OpenAction section where we can specify how the view can show the file, among other things.
I changed this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R /FitH null]');
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},
to this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R 1 100]'); //change from standard code to use zoom to 100 % instead of fit to width
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},

Add a hyperlink into a PDF document

I'm currently extending our custom PDF writer to be able to write links to websites. However, I have a problem because I can't find anywhere how to place a link into the PDF.
This is, what prints a text:
BT
70 50 TD
/F1 12 Tf
(visit my website!) Tj
ET
What I need now is to wrap this into a hyperlink so the user gets redirected to my website, when clicking "visit my website!"
Any idea how to do so? I can't use a tool or so - I need to know how to write the right PDF commands to the file, since a lot of documents are generated dynamically using C#. Currently I'm using iTextSharp - but I couldn't find any functionality to write a hyperlink, so I decided to add this functionality.
Here's how the spec side of things: Links are created by having Link Annotations placed on the page. A link annotation is represented by either the Rect key or by a set of quadrilaterals. Let's assume that you're working with rectangles. In order to place the link, you'll need a dictionary like this at a minimum:
<< /Type /Annot /Subtype /Link /Rect [ x1 y1 x2 y2 ] >>
(x1, y1) and (x2, y2) describe the corners of the rectangle where the link's active area lives.
To work with this, this should be an indirect object in the PDF and referenced from your page's Annots array.
If you can create this, you'll get a link on the page that goes nowhere.
To get the link to go somewhere you'll need either a /Dest or an /A entry in the link annot (but not both). /Dest is an older artifact for page-level navigation - you won't use this. Instead, use the /A entry which is an action dictionary. So if you wanted to navigate to the url http://www.google.com, you would make your annotation look like this:
<< /Type /Annot /Subtype /Link /Rect [ x1 y1 x2 y2 ]
/A << /Type /Action /S /URI /URI (http://www.google.com) >>
>>
I can't help you specifically with how to do this in iTextSharp. I don't particularly like the model or abstraction that they use. I write a PDF toolkit for Atalasoft and I'll show you how I would do that in my own toolkit. Again, I'm making no effort to conceal that this is a commercial product and it's what I do for a living. I just want you to see that there are other options available.
// make a document, add a font, get its metrics
PdfGeneratedDocument doc = new PdfGeneratedDocument();
string fontResource = doc.Resources.Fonts.AddFromFontName("Times New Roman");
PdfFontMetrics mets = doc.Resources.Fonts.Get(fontResource).Metrics;
// make a page, place a line of text
PdfGeneratedPage page = doc.Pages.AddPage(PdfDefaultPages.Letter);
PdfTextLine line = new PdfTextLine(fontResource, 12.0, "Visit my web site.",
new PdfPoint(72, 400));
page.DrawingList.Add(line);
// get the bounds of the text we place, make an annotation
PdfBounds bounds = mets.GetTextBounds(12.0, "Visit my web site.");
bounds = new PdfBounds(72, 400, bounds.Width, bounds.Height);
LinkAnnotation annot = new LinkAnnotation(bounds, new PdfURIAction(new URI("my url")));
page.Annotations.Add(annot);
// save the content
doc.Save("finaldoc.pdf");
The only thing that is "tricky" is that there is a disassociation between what content is on the page and the link annotation - but this is because this is how Acrobat models links. If you were modifying an existing document, you would construct a PdfGeneratedDocument from the existing file/stream, add the annotation and then save.
Currently I'm using iTextSharp - but I couldn't find any functionality to write a hyperlink.
Have a look at the iText in Action, 2nd Edition Webified iTextSharp Examples MovieLinks2.cs or LinkActions.cs:
// create an external link
Chunk imdb = new Chunk("Internet Movie Database", FilmFonts.ITALIC);
imdb.SetAnchor(new Uri("http://www.imdb.com/"));
p = new Paragraph("Click on a country, and you'll get a list of movies, containing links to the ");
p.Add(imdb);
p.Add(".");
document.Add(p);
Thus, it really is simple to add links using iTextSharp.
If you still want to do that manually, have a look at the PDF specification. Section 12.5.6.5 explains Link Annotations, and section 12.6.4.7 shows the URI Actions to use in that link annotation.

How to find x,y location of a text in pdf

Is there any tool to find the X-Y location on a text content in a pdf file ?
Docotic.Pdf Library can do it. See C# sample below:
using (PdfDocument doc = new PdfDocument("your_pdf.pdf"))
{
foreach (PdfTextData textData in doc.Pages[0].Canvas.GetTextData())
Console.WriteLine(textData.Position + " " + textData.Text);
}
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object.
If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
TET, the Text Extraction Toolkit from the pdflib family of products can do that. TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...)
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.