Detecting Headers and Borders in PDF Tables using PDF Clown - pdf

I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table?
private void Extract(ContentScanner level, PrimitiveComposer composer)
{
if(level == null)
return;
while(level.MoveNext())
{
ContentObject content = level.Current;
}
}

I am using PDF Clown's TextInfoExtractionSample...
In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders.
while(level.MoveNext())
{
ContentObject content = level.Current;
}
A) Visit all content
In your loop code you removed very important blocks from the original example,
if(content is XObject)
{
// Scan the external level!
Extract(((XObject)content).GetScanner(level), composer);
}
and
if(content is ContainerObject)
{
// Scan the inner level!
Extract(level.ChildLevel, composer);
}
These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content.
B) Inspect all content
Anyone know what object represents borders in PDF table
Unfortunately there is nothing like a border attribute in PDF content. Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles.
Thus, while scanning the page content (recursively, as indicated in A) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects) containing
moveTo m, lineTo l, and stroke S operations or
rectangle re and fill f operations.
(This answer may help)
When you come across such lines, you will have to interpret them. These lines may be borders, but they may also be used as underlines, page decorations, ...
If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. Unfortunately still less PDFs are tagged than not.
OR how to detect if a text is a header of the table?
Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. You have to interpret again. Is that text outside of lines you determined to form a table? Is it inside at the top? Or just anywhere inside? Is it drawn in a specific font? Or larger? Different color? Etc.

Related

iText's Alt-Text adding sample code not working for PDFs tagged using Acrobat

I'm working on a PDF accessibility assignment, which is to add alternative text in a tagged PDF. I got the sample code for the same at: Add alternative text for an image in tagged pdf (PDF/UA) using iText
Very much excited about that my task is going to end in a very short time, without much R&D.
Created a Java project based on the code, and when I executed it, it worked perfectly for the input PDF used in iText.
Unfortunately, the same source code is not working with PDFs tagged using Acrobat.
Sample Inputs: iText PDF: no_alt_attribute.pdf   &   My PDF: SARO_Sample_v1.7.pdf
Issue:
// This line works and returns RootElement
PdfDictionary structTreeRoot = catalog.getAsDict(PdfName.STRUCTTREEROOT);
// --> This line always returns NULL,
// Instead of returning the child elements of RootElement
PdfArray kids = structTreeRoot.getAsArray(PdfName.K);
// --> As per the structure Kids are present
Compared the structure of both PDFs and the following are my observations:
Tagging Structure - exactly same in both PDFs Tagging Structure
Content Structure - almost same, but a few additions are available in the PDF created by me. Content Structure
Tag Tree Structure - almost same respective to Tags, but with a major difference: iText's PDF tags are marked with /T:StructElem whereas that's not found in MY-PDF Even re-tagging doesn't help. Tag Tree Structure
Verified with various tagged PDFs available with us and all are similar (without /T:StructElem). These PDFs are validated and have passed accessibility compliance.
Need some thoughts on how to make this source code work with the PDFs we have. Alternatively, I need a way to ADD the missing /T:StructElem automatically in the PDFs while tagging in Acrobat.
Any help will be much appreciated!
Please do let me know if any further information is needed.
Note: I'm still not sure adding this /T:StructElem will work, since the PDFs were passed in PAC.
If this is really an issue, then those PDFs wont be passed the validations, right? But this is the only difference I found between those two PDFs.
PS: The Acrobat version I'm using is "Adobe Acrobat (Pro) DC."
-- Thanks,SaRaVaNaN
Bruno's code in the referenced answer does not walk the whole structure tree because he did not implement all cases of the K contents. The structure element K entry is specified like this:
The children of this structure element. The value of this entry may be one of the following objects or an array consisting of one or more of the following objects in any combination: [...]
(ISO 32000-2, Table 355 — Entries in a structure element dictionary)
Bruno's code, though, always assumes the value to be an array:
PdfArray kids = element.getAsArray(PdfName.K);
(Most likely he implemented that code with just the structure tree of the PDF in question there in mind.)
Thus, replace
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) return;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
by something like
PdfObject kid = element.getDirectObject(PdfName.K);
if (kid instanceof PdfDictionary) {
manipulate((PdfDictionary)kid);
} else if (kid instanceof PdfArray) {
PdfArray kids = (PdfArray)kid;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
}
As you did not share an example document, I could not test the code. If there are problems, please share an example PDF.

Fill textfield with pdfbox cause an offset [duplicate]

I am using Apache PDFBox for configuration of PDTextField's on a PDF document where I load Lato onto the document using:
font = PDType0Font.load(
#j_pd_document,
java.io.FileInputStream.new('/path/to/Lato-Regular.ttf')
) # => Lato-Regular
font_name = pd_default_resources.add(font).get_name # => F4
I then pass the font_name into a default_appearance_string for the PDTextField like so:
j_text_field.set_default_appearance("/#{font_name} 0 Tf 0 g") # where font_name is
# passed in from above
The issue now occurs when I proceed to invoke setValue on the PDTextField. Because I set the font_size in the defaultAppearanceString to 0, according to the library's example, the text should scale itself to fit in the text box's given area. However, the behaviour of this 'scale-to-fit' is inconsistent for certain fields: it does not always choose the largest font size to fit in the PDTextField. Might there be any further configuration that might allow for this to happen? Below are the PDFs where I've noticed this problem occurring.
Unfilled, with fonts loaded:
http://www.filedropper.com/0postfontload
Filled, with inconsisteny textbox text sizing:
http://www.filedropper.com/file_327
Side Note: I am using PDFBox through jruby which is just a integration layer that allows Ruby to invoke Java libraries. All java methods for the library available; a java method like thisExampleMethod would have a one-to-one translation into ruby this_example_method.
Updates
In response to comments, the appearances that are incorrect in the second uploaded file example are:
1st page Resident Name field (two text fields that have text that is too small for the given input field size)
2nd page Phone fields (four text fields that have text that overflows the given input field size)
Especially the appearances of the Resident Name fields, the Phone fields, and the Care Providers Address fields appear conspicuous. Only the former two are mentioned by the OP.
Let's inspect these fields; all screen shots are made using Adobe Reader DC on MS Windows:
The Resident Name fields
The filled in Resident Name fields look like this
While the height is appropriate, the glyphs are narrower than they should be. Actually this effect can already be seen in the original PDF:
This horizontal compression is caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
The widget rectangles: [ 45.72 601.44 118.924 615.24 ] and [ 119.282 601.127 192.486 614.927 ], i.e. 73.204*13.8 in both cases.
The appearance bounding box: [ 0 0 147.24 13.8 ], i.e. 147.24*13.8.
So they have the same height but the appearance bounding box is approximately twice as wide as the widget rectangle. Thus, the text drawn normally in the appearance stream gets compressed to half its width when the appearance is displayed in the widget rectangle.
When setting the value of a field PDFBox unfortunately re-uses the appearance stream as is and only updates details from the default appearance, i.e. font name, font size, and color, and the actual text value, apparently assuming the other properties of the appearance are as they are for a reason. Thus, the PDFBox output also shows this horizontal compression
To make PDFBox create a proper appearance, it is necessary to remove the old appearances before setting the new value.
The Phone fields
The filled in Phone fields look like this
and again there is a similar display in the original file
That only the first two letters are shown even though there is enough space for the whole word, is due to the configuration of these fields: They are configured as comb fields with a maximum length of 2 characters.
To have a value here set with PDFBox displayed completely and not so spaced out, you have to remove the maximum length (or at least have to make it no less than the length of your value) and unset the comb flag.
The Care Providers Address fields
Filled in they look like this:
Originally they look similar:
This vertical compression is again caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
A widget rectangle: [ 278.6 642.928 458.36 657.96 ], i.e. 179.76*15.032.
The appearance bounding box: [ 0 0 179.76 58.56 ], i.e. 179.76*58.56.
Just like in the case of the Resident Name fields above it is necessary to remove the old appearances before setting the new value to make PDFBox create a proper appearance.
A complication
Actually there is an additional issue when filling in the Care Providers Address fields, after removing the old appearances they look like this:
This is due to a shortcoming of PDFBox: These fields are configured as multi line text fields. While PDFBox for single line text fields properly calculates the font size based on the content and later finely makes sure that the text vertically fits quite well, it proceeds very crudely for multi line fields, it selects a hard coded font size of 12 and does not fine tune the vertical position, see the code of the AppearanceGeneratorHelper methods calculateFontSize(PDFont, PDRectangle) and insertGeneratedAppearance(PDAnnotationWidget, PDAppearanceStream, OutputStream).
As in your form these address fields anyways are only one line high, an obvious solution would be to make these fields single line fields, i.e. clear the Multiline flag.
Example code
Using Java one can implement the solutions explained above like this:
final int FLAG_MULTILINE = 1 << 12;
final int FLAG_COMB = 1 << 24;
PDDocument doc = PDDocument.load(originalStream);
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDType0Font font = PDType0Font.load(doc, fontStream, false);
String font_name = acroForm.getDefaultResources().add(font).getName();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.getCOSObject().removeItem(COSName.MAX_LEN);
textField.getCOSObject().setFlag(COSName.FF, FLAG_COMB | FLAG_MULTILINE, false);;
textField.setDefaultAppearance(String.format("/%s 0 Tf 0 g", font_name));
textField.getWidgets().forEach(w -> w.getAppearance().setNormalAppearance((PDAppearanceEntry)null));
textField.setValue("Test");
}
}
(FillInForm test testFill0DropOldAppearanceNoCombNoMaxNoMultiLine)
Screen shots of the output of the example code
The Resident Name field value now is not vertically compressed anymore:
The Phone and Care Providers Address fields also look appropriate now:

Getting the width of a path drawn on a XAML canvas

I'm drawing a canvas programmatically, given a bunch of path data from somewhere else and adding it to the canvas as
// This is actually done more elaborately, but will do for now
PathFigureCollection figures = GetPathFigureCollection();
var path = new Path
{
Data = new PathGeometry { Figures = figures },
Fill = GetFill(),
Stroke = GetStroke(),
StrokeThickness = GetThickness()
};
MyCanvas.Children.Add(path);
Now, I have the canvas in a ScrollViewer, so I want to make sure that I can scroll all the way to reveal the entire path (actually paths - I have several, generated the same way) but no further. I tried this:
var drawingWidth = MyCanvas.Children
.OfType<FrameworkElement>()
.Max(e => Canvas.GetLeft(e) + e.ActualWidth);
MyCanvas.Width = drawingWidth;
This works well for some other elements (the drawing also has a few text blocks and ellipses), but for the paths both Canvas.GetLeft(e) and e.ActualWith (as well as some other things I tried like e.RenderSize.Width and e.DesiredSize.With) all return 0. Since the element that extends farthest to the right is a path, this results in a canvas that is too small.
How do I get the width of the Path elements too?
Ha, found it!
Rewriting the LINQ query as a loop, I could cast paths to Path, and use path.Data.Bounds.Right as the right edge of that element.** I might be able to convert the code back to a LINQ query now that I know what I want to do (I always find them more readable than stateful loops...).
I found this when I, after having perused the link provided by markE where, as a side note, it was stated that
If your design requirements allow more rough approximates, then you will find that cubic Bezier curves are always contained within their control points.
So, if I could find the right-most control point of all the path figures in my path, I would be home. Intellisense did the rest of the job for me :)

How to add text to an existing pdf at a fixed position?

I must insert a number at a fix position in an existing A4 pdf.
I've tried the following as a first test, but that doesn't work(not text is added).
What goes wrong?
Here's my code:
byte[] omrMarks = omrFrame.getOmrImage();
Jpeg img = new Jpeg(omrMarks);
PdfImportedPage page = stamper.getImportedPage(source, pageNum);
PdfContentByte pageContent = stamper.getOverContent(pageNum);
pageContent.addImage(
img, img.getWidth(), 0, 0, img.getHeight(), 15f, (page.getHeight() - 312));
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
There are many issues with this question.
This is certainly wrong:
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
The moveTo() method doesn't make sense; it has no effect on the text state object.
The text state object is illegal because there's no setFontAndSize() (it's very odd that this doesn't throw a RuntimeException, are you using an obsolete version of iText?)
The setLiteral() method should only be used to add some literal PDF syntax to a content stream.
For instance, something like:
pageContent.setLiteral("\n100 100 m\n100 200 l\nS\n");
should only be used if you understand that the following PDF syntax draws a line:
100 100 m
100 200 l
S
It's clear from your question that you don't understand PDF syntax, so you shouldn't use these methods. Instead you should use convenience methods such as the showTextAligned() method, which hide the complexity of PDF and save you a couple of lines.
Maybe you have a good reason to opt for the "hard way", but in that case, you should read the documentation, otherwise you'll continue using methods such as setLiteral() instead of showText(), moveTo() instead of moveText(), and so on, resulting in code you don't want your employer to see.
Furthermore, you're making the assumption that the lower left corner of the page has the coordinates (0,0). That's probably true for the majority of PDF documents found in the wild, but that's not true for all PDF documents. The MediaBox doesn't have to be [0 0 595 842], it could as well be [595 842 1190 1684]. Moreover: what if there's a CropBox? Maybe you're adding content that isn't visible because it's cropped away...

Create pdf with tooltips in R

Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
There's always a way. But the devil is in the details, so the real question is: are you willing to get your hands dirty?
PDF does support tooltips for certain kinds of objects such as hyperlinks. So, if there is a way to insert raw PDF statements indicating there should be a hyperlink-like object at some position in your plot, then there is a way to pop up a tooltip.
Now, the only way I know of to generate and compile raw PDF statements is to create the document using TeX. There are definitely other ways to do it, but this is the one I am familiar with. The following examples use a graphics device that Cameron Bracken and I wrote, the tikzDevice, to render R graphics to LaTeX code. The tikzDevice has preliminary support for injecting arbitrary LaTeX code into the graphics output stream through the tikzAnnotate function---we will be using this to drop PDF callouts into the plot.
The steps involved are:
Set up a LaTeX macro to generate the PDF commands required to produce a callout.
Set up an R function that uses tikzAnnotate to invoke the LaTeX macro at specific points in the plot.
???
Profit!
In the examples that follow, one major caveat is attached to step 2. The coordinate calculations used will only work with base graphics, not grid graphics such as ggplot2.
Simple Tooltips
Step 1
The tikzDevice allows you to create R graphics that include the execution of arbitrary LaTeX commands. Usually this is done to insert things like $\alpha$ into plot titles to generate greek letters, α, or complex equations. Here we are going to use this feature to invoke some raw PDF voodoo.
Any LaTeX macros that you wish to be available during the generation of a tikzDevice graphic need to be defined up-front by setting the tikzLatexPackages option. Here we are going to append some stuff to that declaration:
require(tikzDevice) # So that default options are set
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'), # The original contents: required stuff
# Avert your eyes for a sec, all will be explained below
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa\\setbox\\tempboxa=\\hbox{}\\immediate\\pdfxform\\tempboxa \\edef\\emptyicon{\\the\\pdflastxform}",
"\\newcommand\\tooltip[1]{\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R >>}\\tooltiptarget\\pdfendlink}"
))
If all that quoted nonsense were to be written out as LaTeX code by someone who cared about readability, it would look like this:
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\setbox\tempboxa=\hbox{}
\immediate\pdfxform\tempboxa
\edef\emptyicon{\the\pdflastxform}
\newcommand\tooltip[1]{%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
For those programmers who have never taken a walk on the wild side and done some "programming" in TeX, here's a blow-by-blow for the above code (as I understand it anyway, TeX can get very weird---especially when you get down in the trenches with it):
Line 1: Define an object, tooltiptarget, which is non-visible (a phantom) and is a 1mm x 1mm rectangle (a rule). This will be the onscreen area which we will use to detect mouseovers.
Line 2: Create a new box, which is like a "page fragment" of typset material. Can contain pretty much anything, including other boxes (sort of like an R list). Call it tempboxa.
Line 3: Assign the contents of tempboxa to contain an empty box that arranges its contents using a horizontal layout (which is unimportant, could have used a vbox or other box).
Line 4: Create a PDF Form XObject using the contents of tempboxa. A Form XObject can be used by PDF files to store graphics, like logos, that may be used over and over. Here we are creating a "blank icon" that we can use later to cut down on visual clutter. TeX defers output operations, like writing objects to a PDF file, until certain conditions have been met---such as a page has filled up. Immediate makes sure this operation is not deferred.
Line 5: This line captures an integer value that serves as a reference to the PDF XObject we just created and assigns it the name emptyicon.
Line 6: Starts the definition of a new macro called tooltip that takes 1 argument which is referred to in the body of the macro as #1. Each line in the macro ends with a comment character, %, to keep TeX from noticing the newlines that have been added for readability (newlines can have strange effects inside macros).
Line 7: Output raw PDF commands (pdfstartlink). This begins the creation of a new PDF annotation object (\Type \Annot) of which there are about 20 different subtypes---among them are hyperlinks. Every line following this contains raw PDF markup with a couple of TeX macros.
Line 8: Declare the annotation subtype we are going to use. Here I am going with a plain Text annotation such as a comment or sticky note.
Line 9: Declare the contents of the annotation. This will be the contents of our tooltip and is set to #1, the argument to the tooltip macro.
Lines 10-12: Normally text annotations are marked by an icon, such as a sticky note, to highlight their location in the text. This behavior will cause a visual mess if we allow it to splatter little sticky notes all over our graphs. Here we use an appearance array (\AP << >>) set the "normal" annotation icon (\N) to be the blank icon we created earlier. The integer we stored in emptyicon along with 0 R forms a reference to the Form XObject we made on Line 4 using an empty box.
Line 14: If we were making a normal hyperlink, here is where the text/image/whatever would go that would serve as the link body. Instead we insert tooltiptarget, our invisible phantom object which does not show up on the screen but does react to mouseovers.
Step 2
Allright, now that we have told LaTeX how to create tooltips, it is time to make them usable from R. This involves writing a function that will take coordinates on our graph, such as (1,1), and convert them into canvas or "device" coordinates. In the case of the tikzDevice the required measurement is "TeX points" (1/72.27 of an inch) from the absolute bottom left of the plotting area. Fortunately for base graphics, there are handy functions to calculate this for us. Grid graphics work differently, so the approach taken in the examples here won't work for them.
The final task for our R function is to call tikzAnnotate to insert a TikZ "node" into the output stream that is located at the coordinates we computed. Nodes can contain arbitrary TeX commands, in this case we will be calling upon tooltip.
Here is an R function that contains the above functionality:
place_PDF_tooltip <- function(x, y, text){
# Calculate coordinates
tikzX <- round(grconvertX(x, to = "device"), 2)
tikzY <- round(grconvertY(y, to = "device"), 2)
# Insert node
tikzAnnotate(paste(
"\\node at (", tikzX, ",", tikzY, ") ",
"{\\tooltip{", text, "}};",
sep = ''
))
invisible()
}
Step 3
Try it out on a plot:
# standAlone creates a complete LaTeX document. Default output makes
# stripped-down graphs ment for inclusion in other LaTeX documents
tikz('tooltips_ahoy.tex', standAlone = TRUE)
plot(1,1)
place_PDF_tooltip(1,1, 'Hello, World!')
dev.off()
require(tools)
texi2dvi('tooltips_ahoy.tex', pdf = TRUE)
Step 4
Behold the result (download a pdf):
Advanced Tooltips
Step 1
So, now that we have simple tooltips out of the way, why not crank it to 11? In the previous example, we used an empty hbox to get rid of the tooltip icon. But what if we had put something in that box, like text or a drawing? And what if there was a way to make it so that the icon only appeared during mouseover events?
The following TeX macro is a little rough around the edges, but it shows that this is possible:
\usetikzlibrary{shapes.callouts}
\tikzset{tooltip/.style = {
rectangle callout,
draw,
callout absolute pointer = {(-2em, 1em)}
}}
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\newcommand\tooltip[1]{%
\def\tooltipcallout{\tikz{\node[tooltip]{#1};}}%
\setbox\tempboxa=\hbox{\phantom{\tooltipcallout}}%
\immediate\pdfxform\tempboxa%
\edef\emptyicon{\the\pdflastxform}%
\setbox\tempboxa=\hbox{\tooltipcallout}%
\immediate\pdfxform\tempboxa%
\edef\tooltipicon{\the\pdflastxform}%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
/R \tooltipicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
The following modifications have been made compared to the simple callout.
The shapes.callouts library is loaded which contains templates for TikZ to use when drawing callout boxes.
A tooltip style is defined which contains some TikZ graphics boilerplate. It specifies a rectangular callout box that is to be visible (draw). The callout absolute pointer business is a hack because I've had too many beers by this point to figure out how to place annotation icons using dynamically generated PDF primitives. This relies on the default anchoring of icons at their upper left corner and so pulls the pointer of the callout box toward that location. The result is that the boxes will always appear to the lower right of the pointer and if the callout text is long enough, they won't look right.
Inside the macro, the tooltip is generated using a one-shot tikz command that is stuffed into the tooltipcallout macro. A form XObject is generated from tooltipcallout and assigned to tooltipicon.
emptyicon is also dynamically generated by evaluating tooltipcallout inside of phantom. This is required because the size of the default icon apparently sets the viewport available for the rollover icon.
When generating PDF commands, a new row is added to the /AP array, /R for rollover, that uses the XObject referenced by tooltipicon.
The ready to consume R version is:
require(tikzDevice)
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'),
"\\usetikzlibrary{shapes.callouts}",
"\\tikzset{tooltip/.style = {rectangle callout,draw,callout absolute pointer = {(-2em, 1em)}}}",
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa",
"\\newcommand\\tooltip[1]{\\def\\tooltipcallout{\\tikz{\\node[tooltip]{#1};}}\\setbox\\tempboxa=\\hbox{\\phantom{\\tooltipcallout}}\\immediate\\pdfxform\\tempboxa\\edef\\emptyicon{\\the\\pdflastxform}\\setbox\\tempboxa=\\hbox{\\tooltipcallout}\\immediate\\pdfxform\\tempboxa\\edef\\tooltipicon{\\the\\pdflastxform}\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R/R \\tooltipicon\\space 0 R>>}\\tooltiptarget\\pdfendlink}"
))
Step 2
The R-level code is unchanged.
Step 3
Let's try a slightly more complicated graph:
tikz('tooltips_with_callouts.tex', standAlone = TRUE)
x <- 1:10
y <- runif(10, 0, 10)
plot(x,y)
place_PDF_tooltip(x,y,x)
dev.off()
require(tools)
texi2dvi('tooltips_with_callouts.tex', pdf = TRUE)
Step 4
The result (download a PDF):
As you can see, there is an issue with both the tooltip and the callout being displayed. Setting \Contents () so that the tooltip has an empty string won't help anything. This can probably be solved by using a different annotation type, but I'm not going to spend any more time on it at the moment.
Caveats
Lots of TeX commands contain backslashes, you will need to double the backslashes when expressing things in R code.
Certain characters are special to TeX, such as _, ^, %, complete list here, you will need to ensure these are properly escaped when using the tikzDevice.
Even though PDF is supposed to be superior to HTML in that it has a consistant rendering across platforms, your mileage will vary significantly depending on which viewer is being used. The screenshots were taken in Acrobat 8 on OS X, Preview also did a passable job but did not render the rollover callout. On Linux, xpdf didn't render anything and okular showed a tooltip, but did not suppress the tooltip icon and displayed a stock icon that looked a little garish in the middle of a plot.
Alternative Implementations
cooltooltips and fancytooltips are LaTeX packages that provide tooltip functionality that could probably be used from the tikzDevice. Some ideas used in the above examples were taken from cooltooltips.
Concluding Thoughts
Was this worth it? Well, by the end of the day I did learn two things:
I am not a PDF expert and I had to search a couple mailing lists and digest some parts of a 700+ page specification to even answer the question "is this possible?" let alone come up with an implementation.
I am not a SVG expert and yet I know that tooltips in SVG is not a question of "is this possible?" but rather "how sexy do you want it to look?".
Given that observation, I say that it is getting to be time for PDF to ride off into the sunset and take it's 700 page spec with it. We need a new page description markup language and personally I'm hoping SVG Print will be complete enough to do the job. I would even accept Adobe Mars if the specification is accesible enough. Just give us an XML-based format that can exploit the full power of today's JavaScript libraries and then some real innovation can begin. A LuaTeX backend would be a nice bonus.
References
tikzDevice: A device for outputting R graphics as LaTeX code.
comp.text.tex: Source of much insight into esoteric TeX details. Posts by Herbert Voß and Mycroft provided implementation ideas.
The pdfTeX manual: Source of information concerning TeX commands that can generate raw PDF code, such as \pdfstartlink
TeX by Topic: Go-to manual for low-level TeX details such as how \immediate actually works.
The Adobe PDF Specification: The lair of details concerning PDF primitives.
The TikZ Manual: Quite possibly the finest example of open source documentation ever written.
Well, not pdf, but you could include tooltips (among ohers) in svg format with SVGAnnotation or RSVGTipsDevice package.
The pdf2 package works fine for me. (https://r-forge.r-project.org/R/?group_id=193). You can just include a 'popup' in the regular text command. It's not current as of 2.14, but I imagine he'll get round to it before too long.
text(x,y,'hard copy text',popup='tooltip text')