itextsharp not working on converted pdf from msoffice applications - vb.net

I'm very confused, why iTextsharp can't read or get the image from pdf(pdf converted from msword,excel,powerpoint)
Here's what I did, I opened msword file, then convert the msword file to pdf, then read the pdf file using iTextsharp, it doesn't recognize if pdf file has an image or shape.
I also tried from powerpoint to pdf, then read the pdf file, it doesn't read the images also.
Here's the code:
Below the images....EDITED...
This is the image that can't be extracted:
This is the image that I test a while ago that is good, and I don't know why the other image can't be detected, or it errors.
As of now I change the code to this:
But also can't detect its an image on the circle shape.
For pn As Integer = 1 To pc
Dim pg As PdfDictionary = pdfr.GetPageN(pn)
Dim res As PdfDictionary = DirectCast(PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)), PdfDictionary)
Dim xobj As PdfDictionary = DirectCast(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)), PdfDictionary)
MessageBox.Show("THE ERROR IS HERE, IT BYPASS, SO XOBJ IS NOTHING IN THAT IMAGE")
If xobj IsNot Nothing Then
For Each name As PdfName In xobj.Keys
Dim obj As PdfObject = xobj.Get(name)
If obj.IsIndirect() Then
Dim tg As PdfDictionary = DirectCast(PdfReader.GetPdfObject(obj), PdfDictionary)
Dim type As PdfName = DirectCast(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)), PdfName)
Dim XrefIndex As Integer = Convert.ToInt32(DirectCast(obj, PRIndirectReference).Number.ToString(System.Globalization.CultureInfo.InvariantCulture))
Dim pdfObj As PdfObject = pdfr.GetPdfObject(XrefIndex)
Dim pdfStrem As PdfStream = DirectCast(pdfObj, PdfStream)
If PdfName.IMAGE.Equals(type) Then
Dim bytes As Byte() = PdfReader.GetStreamBytesRaw(DirectCast(pdfStrem, PRStream))
If (bytes IsNot Nothing) Then
Dim strat As New ImageInfoTextExtractionStrategy()
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfr, pn, strat)
End If
End If
End If
Next
End If
Next

Why your current code does not find or extract those shapes:
The smiley image and the flower image are completely different in nature: The flower image is a bitmap image stored in the PDF as an /XObject (eXternal Object) of subtype /Image while the smiley is a vector images stored in the PDF as a part of the page content stream as a (not necessarily continuous) sequence of path definition and drawing operations.
Your code only searches for bitmap images stored as external objects and it does so in a somewhat convoluted way: It first scans for image xobjects using lowlevel methods, and only if it finds such a xobject, it employs the iText highlevel extraction capabilities. If it started out using only the iText image extraction capabilities, it would be less complex, and at the same time it would also recognize inlined bitmap images.
You might want to look at the iText in Action — 2nd Edition chapter 15 Webified iTextSharp Examples, especially ExtractImages.cs which utilizes MyImageRenderListener.cs for this. While inspiration by that code could improve your current code, it won't yet help you with your issue at hand, though.
What you have to do to find or extract the shapes using iText:
Unfortunately your question is not entirely clear on what you actually are trying to achieve.
Do you merely need to detect whether there is some image (bitmap or vector graphic) on some page?
Do you need some information on the image, e.g. size or position on the page?
Or do you actually want to extract the image?
While these objectives can easily be implemented for bitmap graphics using the afore mentioned iText highlevel extraction capabilities, they are fairly difficult to fulfill for vector graphics.
For generic PDFs they are virtually impossible to implement because the drawing operations for an individual figure need not be together, and even worse, the drawing operations for different figures on the same page, for underlines on the page, and for other graphic effects might even be mixed in one big heap of operations in seemingly random order.
In your case, though, you have one advantage: Office seems to properly tag the figures in the PDF. This at least makes detection of the number of different (i.e. differently tagged) vector graphics on a page easy and also allows for the differentiation which drawing operation belongs to which figure.
Thus, here some pointers on how to achieve the goals mentioned above for PDFs tagged like your sample PDF. As I'm not using VB myself, I don't have sample code. But as your sample code shows that you already know how to follow object references and how to interpret PDF object information, these pointers should suffice to show the way.
1. Detecting whether there is some image on some page.
As the page content is tagged, it suffices to scan the structure hierarchy starting from the /StructTreeRoot entry in the document catalogue (use PdfReader.Catalog, select the value of PdfName.STRUCTTREEROOT in it, and dig into it).
E.g. for page 1 (in PDF object 4 0) of your sample (with the "1233" at the top and the smiley below) you'll find an array with dictionaries:
<<
/Pg 4 0 R
/K [0]
/S /P
/P 24 0 R
>>
and
<<
/Pg 4 0 R
/K [1]
/Alt ()
/S /Figure
/P 22 0 R
>>
each of which references the page (/Pg 4 0 R). The first one is of type /P, a paragraph (your "1233"), and the second one is of type /Figure, a figure (your smiley). The presence of the second element indicates the presence of a figure on the page. Thus, the goal 1 is achieved already with these data.
(For details cf. the PDF specification ISO 32000-1:2008 section 14.7 and 14.8.)
2. Retrieving some information on the image, e.g. size or position on the page.
For this you have to extract the graphics operators responsible for creating the figure in question. As it is tagged, you have to extract the operators in a marked content block associated with the marked content ID given by /K [1] in the /Figure dictionary above, i.e. 1.
In the content stream you'll find this:
/P <</MCID 1>> BDC 0.31 0.506 0.741 rg
108.6 516.6 m
108.6 569.29 160.18 612 223.8 612 c
287.42 612 339 569.29 339 516.6 c
339 463.91 287.42 421.2 223.8 421.2 c
160.18 421.2 108.6 463.91 108.6 516.6 c
h
f*
[...]
108.6 516.6 m
108.6 569.29 160.18 612 223.8 612 c
287.42 612 339 569.29 339 516.6 c
339 463.91 287.42 421.2 223.8 421.2 c
160.18 421.2 108.6 463.91 108.6 516.6 c
h
S
EMC
This section between BDC for /MCID 1 and EMC contains the graphics operations you seek. If you want to get some information on the figure they represent, you have to analyze them.
This is a very low-level view on all this and one might whish for a higher level API to retrieve this.
iText does have a high level API for the analogous operations for text and bitmap image processing using the parser namespace class PdfReaderContentParser together with some apropos RenderListener implementation like your ImageInfoTextExtractionStrategy. Unfortunately, though, PdfReaderContentParser does not yet properly pre-process the vector graphics related operators.
To do this with iText, therefore, you either have to extend the underlying PdfContentStreamProcessor to add the missing pre-processing (which is do-able as that parser class is implemented using separate listeners for the individual operations, and you can easily register new listeners for the graphics operators); or you have to retrieve the page content and parse it yourself.
3. Extracting the image.
As the vector images inside a PDF use PDF specific vector graphics operators, you first have to decide in which format you want to export the image. Unless you are interested in the raw PDF operators, you will most likely require some library helping you to create a file in the desired format.
Once that is decided, you first extract the graphics operators in question as explained before and then feed them to that library to create an exportable image of your choice.

Related

iText throws ClassCastException: PdfNumber cannot be cast to PdfLiteral

I am using iText v5.5.1 to read PDF and render paint text from it:
pdfReader = new PdfReader(new CloseShieldInputStream(is));
pdfParser = new PdfReaderContentParser(pdfReader);
int maxPageNumber = pdfReader.getNumberOfPages();
int pageNumber = 1;
StringBuilder sb = new StringBuilder();
SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();
while (pageNumber <= maxPageNumber) {
pdfParser.processContent(pageNumber, extractionStrategy);
sb.append(extractionStrategy.getText());
pageNumber++;
}
On one PDF file the following exception is thrown:
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNumber cannot be cast to com.itextpdf.text.pdf.PdfLiteral
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:382)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:80)
That PDF file seems to be broken, but maybe its contents still makes sense...
Indeed
That PDF file seems to be broken
The content streams of all pages look like this:
/GS1 gs
q
595.00 0 0
It looks like they all are cut off early as the last line is not a complete operation. This certainly can make a parser hickup as iText does.
Furthermore the content should be longer because even the size of their compressed stream is a bit larger than the length of this. This indicates streams broken on the byte level.
Looking at the bytes of the PDF file one cannot help but notice that
even inside binary streams the codes 13 and 10 only occur together and
cross-reference offset values are less than the actual positions.
So I assume that this PDF has been transmitted using a transport method handling it as textual data, especially replacing any kind of assumed line break (CR or LF or CR LF) with the CR LF now omnipresent in the file (CR = Carriage Return = 13; LF = Line Feed = 10). Such replacements will automatically break any compressed data stream like the content streams in your file.
Unfortunately, though...
but maybe its contents still makes sense
Not much. There is one big image associated to each page respectively. Considering the small size of the content streams and the large image size I would assume that the PDF only contains scanned pages. But the images also are broken due to the replacements mentioned above.
This isn't the best solution, but I had this exact problem and unfortunately can't share the exact PDFs I was having issues with.
I made a fork of itextpdf that catches the ClassCastException and just skips PdfObjects that it takes issue with. It prints to System.out what the text contained and what type itextpdf thinks it was. I haven't been able to map this out to some systemic problem with my PDFs (someone smarter than me will need to do that), and this exception only happens once in a blue moon. Anyway, in case it helps anyone, this fork at least doesn't crash your code, lets you parse the majority of your PDFs, and gives you a bit of info on what types of bytestrings seem to give itextpdf indigestion.
https://github.com/njhwang/itextpdf

First page of PDF's uploaded to web server appearing blank

I've uploaded the pdf's to a client's web server (using the upload on a cpanel and filezilla), and the first page of each multi page pdf appears blanks.
http://red-rockfinancial.com/downloads/Critical%20Illness.pdf
The text is there as when you highlight and copy, you do copy the text, but I've tried everything I can think of I cant get the pdf's to upload.
It's worth noting that pdf's have been correctly uploaded to this site several times, however this particular "batch" of pdf is the only set having this issue. I've renamed, and resaved the files several times with no luck.
That PDF seems to be made to check out limits of PDF viewers.
PDFs have a page dictionary for each page. This dictionary has an entry Contents whose value is a stream or an array of streams describing the content of the page.
A part of the content of the page may be moved into a separate object, though, a so called form xobject, which merely is referenced from the page content. This allows the easy re-use of these content parts as such a xobject can be referenced from many pages. It also allows for grouping of content to allow for special transparency effects.
Form xobjects in their contents can again reference other xobjects.
The first page of your document uses this structure to the extreme, its content is
q Q q 0 0 595 841 re W n /Fm1 Do Q
i.e. it saves the current graphic state, sets a clip path around the border, includes form xobject Fm1, and restores the graphic state.
Fm1 has the content
q Q q 0 0 595 841 re W n /Fm2 Do Q
i.e. it saves the current graphic state, sets a clip path around the border, includes form xobject Fm2, and restores the graphic state.
Fm2 has the content
q Q q 0 0 595 814 re W n /Fm3 Do Q
...
Fm29 has the content
q Q q 0 0 595 814 re W n /Fm30 Do Q
And finally Fm30 has the actual page content.
The second page, on the other hand, immediately describes the page in its Contents stream.
PDF viewers do have certain limits concerning the nesting of xobjects and the more specifically the stacking of graphic states.
E.g. the nesting as observed above in essence means that the graphic state is saved 30 times, then the actual content is drawn, then the graphic state is restored 30 times. But the PDF specification ISO 32000-1 in section C.2 Architectural limits (which describes the minimum architectural limits that should be accommodated by conforming readers) gives a limit of 28 for q/Q nesting.
Thus, a conforming PDF viewer does not need to support your PDF as it nests graphic state saving operations deeper than the specification requires it to support.

Programmatically change the color of a black box in a PDF file?

I have a PDF file generated by Microsoft Word. The user has specified a "highlight" color of black to make the text look like it's a black box (and make the text look like its been redacted). I'd like to change the black boxes to yellow so that the text is highlighted instead.
Ideally, I'd like to do this in Python.
Thanks!
Option 1: If a commercial library is an option, you can easily implement this with Amyuni PDF Creator .Net, the C# code would look like this:
using System.IO;
using Amyuni.PDFCreator;
using System.Collections;
//open a pdf document
FileStream testfile = new FileStream("test1.pdf", FileMode.Open, FileAccess.Read, FileShare.Read);
IacDocument document = new IacDocument(null);
document.Open(testfile, "");
//get the first page
IacPage page1 = document.GetPage(1);
//get all graphic objects on the page
IacAttribute attribute = page1.AttributeByName("Objects");
// listobj is an arraylist of objects
ArrayList listobj = (ArrayList)attribute.Value;
foreach (IacObject iacObj in listobj)
{
//if the object is a rectangle and the background color is black then set it to yellow
if ((IacObjectType)iacObj.AttributeByName("ObjectType").Value == (IacObjectType.acObjectTypeFrame && (int)obj.Attribute("BackColor").Value == 0)
{
obj.Attribute("BackColor").Value = 0x00FFFF; //Yellow
}
}
I suppose you could translate this to IronPython instead.
Usual disclaimer applies for this suggestion
Option 2: If a commercial library is not an option and you are not developing a commercial closed-source application, you could try a bit of unreliable hacking on the page content using iText:
You can try decoding the page content (see ContentByteUtils class in iText for details), inserting a color selection operator before every fill operator, then resave the file. For more details on these operators see the TABLE 4.10 Path-painting operators of the Adobe PDF reference document.
Operand f:
Fill the path, using the nonzero winding number rule to determine the region to fill (see “Nonzero Winding Number Rule” on page 232).
Operand rg: sets the nonstroking color space to DeviceRGB, and sets the nonstroking color to the specified value
Operand q: saves the current graphic state
Operand Q: Restores the saved graphic state
So if you have a sequence of operators on your page:
0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
f % Fill path
It should become:
0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
q % Saves the current graphic state
1.0 1.0 0.0 rg % Set nonstroking color to yellow
f % Fill path
Q % Restores the saved graphic state
Some remarks:
-This approach will turn every non-text drawing into yellow (including lines, curves, etc and excluding raster images) and it will also draw as yellow any text that is drawn on the page using the same drawing operators as other PDF drawings.
-Xforms and annotations used on the page will not be processed.
-If the documents you will process are produced by the same tool in the same way you may just test a few files and see how it goes.
Important: This is just an untested idea from the top of my head, it may work, or it may not.

How to add text to an existing pdf at a fixed position?

I must insert a number at a fix position in an existing A4 pdf.
I've tried the following as a first test, but that doesn't work(not text is added).
What goes wrong?
Here's my code:
byte[] omrMarks = omrFrame.getOmrImage();
Jpeg img = new Jpeg(omrMarks);
PdfImportedPage page = stamper.getImportedPage(source, pageNum);
PdfContentByte pageContent = stamper.getOverContent(pageNum);
pageContent.addImage(
img, img.getWidth(), 0, 0, img.getHeight(), 15f, (page.getHeight() - 312));
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
There are many issues with this question.
This is certainly wrong:
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
The moveTo() method doesn't make sense; it has no effect on the text state object.
The text state object is illegal because there's no setFontAndSize() (it's very odd that this doesn't throw a RuntimeException, are you using an obsolete version of iText?)
The setLiteral() method should only be used to add some literal PDF syntax to a content stream.
For instance, something like:
pageContent.setLiteral("\n100 100 m\n100 200 l\nS\n");
should only be used if you understand that the following PDF syntax draws a line:
100 100 m
100 200 l
S
It's clear from your question that you don't understand PDF syntax, so you shouldn't use these methods. Instead you should use convenience methods such as the showTextAligned() method, which hide the complexity of PDF and save you a couple of lines.
Maybe you have a good reason to opt for the "hard way", but in that case, you should read the documentation, otherwise you'll continue using methods such as setLiteral() instead of showText(), moveTo() instead of moveText(), and so on, resulting in code you don't want your employer to see.
Furthermore, you're making the assumption that the lower left corner of the page has the coordinates (0,0). That's probably true for the majority of PDF documents found in the wild, but that's not true for all PDF documents. The MediaBox doesn't have to be [0 0 595 842], it could as well be [595 842 1190 1684]. Moreover: what if there's a CropBox? Maybe you're adding content that isn't visible because it's cropped away...

Create pdf with tooltips in R

Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
There's always a way. But the devil is in the details, so the real question is: are you willing to get your hands dirty?
PDF does support tooltips for certain kinds of objects such as hyperlinks. So, if there is a way to insert raw PDF statements indicating there should be a hyperlink-like object at some position in your plot, then there is a way to pop up a tooltip.
Now, the only way I know of to generate and compile raw PDF statements is to create the document using TeX. There are definitely other ways to do it, but this is the one I am familiar with. The following examples use a graphics device that Cameron Bracken and I wrote, the tikzDevice, to render R graphics to LaTeX code. The tikzDevice has preliminary support for injecting arbitrary LaTeX code into the graphics output stream through the tikzAnnotate function---we will be using this to drop PDF callouts into the plot.
The steps involved are:
Set up a LaTeX macro to generate the PDF commands required to produce a callout.
Set up an R function that uses tikzAnnotate to invoke the LaTeX macro at specific points in the plot.
???
Profit!
In the examples that follow, one major caveat is attached to step 2. The coordinate calculations used will only work with base graphics, not grid graphics such as ggplot2.
Simple Tooltips
Step 1
The tikzDevice allows you to create R graphics that include the execution of arbitrary LaTeX commands. Usually this is done to insert things like $\alpha$ into plot titles to generate greek letters, α, or complex equations. Here we are going to use this feature to invoke some raw PDF voodoo.
Any LaTeX macros that you wish to be available during the generation of a tikzDevice graphic need to be defined up-front by setting the tikzLatexPackages option. Here we are going to append some stuff to that declaration:
require(tikzDevice) # So that default options are set
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'), # The original contents: required stuff
# Avert your eyes for a sec, all will be explained below
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa\\setbox\\tempboxa=\\hbox{}\\immediate\\pdfxform\\tempboxa \\edef\\emptyicon{\\the\\pdflastxform}",
"\\newcommand\\tooltip[1]{\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R >>}\\tooltiptarget\\pdfendlink}"
))
If all that quoted nonsense were to be written out as LaTeX code by someone who cared about readability, it would look like this:
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\setbox\tempboxa=\hbox{}
\immediate\pdfxform\tempboxa
\edef\emptyicon{\the\pdflastxform}
\newcommand\tooltip[1]{%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
For those programmers who have never taken a walk on the wild side and done some "programming" in TeX, here's a blow-by-blow for the above code (as I understand it anyway, TeX can get very weird---especially when you get down in the trenches with it):
Line 1: Define an object, tooltiptarget, which is non-visible (a phantom) and is a 1mm x 1mm rectangle (a rule). This will be the onscreen area which we will use to detect mouseovers.
Line 2: Create a new box, which is like a "page fragment" of typset material. Can contain pretty much anything, including other boxes (sort of like an R list). Call it tempboxa.
Line 3: Assign the contents of tempboxa to contain an empty box that arranges its contents using a horizontal layout (which is unimportant, could have used a vbox or other box).
Line 4: Create a PDF Form XObject using the contents of tempboxa. A Form XObject can be used by PDF files to store graphics, like logos, that may be used over and over. Here we are creating a "blank icon" that we can use later to cut down on visual clutter. TeX defers output operations, like writing objects to a PDF file, until certain conditions have been met---such as a page has filled up. Immediate makes sure this operation is not deferred.
Line 5: This line captures an integer value that serves as a reference to the PDF XObject we just created and assigns it the name emptyicon.
Line 6: Starts the definition of a new macro called tooltip that takes 1 argument which is referred to in the body of the macro as #1. Each line in the macro ends with a comment character, %, to keep TeX from noticing the newlines that have been added for readability (newlines can have strange effects inside macros).
Line 7: Output raw PDF commands (pdfstartlink). This begins the creation of a new PDF annotation object (\Type \Annot) of which there are about 20 different subtypes---among them are hyperlinks. Every line following this contains raw PDF markup with a couple of TeX macros.
Line 8: Declare the annotation subtype we are going to use. Here I am going with a plain Text annotation such as a comment or sticky note.
Line 9: Declare the contents of the annotation. This will be the contents of our tooltip and is set to #1, the argument to the tooltip macro.
Lines 10-12: Normally text annotations are marked by an icon, such as a sticky note, to highlight their location in the text. This behavior will cause a visual mess if we allow it to splatter little sticky notes all over our graphs. Here we use an appearance array (\AP << >>) set the "normal" annotation icon (\N) to be the blank icon we created earlier. The integer we stored in emptyicon along with 0 R forms a reference to the Form XObject we made on Line 4 using an empty box.
Line 14: If we were making a normal hyperlink, here is where the text/image/whatever would go that would serve as the link body. Instead we insert tooltiptarget, our invisible phantom object which does not show up on the screen but does react to mouseovers.
Step 2
Allright, now that we have told LaTeX how to create tooltips, it is time to make them usable from R. This involves writing a function that will take coordinates on our graph, such as (1,1), and convert them into canvas or "device" coordinates. In the case of the tikzDevice the required measurement is "TeX points" (1/72.27 of an inch) from the absolute bottom left of the plotting area. Fortunately for base graphics, there are handy functions to calculate this for us. Grid graphics work differently, so the approach taken in the examples here won't work for them.
The final task for our R function is to call tikzAnnotate to insert a TikZ "node" into the output stream that is located at the coordinates we computed. Nodes can contain arbitrary TeX commands, in this case we will be calling upon tooltip.
Here is an R function that contains the above functionality:
place_PDF_tooltip <- function(x, y, text){
# Calculate coordinates
tikzX <- round(grconvertX(x, to = "device"), 2)
tikzY <- round(grconvertY(y, to = "device"), 2)
# Insert node
tikzAnnotate(paste(
"\\node at (", tikzX, ",", tikzY, ") ",
"{\\tooltip{", text, "}};",
sep = ''
))
invisible()
}
Step 3
Try it out on a plot:
# standAlone creates a complete LaTeX document. Default output makes
# stripped-down graphs ment for inclusion in other LaTeX documents
tikz('tooltips_ahoy.tex', standAlone = TRUE)
plot(1,1)
place_PDF_tooltip(1,1, 'Hello, World!')
dev.off()
require(tools)
texi2dvi('tooltips_ahoy.tex', pdf = TRUE)
Step 4
Behold the result (download a pdf):
Advanced Tooltips
Step 1
So, now that we have simple tooltips out of the way, why not crank it to 11? In the previous example, we used an empty hbox to get rid of the tooltip icon. But what if we had put something in that box, like text or a drawing? And what if there was a way to make it so that the icon only appeared during mouseover events?
The following TeX macro is a little rough around the edges, but it shows that this is possible:
\usetikzlibrary{shapes.callouts}
\tikzset{tooltip/.style = {
rectangle callout,
draw,
callout absolute pointer = {(-2em, 1em)}
}}
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\newcommand\tooltip[1]{%
\def\tooltipcallout{\tikz{\node[tooltip]{#1};}}%
\setbox\tempboxa=\hbox{\phantom{\tooltipcallout}}%
\immediate\pdfxform\tempboxa%
\edef\emptyicon{\the\pdflastxform}%
\setbox\tempboxa=\hbox{\tooltipcallout}%
\immediate\pdfxform\tempboxa%
\edef\tooltipicon{\the\pdflastxform}%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
/R \tooltipicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
The following modifications have been made compared to the simple callout.
The shapes.callouts library is loaded which contains templates for TikZ to use when drawing callout boxes.
A tooltip style is defined which contains some TikZ graphics boilerplate. It specifies a rectangular callout box that is to be visible (draw). The callout absolute pointer business is a hack because I've had too many beers by this point to figure out how to place annotation icons using dynamically generated PDF primitives. This relies on the default anchoring of icons at their upper left corner and so pulls the pointer of the callout box toward that location. The result is that the boxes will always appear to the lower right of the pointer and if the callout text is long enough, they won't look right.
Inside the macro, the tooltip is generated using a one-shot tikz command that is stuffed into the tooltipcallout macro. A form XObject is generated from tooltipcallout and assigned to tooltipicon.
emptyicon is also dynamically generated by evaluating tooltipcallout inside of phantom. This is required because the size of the default icon apparently sets the viewport available for the rollover icon.
When generating PDF commands, a new row is added to the /AP array, /R for rollover, that uses the XObject referenced by tooltipicon.
The ready to consume R version is:
require(tikzDevice)
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'),
"\\usetikzlibrary{shapes.callouts}",
"\\tikzset{tooltip/.style = {rectangle callout,draw,callout absolute pointer = {(-2em, 1em)}}}",
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa",
"\\newcommand\\tooltip[1]{\\def\\tooltipcallout{\\tikz{\\node[tooltip]{#1};}}\\setbox\\tempboxa=\\hbox{\\phantom{\\tooltipcallout}}\\immediate\\pdfxform\\tempboxa\\edef\\emptyicon{\\the\\pdflastxform}\\setbox\\tempboxa=\\hbox{\\tooltipcallout}\\immediate\\pdfxform\\tempboxa\\edef\\tooltipicon{\\the\\pdflastxform}\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R/R \\tooltipicon\\space 0 R>>}\\tooltiptarget\\pdfendlink}"
))
Step 2
The R-level code is unchanged.
Step 3
Let's try a slightly more complicated graph:
tikz('tooltips_with_callouts.tex', standAlone = TRUE)
x <- 1:10
y <- runif(10, 0, 10)
plot(x,y)
place_PDF_tooltip(x,y,x)
dev.off()
require(tools)
texi2dvi('tooltips_with_callouts.tex', pdf = TRUE)
Step 4
The result (download a PDF):
As you can see, there is an issue with both the tooltip and the callout being displayed. Setting \Contents () so that the tooltip has an empty string won't help anything. This can probably be solved by using a different annotation type, but I'm not going to spend any more time on it at the moment.
Caveats
Lots of TeX commands contain backslashes, you will need to double the backslashes when expressing things in R code.
Certain characters are special to TeX, such as _, ^, %, complete list here, you will need to ensure these are properly escaped when using the tikzDevice.
Even though PDF is supposed to be superior to HTML in that it has a consistant rendering across platforms, your mileage will vary significantly depending on which viewer is being used. The screenshots were taken in Acrobat 8 on OS X, Preview also did a passable job but did not render the rollover callout. On Linux, xpdf didn't render anything and okular showed a tooltip, but did not suppress the tooltip icon and displayed a stock icon that looked a little garish in the middle of a plot.
Alternative Implementations
cooltooltips and fancytooltips are LaTeX packages that provide tooltip functionality that could probably be used from the tikzDevice. Some ideas used in the above examples were taken from cooltooltips.
Concluding Thoughts
Was this worth it? Well, by the end of the day I did learn two things:
I am not a PDF expert and I had to search a couple mailing lists and digest some parts of a 700+ page specification to even answer the question "is this possible?" let alone come up with an implementation.
I am not a SVG expert and yet I know that tooltips in SVG is not a question of "is this possible?" but rather "how sexy do you want it to look?".
Given that observation, I say that it is getting to be time for PDF to ride off into the sunset and take it's 700 page spec with it. We need a new page description markup language and personally I'm hoping SVG Print will be complete enough to do the job. I would even accept Adobe Mars if the specification is accesible enough. Just give us an XML-based format that can exploit the full power of today's JavaScript libraries and then some real innovation can begin. A LuaTeX backend would be a nice bonus.
References
tikzDevice: A device for outputting R graphics as LaTeX code.
comp.text.tex: Source of much insight into esoteric TeX details. Posts by Herbert Voß and Mycroft provided implementation ideas.
The pdfTeX manual: Source of information concerning TeX commands that can generate raw PDF code, such as \pdfstartlink
TeX by Topic: Go-to manual for low-level TeX details such as how \immediate actually works.
The Adobe PDF Specification: The lair of details concerning PDF primitives.
The TikZ Manual: Quite possibly the finest example of open source documentation ever written.
Well, not pdf, but you could include tooltips (among ohers) in svg format with SVGAnnotation or RSVGTipsDevice package.
The pdf2 package works fine for me. (https://r-forge.r-project.org/R/?group_id=193). You can just include a 'popup' in the regular text command. It's not current as of 2.14, but I imagine he'll get round to it before too long.
text(x,y,'hard copy text',popup='tooltip text')