Fill textfield with pdfbox cause an offset [duplicate] - pdfbox

I am using Apache PDFBox for configuration of PDTextField's on a PDF document where I load Lato onto the document using:
font = PDType0Font.load(
#j_pd_document,
java.io.FileInputStream.new('/path/to/Lato-Regular.ttf')
) # => Lato-Regular
font_name = pd_default_resources.add(font).get_name # => F4
I then pass the font_name into a default_appearance_string for the PDTextField like so:
j_text_field.set_default_appearance("/#{font_name} 0 Tf 0 g") # where font_name is
# passed in from above
The issue now occurs when I proceed to invoke setValue on the PDTextField. Because I set the font_size in the defaultAppearanceString to 0, according to the library's example, the text should scale itself to fit in the text box's given area. However, the behaviour of this 'scale-to-fit' is inconsistent for certain fields: it does not always choose the largest font size to fit in the PDTextField. Might there be any further configuration that might allow for this to happen? Below are the PDFs where I've noticed this problem occurring.
Unfilled, with fonts loaded:
http://www.filedropper.com/0postfontload
Filled, with inconsisteny textbox text sizing:
http://www.filedropper.com/file_327
Side Note: I am using PDFBox through jruby which is just a integration layer that allows Ruby to invoke Java libraries. All java methods for the library available; a java method like thisExampleMethod would have a one-to-one translation into ruby this_example_method.
Updates
In response to comments, the appearances that are incorrect in the second uploaded file example are:
1st page Resident Name field (two text fields that have text that is too small for the given input field size)
2nd page Phone fields (four text fields that have text that overflows the given input field size)

Especially the appearances of the Resident Name fields, the Phone fields, and the Care Providers Address fields appear conspicuous. Only the former two are mentioned by the OP.
Let's inspect these fields; all screen shots are made using Adobe Reader DC on MS Windows:
The Resident Name fields
The filled in Resident Name fields look like this
While the height is appropriate, the glyphs are narrower than they should be. Actually this effect can already be seen in the original PDF:
This horizontal compression is caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
The widget rectangles: [ 45.72 601.44 118.924 615.24 ] and [ 119.282 601.127 192.486 614.927 ], i.e. 73.204*13.8 in both cases.
The appearance bounding box: [ 0 0 147.24 13.8 ], i.e. 147.24*13.8.
So they have the same height but the appearance bounding box is approximately twice as wide as the widget rectangle. Thus, the text drawn normally in the appearance stream gets compressed to half its width when the appearance is displayed in the widget rectangle.
When setting the value of a field PDFBox unfortunately re-uses the appearance stream as is and only updates details from the default appearance, i.e. font name, font size, and color, and the actual text value, apparently assuming the other properties of the appearance are as they are for a reason. Thus, the PDFBox output also shows this horizontal compression
To make PDFBox create a proper appearance, it is necessary to remove the old appearances before setting the new value.
The Phone fields
The filled in Phone fields look like this
and again there is a similar display in the original file
That only the first two letters are shown even though there is enough space for the whole word, is due to the configuration of these fields: They are configured as comb fields with a maximum length of 2 characters.
To have a value here set with PDFBox displayed completely and not so spaced out, you have to remove the maximum length (or at least have to make it no less than the length of your value) and unset the comb flag.
The Care Providers Address fields
Filled in they look like this:
Originally they look similar:
This vertical compression is again caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
A widget rectangle: [ 278.6 642.928 458.36 657.96 ], i.e. 179.76*15.032.
The appearance bounding box: [ 0 0 179.76 58.56 ], i.e. 179.76*58.56.
Just like in the case of the Resident Name fields above it is necessary to remove the old appearances before setting the new value to make PDFBox create a proper appearance.
A complication
Actually there is an additional issue when filling in the Care Providers Address fields, after removing the old appearances they look like this:
This is due to a shortcoming of PDFBox: These fields are configured as multi line text fields. While PDFBox for single line text fields properly calculates the font size based on the content and later finely makes sure that the text vertically fits quite well, it proceeds very crudely for multi line fields, it selects a hard coded font size of 12 and does not fine tune the vertical position, see the code of the AppearanceGeneratorHelper methods calculateFontSize(PDFont, PDRectangle) and insertGeneratedAppearance(PDAnnotationWidget, PDAppearanceStream, OutputStream).
As in your form these address fields anyways are only one line high, an obvious solution would be to make these fields single line fields, i.e. clear the Multiline flag.
Example code
Using Java one can implement the solutions explained above like this:
final int FLAG_MULTILINE = 1 << 12;
final int FLAG_COMB = 1 << 24;
PDDocument doc = PDDocument.load(originalStream);
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDType0Font font = PDType0Font.load(doc, fontStream, false);
String font_name = acroForm.getDefaultResources().add(font).getName();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.getCOSObject().removeItem(COSName.MAX_LEN);
textField.getCOSObject().setFlag(COSName.FF, FLAG_COMB | FLAG_MULTILINE, false);;
textField.setDefaultAppearance(String.format("/%s 0 Tf 0 g", font_name));
textField.getWidgets().forEach(w -> w.getAppearance().setNormalAppearance((PDAppearanceEntry)null));
textField.setValue("Test");
}
}
(FillInForm test testFill0DropOldAppearanceNoCombNoMaxNoMultiLine)
Screen shots of the output of the example code
The Resident Name field value now is not vertically compressed anymore:
The Phone and Care Providers Address fields also look appropriate now:

Related

JSX (Photoshop) - document resolution in dpi

I'm working with a jsx script in Photoshop that resizes images to a specific size. The resolution is set at 200 dpi. After running the script, I can check this under Image > Image Size.
Problem is, depending on the image, it initially tends to show the resolution in dots/cm instead of dots/inch. The number itself is correct either way, but I'd like to see it mentioned there as the latter. Is there a way to realize this in JSX?
Thanks!
J
The easy way is to open your Info Panel by going to Window > Info, and then click on the x/y coordinates dropdown in the Info Panel and select inches. The dropdown is the + toward the lower-left of the panel, with the little down arrow at the bottom right of the + symbol (The plus is actually an x axis and y axis representing a coordinate plane). After that, when you check under Image > Image Size, it should show you all information in inches instead of centimeters. This should also show you inches anywhere else you look in Photoshop's interface, too, such as the rulers.
An exception would be that when using selection tools, such as the marquee tool with a setting like "fixed size" selected, you can override the units setting by typing in another unit in the Width and Height sections at the top of the window. You can even mix and match units, making a precise selection that is, for example, exactly 250 pixels (px in the Width setting) by 30 points (pt in the Height setting). And when you check your image size, it should still show you results in inches.
And finally, to answer your question as it was asked, the following code will change your rulerUnits preference without opening the Info Panel.
#target Photoshop
preferences.rulerUnits = Units.INCHES;
Note that if you want to write other scripts, you can change the rulerUnits to whatever units the script calls for, and then at the end of the script put your units back the way you had them.
#target Photoshop
// Save the original rulerUnits setting to a variable
var originalRulerUnits = preferences.rulerUnits;
// Change the rulerUnits to Inches
preferences.rulerUnits = Units.INCHES;
//
// Do magical scripty stuff here...
//
// Restore the original setting
preferences.rulerUnits = originalRulerUnits;
// List of rulerUnits settings available
// Units.CM
// Units.INCHES
// Units.MM
// Units.PERCENT
// Units.PICAS
// Units.PIXELS
// Units.POINTS

Detecting Headers and Borders in PDF Tables using PDF Clown

I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table?
private void Extract(ContentScanner level, PrimitiveComposer composer)
{
if(level == null)
return;
while(level.MoveNext())
{
ContentObject content = level.Current;
}
}
I am using PDF Clown's TextInfoExtractionSample...
In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders.
while(level.MoveNext())
{
ContentObject content = level.Current;
}
A) Visit all content
In your loop code you removed very important blocks from the original example,
if(content is XObject)
{
// Scan the external level!
Extract(((XObject)content).GetScanner(level), composer);
}
and
if(content is ContainerObject)
{
// Scan the inner level!
Extract(level.ChildLevel, composer);
}
These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content.
B) Inspect all content
Anyone know what object represents borders in PDF table
Unfortunately there is nothing like a border attribute in PDF content. Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles.
Thus, while scanning the page content (recursively, as indicated in A) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects) containing
moveTo m, lineTo l, and stroke S operations or
rectangle re and fill f operations.
(This answer may help)
When you come across such lines, you will have to interpret them. These lines may be borders, but they may also be used as underlines, page decorations, ...
If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. Unfortunately still less PDFs are tagged than not.
OR how to detect if a text is a header of the table?
Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. You have to interpret again. Is that text outside of lines you determined to form a table? Is it inside at the top? Or just anywhere inside? Is it drawn in a specific font? Or larger? Different color? Etc.

OpenCV - Variable value range of trackbar

I have a set of images and want to make a cross matching between all and display the results using trackbars using OpenCV 2.4.6 (ROS Hydro package). The matching part is done using a vector of vectors of vectors of cv::DMatch-objects:
image[0] --- image[3] -------- image[8] ------ ...
| | |
| cv::DMatch-vect cv::DMatch-vect
|
image[1] --- ...
|
image[2] --- ...
|
...
|
image[N] --- ...
Because we omit matching an image with itself (no point in doing that) and because a query image might not be matched with all the rest each set of matched train images for a query image might have a different size from the rest. Note that the way it's implemented right I actually match a pair of images twice, which of course is not optimal (especially since I used a BruteForce matcher with cross-check turned on, which basically means that I match a pair of images 4 times!) but for now that's it. In order to avoid on-the-fly drawing of matched pairs of images I have populated a vector of vectors of cv::Mat-objects. Each cv::Mat represents the current query image and some matched train image (I populate it using cv::drawMatches()):
image[0] --- cv::Mat[0,3] ---- cv::Mat[0,8] ---- ...
|
image[1] --- ...
|
image[2] --- ...
|
...
|
image[N] --- ...
Note: In the example above cv::Mat[0,3] stands for cv::Mat that stores the product of cv::drawMatches() using image[0] and image[3].
Here are the GUI settings:
Main window: here I display the current query image. Using a trackbar - let's call it TRACK_QUERY - I iterate through each image in my set.
Secondary window: here I display the matched pair (query,train), where the combination between the position of TRACK_QUERY's slider and the position of the slider of another trackbar in this window - let's call it TRACK_TRAIN - allows me to iterate through all the cv::Mat-match-images for the current query image.
The issue here comes from the fact that each query can have a variable number of matched train images. My TRACK_TRAIN should be able to adjust to the number of matched train images, that is the number of elements in each cv::Mat-vector for the current query image. Sadly so far I was unable to find a way to do that. The cv::createTrackbar() requires a count-parameter, which from what I see sets the limit of the trackbar's slider and cannot be altered later on. Do correct me if I'm wrong since this is exactly what's bothering me. A possible solution (less elegant and involving various checks to avoid out-of-range erros) is to take the size of the largest set of matched train images and use it as the limit for my TRACK_TRAIN. I would like to avoid doing that if possible. Another possible solution involves creating a trackbar per query image with the appropriate value range and swap each in my secondary windows according to the selected query image. For now this seems to be the more easy way to go but poses a big overhead of trackbars not to mention that fact that I haven't heard of OpenCV allowing you to hide GUI controls. Here are two example that might clarify things a little bit more:
Example 1:
In main window I select image 2 using TRACK_QUERY. For this image I have managed to match 5 other images from my set. Let's say those are image 4, 10, 17, 18 and 20. The secondary window updates automatically and shows me the match between image 2 and image 4 (first in the subset of matched train images). TRACK_TRAIN has to go from 0 to 4. Moving the slider in both directions allows me to go through image 4, 10, 17, 18 and 20 updating each time the secondary window.
Example 2:
In main window I select image 7 using TRACK_QUERY. For this image I have managed to match 3 other images from my set. Let's say those are image 0, 1, 11 and 19. The secondary window updates automatically and shows me the match between image 2 and image 0 (first in the subset of matched train images). TRACK_TRAIN has to go from 0 to 2. Moving the slider in both directions allows me to go through image 0, 1, 1 and 19 updating each time the secondary window.
If you have any questions feel free to ask and I'll to answer them as well as I can. Thanks in advance!
PS: Sadly the way the ROS package is it has the bare minimum of what OpenCV can offer. No Qt integration, no OpenMP, no OpenGL etc.
After doing some more research I'm pretty sure that this is currently not possible. That's why I implemented the first proposition that I gave in my question - use the match-vector with the most number of matches in it to determine a maximum size for the trackbar and then use some checking to avoid out-of-range exceptions. Below there is a more or less detailed description how it all works. Since the matching procedure in my code involves some additional checks that does not concern the problem at hand, I'll skip it here. Note that in a given set of images we want to match I refer to an image as object-image when that image (example: card) is currently matched to a scene-image (example: a set of cards) - top level of the matches-vector (see below) and equal to the index in processedImages (see below). I find the train/query notation in OpenCV somewhat confusing. This scene/object notation is taken from http://docs.opencv.org/doc/tutorials/features2d/feature_homography/feature_homography.html. You can change or swap the notation to your liking but make sure you change it everywhere accordingly otherwise you might end up with a some weird results.
// stores all the images that we want to cross-match
std::vector<cv::Mat> processedImages;
// stores keypoints for each image in processedImages
std::vector<std::vector<cv::Keypoint> > keypoints;
// stores descriptors for each image in processedImages
std::vector<cv::Mat> descriptors;
// fill processedImages here (read images from files, convert to grayscale, undistort, resize etc.), extract keypoints, compute descriptors
// ...
// I use brute force matching since I also used ORB, which has binary descriptors and HAMMING_NORM is the way to go
cv::BFmatcher matcher;
// matches contains the match-vectors for each image matched to all other images in our set
// top level index matches.at(X) is equal to the image index in processedImages
// middle level index matches.at(X).at(Y) gives the match-vector for the Xth image and some other Yth from the set that is successfully matched to X
std::vector<std::vector<std::vector<cv::DMatch> > > matches;
// contains images that store visually all matched pairs
std::vector<std::vector<cv::Mat> > matchesDraw;
// fill all the vectors above with data here, don't forget about matchesDraw
// stores the highest count of matches for all pairs - I used simple exclusion by simply comparing the size() of the current std::vector<cv::DMatch> vector with the previous value of this variable
long int sceneWithMaxMatches = 0;
// ...
// after all is ready do some additional checking here in order to make sure the data is usable in our GUI. A trackbar for example requires AT LEAST 2 for its range since a range (0;0) doesn't make any sense
if(sceneWithMaxMatches < 2)
return -1;
// in this window show the image gallery (scene-images); the user can scroll through all image using a trackbar
cv::namedWindow("Images", CV_GUI_EXPANDED | CV_WINDOW_AUTOSIZE);
// just a dummy to store the state of the trackbar
int imagesTrackbarState = 0;
// create the first trackbar that the user uses to scroll through the scene-images
// IMPORTANT: use processedImages.size() - 1 since indexing in vectors is the same as in arrays - it starts from 0 and not reducing it by 1 will throw an out-of-range exception
cv::createTrackbar("Images:", "Images", &imagesTrackbarState, processedImages.size() - 1, on_imagesTrackbarCallback, NULL);
// in this window we show the matched object-images relative to the selected image in the "Images" window
cv::namedWindow("Matches for current image", CV_WINDOW_AUTOSIZE);
// yet another dummy to store the state of the trackbar in this new window
int imageMatchesTrackbarState = 0;
// IMPORTANT: again since sceneWithMaxMatches stores the SIZE of a vector we need to reduce it by 1 in order to be able to use it for the indexing later on
cv::createTrackbar("Matches:", "Matches for current image", &imageMatchesTrackbarState, sceneWithMaxMatches - 1, on_imageMatchesTrackbarCallback, NULL);
while(true)
{
char key = cv::waitKey(20);
if(key == 27)
break;
// from here on the magic begins
// show the image gallery; use the position of the "Images:" trackbar to call the image at that position
cv::imshow("Images", processedImages.at(cv::getTrackbarPos("Images:", "Images")));
// store the index of the current scene-image by calling the position of the trackbar in the "Images:" window
int currentSceneIndex = cv::getTrackbarPos("Images:", "Images");
// we have to make sure that the match of the currently selected scene-image actually has something in it
if(matches.at(currentSceneIndex).size())
{
// store the index of the current object-image that we have matched to the current scene-image in the "Images:" window
int currentObjectIndex = cv::getTrackbarPos("Matches:", "Matches for current image");
cv::imshow(
"Matches for current image",
matchesDraw.at(currentSceneIndex).at(currentObjectIndex < matchesDraw.at(currentSceneIndex).size() ? // is the current object index within the range of the matches for the current object and current scene
currentObjectIndex : // yes, return the correct index
matchesDraw.at(currentSceneIndex).size() - 1)); // if outside the range show the last matched pair!
}
}
// do something else
// ...
The tricky part is the trackbar in the second window responsible for accessing the matched images to our currently selected image in the "Images" window. As I've explained above I set the trackbar "Matches:" in the "Matches for current image" window to have a range from 0 to (sceneWithMaxMatches-1). However not all images have the same amount of matches with the rest in the image set (applies tenfold if you have done some additional filtering to ensure reliable matches for example by exploiting the properties of the homography, ratio test, min/max distance check etc.). Because I was unable to find a way to dynamically adjust the trackbar's range I needed a validation of the index. Otherwise for some of the images and their matches the application will throw an out-of-range exception. This is due to the simple fact that for some matches we try to access a match-vector with an index greater than it's size minus 1 because cv::getTrackbarPos() goes all the way to (sceneWithMaxMatches - 1). If the trackbar's position goes out of range for the currently selected vector with matches, I simply set the matchDraw-image in "Matches for current image" to the very last in the vector. Here I exploit the fact that the indexing can't go below zero as well as the trackbar's position so there is not need to check this but only what comes after the initial position 0. If this is not your case make sure you check the lower bound too and not only the upper.
Hope this helps!

Programmatically change the color of a black box in a PDF file?

I have a PDF file generated by Microsoft Word. The user has specified a "highlight" color of black to make the text look like it's a black box (and make the text look like its been redacted). I'd like to change the black boxes to yellow so that the text is highlighted instead.
Ideally, I'd like to do this in Python.
Thanks!
Option 1: If a commercial library is an option, you can easily implement this with Amyuni PDF Creator .Net, the C# code would look like this:
using System.IO;
using Amyuni.PDFCreator;
using System.Collections;
//open a pdf document
FileStream testfile = new FileStream("test1.pdf", FileMode.Open, FileAccess.Read, FileShare.Read);
IacDocument document = new IacDocument(null);
document.Open(testfile, "");
//get the first page
IacPage page1 = document.GetPage(1);
//get all graphic objects on the page
IacAttribute attribute = page1.AttributeByName("Objects");
// listobj is an arraylist of objects
ArrayList listobj = (ArrayList)attribute.Value;
foreach (IacObject iacObj in listobj)
{
//if the object is a rectangle and the background color is black then set it to yellow
if ((IacObjectType)iacObj.AttributeByName("ObjectType").Value == (IacObjectType.acObjectTypeFrame && (int)obj.Attribute("BackColor").Value == 0)
{
obj.Attribute("BackColor").Value = 0x00FFFF; //Yellow
}
}
I suppose you could translate this to IronPython instead.
Usual disclaimer applies for this suggestion
Option 2: If a commercial library is not an option and you are not developing a commercial closed-source application, you could try a bit of unreliable hacking on the page content using iText:
You can try decoding the page content (see ContentByteUtils class in iText for details), inserting a color selection operator before every fill operator, then resave the file. For more details on these operators see the TABLE 4.10 Path-painting operators of the Adobe PDF reference document.
Operand f:
Fill the path, using the nonzero winding number rule to determine the region to fill (see “Nonzero Winding Number Rule” on page 232).
Operand rg: sets the nonstroking color space to DeviceRGB, and sets the nonstroking color to the specified value
Operand q: saves the current graphic state
Operand Q: Restores the saved graphic state
So if you have a sequence of operators on your page:
0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
f % Fill path
It should become:
0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
q % Saves the current graphic state
1.0 1.0 0.0 rg % Set nonstroking color to yellow
f % Fill path
Q % Restores the saved graphic state
Some remarks:
-This approach will turn every non-text drawing into yellow (including lines, curves, etc and excluding raster images) and it will also draw as yellow any text that is drawn on the page using the same drawing operators as other PDF drawings.
-Xforms and annotations used on the page will not be processed.
-If the documents you will process are produced by the same tool in the same way you may just test a few files and see how it goes.
Important: This is just an untested idea from the top of my head, it may work, or it may not.

Create pdf with tooltips in R

Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
There's always a way. But the devil is in the details, so the real question is: are you willing to get your hands dirty?
PDF does support tooltips for certain kinds of objects such as hyperlinks. So, if there is a way to insert raw PDF statements indicating there should be a hyperlink-like object at some position in your plot, then there is a way to pop up a tooltip.
Now, the only way I know of to generate and compile raw PDF statements is to create the document using TeX. There are definitely other ways to do it, but this is the one I am familiar with. The following examples use a graphics device that Cameron Bracken and I wrote, the tikzDevice, to render R graphics to LaTeX code. The tikzDevice has preliminary support for injecting arbitrary LaTeX code into the graphics output stream through the tikzAnnotate function---we will be using this to drop PDF callouts into the plot.
The steps involved are:
Set up a LaTeX macro to generate the PDF commands required to produce a callout.
Set up an R function that uses tikzAnnotate to invoke the LaTeX macro at specific points in the plot.
???
Profit!
In the examples that follow, one major caveat is attached to step 2. The coordinate calculations used will only work with base graphics, not grid graphics such as ggplot2.
Simple Tooltips
Step 1
The tikzDevice allows you to create R graphics that include the execution of arbitrary LaTeX commands. Usually this is done to insert things like $\alpha$ into plot titles to generate greek letters, α, or complex equations. Here we are going to use this feature to invoke some raw PDF voodoo.
Any LaTeX macros that you wish to be available during the generation of a tikzDevice graphic need to be defined up-front by setting the tikzLatexPackages option. Here we are going to append some stuff to that declaration:
require(tikzDevice) # So that default options are set
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'), # The original contents: required stuff
# Avert your eyes for a sec, all will be explained below
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa\\setbox\\tempboxa=\\hbox{}\\immediate\\pdfxform\\tempboxa \\edef\\emptyicon{\\the\\pdflastxform}",
"\\newcommand\\tooltip[1]{\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R >>}\\tooltiptarget\\pdfendlink}"
))
If all that quoted nonsense were to be written out as LaTeX code by someone who cared about readability, it would look like this:
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\setbox\tempboxa=\hbox{}
\immediate\pdfxform\tempboxa
\edef\emptyicon{\the\pdflastxform}
\newcommand\tooltip[1]{%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
For those programmers who have never taken a walk on the wild side and done some "programming" in TeX, here's a blow-by-blow for the above code (as I understand it anyway, TeX can get very weird---especially when you get down in the trenches with it):
Line 1: Define an object, tooltiptarget, which is non-visible (a phantom) and is a 1mm x 1mm rectangle (a rule). This will be the onscreen area which we will use to detect mouseovers.
Line 2: Create a new box, which is like a "page fragment" of typset material. Can contain pretty much anything, including other boxes (sort of like an R list). Call it tempboxa.
Line 3: Assign the contents of tempboxa to contain an empty box that arranges its contents using a horizontal layout (which is unimportant, could have used a vbox or other box).
Line 4: Create a PDF Form XObject using the contents of tempboxa. A Form XObject can be used by PDF files to store graphics, like logos, that may be used over and over. Here we are creating a "blank icon" that we can use later to cut down on visual clutter. TeX defers output operations, like writing objects to a PDF file, until certain conditions have been met---such as a page has filled up. Immediate makes sure this operation is not deferred.
Line 5: This line captures an integer value that serves as a reference to the PDF XObject we just created and assigns it the name emptyicon.
Line 6: Starts the definition of a new macro called tooltip that takes 1 argument which is referred to in the body of the macro as #1. Each line in the macro ends with a comment character, %, to keep TeX from noticing the newlines that have been added for readability (newlines can have strange effects inside macros).
Line 7: Output raw PDF commands (pdfstartlink). This begins the creation of a new PDF annotation object (\Type \Annot) of which there are about 20 different subtypes---among them are hyperlinks. Every line following this contains raw PDF markup with a couple of TeX macros.
Line 8: Declare the annotation subtype we are going to use. Here I am going with a plain Text annotation such as a comment or sticky note.
Line 9: Declare the contents of the annotation. This will be the contents of our tooltip and is set to #1, the argument to the tooltip macro.
Lines 10-12: Normally text annotations are marked by an icon, such as a sticky note, to highlight their location in the text. This behavior will cause a visual mess if we allow it to splatter little sticky notes all over our graphs. Here we use an appearance array (\AP << >>) set the "normal" annotation icon (\N) to be the blank icon we created earlier. The integer we stored in emptyicon along with 0 R forms a reference to the Form XObject we made on Line 4 using an empty box.
Line 14: If we were making a normal hyperlink, here is where the text/image/whatever would go that would serve as the link body. Instead we insert tooltiptarget, our invisible phantom object which does not show up on the screen but does react to mouseovers.
Step 2
Allright, now that we have told LaTeX how to create tooltips, it is time to make them usable from R. This involves writing a function that will take coordinates on our graph, such as (1,1), and convert them into canvas or "device" coordinates. In the case of the tikzDevice the required measurement is "TeX points" (1/72.27 of an inch) from the absolute bottom left of the plotting area. Fortunately for base graphics, there are handy functions to calculate this for us. Grid graphics work differently, so the approach taken in the examples here won't work for them.
The final task for our R function is to call tikzAnnotate to insert a TikZ "node" into the output stream that is located at the coordinates we computed. Nodes can contain arbitrary TeX commands, in this case we will be calling upon tooltip.
Here is an R function that contains the above functionality:
place_PDF_tooltip <- function(x, y, text){
# Calculate coordinates
tikzX <- round(grconvertX(x, to = "device"), 2)
tikzY <- round(grconvertY(y, to = "device"), 2)
# Insert node
tikzAnnotate(paste(
"\\node at (", tikzX, ",", tikzY, ") ",
"{\\tooltip{", text, "}};",
sep = ''
))
invisible()
}
Step 3
Try it out on a plot:
# standAlone creates a complete LaTeX document. Default output makes
# stripped-down graphs ment for inclusion in other LaTeX documents
tikz('tooltips_ahoy.tex', standAlone = TRUE)
plot(1,1)
place_PDF_tooltip(1,1, 'Hello, World!')
dev.off()
require(tools)
texi2dvi('tooltips_ahoy.tex', pdf = TRUE)
Step 4
Behold the result (download a pdf):
Advanced Tooltips
Step 1
So, now that we have simple tooltips out of the way, why not crank it to 11? In the previous example, we used an empty hbox to get rid of the tooltip icon. But what if we had put something in that box, like text or a drawing? And what if there was a way to make it so that the icon only appeared during mouseover events?
The following TeX macro is a little rough around the edges, but it shows that this is possible:
\usetikzlibrary{shapes.callouts}
\tikzset{tooltip/.style = {
rectangle callout,
draw,
callout absolute pointer = {(-2em, 1em)}
}}
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\newcommand\tooltip[1]{%
\def\tooltipcallout{\tikz{\node[tooltip]{#1};}}%
\setbox\tempboxa=\hbox{\phantom{\tooltipcallout}}%
\immediate\pdfxform\tempboxa%
\edef\emptyicon{\the\pdflastxform}%
\setbox\tempboxa=\hbox{\tooltipcallout}%
\immediate\pdfxform\tempboxa%
\edef\tooltipicon{\the\pdflastxform}%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
/R \tooltipicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
The following modifications have been made compared to the simple callout.
The shapes.callouts library is loaded which contains templates for TikZ to use when drawing callout boxes.
A tooltip style is defined which contains some TikZ graphics boilerplate. It specifies a rectangular callout box that is to be visible (draw). The callout absolute pointer business is a hack because I've had too many beers by this point to figure out how to place annotation icons using dynamically generated PDF primitives. This relies on the default anchoring of icons at their upper left corner and so pulls the pointer of the callout box toward that location. The result is that the boxes will always appear to the lower right of the pointer and if the callout text is long enough, they won't look right.
Inside the macro, the tooltip is generated using a one-shot tikz command that is stuffed into the tooltipcallout macro. A form XObject is generated from tooltipcallout and assigned to tooltipicon.
emptyicon is also dynamically generated by evaluating tooltipcallout inside of phantom. This is required because the size of the default icon apparently sets the viewport available for the rollover icon.
When generating PDF commands, a new row is added to the /AP array, /R for rollover, that uses the XObject referenced by tooltipicon.
The ready to consume R version is:
require(tikzDevice)
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'),
"\\usetikzlibrary{shapes.callouts}",
"\\tikzset{tooltip/.style = {rectangle callout,draw,callout absolute pointer = {(-2em, 1em)}}}",
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa",
"\\newcommand\\tooltip[1]{\\def\\tooltipcallout{\\tikz{\\node[tooltip]{#1};}}\\setbox\\tempboxa=\\hbox{\\phantom{\\tooltipcallout}}\\immediate\\pdfxform\\tempboxa\\edef\\emptyicon{\\the\\pdflastxform}\\setbox\\tempboxa=\\hbox{\\tooltipcallout}\\immediate\\pdfxform\\tempboxa\\edef\\tooltipicon{\\the\\pdflastxform}\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R/R \\tooltipicon\\space 0 R>>}\\tooltiptarget\\pdfendlink}"
))
Step 2
The R-level code is unchanged.
Step 3
Let's try a slightly more complicated graph:
tikz('tooltips_with_callouts.tex', standAlone = TRUE)
x <- 1:10
y <- runif(10, 0, 10)
plot(x,y)
place_PDF_tooltip(x,y,x)
dev.off()
require(tools)
texi2dvi('tooltips_with_callouts.tex', pdf = TRUE)
Step 4
The result (download a PDF):
As you can see, there is an issue with both the tooltip and the callout being displayed. Setting \Contents () so that the tooltip has an empty string won't help anything. This can probably be solved by using a different annotation type, but I'm not going to spend any more time on it at the moment.
Caveats
Lots of TeX commands contain backslashes, you will need to double the backslashes when expressing things in R code.
Certain characters are special to TeX, such as _, ^, %, complete list here, you will need to ensure these are properly escaped when using the tikzDevice.
Even though PDF is supposed to be superior to HTML in that it has a consistant rendering across platforms, your mileage will vary significantly depending on which viewer is being used. The screenshots were taken in Acrobat 8 on OS X, Preview also did a passable job but did not render the rollover callout. On Linux, xpdf didn't render anything and okular showed a tooltip, but did not suppress the tooltip icon and displayed a stock icon that looked a little garish in the middle of a plot.
Alternative Implementations
cooltooltips and fancytooltips are LaTeX packages that provide tooltip functionality that could probably be used from the tikzDevice. Some ideas used in the above examples were taken from cooltooltips.
Concluding Thoughts
Was this worth it? Well, by the end of the day I did learn two things:
I am not a PDF expert and I had to search a couple mailing lists and digest some parts of a 700+ page specification to even answer the question "is this possible?" let alone come up with an implementation.
I am not a SVG expert and yet I know that tooltips in SVG is not a question of "is this possible?" but rather "how sexy do you want it to look?".
Given that observation, I say that it is getting to be time for PDF to ride off into the sunset and take it's 700 page spec with it. We need a new page description markup language and personally I'm hoping SVG Print will be complete enough to do the job. I would even accept Adobe Mars if the specification is accesible enough. Just give us an XML-based format that can exploit the full power of today's JavaScript libraries and then some real innovation can begin. A LuaTeX backend would be a nice bonus.
References
tikzDevice: A device for outputting R graphics as LaTeX code.
comp.text.tex: Source of much insight into esoteric TeX details. Posts by Herbert Voß and Mycroft provided implementation ideas.
The pdfTeX manual: Source of information concerning TeX commands that can generate raw PDF code, such as \pdfstartlink
TeX by Topic: Go-to manual for low-level TeX details such as how \immediate actually works.
The Adobe PDF Specification: The lair of details concerning PDF primitives.
The TikZ Manual: Quite possibly the finest example of open source documentation ever written.
Well, not pdf, but you could include tooltips (among ohers) in svg format with SVGAnnotation or RSVGTipsDevice package.
The pdf2 package works fine for me. (https://r-forge.r-project.org/R/?group_id=193). You can just include a 'popup' in the regular text command. It's not current as of 2.14, but I imagine he'll get round to it before too long.
text(x,y,'hard copy text',popup='tooltip text')