Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
Simple question: Is there a way to plot a graph from R in a pdf file and include tooltips?
There's always a way. But the devil is in the details, so the real question is: are you willing to get your hands dirty?
PDF does support tooltips for certain kinds of objects such as hyperlinks. So, if there is a way to insert raw PDF statements indicating there should be a hyperlink-like object at some position in your plot, then there is a way to pop up a tooltip.
Now, the only way I know of to generate and compile raw PDF statements is to create the document using TeX. There are definitely other ways to do it, but this is the one I am familiar with. The following examples use a graphics device that Cameron Bracken and I wrote, the tikzDevice, to render R graphics to LaTeX code. The tikzDevice has preliminary support for injecting arbitrary LaTeX code into the graphics output stream through the tikzAnnotate function---we will be using this to drop PDF callouts into the plot.
The steps involved are:
Set up a LaTeX macro to generate the PDF commands required to produce a callout.
Set up an R function that uses tikzAnnotate to invoke the LaTeX macro at specific points in the plot.
???
Profit!
In the examples that follow, one major caveat is attached to step 2. The coordinate calculations used will only work with base graphics, not grid graphics such as ggplot2.
Simple Tooltips
Step 1
The tikzDevice allows you to create R graphics that include the execution of arbitrary LaTeX commands. Usually this is done to insert things like $\alpha$ into plot titles to generate greek letters, α, or complex equations. Here we are going to use this feature to invoke some raw PDF voodoo.
Any LaTeX macros that you wish to be available during the generation of a tikzDevice graphic need to be defined up-front by setting the tikzLatexPackages option. Here we are going to append some stuff to that declaration:
require(tikzDevice) # So that default options are set
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'), # The original contents: required stuff
# Avert your eyes for a sec, all will be explained below
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa\\setbox\\tempboxa=\\hbox{}\\immediate\\pdfxform\\tempboxa \\edef\\emptyicon{\\the\\pdflastxform}",
"\\newcommand\\tooltip[1]{\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R >>}\\tooltiptarget\\pdfendlink}"
))
If all that quoted nonsense were to be written out as LaTeX code by someone who cared about readability, it would look like this:
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\setbox\tempboxa=\hbox{}
\immediate\pdfxform\tempboxa
\edef\emptyicon{\the\pdflastxform}
\newcommand\tooltip[1]{%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
For those programmers who have never taken a walk on the wild side and done some "programming" in TeX, here's a blow-by-blow for the above code (as I understand it anyway, TeX can get very weird---especially when you get down in the trenches with it):
Line 1: Define an object, tooltiptarget, which is non-visible (a phantom) and is a 1mm x 1mm rectangle (a rule). This will be the onscreen area which we will use to detect mouseovers.
Line 2: Create a new box, which is like a "page fragment" of typset material. Can contain pretty much anything, including other boxes (sort of like an R list). Call it tempboxa.
Line 3: Assign the contents of tempboxa to contain an empty box that arranges its contents using a horizontal layout (which is unimportant, could have used a vbox or other box).
Line 4: Create a PDF Form XObject using the contents of tempboxa. A Form XObject can be used by PDF files to store graphics, like logos, that may be used over and over. Here we are creating a "blank icon" that we can use later to cut down on visual clutter. TeX defers output operations, like writing objects to a PDF file, until certain conditions have been met---such as a page has filled up. Immediate makes sure this operation is not deferred.
Line 5: This line captures an integer value that serves as a reference to the PDF XObject we just created and assigns it the name emptyicon.
Line 6: Starts the definition of a new macro called tooltip that takes 1 argument which is referred to in the body of the macro as #1. Each line in the macro ends with a comment character, %, to keep TeX from noticing the newlines that have been added for readability (newlines can have strange effects inside macros).
Line 7: Output raw PDF commands (pdfstartlink). This begins the creation of a new PDF annotation object (\Type \Annot) of which there are about 20 different subtypes---among them are hyperlinks. Every line following this contains raw PDF markup with a couple of TeX macros.
Line 8: Declare the annotation subtype we are going to use. Here I am going with a plain Text annotation such as a comment or sticky note.
Line 9: Declare the contents of the annotation. This will be the contents of our tooltip and is set to #1, the argument to the tooltip macro.
Lines 10-12: Normally text annotations are marked by an icon, such as a sticky note, to highlight their location in the text. This behavior will cause a visual mess if we allow it to splatter little sticky notes all over our graphs. Here we use an appearance array (\AP << >>) set the "normal" annotation icon (\N) to be the blank icon we created earlier. The integer we stored in emptyicon along with 0 R forms a reference to the Form XObject we made on Line 4 using an empty box.
Line 14: If we were making a normal hyperlink, here is where the text/image/whatever would go that would serve as the link body. Instead we insert tooltiptarget, our invisible phantom object which does not show up on the screen but does react to mouseovers.
Step 2
Allright, now that we have told LaTeX how to create tooltips, it is time to make them usable from R. This involves writing a function that will take coordinates on our graph, such as (1,1), and convert them into canvas or "device" coordinates. In the case of the tikzDevice the required measurement is "TeX points" (1/72.27 of an inch) from the absolute bottom left of the plotting area. Fortunately for base graphics, there are handy functions to calculate this for us. Grid graphics work differently, so the approach taken in the examples here won't work for them.
The final task for our R function is to call tikzAnnotate to insert a TikZ "node" into the output stream that is located at the coordinates we computed. Nodes can contain arbitrary TeX commands, in this case we will be calling upon tooltip.
Here is an R function that contains the above functionality:
place_PDF_tooltip <- function(x, y, text){
# Calculate coordinates
tikzX <- round(grconvertX(x, to = "device"), 2)
tikzY <- round(grconvertY(y, to = "device"), 2)
# Insert node
tikzAnnotate(paste(
"\\node at (", tikzX, ",", tikzY, ") ",
"{\\tooltip{", text, "}};",
sep = ''
))
invisible()
}
Step 3
Try it out on a plot:
# standAlone creates a complete LaTeX document. Default output makes
# stripped-down graphs ment for inclusion in other LaTeX documents
tikz('tooltips_ahoy.tex', standAlone = TRUE)
plot(1,1)
place_PDF_tooltip(1,1, 'Hello, World!')
dev.off()
require(tools)
texi2dvi('tooltips_ahoy.tex', pdf = TRUE)
Step 4
Behold the result (download a pdf):
Advanced Tooltips
Step 1
So, now that we have simple tooltips out of the way, why not crank it to 11? In the previous example, we used an empty hbox to get rid of the tooltip icon. But what if we had put something in that box, like text or a drawing? And what if there was a way to make it so that the icon only appeared during mouseover events?
The following TeX macro is a little rough around the edges, but it shows that this is possible:
\usetikzlibrary{shapes.callouts}
\tikzset{tooltip/.style = {
rectangle callout,
draw,
callout absolute pointer = {(-2em, 1em)}
}}
\def\tooltiptarget{\phantom{\rule{1mm}{1mm}}}
\newbox\tempboxa
\newcommand\tooltip[1]{%
\def\tooltipcallout{\tikz{\node[tooltip]{#1};}}%
\setbox\tempboxa=\hbox{\phantom{\tooltipcallout}}%
\immediate\pdfxform\tempboxa%
\edef\emptyicon{\the\pdflastxform}%
\setbox\tempboxa=\hbox{\tooltipcallout}%
\immediate\pdfxform\tempboxa%
\edef\tooltipicon{\the\pdflastxform}%
\pdfstartlink user{%
/Subtype /Text
/Contents (#1)
/AP <<
/N \emptyicon\space 0 R
/R \tooltipicon\space 0 R
>>
}%
\tooltiptarget%
\pdfendlink%
}
The following modifications have been made compared to the simple callout.
The shapes.callouts library is loaded which contains templates for TikZ to use when drawing callout boxes.
A tooltip style is defined which contains some TikZ graphics boilerplate. It specifies a rectangular callout box that is to be visible (draw). The callout absolute pointer business is a hack because I've had too many beers by this point to figure out how to place annotation icons using dynamically generated PDF primitives. This relies on the default anchoring of icons at their upper left corner and so pulls the pointer of the callout box toward that location. The result is that the boxes will always appear to the lower right of the pointer and if the callout text is long enough, they won't look right.
Inside the macro, the tooltip is generated using a one-shot tikz command that is stuffed into the tooltipcallout macro. A form XObject is generated from tooltipcallout and assigned to tooltipicon.
emptyicon is also dynamically generated by evaluating tooltipcallout inside of phantom. This is required because the size of the default icon apparently sets the viewport available for the rollover icon.
When generating PDF commands, a new row is added to the /AP array, /R for rollover, that uses the XObject referenced by tooltipicon.
The ready to consume R version is:
require(tikzDevice)
options(tikzLatexPackages = c(
getOption('tikzLatexPackages'),
"\\usetikzlibrary{shapes.callouts}",
"\\tikzset{tooltip/.style = {rectangle callout,draw,callout absolute pointer = {(-2em, 1em)}}}",
"\\def\\tooltiptarget{\\phantom{\\rule{1mm}{1mm}}}",
"\\newbox\\tempboxa",
"\\newcommand\\tooltip[1]{\\def\\tooltipcallout{\\tikz{\\node[tooltip]{#1};}}\\setbox\\tempboxa=\\hbox{\\phantom{\\tooltipcallout}}\\immediate\\pdfxform\\tempboxa\\edef\\emptyicon{\\the\\pdflastxform}\\setbox\\tempboxa=\\hbox{\\tooltipcallout}\\immediate\\pdfxform\\tempboxa\\edef\\tooltipicon{\\the\\pdflastxform}\\pdfstartlink user{/Subtype /Text/Contents (#1)/AP <</N \\emptyicon\\space 0 R/R \\tooltipicon\\space 0 R>>}\\tooltiptarget\\pdfendlink}"
))
Step 2
The R-level code is unchanged.
Step 3
Let's try a slightly more complicated graph:
tikz('tooltips_with_callouts.tex', standAlone = TRUE)
x <- 1:10
y <- runif(10, 0, 10)
plot(x,y)
place_PDF_tooltip(x,y,x)
dev.off()
require(tools)
texi2dvi('tooltips_with_callouts.tex', pdf = TRUE)
Step 4
The result (download a PDF):
As you can see, there is an issue with both the tooltip and the callout being displayed. Setting \Contents () so that the tooltip has an empty string won't help anything. This can probably be solved by using a different annotation type, but I'm not going to spend any more time on it at the moment.
Caveats
Lots of TeX commands contain backslashes, you will need to double the backslashes when expressing things in R code.
Certain characters are special to TeX, such as _, ^, %, complete list here, you will need to ensure these are properly escaped when using the tikzDevice.
Even though PDF is supposed to be superior to HTML in that it has a consistant rendering across platforms, your mileage will vary significantly depending on which viewer is being used. The screenshots were taken in Acrobat 8 on OS X, Preview also did a passable job but did not render the rollover callout. On Linux, xpdf didn't render anything and okular showed a tooltip, but did not suppress the tooltip icon and displayed a stock icon that looked a little garish in the middle of a plot.
Alternative Implementations
cooltooltips and fancytooltips are LaTeX packages that provide tooltip functionality that could probably be used from the tikzDevice. Some ideas used in the above examples were taken from cooltooltips.
Concluding Thoughts
Was this worth it? Well, by the end of the day I did learn two things:
I am not a PDF expert and I had to search a couple mailing lists and digest some parts of a 700+ page specification to even answer the question "is this possible?" let alone come up with an implementation.
I am not a SVG expert and yet I know that tooltips in SVG is not a question of "is this possible?" but rather "how sexy do you want it to look?".
Given that observation, I say that it is getting to be time for PDF to ride off into the sunset and take it's 700 page spec with it. We need a new page description markup language and personally I'm hoping SVG Print will be complete enough to do the job. I would even accept Adobe Mars if the specification is accesible enough. Just give us an XML-based format that can exploit the full power of today's JavaScript libraries and then some real innovation can begin. A LuaTeX backend would be a nice bonus.
References
tikzDevice: A device for outputting R graphics as LaTeX code.
comp.text.tex: Source of much insight into esoteric TeX details. Posts by Herbert Voß and Mycroft provided implementation ideas.
The pdfTeX manual: Source of information concerning TeX commands that can generate raw PDF code, such as \pdfstartlink
TeX by Topic: Go-to manual for low-level TeX details such as how \immediate actually works.
The Adobe PDF Specification: The lair of details concerning PDF primitives.
The TikZ Manual: Quite possibly the finest example of open source documentation ever written.
Well, not pdf, but you could include tooltips (among ohers) in svg format with SVGAnnotation or RSVGTipsDevice package.
The pdf2 package works fine for me. (https://r-forge.r-project.org/R/?group_id=193). You can just include a 'popup' in the regular text command. It's not current as of 2.14, but I imagine he'll get round to it before too long.
text(x,y,'hard copy text',popup='tooltip text')
Related
I am using Apache PDFBox for configuration of PDTextField's on a PDF document where I load Lato onto the document using:
font = PDType0Font.load(
#j_pd_document,
java.io.FileInputStream.new('/path/to/Lato-Regular.ttf')
) # => Lato-Regular
font_name = pd_default_resources.add(font).get_name # => F4
I then pass the font_name into a default_appearance_string for the PDTextField like so:
j_text_field.set_default_appearance("/#{font_name} 0 Tf 0 g") # where font_name is
# passed in from above
The issue now occurs when I proceed to invoke setValue on the PDTextField. Because I set the font_size in the defaultAppearanceString to 0, according to the library's example, the text should scale itself to fit in the text box's given area. However, the behaviour of this 'scale-to-fit' is inconsistent for certain fields: it does not always choose the largest font size to fit in the PDTextField. Might there be any further configuration that might allow for this to happen? Below are the PDFs where I've noticed this problem occurring.
Unfilled, with fonts loaded:
http://www.filedropper.com/0postfontload
Filled, with inconsisteny textbox text sizing:
http://www.filedropper.com/file_327
Side Note: I am using PDFBox through jruby which is just a integration layer that allows Ruby to invoke Java libraries. All java methods for the library available; a java method like thisExampleMethod would have a one-to-one translation into ruby this_example_method.
Updates
In response to comments, the appearances that are incorrect in the second uploaded file example are:
1st page Resident Name field (two text fields that have text that is too small for the given input field size)
2nd page Phone fields (four text fields that have text that overflows the given input field size)
Especially the appearances of the Resident Name fields, the Phone fields, and the Care Providers Address fields appear conspicuous. Only the former two are mentioned by the OP.
Let's inspect these fields; all screen shots are made using Adobe Reader DC on MS Windows:
The Resident Name fields
The filled in Resident Name fields look like this
While the height is appropriate, the glyphs are narrower than they should be. Actually this effect can already be seen in the original PDF:
This horizontal compression is caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
The widget rectangles: [ 45.72 601.44 118.924 615.24 ] and [ 119.282 601.127 192.486 614.927 ], i.e. 73.204*13.8 in both cases.
The appearance bounding box: [ 0 0 147.24 13.8 ], i.e. 147.24*13.8.
So they have the same height but the appearance bounding box is approximately twice as wide as the widget rectangle. Thus, the text drawn normally in the appearance stream gets compressed to half its width when the appearance is displayed in the widget rectangle.
When setting the value of a field PDFBox unfortunately re-uses the appearance stream as is and only updates details from the default appearance, i.e. font name, font size, and color, and the actual text value, apparently assuming the other properties of the appearance are as they are for a reason. Thus, the PDFBox output also shows this horizontal compression
To make PDFBox create a proper appearance, it is necessary to remove the old appearances before setting the new value.
The Phone fields
The filled in Phone fields look like this
and again there is a similar display in the original file
That only the first two letters are shown even though there is enough space for the whole word, is due to the configuration of these fields: They are configured as comb fields with a maximum length of 2 characters.
To have a value here set with PDFBox displayed completely and not so spaced out, you have to remove the maximum length (or at least have to make it no less than the length of your value) and unset the comb flag.
The Care Providers Address fields
Filled in they look like this:
Originally they look similar:
This vertical compression is again caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
A widget rectangle: [ 278.6 642.928 458.36 657.96 ], i.e. 179.76*15.032.
The appearance bounding box: [ 0 0 179.76 58.56 ], i.e. 179.76*58.56.
Just like in the case of the Resident Name fields above it is necessary to remove the old appearances before setting the new value to make PDFBox create a proper appearance.
A complication
Actually there is an additional issue when filling in the Care Providers Address fields, after removing the old appearances they look like this:
This is due to a shortcoming of PDFBox: These fields are configured as multi line text fields. While PDFBox for single line text fields properly calculates the font size based on the content and later finely makes sure that the text vertically fits quite well, it proceeds very crudely for multi line fields, it selects a hard coded font size of 12 and does not fine tune the vertical position, see the code of the AppearanceGeneratorHelper methods calculateFontSize(PDFont, PDRectangle) and insertGeneratedAppearance(PDAnnotationWidget, PDAppearanceStream, OutputStream).
As in your form these address fields anyways are only one line high, an obvious solution would be to make these fields single line fields, i.e. clear the Multiline flag.
Example code
Using Java one can implement the solutions explained above like this:
final int FLAG_MULTILINE = 1 << 12;
final int FLAG_COMB = 1 << 24;
PDDocument doc = PDDocument.load(originalStream);
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDType0Font font = PDType0Font.load(doc, fontStream, false);
String font_name = acroForm.getDefaultResources().add(font).getName();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.getCOSObject().removeItem(COSName.MAX_LEN);
textField.getCOSObject().setFlag(COSName.FF, FLAG_COMB | FLAG_MULTILINE, false);;
textField.setDefaultAppearance(String.format("/%s 0 Tf 0 g", font_name));
textField.getWidgets().forEach(w -> w.getAppearance().setNormalAppearance((PDAppearanceEntry)null));
textField.setValue("Test");
}
}
(FillInForm test testFill0DropOldAppearanceNoCombNoMaxNoMultiLine)
Screen shots of the output of the example code
The Resident Name field value now is not vertically compressed anymore:
The Phone and Care Providers Address fields also look appropriate now:
I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table?
private void Extract(ContentScanner level, PrimitiveComposer composer)
{
if(level == null)
return;
while(level.MoveNext())
{
ContentObject content = level.Current;
}
}
I am using PDF Clown's TextInfoExtractionSample...
In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders.
while(level.MoveNext())
{
ContentObject content = level.Current;
}
A) Visit all content
In your loop code you removed very important blocks from the original example,
if(content is XObject)
{
// Scan the external level!
Extract(((XObject)content).GetScanner(level), composer);
}
and
if(content is ContainerObject)
{
// Scan the inner level!
Extract(level.ChildLevel, composer);
}
These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content.
B) Inspect all content
Anyone know what object represents borders in PDF table
Unfortunately there is nothing like a border attribute in PDF content. Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles.
Thus, while scanning the page content (recursively, as indicated in A) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects) containing
moveTo m, lineTo l, and stroke S operations or
rectangle re and fill f operations.
(This answer may help)
When you come across such lines, you will have to interpret them. These lines may be borders, but they may also be used as underlines, page decorations, ...
If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. Unfortunately still less PDFs are tagged than not.
OR how to detect if a text is a header of the table?
Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. You have to interpret again. Is that text outside of lines you determined to form a table? Is it inside at the top? Or just anywhere inside? Is it drawn in a specific font? Or larger? Different color? Etc.
I'm very confused, why iTextsharp can't read or get the image from pdf(pdf converted from msword,excel,powerpoint)
Here's what I did, I opened msword file, then convert the msword file to pdf, then read the pdf file using iTextsharp, it doesn't recognize if pdf file has an image or shape.
I also tried from powerpoint to pdf, then read the pdf file, it doesn't read the images also.
Here's the code:
Below the images....EDITED...
This is the image that can't be extracted:
This is the image that I test a while ago that is good, and I don't know why the other image can't be detected, or it errors.
As of now I change the code to this:
But also can't detect its an image on the circle shape.
For pn As Integer = 1 To pc
Dim pg As PdfDictionary = pdfr.GetPageN(pn)
Dim res As PdfDictionary = DirectCast(PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)), PdfDictionary)
Dim xobj As PdfDictionary = DirectCast(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)), PdfDictionary)
MessageBox.Show("THE ERROR IS HERE, IT BYPASS, SO XOBJ IS NOTHING IN THAT IMAGE")
If xobj IsNot Nothing Then
For Each name As PdfName In xobj.Keys
Dim obj As PdfObject = xobj.Get(name)
If obj.IsIndirect() Then
Dim tg As PdfDictionary = DirectCast(PdfReader.GetPdfObject(obj), PdfDictionary)
Dim type As PdfName = DirectCast(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)), PdfName)
Dim XrefIndex As Integer = Convert.ToInt32(DirectCast(obj, PRIndirectReference).Number.ToString(System.Globalization.CultureInfo.InvariantCulture))
Dim pdfObj As PdfObject = pdfr.GetPdfObject(XrefIndex)
Dim pdfStrem As PdfStream = DirectCast(pdfObj, PdfStream)
If PdfName.IMAGE.Equals(type) Then
Dim bytes As Byte() = PdfReader.GetStreamBytesRaw(DirectCast(pdfStrem, PRStream))
If (bytes IsNot Nothing) Then
Dim strat As New ImageInfoTextExtractionStrategy()
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfr, pn, strat)
End If
End If
End If
Next
End If
Next
Why your current code does not find or extract those shapes:
The smiley image and the flower image are completely different in nature: The flower image is a bitmap image stored in the PDF as an /XObject (eXternal Object) of subtype /Image while the smiley is a vector images stored in the PDF as a part of the page content stream as a (not necessarily continuous) sequence of path definition and drawing operations.
Your code only searches for bitmap images stored as external objects and it does so in a somewhat convoluted way: It first scans for image xobjects using lowlevel methods, and only if it finds such a xobject, it employs the iText highlevel extraction capabilities. If it started out using only the iText image extraction capabilities, it would be less complex, and at the same time it would also recognize inlined bitmap images.
You might want to look at the iText in Action — 2nd Edition chapter 15 Webified iTextSharp Examples, especially ExtractImages.cs which utilizes MyImageRenderListener.cs for this. While inspiration by that code could improve your current code, it won't yet help you with your issue at hand, though.
What you have to do to find or extract the shapes using iText:
Unfortunately your question is not entirely clear on what you actually are trying to achieve.
Do you merely need to detect whether there is some image (bitmap or vector graphic) on some page?
Do you need some information on the image, e.g. size or position on the page?
Or do you actually want to extract the image?
While these objectives can easily be implemented for bitmap graphics using the afore mentioned iText highlevel extraction capabilities, they are fairly difficult to fulfill for vector graphics.
For generic PDFs they are virtually impossible to implement because the drawing operations for an individual figure need not be together, and even worse, the drawing operations for different figures on the same page, for underlines on the page, and for other graphic effects might even be mixed in one big heap of operations in seemingly random order.
In your case, though, you have one advantage: Office seems to properly tag the figures in the PDF. This at least makes detection of the number of different (i.e. differently tagged) vector graphics on a page easy and also allows for the differentiation which drawing operation belongs to which figure.
Thus, here some pointers on how to achieve the goals mentioned above for PDFs tagged like your sample PDF. As I'm not using VB myself, I don't have sample code. But as your sample code shows that you already know how to follow object references and how to interpret PDF object information, these pointers should suffice to show the way.
1. Detecting whether there is some image on some page.
As the page content is tagged, it suffices to scan the structure hierarchy starting from the /StructTreeRoot entry in the document catalogue (use PdfReader.Catalog, select the value of PdfName.STRUCTTREEROOT in it, and dig into it).
E.g. for page 1 (in PDF object 4 0) of your sample (with the "1233" at the top and the smiley below) you'll find an array with dictionaries:
<<
/Pg 4 0 R
/K [0]
/S /P
/P 24 0 R
>>
and
<<
/Pg 4 0 R
/K [1]
/Alt ()
/S /Figure
/P 22 0 R
>>
each of which references the page (/Pg 4 0 R). The first one is of type /P, a paragraph (your "1233"), and the second one is of type /Figure, a figure (your smiley). The presence of the second element indicates the presence of a figure on the page. Thus, the goal 1 is achieved already with these data.
(For details cf. the PDF specification ISO 32000-1:2008 section 14.7 and 14.8.)
2. Retrieving some information on the image, e.g. size or position on the page.
For this you have to extract the graphics operators responsible for creating the figure in question. As it is tagged, you have to extract the operators in a marked content block associated with the marked content ID given by /K [1] in the /Figure dictionary above, i.e. 1.
In the content stream you'll find this:
/P <</MCID 1>> BDC 0.31 0.506 0.741 rg
108.6 516.6 m
108.6 569.29 160.18 612 223.8 612 c
287.42 612 339 569.29 339 516.6 c
339 463.91 287.42 421.2 223.8 421.2 c
160.18 421.2 108.6 463.91 108.6 516.6 c
h
f*
[...]
108.6 516.6 m
108.6 569.29 160.18 612 223.8 612 c
287.42 612 339 569.29 339 516.6 c
339 463.91 287.42 421.2 223.8 421.2 c
160.18 421.2 108.6 463.91 108.6 516.6 c
h
S
EMC
This section between BDC for /MCID 1 and EMC contains the graphics operations you seek. If you want to get some information on the figure they represent, you have to analyze them.
This is a very low-level view on all this and one might whish for a higher level API to retrieve this.
iText does have a high level API for the analogous operations for text and bitmap image processing using the parser namespace class PdfReaderContentParser together with some apropos RenderListener implementation like your ImageInfoTextExtractionStrategy. Unfortunately, though, PdfReaderContentParser does not yet properly pre-process the vector graphics related operators.
To do this with iText, therefore, you either have to extend the underlying PdfContentStreamProcessor to add the missing pre-processing (which is do-able as that parser class is implemented using separate listeners for the individual operations, and you can easily register new listeners for the graphics operators); or you have to retrieve the page content and parse it yourself.
3. Extracting the image.
As the vector images inside a PDF use PDF specific vector graphics operators, you first have to decide in which format you want to export the image. Unless you are interested in the raw PDF operators, you will most likely require some library helping you to create a file in the desired format.
Once that is decided, you first extract the graphics operators in question as explained before and then feed them to that library to create an exportable image of your choice.
I must insert a number at a fix position in an existing A4 pdf.
I've tried the following as a first test, but that doesn't work(not text is added).
What goes wrong?
Here's my code:
byte[] omrMarks = omrFrame.getOmrImage();
Jpeg img = new Jpeg(omrMarks);
PdfImportedPage page = stamper.getImportedPage(source, pageNum);
PdfContentByte pageContent = stamper.getOverContent(pageNum);
pageContent.addImage(
img, img.getWidth(), 0, 0, img.getHeight(), 15f, (page.getHeight() - 312));
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
There are many issues with this question.
This is certainly wrong:
pageContent.moveTo(10, 200);
pageContent.beginText();
pageContent.setLiteral("Test");
pageContent.endText();
The moveTo() method doesn't make sense; it has no effect on the text state object.
The text state object is illegal because there's no setFontAndSize() (it's very odd that this doesn't throw a RuntimeException, are you using an obsolete version of iText?)
The setLiteral() method should only be used to add some literal PDF syntax to a content stream.
For instance, something like:
pageContent.setLiteral("\n100 100 m\n100 200 l\nS\n");
should only be used if you understand that the following PDF syntax draws a line:
100 100 m
100 200 l
S
It's clear from your question that you don't understand PDF syntax, so you shouldn't use these methods. Instead you should use convenience methods such as the showTextAligned() method, which hide the complexity of PDF and save you a couple of lines.
Maybe you have a good reason to opt for the "hard way", but in that case, you should read the documentation, otherwise you'll continue using methods such as setLiteral() instead of showText(), moveTo() instead of moveText(), and so on, resulting in code you don't want your employer to see.
Furthermore, you're making the assumption that the lower left corner of the page has the coordinates (0,0). That's probably true for the majority of PDF documents found in the wild, but that's not true for all PDF documents. The MediaBox doesn't have to be [0 0 595 842], it could as well be [595 842 1190 1684]. Moreover: what if there's a CropBox? Maybe you're adding content that isn't visible because it's cropped away...
I'm using iText in Java to select a few pages out of a big PDF document and save as a new, smaller PDF. At the same time I'd like to change their colours.
For example, suppose my pages all use shades of grey, and I'd like to make it green. All the colours used are shades of gray. I'd like to replace each of those colours with a corresponding colour in green.
Mark Storer asks:
What exactly are you trying to accomplish?
Turn this... into this:
I have some documents, on which I'm already using iText to select a smaller set of pages from the document based on user input - cutting more than 100 pages down to about 5. At the same time I wish to produce green, blue, yellow, pink etc versions of them. Not every page is in grayscale, but all the ones that matter are, so I can force their colour space if need be.
Update:
Following Mark Storer's suggestion of blending modes, here's what I have:
val reader = new PdfReader(file.toURL)
val document = new Document
val writer = PdfWriter.getInstance(document, outputStream)
document.open()
/* draw a white background behind the page, so the
blend always has something to transform, otherwise
it just fills. */
val canvas = writer.getDirectContent
canvas.setColorFill(new CMYKColor(0.0f, 0.0f, 0.0f, 0.0f))
canvas.rectangle(10f, 0f, 100f, 100f)
canvas.fill
/* Put the imported page on top of that */
val page = writer.getImportedPage(reader, 1)
canvas.addTemplate(page, 0, 0)
/* Fill a box with colour and a blending mode */
canvas.setColorFill(new CMYKColor(0.6f,0.1f,0.0f,0.5f))
val gstate = new PdfGState
gstate.setBlendMode(PdfGState.BM_SCREEN)
canvas.setGState(gstate)
canvas.rectangle(0f, 0f, 100f, 100f)
canvas.fill
document.close()
(It's in Scala, but the iText library is just the same as in Java)
The problem is, all the blending modes iText has available are "separable" modes: they operate on each colour channel independently. That means I can separately adjust the cyan, magenta, yellow or black values, but I can't turn gray into green.
To do that, I'd need to use the Color blending mode, which is "non-separable", ie the colour channels affect each other. As far as I can tell, iText doesn't offer that - none of the non-separale blending modes are listed among the constants in PdfGState. I'm using iText 5.0.5, which is the latest version as of writing this.
Is there a way of accessing these blending modes in iText, or even of hacking them in? Is there another way of achieving the result?
This adobe document described blending modes.
Update:
Even setting the blend mode to Color didn't work. I did this in code to force it:
val gstate = new PdfGState
gstate.put(PdfName.BM, new PdfName("Color"))
canvas.setGState(gstate)
and I checked the resulting PDF in a text editor to make sure it said the right thing. Sadly the result on screen just didn't work. I've no idea why, according to the PDF specification that should be the right blend mode.
Mark Storer asks:
"Color" didn't work? Funky. Can we see the PDF?
Here's a PDF.
Putting it on the web, I can now see that the "Color" mode works correctly in Chrome, but doesn't work in Acrobat 9 Pro (CS4). So the technique is correct, but Adobe fails at rendering!
I wonder if there isn't some way of "flattening" the effect of the blending mode, so the PDF contains the desired colour object directly rather than a blending intended to result in the right colour.
Idea: Turn this upside down. Use the existing page as an alpha channel on a page filled
entirely with the desired color rather than the other way around.
How? I'm not sure the GState applies to adding a template?
Also, the imported page would need the white background adding first, or it will simply flood with colour wherever there isn't an object rather then blending.
I tried doing this:
val canvas = writer.getDirectContent
canvas.setColorFill(new CMYKColor(0.6f,0.1f,0.0f,0.0f))
canvas.rectangle(10f, 0f, 500f, 500f)
canvas.fill
val template = canvas.createTemplate(500f, 500f)
template.setColorFill(new CMYKColor(0f, 0f, 0f, 0f))
template.rectangle(0f, 0f, 500f, 500f)
template.fill
val page = writer.getImportedPage(reader, 1)
template.addTemplate(page, 0, 0)
val gstate = new PdfGState
gstate.put(PdfName.BM, new PdfName("Color"))
canvas.setGState(gstate)
canvas.addTemplate(template, 0, 0)
And here's the PDF it produced. Not quite right, either in Chrome or Acrobat :)
Edit: Silly me. I changed the mode to "Luminosity", producing this file. As before, this looks correct in Chrome but not in Acrobat.
I just checked, and even Adobe Reader X doesn't render it properly. Which probably means what I'm asking for is impossible. :(
Solution
Leonard Rosenthal from Adobe got back to me, and clarified the problem: the "Color" blend mode only works when the transformation space is RGB, not CMYK. My PDF wasn't specifying the space, so Adobe products default to CMYK, while others default to RGB.
The solution in iText was to add this line near the top:
writer.setRgbTransparencyBlending(true)
Of course, for the sake of colour accuracy you don't want any more colour space conversions than absolutely necessary, so only use this line if you really do need to use RGB blend modes.
The colour pages produced look a little strange from a Photoshop user's perspective: it looks like light greys have been made more saturate than dark greys. I'm investigating ways of combining filters to adjust that output.
Here's the result!
Many thanks to Mark Storer for helping me reach this solution.
If you consistently want to go from "shades of gray" to "shades of Color X", you might be able to use transparency with some funky blending mode.
If you want to go through all the content streams and edit the existing color commands, that's a pretty tall order. You have to take into account a Wide Variety of colo(u)r spaces. DeviceGray, DeviceRGB, DeviceCMYK, ICC profiles, calibrated RGB and CMYK, spot colors, and So Forth.
What exactly are you trying to accomplish?
"Color" didn't work? Funky. Can we see the PDF?
Idea: Turn this upside down. Use the existing page as an alpha channel on a page filled entirely with the desired color rather than the other way around.
One more try. Rather than blending, use a transfer function. You'd need to build a Function dictionary. You're sticking with CMYK, so cramming any and all inputs into a specific output should be fairly simple.
something like this:
C: [0 1] -> [0 0.6]
M: [0 1] -> [0 0.1]
Y: [0 1] -> [0 0]
K: [0 1] -> [0 0]
(I swiped the 0.6 0.1 0 0 from your PDF)
Urgh... only your existing page is all deviceGray, right? Nope... CMYK as well, only just K's. You'd need a transfer function that took the K values, and mapped them to CMYK based on the color output you wanted.
And then I looked at how you define a function in PDF. So much for simple. Domains and ranges and samples oh my! Not exactly trivial.
Still, this just might work.
(though I still think you should find a blended PDF that works in Acrobat and see what the differences are)
Last ditch effort:
PM Leonard Rosenthol. He has an account here on SO. He's the Acrobat developer relations guy for Adobe. Tell him that Mark Storer is stumped. That should get his attention. ;)