Get font height/weight from TextRenderInfo how? - pdf

When I parse an existing PDF using iText(Sharp), I create an object which implements IRenderListener which I pass into PdfReaderContentParser.ProcessContent() and sure enough, my object's RenderText() gets called repeatedly with all the text in the PDF.
The problem is, the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold). Is this a known deficiency of iText(Sharp) or am I missing something?

the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold)
Height
Unfortunately iTextSharp does not provide a public font size method or member in the TextRenderInfo. Some people worked around this by using the distance between its GetAscentLine() and its GetDescentLine().
If you are ready to use Reflection, though, you can do better by exposing and using the private TextRenderInfo member GraphicsState gs, e.g. like in this render listener:
public class LocationTextSizeExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<SizeAndTextAndFont> myChunks = new List<SizeAndTextAndFont>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
myChunks.Add(new SizeAndTextAndFont(gs.FontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
//Helper class that stores our rectangle, text, and font
public class SizeAndTextAndFont
{
public float Size;
public String Text;
public String Font;
public SizeAndTextAndFont(float size, String text, String font)
{
this.Size = size;
this.Text = text;
this.Font = font;
}
}
You can extract information with such a render listener like this:
using (var pdfReader = new PdfReader(testFile))
{
// Loop through each page of the document
for (var page = startPage; page < endPage; page++)
{
Console.WriteLine("\n Page {0}", page);
LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
foreach (SizeAndTextAndFont p in strategy.myChunks)
{
Console.WriteLine(string.Format("<{0}> in {2} at {1}", p.Text, p.Size, p.Font));
}
}
}
This produces an output like this:
Page 1
< The Philippine Stock Exchange, Inc> in Helvetica-Bold at 8
< Daily Quotations Report> in Helvetica-Bold at 8
< March 23 , 2015> in Helvetica-Bold at 8
<Name> in Helvetica at 7
<Symbol> in Helvetica at 7
<Bid> in Helvetica at 7
[...]
Considering transformations
The numbers you see in the output as font sizes are the values of the font size property in the PDF graphics state at the time the respective text is drawn.
Due to the flexibility of PDF this may not be font size you eventually see in the output, though, a custom transformation may stretch the output considerably. Some PDF producers even always use a font size of 1 and transformations to stretch the output accordingly.
To get a good value for font sizes in such documents, you can improve the LocationTextSizeExtractionStrategy method RenderText like this:
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
Matrix textToUserSpaceTransformMatrix = (Matrix) TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
float transformedFontSize = new Vector(0, gs.FontSize, 0).Cross(textToUserSpaceTransformMatrix).Length;
myChunks.Add(new SizeAndTextAndFont(transformedFontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
with this additional reflection FieldInfo member.
FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField("textToUserSpaceTransformMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
Weight
As you can see in the output above, the name of the font may contain more than the mere font family name but also a weight indicator
< March 23 , 2015> in Helvetica-Bold at 8
In your example, therefore,
the TextRenderInfo tells me about the base font (in my case, Helvetica)
the Helvetica without any decorations would imply a regular weight.
Helvetica is one of the standard 14 fonts which every PDF viewer must provide out-of-the-box: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique. Thus, these names are pretty dependable.
Unfortunately font names in general may be chosen arbitrarily; a bold font may have "Bold" or "Black" or other indicators of boldness in its name or none at all.
One might also try to use the font's FontDescriptor dictionary for which an entry FontWeight is specified. Unfortunately this entry is optional, you cannot count on it being there at all.
Furthermore, a font in a PDF can be artificially bold'ed, cf. this answer:
All these numbers are drawn using the same font, merely adding a rising outline line width.
Thus, I'm afraid there is no dependable way to find the exact font weight, merely a number of heuristics which may or may not return acceptable approximations.

Related

PDF Formfield font size: Default appearance vs. appearance stream

I extract the font size of a form field to get the information about its size (using iText). This works well for most of the documents however for some I get a font size of 1 because in the appearance the font size is 1. However if I open the PDF in several different viewers the size of this text field is always 8. I thought a form field should be rendered according to its appearance? So why do PDF Viewers use the default appearance and not the font size defined in the appearance stream?
Update: As mentioned by MKL I did forget to consider the text matrix.
I did implement my own RenderListener for the font. Does anyone know how to apply the scaling?
public class PdfStreamFontExtractor implements RenderListener{
#Override
public void usedFont(DocumentFont font, float fontSize) {
this.font=font;
this.fontSize=fontSize;
}
#Override
public void renderText(TextRenderInfo renderInfo) {
//get scaling factor from textToUserSpaceTransformMatrix?
}
...
}
But the effective font size is 8 even in the appearance stream!
Have a look at the whole text object:
BT
8 0 0 8 2 5.55 Tm
/TT1 1 Tf
[...] TJ
ET
The text matrix set at the start scales everything by a factor of 8. Thus, the text drawn thereafter has an effective font size of 8 × 1 = 8.
Admittedly, while you can see scaling text matrices in combination with size 1 Tf instructions in regular contents pretty often, I have not seen that in form field appearances yet. It's pretty uncommon, I'd assume.
Concerning your update...
Does anyone know how to apply the scaling?
#Override
public void renderText(TextRenderInfo renderInfo) {
//get scaling factor from textToUserSpaceTransformMatrix?
}
Well, it depends on how you want to measure the transformed size.
One approach would be to take a vertical vector as long as the font size (as given in the Tf instruction), transform it by textToUserSpaceTransformMatrix, and take the length of the transformation result:
#Override
public void renderText(TextRenderInfo renderInfo) {
scaledfontSize=renderInfo.getTransformedFontSize(unScaledfontSize);
}
public class TextRenderInfo {
...
public float getTransformedFontSize(float fontSize){
return new Vector(0, fontSize, 0).cross(this.textToUserSpaceTransformMatrix).length();
}
...
If the transformation only consists of reflections, rotations and scaling, the result should be as desired. If skewing effects are involved, you might want to project that transformed vector onto the plane perpendicular to the transformed writing direction before taking the length.

iText - PDFAppearence issue

We're using iText to put a text inside a signature placeholder in a PDF. We use a code snippet similar to this to define the Signature Appearence
PdfStamper stp = PdfStamper.createSignature(inputReader, os, '\0', tempFile2, true);
sap = stp.getSignatureAppearance();
sap.setVisibleSignature(placeholder);
sap.setRenderingMode(PdfSignatureAppearance.RenderingMode.DESCRIPTION);
sap.setCertificationLevel(PdfSignatureAppearance.NOT_CERTIFIED);
Calendar cal = Calendar.getInstance();
sap.setSignDate(cal);
sap.setLayer2Text(text+"\n"+cal.getTime().toString());
sap.setReason(text+"\n"+cal.getTime().toString()); `
Everything works fine, but the signature text does not fill all the signature placeholder area as expected by us, but the area filled seems to have an height that is approximately the 70% of the available space.
As a result, sometimes especially if the length of the signature text is quite big, the signature text does not fit in the placeholder and the text is striped away.
Example of filled Signature:
I looked into the PdfSignatureAppearence class and I found this code snippet in the getApperance() method that is responsible of this behaviour and is invoked when
sap.setRenderingMode(PdfSignatureAppearance.RenderingMode.DESCRIPTION);
is being called
else {
dataRect = new Rectangle(
MARGIN,
MARGIN,
rect.getWidth() - MARGIN,
rect.getHeight() * (1 - TOP_SECTION) - MARGIN);
}
I don't get the reason for that, because I expect that the text could use all the available placeholder height, with the proper margin.
Is there any way to bypass this behaviour?
We are using iText 5.4.2, but also newer version contains same code snippet so I expect that the behaviour will be same.
As #JJ. already commented,
TOP_SECTION is connected with acro6layers rendering and the code [determining the datarect in pure DESCRIPTION mode] does not take into account the value of the acro6layer flag.
Unless one wants to fix this in the iText 5 code itself, the easiest way to make one's description use the whole signature space is to construct the layer 2 appearance oneself.
To do so one merely has to retrieve a PdfTemplate from PdfSignatureAppearance.getLayer(2) and fill it as desired after one has called PdfSignatureAppearance.setVisibleSignature. The PdfSignatureAppearance remembers that you already have retrieved the layer 2 and doesn't change it anymore.
For the case at hand we essentially copy the PdfSignatureAppearance.getAppearance code for generating layer 2 in pure DESCRIPTION mode, merely correcting the code determining the datarect:
PdfSignatureAppearance appearance = ...;
[...]
appearance.setVisibleSignature(new Rectangle(36, 748, 144, 780), 1, "sig");
PdfTemplate layer2 = appearance.getLayer(2);
String text = "We're using iText to put a text inside a signature placeholder in a PDF. "
+ "We use a code snippet similar to this to define the Signature Appearence.\n"
+ "Everything works fine, but the signature text does not fill all the signature "
+ "placeholder area as expected by us, but the area filled seems to have an height "
+ "that is approximately the 70% of the available space.\n"
+ "As a result, sometimes especially if the length of the signature text is quite "
+ "big, the signature text does not fit in the placeholder and the text is striped "
+ "away.";
Font font = new Font();
float size = font.getSize();
final float MARGIN = 2;
Rectangle dataRect = new Rectangle(
MARGIN,
MARGIN,
appearance.getRect().getWidth() - MARGIN,
appearance.getRect().getHeight() - MARGIN);
if (size <= 0) {
Rectangle sr = new Rectangle(dataRect.getWidth(), dataRect.getHeight());
size = ColumnText.fitText(font, text, sr, 12, appearance.getRunDirection());
}
ColumnText ct = new ColumnText(layer2);
ct.setRunDirection(appearance.getRunDirection());
ct.setSimpleColumn(new Phrase(text, font), dataRect.getLeft(), dataRect.getBottom(), dataRect.getRight(), dataRect.getTop(), size, Element.ALIGN_LEFT);
ct.go();
(CreateSignature.java test signWithCustomLayer2)
(As description text I used some paragraphs from the question body.)
The result:
By adapting the MARGIN value in the code above, one can even use more are. As that can result in the text touching the border, though, that might not be really beautiful.
As an aside:
if the length of the signature text is quite big, the signature text does not fit in the placeholder and the text is striped away.
If you initialize the size variable above with a non-positive value, the code in the if (size <= 0) block will calculate a font size which allows all of the text to fit into the signature rectangle. This does happen in the code above as new Font() returns a font with a size of UNDEFINED which is a constant -1.

How to measure the text length using PDFsharp library

I have a requirement to measure the text length in a PDF and wrap the line if the length exceeds a certain amount. I am already using PDFsharp library.
I already used the following code to determine the length of the text.
public static Size MeasureString(string s, Font font)
{
SizeF result;
using (var image = new Bitmap(1, 1))
{
using (var g = Graphics.FromImage(image))
{
result = g.MeasureString(s, font);
}
}
return result.ToSize();
}
As I understood I am pretty dependent of the resolution and dpi to convert Height and Width properties of the Size class to millimeter. But according to the PDFsharp's team answer in this post "PDF files are vector files that have no DPI".
So I am a bit confused about the right way to measure the text length using this library.
PDF files have no pixels, PDF files have no DPI.
The standard unit with PDFsharp is points. There are 72 points per inch.
You can have the length of the text in points, mm, cm, inch, ...
You can have the width of the page in points, mm, cm, inch, ...
The XTextFormatter class can do simple line-wrapping for you:
http://www.pdfsharp.net/wiki/TextLayout-sample.ashx
This sample shows how to call MeasureString:
http://www.pdfsharp.net/wiki/Graphics-sample.ashx#Show_how_to_get_text_metric_information_19
Use the correct MeasureString method with an XGraphics object and you will get an XSize object with the text dimensions - no pixels, but mm, cm, inch, point, ...
Use MigraDoc for line-wrapping with sophisticated text formatting.
The Wikipedia article on Points: https://en.wikipedia.org/wiki/Point_(typography)

PDFBox pdf to image generates overlapping text

For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.
This is the code I am using. It is very simple code which calls PDFToImage. But the output jpg image file looks really bad with lot of commas inserted and some overlapping text.
String [] args_2 = new String[7];
String pdfPath = "C:\\bitcoin.pdf";
args_2[0] = "-startPage";
args_2[1] = "1";
args_2[2] = "-endPage";
args_2[3] = "1";
args_2[4] = "-outputPrefix";
args_2[5] = "my_image_2";
//args_2[6] = "-resolution";
//args_2[7] = "1000";
args_2[6] = pdfPath;
try {
PDFToImage.main(args_2);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font
So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:
different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.
This results in
The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.
Using PDFBox 2.0.0-SNAPSHOT
Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:
PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
#SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();
PDFRenderer renderer = new PDFRenderer(document);
for (int i = 0; i < pages.size(); i++)
{
BufferedImage image = renderer.renderImage(i);
ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}
The result with this code is:
Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.
PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by
ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.
If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

Some pdf file watermark does not show using iText

Our company using iText to stamp some watermark text (not image) on some pdf forms. I noticed 95% forms shows watermark correctly, about 5% does not. I tested, copy 2 original pdf files, one was marked ok, other one does not ok, then tested in via a small program, same result: one got marked, the other does not. I then tried the latest version of iText jar file (version 5.0.6), same thing. I checked pdf file properties, security settings etc, seems nothing shows any hint. The result file does changed size and markd "changed by iText version...." after executed program.
Here is the sample watermark code (using itext jar version 2.1.7), note topText, mainText, bottonText parameters passed in, make 3 lines of watermarks show in the pdf as watermark.
Any help appreciated !!
public class WatermarkGenerator {
private static int TEXT_TILT_ANGLE = 25;
private static Color MEDIUM_GRAY = new Color(160, 160, 160);
private static int SUPPORT_FONT_SIZE = 42;
private static int PRIMARY_FONT_SIZE = 54;
public static void addWaterMark(InputStream pdfInputStream,
OutputStream outputStream, String topText,
String mainText, String bottomText) throws Exception {
PdfReader reader = new PdfReader(pdfInputStream);
int numPages = reader.getNumberOfPages();
// Create a stamper that will copy the document to the output
// stream.
PdfStamper stamp = new PdfStamper(reader, outputStream);
int page=1;
BaseFont baseFont =
BaseFont.createFont(BaseFont.HELVETICA_BOLDOBLIQUE,
BaseFont.WINANSI, BaseFont.EMBEDDED);
float width;
float height;
while (page <= numPages) {
PdfContentByte cb = stamp.getOverContent(page);
height = reader.getPageSizeWithRotation(page).getHeight() / 2;
width = reader.getPageSizeWithRotation(page).getWidth() / 2;
cb = stamp.getUnderContent(page);
cb.saveState();
cb.setColorFill(MEDIUM_GRAY);
// Top Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, topText, width,
height+PRIMARY_FONT_SIZE+16, TEXT_TILT_ANGLE);
cb.endText();
// Primary Text
cb.beginText();
cb.setFontAndSize(baseFont, PRIMARY_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, mainText, width,
height, TEXT_TILT_ANGLE);
cb.endText();
// Bottom Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, bottomText, width,
height-PRIMARY_FONT_SIZE-6, TEXT_TILT_ANGLE);
cb.endText();
cb.restoreState();
page++;
}
stamp.close();
}
}
We solved problem by change Adobe LifecycleSave file option. File->Save->properties->Save as, then look at Save as type, default is Acrobat 7.0.5 Dynamic PDF Form File, we changed to use 7.0.5 Static PDF Form File (actually any static one will work). File saved in static one do not have this watermark disappear problem. Thanks Mark for pointing to the right direction.
You're using the underContent rather than the overContent. Don't do that. It leaves you at the mercy of big, white-filled rectangles that some folks insist on drawing first thing. It's a hold over from less-than-good PostScript interpreters and hasn't been necessary for Many Years.
Okay, having viewed your PDF, I can see the problem is that this is an XFA-based form (from LiveCycle Designer). Acrobat can (and often does) rebuild the entire file based on the XFA (a type of xml) it contains. That's how your changes are lost. When Acrobat rebuilds the PDF from the XFA, all the existing PDF information is pitched, including your watermark.
The only way to get this to work would be to define the watermark as part of the XFA file contained in the PDF.
Detecting these forms isn't all that hard:
PdfReader reader = new PdfReader(...);
AcroFields acFields = reader.getAcroFields();
XfaForm xfaForm = acFields.getXfaForm();
if (xfaForm != null && xfaForm.isXfaPresent()) {
// Ohs nose.
throw new ItsATrapException("We can't repel XML of that magnitude!");
}
Modifying them on the other hand could be Quite Challenging, but here's the specs.
Once you've figured out what needs to be changed, it's a simple matter of XML manipulation... but that "figure it out" part could be interesting.
Good hunting.