I extract the font size of a form field to get the information about its size (using iText). This works well for most of the documents however for some I get a font size of 1 because in the appearance the font size is 1. However if I open the PDF in several different viewers the size of this text field is always 8. I thought a form field should be rendered according to its appearance? So why do PDF Viewers use the default appearance and not the font size defined in the appearance stream?
Update: As mentioned by MKL I did forget to consider the text matrix.
I did implement my own RenderListener for the font. Does anyone know how to apply the scaling?
public class PdfStreamFontExtractor implements RenderListener{
#Override
public void usedFont(DocumentFont font, float fontSize) {
this.font=font;
this.fontSize=fontSize;
}
#Override
public void renderText(TextRenderInfo renderInfo) {
//get scaling factor from textToUserSpaceTransformMatrix?
}
...
}
But the effective font size is 8 even in the appearance stream!
Have a look at the whole text object:
BT
8 0 0 8 2 5.55 Tm
/TT1 1 Tf
[...] TJ
ET
The text matrix set at the start scales everything by a factor of 8. Thus, the text drawn thereafter has an effective font size of 8 × 1 = 8.
Admittedly, while you can see scaling text matrices in combination with size 1 Tf instructions in regular contents pretty often, I have not seen that in form field appearances yet. It's pretty uncommon, I'd assume.
Concerning your update...
Does anyone know how to apply the scaling?
#Override
public void renderText(TextRenderInfo renderInfo) {
//get scaling factor from textToUserSpaceTransformMatrix?
}
Well, it depends on how you want to measure the transformed size.
One approach would be to take a vertical vector as long as the font size (as given in the Tf instruction), transform it by textToUserSpaceTransformMatrix, and take the length of the transformation result:
#Override
public void renderText(TextRenderInfo renderInfo) {
scaledfontSize=renderInfo.getTransformedFontSize(unScaledfontSize);
}
public class TextRenderInfo {
...
public float getTransformedFontSize(float fontSize){
return new Vector(0, fontSize, 0).cross(this.textToUserSpaceTransformMatrix).length();
}
...
If the transformation only consists of reflections, rotations and scaling, the result should be as desired. If skewing effects are involved, you might want to project that transformed vector onto the plane perpendicular to the transformed writing direction before taking the length.
Related
When I parse an existing PDF using iText(Sharp), I create an object which implements IRenderListener which I pass into PdfReaderContentParser.ProcessContent() and sure enough, my object's RenderText() gets called repeatedly with all the text in the PDF.
The problem is, the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold). Is this a known deficiency of iText(Sharp) or am I missing something?
the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold)
Height
Unfortunately iTextSharp does not provide a public font size method or member in the TextRenderInfo. Some people worked around this by using the distance between its GetAscentLine() and its GetDescentLine().
If you are ready to use Reflection, though, you can do better by exposing and using the private TextRenderInfo member GraphicsState gs, e.g. like in this render listener:
public class LocationTextSizeExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<SizeAndTextAndFont> myChunks = new List<SizeAndTextAndFont>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
myChunks.Add(new SizeAndTextAndFont(gs.FontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
//Helper class that stores our rectangle, text, and font
public class SizeAndTextAndFont
{
public float Size;
public String Text;
public String Font;
public SizeAndTextAndFont(float size, String text, String font)
{
this.Size = size;
this.Text = text;
this.Font = font;
}
}
You can extract information with such a render listener like this:
using (var pdfReader = new PdfReader(testFile))
{
// Loop through each page of the document
for (var page = startPage; page < endPage; page++)
{
Console.WriteLine("\n Page {0}", page);
LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
foreach (SizeAndTextAndFont p in strategy.myChunks)
{
Console.WriteLine(string.Format("<{0}> in {2} at {1}", p.Text, p.Size, p.Font));
}
}
}
This produces an output like this:
Page 1
< The Philippine Stock Exchange, Inc> in Helvetica-Bold at 8
< Daily Quotations Report> in Helvetica-Bold at 8
< March 23 , 2015> in Helvetica-Bold at 8
<Name> in Helvetica at 7
<Symbol> in Helvetica at 7
<Bid> in Helvetica at 7
[...]
Considering transformations
The numbers you see in the output as font sizes are the values of the font size property in the PDF graphics state at the time the respective text is drawn.
Due to the flexibility of PDF this may not be font size you eventually see in the output, though, a custom transformation may stretch the output considerably. Some PDF producers even always use a font size of 1 and transformations to stretch the output accordingly.
To get a good value for font sizes in such documents, you can improve the LocationTextSizeExtractionStrategy method RenderText like this:
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
Matrix textToUserSpaceTransformMatrix = (Matrix) TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
float transformedFontSize = new Vector(0, gs.FontSize, 0).Cross(textToUserSpaceTransformMatrix).Length;
myChunks.Add(new SizeAndTextAndFont(transformedFontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
with this additional reflection FieldInfo member.
FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField("textToUserSpaceTransformMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
Weight
As you can see in the output above, the name of the font may contain more than the mere font family name but also a weight indicator
< March 23 , 2015> in Helvetica-Bold at 8
In your example, therefore,
the TextRenderInfo tells me about the base font (in my case, Helvetica)
the Helvetica without any decorations would imply a regular weight.
Helvetica is one of the standard 14 fonts which every PDF viewer must provide out-of-the-box: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique. Thus, these names are pretty dependable.
Unfortunately font names in general may be chosen arbitrarily; a bold font may have "Bold" or "Black" or other indicators of boldness in its name or none at all.
One might also try to use the font's FontDescriptor dictionary for which an entry FontWeight is specified. Unfortunately this entry is optional, you cannot count on it being there at all.
Furthermore, a font in a PDF can be artificially bold'ed, cf. this answer:
All these numbers are drawn using the same font, merely adding a rising outline line width.
Thus, I'm afraid there is no dependable way to find the exact font weight, merely a number of heuristics which may or may not return acceptable approximations.
I thought this would be like pretty simple task to do, but now I have tried for hours and cant figure out how to get around this.
I have a list of friends which should be displayed in a scrollable list. Each friend have a profile image and a name associated to him, so each item in the list should display the image and the name.
The problem is that I cant figure out how to make a flexible container that contains both the image and the name label. I want to be able to change the width and height dynamically so that the image and the text will scale and move accordingly.
I am using Unity 5 and Unity UI.
I want to achieve the following for the container:
The width and height of the container should be flexible
The image is a child of the container and should be left aligned, the height should fill the container height and should keep its aspect ratio.
The name label is a child of the contianer and should be left aligned to the image with 15 px left padding. The width of the text should fill the rest of the space in the container.
Hope this is illustrated well in the following attached image:
I asked the same question here on Unity Answers, but no answers so far. Is it really possible that such a simple task is not doable in Unity UI without using code?
Thanks a lot for your time!
Looks like can be achieved with layout components.
The image is a child of the container and should be left aligned, the height should fill the container height and should keep its aspect ratio.
For this try to add Aspect Ratio Fitter Component with Aspect mode - Width Controls Height
The name label is a child of the container and should be left aligned to the image with 15 px left padding. The width of the text should fill the rest of the space in the container.
For this you can simply anchor and stretch your label to the container size and use BestFit option on the Text component
We never found a way to do this without code. I am very unsatisfied that such a simple task cannot be done in the current UI system.
We did create the following layout script that does the trick (tanks to Angry Ant for helping us out). The script is attached to the text label:
using UnityEngine;
using UnityEngine.EventSystems;
[RequireComponent (typeof (RectTransform))]
public class IndentByHeightFitter : UIBehaviour, UnityEngine.UI.ILayoutSelfController
{
public enum Edge
{
Left,
Right
}
[SerializeField] Edge m_Edge = Edge.Left;
[SerializeField] float border;
public virtual void SetLayoutHorizontal ()
{
UpdateRect ();
}
public virtual void SetLayoutVertical() {}
#if UNITY_EDITOR
protected override void OnValidate ()
{
UpdateRect ();
}
#endif
protected override void OnRectTransformDimensionsChange ()
{
UpdateRect ();
}
Vector2 GetParentSize ()
{
RectTransform parent = transform.parent as RectTransform;
return parent == null ? Vector2.zero : parent.rect.size;
}
RectTransform.Edge IndentEdgeToRectEdge (Edge edge)
{
return edge == Edge.Left ? RectTransform.Edge.Left : RectTransform.Edge.Right;
}
void UpdateRect ()
{
RectTransform rect = (RectTransform)transform;
Vector2 parentSize = GetParentSize ();
rect.SetInsetAndSizeFromParentEdge (IndentEdgeToRectEdge (m_Edge), parentSize.y + border, parentSize.x - parentSize.y);
}
}
I have a requirement to measure the text length in a PDF and wrap the line if the length exceeds a certain amount. I am already using PDFsharp library.
I already used the following code to determine the length of the text.
public static Size MeasureString(string s, Font font)
{
SizeF result;
using (var image = new Bitmap(1, 1))
{
using (var g = Graphics.FromImage(image))
{
result = g.MeasureString(s, font);
}
}
return result.ToSize();
}
As I understood I am pretty dependent of the resolution and dpi to convert Height and Width properties of the Size class to millimeter. But according to the PDFsharp's team answer in this post "PDF files are vector files that have no DPI".
So I am a bit confused about the right way to measure the text length using this library.
PDF files have no pixels, PDF files have no DPI.
The standard unit with PDFsharp is points. There are 72 points per inch.
You can have the length of the text in points, mm, cm, inch, ...
You can have the width of the page in points, mm, cm, inch, ...
The XTextFormatter class can do simple line-wrapping for you:
http://www.pdfsharp.net/wiki/TextLayout-sample.ashx
This sample shows how to call MeasureString:
http://www.pdfsharp.net/wiki/Graphics-sample.ashx#Show_how_to_get_text_metric_information_19
Use the correct MeasureString method with an XGraphics object and you will get an XSize object with the text dimensions - no pixels, but mm, cm, inch, point, ...
Use MigraDoc for line-wrapping with sophisticated text formatting.
The Wikipedia article on Points: https://en.wikipedia.org/wiki/Point_(typography)
I'm looking for a library or command-line program that can compress PDFs.
Compression speed and file size are very important.
The PDFs are full of very large print-quality images.
Adobe Acrobat does high-quality, fast compression but does not allow "reduced size pdfs" to be saved through a programmatic interface.
Ghostscript does high-quality compression be takes way too long (minutes).
If a commercial library is an option, you could give Amyuni PDF Creator a try. There is .net version (C#/VB.Net etc) and an ActiveX version (for C++/Delphi/VB/PHP etc).
You can iterate through all the objects of each page, pick those who are images, and reduce their size. You have several possibilities there:
Setting a lower compression rate.
Down-sampling (extracting the image, re-sizing it to a lower
resolution, and putting it back in your file)
Combining the previous two.
Here is how the code would look like for the first option, in C#, using Amyuni PDF Creator .Net:
//open a pdf document
document.Open("c:\\temp\\myfile.pdf","");
IacPage page1 = document.GetPage (1);
Amyuni.PDFCreator.IacAttribute attribute = page1.AttributeByName ("Objects");
// listobj is an array list of graphic objects
System.Collections.ArrayList listobj = (System.Collections.ArrayList) attribute.Value;
foreach ( object pdfObj in listobj )
{
if ((IacObjectType)pdfObj.AttributeByName("ObjectType").Value == IacObjectType.acObjectTypePicture)
{
if ((IacImageCompressionConstants)pdfObj.AttributeByName("Compression").Value == IacImageCompressionConstants.acCompressionJPegMedium)
pdfObj.AttributeByName("Compression").Value = IacImageCompressionConstants.acCompressionJPegLow;
if ((IacImageCompressionConstants)pdfObj.AttributeByName("Compression").Value == IacImageCompressionConstants.acCompressionJPegHigh)
pdfObj.AttributeByName("Compression").Value = IacImageCompressionConstants.acCompressionJPegMedium;
// (...)
}
}
usual disclaimer applies
You might want to try Docotic.Pdf library for your task.
Here is a code that scales all images that have width or height greater or equal to 256. Scaled images are then encoded using JPEG compression with quality set to 65.
public static void RecompressToJpeg(string path, string outputPath)
{
using (PdfDocument doc = new PdfDocument(path))
{
foreach (PdfImage image in doc.Images)
{
// image that is used as mask or image with attached mask are
// not good candidates for recompression
if (!image.IsMask && image.Mask == null && (image.Width >= 256 || image.Height >= 256))
image.Scale(0.5, PdfImageCompression.Jpeg, 65);
}
doc.Save(outputPath);
}
}
You could also just recompress images without changing their sizes using one of the RecompressWithJpeg methods (or one of other RecompressXXX methods).
And images can be resized to specified width and height using one of the ResizeTo methods. Please note that you will need to take aspect ratio into account in the latter case.
Disclaimer: I work for the vendor of the library.
Problem
There are PDF documents with different type of objects inside. There are simple texts. There can be scanned images that are B&W, and also other images, that are true color. The resolution can be quite high for both (~1789X2711).
I need to convert the PDF into a set of single page TIFF files. There are quite good tools for that. For example Irfanview, ImageMagick. The problem is that I have to define a single compression type for all the pages.
Using JPG for all pages would result in loosing details for B&W images and they would be huge compared to lossless fax compression.
Using lossless fax for all would wanish colors and details of true color images.
Idea
It would be nice to examine the PDF page by page. I could check the content of the page. What kind of images are there inside, and which compression is recommanded for the particular page. I think this can be done with IText, but I don't know exactly, how it should be done. A second thing is that I want to do this analysis without fully reading the PDF file. Is it possible?
Maybe the fastest solution would be to create a list of pages for each compression type with IText analysis, and then to call Irfanview to process the choosen pages with the proper compression.
Any ideas and recommendations are welcome.
UPDATE:
I have now an answer. It does not cover all requirements, and its not freeware. Any opensource ideas? Maybe Java based solutions?
This can be done with DotImage DotPdf from Atalasoft (cue the obligatory "I work there and work on these products"). Here is how I would do this task in C#:
PdfImageSource source = new PdfImageSource(pdfStream);
while (source.HasMoreImages()) {
AtalaImage image = source.AcquireNext();
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder encoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
private TiffCompression SelectCompression(PixelFormat pf)
{
switch (pf) {
// 1 bit? use CCITT G4
case PixelFormat.Pixel1bbIndexed: return TiffCompression.Group4FaxEncoding;
// 24 bit? use JPEG
case PixelFormat.Pixel24bppBgr: return TiffCompression.JpegCompression;
// all else, Lzw
default: return TiffCompression.Lzw;
}
}
You can make SelectCompression do pretty much whatever you want. If you select an invalid compression for that pixel format, the encoder will use an appropriate lossless one in its place (for example, if you select CCITT for 24bit color, the encoder will instead use Lzw).
Our PDF decoder knows when a PDF page is just gray and returns a gray image. It does NOT do anything to get you to 1 bit (this is so antialiased text looks good), however you could threshold the gray image and look at the overall differences between it and the gray image to determine if it could go to 1 bit).
Here's how you could do a set of pages:
public void ExtractNPages(Stream pdfStream, params int[] pageIndexes)
{
PdfImageSource source = new PdfImageSource(pdfStream);
for (int i in pageIndexes) {
AtalaImage image = source[i]; // implied Acquire
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
}
so now you can just do ExtractNPages(stm, 0, 2, 4, 6);