Page by page conversion of PDF into TIFF with proper compression - optimization

Problem
There are PDF documents with different type of objects inside. There are simple texts. There can be scanned images that are B&W, and also other images, that are true color. The resolution can be quite high for both (~1789X2711).
I need to convert the PDF into a set of single page TIFF files. There are quite good tools for that. For example Irfanview, ImageMagick. The problem is that I have to define a single compression type for all the pages.
Using JPG for all pages would result in loosing details for B&W images and they would be huge compared to lossless fax compression.
Using lossless fax for all would wanish colors and details of true color images.
Idea
It would be nice to examine the PDF page by page. I could check the content of the page. What kind of images are there inside, and which compression is recommanded for the particular page. I think this can be done with IText, but I don't know exactly, how it should be done. A second thing is that I want to do this analysis without fully reading the PDF file. Is it possible?
Maybe the fastest solution would be to create a list of pages for each compression type with IText analysis, and then to call Irfanview to process the choosen pages with the proper compression.
Any ideas and recommendations are welcome.
UPDATE:
I have now an answer. It does not cover all requirements, and its not freeware. Any opensource ideas? Maybe Java based solutions?

This can be done with DotImage DotPdf from Atalasoft (cue the obligatory "I work there and work on these products"). Here is how I would do this task in C#:
PdfImageSource source = new PdfImageSource(pdfStream);
while (source.HasMoreImages()) {
AtalaImage image = source.AcquireNext();
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder encoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
private TiffCompression SelectCompression(PixelFormat pf)
{
switch (pf) {
// 1 bit? use CCITT G4
case PixelFormat.Pixel1bbIndexed: return TiffCompression.Group4FaxEncoding;
// 24 bit? use JPEG
case PixelFormat.Pixel24bppBgr: return TiffCompression.JpegCompression;
// all else, Lzw
default: return TiffCompression.Lzw;
}
}
You can make SelectCompression do pretty much whatever you want. If you select an invalid compression for that pixel format, the encoder will use an appropriate lossless one in its place (for example, if you select CCITT for 24bit color, the encoder will instead use Lzw).
Our PDF decoder knows when a PDF page is just gray and returns a gray image. It does NOT do anything to get you to 1 bit (this is so antialiased text looks good), however you could threshold the gray image and look at the overall differences between it and the gray image to determine if it could go to 1 bit).
Here's how you could do a set of pages:
public void ExtractNPages(Stream pdfStream, params int[] pageIndexes)
{
PdfImageSource source = new PdfImageSource(pdfStream);
for (int i in pageIndexes) {
AtalaImage image = source[i]; // implied Acquire
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
}
so now you can just do ExtractNPages(stm, 0, 2, 4, 6);

Related

Why is flying saucer always printing PDF on A4 paper?

I'm trying to save an html document to PDF using flyingsaucer but the generated document always ends up having an A4 dimension when I look at the Document Properties from Adobe Reader (Page Size: 8.26 x 11.69 in).
I did read the documentation and I'm passing the css #page {size: letter;} style. And while it does have an effect on the output, the page size always remains 8.26 x 11.69 in Adobe Reader. For example, if I set the page size to legal, my PDF is still the size of a A4 but the top of the document is missing as if it had fell off the "paper".
I'm not sure if the problem falls on the itext side or the flying saucer side. I was using a fairly old version so my first step was to upgrade to the latest 9.1.6 version of flying saucer. I also moved from itext 2.0.8 to openPDF 1.0.1 but I'm still getting the same behavior.
I also traced in the debugger up to the com.lowagie.text.Document creation in ITextRenderer and at this point the document size passed is correct. That makes me think that the issue might be in openPDF / iText but I can't find what I'm doing wrong.
It turns out the PDF generation was correctly using the #page size declaration and the problem was occurring later in our software. What I had not noticed is that after the generation of the PDF another method was called to merge multiple PDFs into one. This method should probably not have been called, but that's another story.
The bottom line is this method created a new com.lowagie.text.Document(), which by default creates an A4 sized document, and then was iterating over all pages of the pdf, adding the pages to the new document using pdfWriter.getImportedPage(pdfReader, currentPage++). These imported pages did not retain their original size.
I fixed it by passing the page size of the fist page when creating the merged document object:
document = new Document(pdfReader.getPageSize(1));
The real problem is that you're (unwittingly) using software that is no longer supported. Anything that still has the namespace lowagie (the founder and CTO of iText) is really outdated.
If you simply want to convert HTML to pdf, why not use iText directly and cut out the middle-man?
We have multiple options for you.
XMLWorker (iText5 based code that converts HTML to pdf)
pdfHTML (iText7 based add-on that converts HTML5/CSS3 to pdf)
This is a rather extensive code-sample for using pdfHTML:
public void createPdf(String src, String dest, String resources) throws IOException {
try {
FileOutputStream outputStream = new FileOutputStream(dest);
WriterProperties writerProperties = new WriterProperties();
//Add metadata
writerProperties.addXmpMetadata();
PdfWriter pdfWriter = new PdfWriter(outputStream, writerProperties);
PdfDocument pdfDoc = new PdfDocument(pdfWriter);
pdfDoc.getCatalog().setLang(new PdfString("en-US"));
//Set the document to be tagged
pdfDoc.setTagged();
pdfDoc.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
//Set meta tags
PdfDocumentInfo pdfMetaData = pdfDoc.getDocumentInfo();
pdfMetaData.setAuthor("Joris Schellekens");
pdfMetaData.addCreationDate();
pdfMetaData.getProducer();
pdfMetaData.setCreator("iText Software");
pdfMetaData.setKeywords("example, accessibility");
pdfMetaData.setSubject("PDF accessibility");
//Title is derived from html
// pdf conversion
ConverterProperties props = new ConverterProperties();
FontProvider fp = new FontProvider();
fp.addStandardPdfFonts();
fp.addDirectory(resources);//The noto-nashk font file (.ttf extension) is placed in the resources
props.setFontProvider(fp);
props.setBaseUri(resources);
//Setup custom tagworker factory for better tagging of headers
DefaultTagWorkerFactory tagWorkerFactory = new AccessibilityTagWorkerFactory();
props.setTagWorkerFactory(tagWorkerFactory);
HtmlConverter.convertToPdf(new FileInputStream(src), pdfDoc, props);
pdfDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
You can find more information at http://itextpdf.com/itext7/pdfHTML

SSRS ReportViewer - How to improve image quality when exporting to PDF

For a while now I have noticed whenever I export reports from the ReportViewer control (Webforms version) to PDF format, any included images lose quality and appear slightly pixelated.
They look just fine in the ReportViewer however.
From what I have read, the PDF renderer will size any included images at 96 dpi, no matter what dpi the image is originally.
I have done some digging and came across this post here
I have tried this approach in my own code behind by wiring up a button like so
protected void btnExport_Click(object sender, EventArgs e)
{
string mimeType, encoding, fileNameExtension;
Warning[] warnings;
string[] streams;
var sb = new StringBuilder(1024);
var xr = XmlWriter.Create(sb);
xr.WriteStartElement("DeviceInfo");
xr.WriteElementString("DpiX", "300");
xr.WriteElementString("DpiY", "300");
xr.Close();
byte[] bytes = ReportViewer1.ServerReport.Render("PDF", sb.ToString(), out mimeType, out encoding,
out fileNameExtension, out streams, out warnings);
Response.ContentType = "application/pdf";
Response.AppendHeader("Content-Disposition", "attachment; filename=Test.pdf");
Response.BinaryWrite(bytes);
}
This actually makes my images appear much smaller in the exported PDF compared to what shows in the Report Viewer control.
My original images are 600x600 at 300dpi, I have tried using these images as they are and on the image properties in the report RDL designer set the sizing property to 'Fit' and sizing the image to 0.25in x 0.25in. Again, all looking great in preview mode in the Report Viewer control but then quality is lost when exporting to PDF.
I tried resizing the images to 0.25in x 0.25in in my image editor (paint.net) leaving at 300 dpi, but still no difference in the results.
I'm just going round in circles now, no doubt I am missing something. I hope there is a way and someone can shed some light for me?
Thanks!

PDF lossy compression

I'm looking for a library or command-line program that can compress PDFs.
Compression speed and file size are very important.
The PDFs are full of very large print-quality images.
Adobe Acrobat does high-quality, fast compression but does not allow "reduced size pdfs" to be saved through a programmatic interface.
Ghostscript does high-quality compression be takes way too long (minutes).
If a commercial library is an option, you could give Amyuni PDF Creator a try. There is .net version (C#/VB.Net etc) and an ActiveX version (for C++/Delphi/VB/PHP etc).
You can iterate through all the objects of each page, pick those who are images, and reduce their size. You have several possibilities there:
Setting a lower compression rate.
Down-sampling (extracting the image, re-sizing it to a lower
resolution, and putting it back in your file)
Combining the previous two.
Here is how the code would look like for the first option, in C#, using Amyuni PDF Creator .Net:
//open a pdf document
document.Open("c:\\temp\\myfile.pdf","");
IacPage page1 = document.GetPage (1);
Amyuni.PDFCreator.IacAttribute attribute = page1.AttributeByName ("Objects");
// listobj is an array list of graphic objects
System.Collections.ArrayList listobj = (System.Collections.ArrayList) attribute.Value;
foreach ( object pdfObj in listobj )
{
if ((IacObjectType)pdfObj.AttributeByName("ObjectType").Value == IacObjectType.acObjectTypePicture)
{
if ((IacImageCompressionConstants)pdfObj.AttributeByName("Compression").Value == IacImageCompressionConstants.acCompressionJPegMedium)
pdfObj.AttributeByName("Compression").Value = IacImageCompressionConstants.acCompressionJPegLow;
if ((IacImageCompressionConstants)pdfObj.AttributeByName("Compression").Value == IacImageCompressionConstants.acCompressionJPegHigh)
pdfObj.AttributeByName("Compression").Value = IacImageCompressionConstants.acCompressionJPegMedium;
// (...)
}
}
usual disclaimer applies
You might want to try Docotic.Pdf library for your task.
Here is a code that scales all images that have width or height greater or equal to 256. Scaled images are then encoded using JPEG compression with quality set to 65.
public static void RecompressToJpeg(string path, string outputPath)
{
using (PdfDocument doc = new PdfDocument(path))
{
foreach (PdfImage image in doc.Images)
{
// image that is used as mask or image with attached mask are
// not good candidates for recompression
if (!image.IsMask && image.Mask == null && (image.Width >= 256 || image.Height >= 256))
image.Scale(0.5, PdfImageCompression.Jpeg, 65);
}
doc.Save(outputPath);
}
}
You could also just recompress images without changing their sizes using one of the RecompressWithJpeg methods (or one of other RecompressXXX methods).
And images can be resized to specified width and height using one of the ResizeTo methods. Please note that you will need to take aspect ratio into account in the latter case.
Disclaimer: I work for the vendor of the library.

Some pdf file watermark does not show using iText

Our company using iText to stamp some watermark text (not image) on some pdf forms. I noticed 95% forms shows watermark correctly, about 5% does not. I tested, copy 2 original pdf files, one was marked ok, other one does not ok, then tested in via a small program, same result: one got marked, the other does not. I then tried the latest version of iText jar file (version 5.0.6), same thing. I checked pdf file properties, security settings etc, seems nothing shows any hint. The result file does changed size and markd "changed by iText version...." after executed program.
Here is the sample watermark code (using itext jar version 2.1.7), note topText, mainText, bottonText parameters passed in, make 3 lines of watermarks show in the pdf as watermark.
Any help appreciated !!
public class WatermarkGenerator {
private static int TEXT_TILT_ANGLE = 25;
private static Color MEDIUM_GRAY = new Color(160, 160, 160);
private static int SUPPORT_FONT_SIZE = 42;
private static int PRIMARY_FONT_SIZE = 54;
public static void addWaterMark(InputStream pdfInputStream,
OutputStream outputStream, String topText,
String mainText, String bottomText) throws Exception {
PdfReader reader = new PdfReader(pdfInputStream);
int numPages = reader.getNumberOfPages();
// Create a stamper that will copy the document to the output
// stream.
PdfStamper stamp = new PdfStamper(reader, outputStream);
int page=1;
BaseFont baseFont =
BaseFont.createFont(BaseFont.HELVETICA_BOLDOBLIQUE,
BaseFont.WINANSI, BaseFont.EMBEDDED);
float width;
float height;
while (page <= numPages) {
PdfContentByte cb = stamp.getOverContent(page);
height = reader.getPageSizeWithRotation(page).getHeight() / 2;
width = reader.getPageSizeWithRotation(page).getWidth() / 2;
cb = stamp.getUnderContent(page);
cb.saveState();
cb.setColorFill(MEDIUM_GRAY);
// Top Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, topText, width,
height+PRIMARY_FONT_SIZE+16, TEXT_TILT_ANGLE);
cb.endText();
// Primary Text
cb.beginText();
cb.setFontAndSize(baseFont, PRIMARY_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, mainText, width,
height, TEXT_TILT_ANGLE);
cb.endText();
// Bottom Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, bottomText, width,
height-PRIMARY_FONT_SIZE-6, TEXT_TILT_ANGLE);
cb.endText();
cb.restoreState();
page++;
}
stamp.close();
}
}
We solved problem by change Adobe LifecycleSave file option. File->Save->properties->Save as, then look at Save as type, default is Acrobat 7.0.5 Dynamic PDF Form File, we changed to use 7.0.5 Static PDF Form File (actually any static one will work). File saved in static one do not have this watermark disappear problem. Thanks Mark for pointing to the right direction.
You're using the underContent rather than the overContent. Don't do that. It leaves you at the mercy of big, white-filled rectangles that some folks insist on drawing first thing. It's a hold over from less-than-good PostScript interpreters and hasn't been necessary for Many Years.
Okay, having viewed your PDF, I can see the problem is that this is an XFA-based form (from LiveCycle Designer). Acrobat can (and often does) rebuild the entire file based on the XFA (a type of xml) it contains. That's how your changes are lost. When Acrobat rebuilds the PDF from the XFA, all the existing PDF information is pitched, including your watermark.
The only way to get this to work would be to define the watermark as part of the XFA file contained in the PDF.
Detecting these forms isn't all that hard:
PdfReader reader = new PdfReader(...);
AcroFields acFields = reader.getAcroFields();
XfaForm xfaForm = acFields.getXfaForm();
if (xfaForm != null && xfaForm.isXfaPresent()) {
// Ohs nose.
throw new ItsATrapException("We can't repel XML of that magnitude!");
}
Modifying them on the other hand could be Quite Challenging, but here's the specs.
Once you've figured out what needs to be changed, it's a simple matter of XML manipulation... but that "figure it out" part could be interesting.
Good hunting.

Mirrored (Flipped) Printing A PDF File

I generating ID Cards of Students of My College in a PDF file using ASP.NET (Framework 3.5) and Crystal Reports But I Want to print the Cards in a Transparent Sheet and Paste it on a Plastic Card of same size for that i need the everything to be printed mirrored.
I tried designing the crystal reports in mirrored form itself but could not find a way to write text in mirrored form. Can anyone suggest a way to do this work all I want is to Flip the contents in PDF File or in Crystal Report.
A couple of Ideas:
1) Render the PDF to an Image (using Ghostscript/ImageMagick or commercial PDF library( (eg Tif) and mirror the image for printing
2) Mirror the PDF Itself, might be possible with iTextSharp
3) Use the reporting tool and try and use some kind of reverse font (coud be quick option)
Any API that lets you import pages and write directly to the PDF content stream will let you do this.
In iText (Java), it'd look something like this:
PdfReader reader = new PdfReader(pdfPath);
Document doc = new Document();
PdfWriter writer = PdfWriter.getInstance( doc, new FileOutputStream(outPath) );
for (int pageNum = 1; pageNum <= reader.getNumberOfPages(); ++pageNum) {
PdfImportedPage page = writer.getImportedPage(reader, pageNum);
PdfContentByte pageContent = writer.getDirectContent();
// flip around vertical axis
pageContent.addTemplate(page, 1f, 0f, 0f, -1f, page.getWidth(), 0f);
doc.newPage();
}
The above code is making the following ass-u-me-ptions:
The default Document() page size matches the size of the current PdfImportedPage.
The source pages aren't rotated.
There are no annotations, optional content groups (layers), and various other bits that aren't just represented in the page contents.
Some workarounds:
// keep the page size consistent
PdfImportedPage page = writer.getImportedPage(reader, pageNum);
doc.newPage(page.getBoundingBox());
PdfContentByte pageContent = writer.getDirectContent();
pageContent.addTemplate(...);
// to compensate for a page's rotation, you need to either rotate the target page
// Easy in PdfStamper, virtually impossible with `Document` / `PdfWriter`.
AffineTransform unRotate = AffineTranform.getRotateInstance(degToRad(360 - pageRotation), pageCenterX, pageCenterY)
AffineTransform flip = new AffineTransform(1f, 0f, 0f, -1f, page.getWidth(), 0f);
AffineTransform finalTrans = flip;
finalTrans.concatenate(unRotate);
pageContent.addTemplate(page, finalTrans);
FAIR WARNING: My 2d matrix-fu isn't all that strong. I'm almost certainly doing something wrong. Debugging these sorts of things is a real PITA. Stuff either "looks right" or is so badly screwed up its off the page entirely (ergo invisible, so you don't know which way it went). I often change the page rectangles by [-1000 -1000 1000 1000] just so I can see where it all went. Fun stuff.
As for copying annotations and such... ouch. PdfCopy does all that for you, via it's addPage() method, but that doesn't let you transform the page content first. Any changes you make to the PdfImportedPage are ignored. You're really stuck with The Hard Way... manually copying all the fiddly bits and changing them to compensate for your flipped page... or messing with the source to addPage() to get the results you want. Both require some in-depth knowledge of PDF.
Given the specifics, you probably don't need to worry about it, but it's worth mentioning in case someone with a different situation comes along with the same goal.