I use this method to copy and scale page by page number from original PDF and put them to generated PDF which contains only selected and scaled pages from original PDF.
private static void addScaledPage(PdfDocument pdf, PdfDocument srcDoc, String pageNumber) throws IOException {
PdfPage page = pdf.addNewPage(PageSize.A4);
PdfCanvas canvas = new PdfCanvas(page);
AffineTransform transformationMatrix = AffineTransform.getScaleInstance(0.86, 0.86);
canvas.concatMatrix(transformationMatrix);
PdfFormXObject pageCopy = srcDoc.getPage(Integer.valueOf(pageNumber)).copyAsFormXObject(pdf);
canvas.addXObject(pageCopy, 50, 30);
}
This code works fine, but small issue happen when I try to take 3 pages from original PDF which have 140 pages and approx. 10 MB size => the generated PDF with 3 selected pages also have approx. 10 MB.
Also, when I try to copy 3 pages or 10 pages from original document I got always the same size of generated PDF => it seems like references are copied from source PDF
I would appreciate to give me some advice, did I do something wrong in the implementation? Or some other advice?
Kindest regards,
It depends a lot on the resources embedded in the document. If a large image that uses CMYK color, or a font with CJK glyphs (either of these resources could easily be several MB in size) is used on the pages you are copying, that entire resource will be copied into the PDF you're creating. The fact that you are only copying three out 140 pages wouldn't make much difference: the bulk of the file size will be taken up by the resource, and the pages won't display properly without it.
A solution would be a workflow that optimizes your document during or after copying the pages. This could convert images to an equivalent, smaller color space, or subset the font so that you only carry the required glyphs. Both of these techniques can substantially reduce the size of the file (but this is all dependent on how the source file itself is constructed, of course).
Related
Given a PDF (with multiple pages), a page (stream) and a index , how can i replace the page at target index with my stream and save the pdf ?
public Stream ReplacePDFPageAtIndex(Stream pdfStream,int index,Stream replacePageStream)
{
List<Stream> pdfPages= //fetch pdf pages
pdfPages.RemoveAt(index);
pdfPages.Insert(index,replacePageStream);
Stream newPdf=//create new pdf using the edited list
return newPdf;
}
I was wondering if there is any open source solution or i can roll my own with ease , given the fact that i just need to split the pdf in pages , and replace one of them.
I have also checked ITextSharp but i do not see the functionality that i require.
I have existing / source PDF source document and I copy selected pages from it and generate destination PDF with selected pages. Every page in existing / source document is scanned in different resolution and it varies in size:
generated document with 4 pages => 175 kb
generated document with 4 pages => 923 kb (I suppose this is because of higher scan resolution of each page in source document)
What would be best practice to compress this pages?
Is there any code sample with compressing / reducing size of final PDF which consists of copied pages of source document in different resolution?
Kindest regards
If you are just adding scans to a pdf document, it makes sense for the size of the resulting document to go up if you're using a high resolution image.
Keep in mind that iText is a pdf library. Not an image-manipulation library.
You could of course use regular old java to attempt to compress the images.
public static void writeJPG(BufferedImage bufferedImage, OutputStream outputStream, float quality) throws IOException
{
Iterator<ImageWriter> iterator = ImageIO.getImageWritersByFormatName("jpg");
ImageWriter imageWriter = iterator.next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(quality);
ImageOutputStream imageOutputStream = new MemoryCacheImageOutputStream(outputStream);
imageWriter.setOutput(imageOutputStream);
IIOImage iioimage = new IIOImage(bufferedImage, null, null);
imageWriter.write(null, iioimage, imageWriteParam);
imageOutputStream.flush();
}
But really, putting scanned images into a pdf makes life so much more difficult. Imagine the people who have to handle that document after you. They open it, see text, try to select it, and nothing happens.
Additionaly, you might change the WriterProperties when creating your PdfWriter instance:
PdfWriter writer = new PdfWriter(dest,
new WriterProperties().setFullCompressionMode(true));
Full compression mode will compress certain objects into an object stream, and it will also compress the cross-reference table of the PDF. Since most of the objects in your document will be images (which are already compressed), compressing objects won't have much effect, but if you have a large number of pages, compressing the cross-reference table may result in smaller PDF files.
We are currently trying to merge multiple PDFs and create a PDF/A (1B) out of it.
Currently we face a problem when we want to fix the color profiles. The PDF we receive has no embedded color profiles, so during the merge functionality of PDFBox, no OutputIntents are merged. So in the last step we try to add the color profiles.
If we do not add any color profile, we get validation issues for RGB and CMYK. If we add both color profiles to the PDDocumentCatalog, then only the validation issues for the first one are gone. So if we add RGB first, we only get CMYK validation issues and vice versa.
Here is a part of the code when we add the color profiles:
public void convertToPDFA(PDDocument doc, String file){
PDMetadata metadata = new PDMetadata(doc);
PDDocumentCatalog cat = doc.getDocumentCatalog();
cat.setMetadata(metadata);
// do metadata stuff, just removed it for now
InputStream colorProfile = PDFService.class.getResourceAsStream("/pdfa/sRGB Color Space Profile.icm");
PDOutputIntent oi = new PDOutputIntent(doc, colorProfile);
oi.setInfo("sRGB IEC61966-2.1");
oi.setOutputCondition("sRGB IEC61966-2.1");
oi.setOutputConditionIdentifier("sRGB IEC61966-2.1");
oi.setRegistryName("http://www.color.org");
cat.addOutputIntent(oi);
This is the code for RGB, we also add another *.icm color profile for CMYK.
So the color profiles seem to be fine, because dependent on the one we add first, the validation issues are gone.
For me it feels like we are just missing a small thing that both color profiles will be accepted, or could it be that only one color profile can be used for the creation of a PDF/A?
Thanks in advance and kind regards
Only a single output intent is allowed, see here. An alternative is also mentioned there, which would be to use only ICC based colorspaces.
What should be possible (although beyond the scope of the question), would be to assign ICC profiles to /DeviceGray, /DeviceRGB, or /DeviceCMYK, by adding DefaultGray, DefaultRGB, or DefaultCMYK entries the ColorSpaces in the resource dictionary, as explained in section 8.6.5.6 of the PDF specification:
When a device colour space is selected, the ColorSpace subdictionary
of the current resource dictionary (see 7.8.3, "Resource
Dictionaries") is checked for the presence of an entry designating a
corresponding default colour space (DefaultGray, DefaultRGB, or
DefaultCMYK, corresponding to DeviceGray, DeviceRGB, or DeviceCMYK,
respectively). If such an entry is present, its value shall be used as
the colour space for the operation currently being performed.
Be aware that making PDF file PDF/A-1b conformant is often more trickier than just adding output intents - check your file with PDFBox preflight or with the online validator from PDF Tools, there are many possible errors. Which is why there are products from Callas Software or PDF Tools that convert PDF files to PDF/A.
I have searched many places but unable to find a pretty good solution as such.
So what I am trying to achieve is as below:
My program will have quite a lot of PDF docs which I will have to send via mail. There is a mail server limitation of 4 MB. So if all the PDFs are less than 4 MB it will be sent as a single mail. Else I will have to create multiple files each less than 4 MB.
Now my program works fine for the following cases:
1: Lots of files but each less than 4MB and hence keeping a tab during merging so that none of the merged files get over 4MB.
2: All files are pretty small and hence merging them together does not go to 4MB limit.
But there can be a scenario where there is one file which is, say, 14MB. I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages. I have used iText and PDFBox. Any help/pointer will be highly appreciated!
Imagine a 3000 KB document with ten pages and the following objects:
four font subsets used on every page, each about 50 KB
ten images that figure on a single page, each about 200 KB (one image per page)
four images that figure on every page, each about 50 KB
ten pages with content streams of about 25 KB each
about 350 KB for objects such as the catalog, the info dictionary, the page tree, the cross-reference table, etc...
A single page will need at least:
- the four font subsets: 4 times 50 KB
- the single image: 1 time 200 KB
- the four images: 4 times 50 KB
- a single content stream: 1 time 50 KB
- a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB
Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.
This example is the result of guess work (based on experience) and it assumes that the PDF is predictable. Most PDFs aren't:
some pages will require high-definition images (maybe even megaBytes), other pages won't have any images,
some pages will need many different fonts and font subsets (lots of kiloBytes), other pages will consist of merely some vector drawings (tiny content stream if compressed).
different pages can share a large amount of resources (Form XObjects, Image XObjects,...), other pages won't share any resources.
and so on...
You have noticed that yourself, as you write: I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages.
That's exactly why your question can have no other answer than: you'll have to do trial and error. No software can predict how much space is needed by a page before you look at what is needed by that page.
Update:
As David indicates in the comments, it is possible to calculate all the resources needed for a page, and to check if the current resources plus the needed resources exceed the maximum file size.
I have written a small example:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
PdfReader reader = new PdfReader(src);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// check resources needed for reader.getPageN(i);
copy.addPage(copy.getImportedPage(reader, i));
System.out.println("After adding page: " + copy.getOs().getCounter());
}
document.close();
System.out.println("After closing document: " + copy.getOs().getCounter());
reader.close();
}
I have executed the example on a PDF sample with 18 pages and this was the output:
After adding page: 56165
After adding page: 111398
After adding page: 162691
After adding page: 210035
After adding page: 253419
After adding page: 273429
After adding page: 330696
After adding page: 351564
After adding page: 400351
After adding page: 456545
After adding page: 495321
After adding page: 523640
After adding page: 576468
After adding page: 633525
After adding page: 751504
After adding page: 907490
After adding page: 957164
After adding page: 999140
After closing document: 1002509
You see how the file size of the copy gradually grows with each page that is added. After all pages are added, the size is 999140 bytes, and then the page tree and cross-reference stream are written, adding another 3369 bytes.
Where it says // check resources needed for reader.getPageN(i);, you could make a guesstimate of the size that will be added for the page and break out of the loop if it exceeds a maximum value.
Why would this be a guesstimate:
You could be counting objects that are already added. If you keep track of the objects (not that difficult), your guess will be more accurate.
I'm using PdfSmartCopy. Suppose that there are two identical objects inside your PDF. Bad PDF software often causes such problems. For instance: the same image bytes are added twice to the file. PdfSmartCopy can detect this and will reuse the first object it encounters instead of adding the redundant bytes of the extra object.
We currently don't have a reader.getTotalPageBytes() in PdfReader because PdfReader tries to use as little memory as possible. It won't load any objects into memory as long as these objects aren't needed. Hence it doesn't know the size of each object before the page is imported.
However, I'll make sure that such a method is added in the next release.
Update:
In the next version, you'll find a tool named SmartPdfSplitter that depends on a new class named PdfResourceCounter. You can use it like this:
PdfReader reader = new PdfReader(src);
SmartPdfSplitter splitter = new SmartPdfSplitter(reader);
int part = 1;
while (splitter.hasMorePages()) {
splitter.split(new FileOutputStream("results/merge/part_" + part + ".pdf"), 200000);
part++;
}
reader.close();
Note that this can result in a single-page PDF that exceeds the limit (which was set to 200000 bytes in the code sample) in case that single page can not be reduced to less bytes. In that case, splitter.isOverSized() will return true and you'll have to find another way to reduce the PDF.
PDF Clown supports page data size prediction without need of trial and error: since 2010 it has been featuring a dedicated method (org.pdfclown.tools.PageManager.getSize(Page)) that calculates in memory the actual page data size without the need to write it to a file for trial.
Furthermore, there's another method (org.pdfclown.tools.PageManager.split(long maxDataSize)) purposely implemented to address your kind of scenario which leverages the above-mentioned PageManager.getSize method: it automatically splits a file based on a size limit without creating any intermediate, ugly, stupid, temporary file for trial and error.
You can see a practical example of its use in the org.pdfclown.samples.cli.PageManagementSample (PageDataSizeCalculation and DocumentSplitOnMaximumFileSize cases) included in the downloadable distribution -- here it is an example of console output from the PageDataSizeCalculation case:
Page 1: 29380 (full); 29380 (differential); 29380 (incremental)
Page 2: 30493 (full); 1501 (differential); 30881 (incremental)
Page 3: 21888 (full); 1432 (differential); 32313 (incremental)
Page 4: 33781 (full); 4789 (differential); 37102 (incremental)
. . .
where:
full is the page data size encompassing all its dependencies (like shared resources) -- this is the size of the page when extracted as a single-page document;
differential is the additional page data size -- this is the extra content that's not shared with previous pages;
incremental is the data size of the page sublist encompassing all the previous pages and the current one.
I create a PDF document with EVO PDF library from a HTML page using the code below:
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
byte[] outPdfBuffer = htmlToPdfConverter.ConvertUrl(url);
Response.AddHeader("Content-Type", "application/pdf");
Response.AddHeader("Content-Disposition", String.Format("attachment; filename=Merge_HTML_with_Existing_PDF.pdf; size={0}", outPdfBuffer.Length.ToString()));
Response.BinaryWrite(outPdfBuffer);
Response.End();
This produces a PDF document but I have another PDF document that I would like to use as cover page in the final PDF document.
One possiblity I was thinking about was to create the PDF document and then to merge my cover page PDF with the PDF produced by converter but this looks like an inefficient solution. Saving the PDF and loading back for merge seems to introduce a unnecessary overhead. I would like to merge the cover page while the PDF document produced by converter is still in memory.
The following line added in your code right after you create the HTML to PDF converter object should do the trick:
// Set the PDF file to be inserted before conversion result
htmlToPdfConverter.PdfDocumentOptions.AddStartDocument("CoverPage.pdf");