Below code works fine on iTextSharp 5.2.1
var file= new FileInfo(args[0]);
string name = file.Name.Substring(0,file.Name.LastIndexOf("."));
// we create a reader for a certain document
var reader = new PdfReader(args[0]);
// we retrieve the total number of pages
int numberOfPages = reader.NumberOfPages;
Console.WriteLine("There are " + n + " pages in the original file.");
Document document;
string filename;
PdfCopy copy;
for (int pageNumber = 1; i <= numberOfPages; i++)
{
filename = pageNumber.ToString();
filename = "_" + filename + ".pdf";
// step 1: creation of a document-object
document = new Document(reader.GetPageSizeWithRotation(pageNumber ));
// step 2: we create a writer that listens to the document
copy = new PdfCopy(document, new FileStream(name + filename,FileMode.Create));
// step 3: we open the document
document.Open();
copy.AddPage(copy.GetImportedPage(reader, pageNumber));
// step 5: we close the document
document.Close();
}
But it throws below error on iTextSharp 5.5.0,
The page 1 was requested but the document has only 0 pages.
it seems like that the last line actually tampered the reader instance. Could someone help me figure it out? Now I walkaround this by recreating a PdfReader instance for each page, but that is slow for large PDF file.
I finally nailed it. It's a change of iText in recent version that caused this issue, specifically, the issue lies in the WriteAllPages of the PdfReaderInstance.cs file. In it, the line
file.Close();
is the culprit. The reason is that Document.Close method will call PdfCopy.Close method which will in turn call PdfReaderInstance.WriteAllPages method and then all hells break loose.
On the surface, it seems a good practice to close the file, but really, it's none of your business, damn you, PdfReaderInstance.
Hope this info helps others.
iTextSharp page numbers are one-based, not zero. These two lines are accounting for this correctly:
pagenumber = i + 1;
document = new Document(reader.GetPageSizeWithRotation(pagenumber));
However this line is still going back to the zero-based index of your loop:
copy.AddPage(copy.GetImportedPage(reader, i));
You could just change that to:
copy.AddPage(copy.GetImportedPage(reader, pagenumber));
Of you could just make your entire loop one-based and not think about having to add one every once in a while
for (int pagenumber = 1; pagenumber <= n; pagenumber++)
{
filename = pagenumber.ToString();
while (filename.Length< digits) filename = "0" + filename;
filename = "_" + filename + ".pdf";
// step 1: creation of a document-object
document = new Document(reader.GetPageSizeWithRotation(pagenumber));
// step 2: we create a writer that listens to the document
copy = new PdfCopy(document, new FileStream(name+filename,FileMode.Create));
// step 3: we open the document
document.Open();
copy.AddPage(copy.GetImportedPage(reader, pagenumber));
// step 5: we close the document
document.Close();
}
Related
I have a section of my PDF in which I need to use one font for its unicode symbol and the rest of the paragraph should be a different font. (It is something like "1. a 2. b 3. c" where "1." is the unicode symbol/font and "a" is another font) I have followed the method Bruno describes here: iText 7: How to build a paragraph mixing different fonts? and it works fine to generate the PDF. The issue is that the file size of the PDF goes from around 20MB to around 100MB compared to using only one font and one Text element. This section is used repeatedly in the document thousands of times. I am wondering if there is a way to reduce the impact of switching fonts or to reduce the file size of the entire document in some way.
Style creation pseudocode:
Style style1 = new Style();
Style style2 = new Style();
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile1), PdfEncodings.IDENTITY_H, true);
style1.setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile2), "", false);
style2.setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Writing text/paragraph pseudocode:
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph();
loop up to 25 times: {
Text unicodeText = new Text(unicodeSymbol + " ").addStyle(style1);
paragraph.add(unicodeText);
Text plainText = new Text(plainText + " ").addStyle(style2);
paragraph.add(plainText);
}
div.add(paragraph);
This writing of text/paragraph is done thousands of times and makes up most of the document. Basically the document consists of thousands of "buildings" that have corresponding codes and the codes have categories. I need to have the index for the category as the unicode symbol and then all of the corresponding codes within the paragraph for the building.
Here is reproducable code:
float offSet = 50;
Integer leading = 10;
DateFormat format = new SimpleDateFormat("yyyy_MM_dd_kkmmss");
String formattedDate = format.format(new Date());
String path = "/tmp/testing_pdf_"+formattedDate + ".pdf";
File targetPdfFile = new File(path);
PdfWriter writer = new PdfWriter(path, new WriterProperties().addXmpMetadata());
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.LETTER;
Document document = new Document(pdf, pageSize);
document.setMargins(offSet, offSet, offSet, offSet);
byte[] font1file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Garamond-Premier-Pro-Regular.ttf"));
byte[] font2file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Quivira.otf"));
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(font1file), "", true);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(font2file), PdfEncodings.IDENTITY_H, true);
Style style1 = new Style().setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Style style2 = new Style().setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
float columnGap = 5;
float columnWidth = (pageSize.getWidth() - offSet * 2 - columnGap * 2) / 3;
float columnHeight = pageSize.getHeight() - offSet * 2;
Rectangle[] columns = {
new Rectangle(offSet, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth + columnGap, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth * 2 + columnGap * 2, offSet, columnWidth, columnHeight)};
document.setRenderer(new ColumnDocumentRenderer(document, columns));
for (int j = 0; j < 5000; j++) {
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph().setFixedLeading(leading);
// StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 26; i++) {
paragraph.add(new Text("\u3255 ").addStyle(style2));
paragraph.add(new Text("test ").addStyle(style1));
// stringBuilder.append("\u3255 ").append(" test ");
}
// paragraph.add(stringBuilder.toString()).addStyle(style2);
div.add(paragraph);
document.add(div);
}
document.close();
In creating the reproducible code I have found this this is related to the document being tagged. If you remove the line that marks it as tagged it reduces the file size greatly.
You can also reduce the file size by using the commented out string builder with one font instead of two. (Comment out the two "paragraph.add"s in the for-loop) This mirrors the issue I have in my code.
The problem is not in fonts themselves. The issues comes from the fact that you are creating a tagged PDF. Tagged documents have a lot of PDF objects in them that need a lot of space in the file.
I wasn't able to reproduce your 20MB vs 100MB results. On my machine whether with one font or with two fonts, but with two Text elements, the resultant file size is ~44MB.
To compress file when creating large tagged documents, you should use full compression mode which compresses all PDF objects, not only streams.
To activate full compression mode, create a PdfWriter instance with WriterProperties:
PdfWriter writer = new PdfWriter(outFileName,
new WriterProperties().setFullCompressionMode(true));
This setting reduced the file size for me from >40MB to ~5MB.
Please note that you are using iText 7.0.x while 7.1.x line has already been released and is now the main line of iText, so I recommend that you update to the latest version.
Good Evening (UK)
I'm trying to filter down a 1500+ page PDF file to only the pages which include a certain text string (typically one or two words). My laptop is locked down with respect to installing more software BUT I have used action(script)s quite a bit
I get the error below when I try to install this action into Abobe Acrobat X Pro (Win 7):
screen dump of error
called "Extract Commented Pages"... supposed to be OK for X and XI this looks like what I want.....
I wondered if there was something simple causing the problem but the actionscript file is rather... busy to say the least.
I used to have an action that I think was based on a legal redaction script but it is filed somewhere!
If you have already got an action that does this or a version of the above that doesn't give the error I get (unable to import the Action.... The file is either invalid or corrupt) I will forever by indebted to your gratitude
Many thanks, have a good weekend!
I recently came across a script found at the following link: http://forums.adobe.com/thread/1077118
I'm having some issues getting the script to run in Acrobat, despite everything looking alright in the script itself. I'll update if I find any errors.
Here is a copy of the script:
// Set the word to search for here
var sWord = "forms";
// Source document = current document
var sd = this;
var nWords, currWord, fp, fpa = [], nd;
var fn = sd.documentFileName.replace(/\.pdf$/i, "");
// Loop through the pages
for (var i = 0; i < sd.numPages; i += 1) {
// Get the number of words on the page
nWords = sd.getPageNumWords(i);
// Loop through the words on the page
for (var j = 0; j < nWords; j += 1) {
// Get the current word
currWord = sd.getPageNthWord(i, j);
if (currWord === sWord) {
// Extract the current page to a new file
fp = fn + "_" + i + ".pdf";
fpa.push(fp);
sd.extractPages({nStart: i, nEnd: i, cPath: fp});
// Stop searching this page
break;
}
}
}
// Combine the individual pages into one PDF
if (fpa.length) {
// Open the document that's the first extracted page
nd = app.openDoc({cPath: fpa[0], oDoc: sd});
// Append any other pages that were extracted
if (fpa.length > 1) {
for (var i = 1; i < fpa.length; i += 1) {
nd.insertPages({nPage: i - 1, cPath: fpa[i], nStart: 0, nEnd: 0});
}
}
// Save to a new document and close this one
nd.saveAs({cPath: fn + "_searched.pdf"});
nd.closeDoc({bNoSave: true});
}
In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.
I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.
Dear Team,
In my application, i want to split the pdf using itextsharp. If i upload PDF contains 10 pages with file size 10 mb for split, After splitting the combine file size of each pdfs will result into above 20mb file size. If this possible to reduce the file size(each pdf).
Please help me to solve the issue.
Thanks in advance
This may have to do with the resources in the file. If the original document uses an embedded font on each, for example, then there will only be one instance of the font in the original file. When you split it, each file will be required have that font as well. The total overhead will be n pages × sizeof(each font). Elements that will cause this kind of bloat include fonts, images, color profiles, document templates (aka forms), XMP, etc.
And while it doesn't help you in your immediate problem, if you use the PDF tools in Atalasoft dotImage, your task becomes a 1 liner:
PdfDocument.Separate(userpassword, ownerpassword, origPath, destFolder, "Separated Page{0}.pdf", true);
which will take the PDF in orig file and create new pages in the dest folder each named with the pattern. The bool at the end is to overwrite an existing file.
Disclaimer: I work for Atalasoft and wrote the PDF library (also used to work at Adobe on Acrobat versions 1, 2, 3, and 4).
Hi Guys i modified the above code to split a PDF file into multiple Pdf file.
iTextSharp.text.pdf.PdfReader reader = null;
int currentPage = 1;
int pageCount = 0;
//string filepath_New = filepath + "\\PDFDestination\\";
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
//byte[] arrayofPassword = encoding.GetBytes(ExistingFilePassword);
reader = new iTextSharp.text.pdf.PdfReader(filepath);
reader.RemoveUnusedObjects();
pageCount = reader.NumberOfPages;
string ext = System.IO.Path.GetExtension(filepath);
for (int i = 1; i <= pageCount; i++)
{
iTextSharp.text.pdf.PdfReader reader1 = new iTextSharp.text.pdf.PdfReader(filepath);
string outfile = filepath.Replace((System.IO.Path.GetFileName(filepath)), (System.IO.Path.GetFileName(filepath).Replace(".pdf", "") + "_" + i.ToString()) + ext);
reader1.RemoveUnusedObjects();
iTextSharp.text.Document doc = new iTextSharp.text.Document(reader.GetPageSizeWithRotation(currentPage));
iTextSharp.text.pdf.PdfCopy pdfCpy = new iTextSharp.text.pdf.PdfCopy(doc, new System.IO.FileStream(outfile, System.IO.FileMode.Create));
doc.Open();
for (int j = 1; j <= 1; j++)
{
iTextSharp.text.pdf.PdfImportedPage page = pdfCpy.GetImportedPage(reader1, currentPage);
pdfCpy.SetFullCompression();
pdfCpy.AddPage(page);
currentPage += 1;
}
doc.Close();
pdfCpy.Close();
reader1.Close();
reader.Close();
}
Have you tried setting the compression on the writer?
Document doc = new Document();
using (MemoryStream ms = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
writer.SetFullCompression();
}