Corrupted pdf after some compression using iTextSharp

Corrupted pdf after some compression using iTextSharp - pdf

The following code creates a .pdf first which is okay and looks perfect, I have taken the rest of the code (which makes the compression) from another post on this site. The problem is that the compressed.pdf file is 1kb and acrobat says the file is damaged and cannot be repaired. I have never made a pdf compressor before. Please, take a look at my code and if it is possible offer some corrections to make it working.
private void btnEndScan_Click(object sender, EventArgs e)
{
Document doc1 = new Document(PageSize.A4, 0, 0, 0, 0);
string filename = "Prot_" + label.Text + ".pdf";
try
{
PdfWriter.GetInstance(doc1, new FileStream("C:/" + filename, FileMode.Create));
doc1.Open();
for (int i = 0; i < imageArray.Length; i++)
{
iTextSharp.text.Image pic = iTextSharp.text.Image.GetInstance(imageArray[i], System.Drawing.Imaging.ImageFormat.Bmp);
pic.ScalePercent(36f);
doc1.Add(pic);
}
}
catch (Exception ex)
{
MessageBox.Show("Error creating pdf file" + ex);
}
finally
{
doc1.Close();
PdfReader reader = new PdfReader("C:/" + filename);
string filepath = "C:/compressed/" + filename;
using (MemoryStream ms = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(reader, ms, PdfWriter.VERSION_1_5);
PdfWriter writer = stamper.Writer;
writer.CompressionLevel = PdfStream.BEST_COMPRESSION;
reader.RemoveFields();
reader.RemoveUnusedObjects();
stamper.Reader.RemoveUnusedObjects();
stamper.SetFullCompression();
stamper.Writer.SetFullCompression();
byte[] compressed = ms.ToArray();
reader.Close();
stamper.Close();
using (FileStream fs = File.Create("C:/compressed/compressed.pdf"))
{
fs.Write(compressed, 0, (int)compressed.Length);
fs.Close();
}
}
}
}

You are cutting the file too short.
Take a look at these lines:
byte[] compressed = ms.ToArray();
reader.Close();
stamper.Close();
They should be ordered like this:
stamper.Close();
reader.Close();
byte[] compressed = ms.ToArray();
The order in your code is wrong because:
You should close the reader after the stamper, because the stamper may need access to the reader while closing.
The file is only complete when you close the stamper. At the moment you create the byte[] not all the PDF data has been written yet. The file is incomplete.
Because of the incomplete byte[], you are removing a substantial part of your file when you do this:
fs.Write(compressed, 0, (int)compressed.Length);
The value of compressed.Length is too short. Your actual file has a larger file size.

Related

iTextSharp Html Arabic Mixed Content to PDF

We are in project for educational domain, where we are looking for importing Arabic content, Images (uri/base64 crypted texts), html tables.
We are facing issue while executing using HtmlWorker with Stream data. It throws error as "the document has no pages"
string str="html content contains arabic font texts/images";
using (MemoryStream ms = new MemoryStream())
{
using (iTextSharp.text.Document document = new iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 25, 25, 30, 30))
{
using (iTextSharp.text.pdf.PdfWriter writer = iTextSharp.text.pdf.PdfWriter.GetInstance(document,ms))
{
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(document))
{
//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StreamReader(str)) /// We are facing issue at this juncture where it throws error.
{
//Parse the HTML
htmlWorker.Parse(sr);
}
}
document.Close();
writer.Close();
ms.Close();
Response.ContentType = "pdf/application";
Response.AddHeader("content-disposition", "attachment;filename=First_PDF_document.pdf");
Response.OutputStream.Write(ms.ToArray(), 0, ms.ToArray().Length);
}
}
}
Could you please help us regarding this?
Now,
I have tried with other approach:
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(document))
{
using (Stream s = GenerateStreamFromString(str))
{
using (var srt = new StreamReader(s))
{
//Parse the HTML
htmlWorker.Parse(srt); //this line throws error now.
}
}
}
public Stream GenerateStreamFromString(string s)
{
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream);
writer.Write(s);
writer.Flush();
stream.Position = 0;
return stream;
}
Error message:
"The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters."

How to remove one indirectly referenced image from a PDF and keep all others?

I would like to parse a PDF and find the logo via known attributes and when I find a match, remove that image and then copy everything else.
I am using the code below to replace an image with a blank white image to remove a logo from PDFs that are to be printed on letterhead. It replaces the image with a white image of the same size. Is there a way to modify this to actually remove the image (and thus save some space, etc.?).
private static void Main(string[] args)
{
ManipulatePdf(#"C:\in.pdf", #"C:\out.pdf");
Console.WriteLine("Finished - press a key");
Console.ReadKey();
}
public static void ManipulatePdf(String src, String dest)
{
Console.WriteLine("Start");
PdfReader reader = new PdfReader(src);
// first read all references and find the one we wish to work on.
PdfDictionary page = reader.GetPageN(1); // all resources are available to every page (?)
PdfDictionary resources = page.GetAsDict(PdfName.RESOURCES);
PdfDictionary xobjects = resources.GetAsDict(PdfName.XOBJECT);
page = reader.GetPageN(1);
resources = page.GetAsDict(PdfName.RESOURCES);
xobjects = resources.GetAsDict(PdfName.XOBJECT);
foreach (PdfName pdfName in xobjects.Keys)
{
PRStream stream = (PRStream) xobjects.GetAsStream(pdfName);
if (stream.Length > 100000)
{
PdfImage image = new PdfImage(MakeBlankImg(), "", null);
Console.WriteLine("Calling replace stream");
ReplaceStream(stream, image);
}
}
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
stamper.Close();
reader.Close();
}
public static iTextSharp.text.Image MakeBlankImg()
{
Console.WriteLine("Making small blank image");
byte[] array;
using (MemoryStream ms = new MemoryStream())
{
//var drawingImage = image.GetDrawingImage();
using (Bitmap newBi = new Bitmap(1, 1))
{
using (Graphics g = Graphics.FromImage(newBi))
{
g.Clear(Color.White);
g.Flush();
}
newBi.Save(ms, ImageFormat.Jpeg);
}
array = ms.ToArray();
}
Console.WriteLine("Image array is " + array.Length + " bytes.");
return iTextSharp.text.Image.GetInstance(array);
}
public static void ReplaceStream(PRStream orig, PdfStream stream)
{
orig.Clear();
MemoryStream ms = new MemoryStream();
stream.WriteContent(ms);
orig.SetData(ms.ToArray(), false);
Console.WriteLine("Iterating keys");
foreach (KeyValuePair<PdfName, PdfObject> keyValuePair in stream)
{
Console.WriteLine("Key: " + keyValuePair.Key.ToString());
orig.Put(keyValuePair.Key, stream.Get(keyValuePair.Key));
}
}
}

Pdfbox - adding pdf embedded File and save the PDDocument to OutputStream does not keep the embedded Files

I'm using Pdfbox (1.8.8) to adding attachments to a pdf. My problem is when one of the attachments is of type .pdf and i'm saving the PDDocument to OutputStream the final pdf document does not include the attachments. If a save the PDDocument to a file instead an OutputStream all works just fine, and if the attachments does not include any pdf, both save to file or OutputStream works fine.
I would like to know if there is any way to add pdf embedded Files and save the PDDocument to OutputStream keeping the attached files in the final pdf that is generated.
The code i'm using is:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
Boolean hasPdfAttach = false;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
// final PDFTextStripper pdfStripper = new PDFTextStripper();
// final String text = pdfStripper.getText(doc);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
if ("application/pdf".equals(attach.getMimetype())) {
hasPdfAttach = true;
}
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
// final String text2 = pdfStripper.getText(doc);
}
// final String text3 = pdfStripper.getText(doc);
efTree.setNames(embeddedFileMap);
// ((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS); (this not work for me)
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
// final ByteArrayOutputStream pdfboxToDocumentStream = new ByteArrayOutputStream();
final String tmpfile = "temporary.pdf";
if (hasPdfAttach) {
final File f = new File(tmpfile);
doc.save(f);
doc.close();
//i have try with parser but without success too
// PDFParser parser = new PDFParser(new FileInputStream(tmpfile));
// parser.parse();
// PDDocument doc2 = parser.getPDDocument();
final PDDocument doc2 = PDDocument.loadNonSeq(f, new RandomAccessFile(new File(getHomeTMP()
+ "tempppp.pdf"), "r"));
doc2.save(out);
doc2.close();
} else {
doc.save(out);
doc.close();
}
//that does not work too
// final InputStream in = new FileInputStream(tmpfile);
// IOUtils.copy(in, out);
// out = new FileOutputStream(tmpFile);
// doc.save (out);
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}
Best regards
Solution:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
((ByteArrayOutputStream) out).reset();
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
}
efTree.setNames(embeddedFileMap);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
doc.save(out);
doc.close();
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}

You store the new PDF after the original PDF in out:
Look at all the uses of out in your method:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
...
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
...
doc2.save(out);
...
doc.save(out);
So you get as input a ByteArrayOutputStream and take its current content as input (i.e. the ByteArrayOutputStream is not empty but already contains a PDF) and after some processing you append the modified PDF to the ByteArrayOutputStream. Depending on the PDF viewer you present this to, you will be shown either the original or the manipulated PDF or a (very correct) error message that the file is garbage.
If you want the ByteArrayOutputStream to contain only the manipulated PDF, simply add
((ByteArrayOutputStream) out).reset();
or (if you are not sure about the state of the stream)
out = new ByteArrayOutputStream();
right after
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
PS: According to the comments the OP tried the above proposed changes to his code without success.
I cannot run the code as presented in the question because it is not self-contained. Thus, I reduced it to the essentials to get a self-contained test:
#Test
public void test() throws IOException, COSVisitorException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (
InputStream sourceStream = getClass().getResourceAsStream("test.pdf");
InputStream attachStream = getClass().getResourceAsStream("artificial text.pdf"))
{
final PDDocument document = PDDocument.load(sourceStream);
final PDEmbeddedFile embeddedFile = new PDEmbeddedFile(document, attachStream);
embeddedFile.setSubtype("application/pdf");
embeddedFile.setSize(10993);
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile("artificial text.pdf");
fileSpecification.setEmbeddedFile(embeddedFile);
final Map<String, PDComplexFileSpecification> embeddedFileMap = new HashMap<String, PDComplexFileSpecification>();
embeddedFileMap.put("artificial text.pdf", fileSpecification);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
efTree.setNames(embeddedFileMap);
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(document.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
document.getDocumentCatalog().setNames(names);
document.save(baos);
document.close();
}
Files.write(Paths.get("attachment.pdf"), baos.toByteArray());
}
As you see PDFBox here uses only streams. The result:
Thus, PDFBox without problem stores a PDF into which it has embedded a PDF file attachment.
The problem, therefore, most likely have nothing to do with this work flow as such

itextsharp form name and saving pdf

I am using itextsharp in ASP.NET. We populate a PDF with fields that are taken from one of our online forms. I need to change the way we handle the documents - we need to be able to use some of the fields as the name of the document(firstname-lastname.pdf), and to save that PDF into a directory. Here is the code I am using now:
PdfStamper ps = null;
DataTable dt = BindData();
if (dt.Rows.Count > 0)
{
PdfReader r = new PdfReader(new RandomAccessFileOrArray("http://www.example.com/Documents/ppd-certificate.pdf"), null);
ps = new PdfStamper(r, Response.OutputStream);
AcroFields af = ps.AcroFields;
af.SetField("fullName", dt.Rows[0]["fullName"].ToString());
af.SetField("presentationTitle", dt.Rows[0]["presentationTitle"].ToString());
af.SetField("presenterName", dt.Rows[0]["presenterFullName"].ToString());
af.SetField("date", Convert.ToDateTime(dt.Rows[0]["date"]).ToString("MM/dd/yyyy"));
ps.FormFlattening = true;
ps.Close();
}

PdfStamper and PdfWriter both use the generic Stream class so instead of Response.OutputStream you can use a FileStream or a MemoryStream
This example writes directly to disk. Set testFile to whatever you want, I'm using the desktop here
//Your file path here:
var testFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
PdfReader r = new PdfReader(new RandomAccessFileOrArray("http://www.example.com/Documents/ppd-certificate.pdf"), null);
var ps = new PdfStamper(r, fs);
//..code
}
This next example is my preferred method. It creates a MemoryStream, then creates a PDF inside of it and finally grabs the raw bytes. Once you've got raw bytes you can both write them to disk AND Response.BinaryWrite() then.
byte[] bytes;
using (var ms = new MemoryStream()) {
PdfReader r = new PdfReader(new RandomAccessFileOrArray("http://www.example.com/Documents/ppd-certificate.pdf"), null);
var ps = new PdfStamper(r, ms);
//..code
bytes = ms.ToArray();
}
//Your file path here:
var testFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Write to disk
System.IO.File.WriteAllBytes(testFile, bytes);
//Send to HTTP client
Response.BinaryWrite(bytes);

how to append one pdf to other pdf file using itextsharp

How to append pages to one pdf file from another pdf file without creating a new pdf using itextsharp. I have metadata attached to one pdf so i just want to add only the other pdf pages,so that first pdf metadata should remain as it is.
Regards
Himvj

Assuming you have 2 pdf files: file1.pdf and file2.pdf that you want to concatenate and save the resulting pdf to file1.pdf (by replacing its contents) you could try the following:
using (var output = new MemoryStream())
{
var document = new Document();
var writer = new PdfCopy(document, output);
document.Open();
foreach (var file in new[] { "file1.pdf", "file2.pdf" })
{
var reader = new PdfReader(file);
int n = reader.NumberOfPages;
PdfImportedPage page;
for (int p = 1; p <= n; p++)
{
page = writer.GetImportedPage(reader, p);
writer.AddPage(page);
}
}
document.Close();
File.WriteAllBytes("file1.pdf", output.ToArray());
}

You can try this it add the whole document with metadata
public static void MergeFiles(string destinationFile, string[] sourceFiles)
{
try
{
//1: Create the MemoryStream for the destination document.
using (MemoryStream ms = new MemoryStream())
{
//2: Create the PdfCopyFields object.
PdfCopyFields copy = new PdfCopyFields(ms);
// - Set the security and other settings for the destination file.
//copy.Writer.SetEncryption(PdfWriter.STRENGTH128BITS, null, "1234", PdfWriter.AllowPrinting | PdfWriter.AllowCopy | PdfWriter.AllowFillIn);
copy.Writer.ViewerPreferences = PdfWriter.PageModeUseOutlines;
// - Create an arraylist to hold bookmarks for later use.
ArrayList outlines = new ArrayList();
int pageOffset = 0;
int f = 0;
//3: Import the documents specified in args[1], args[2], etc...
while (f < sourceFiles.Length)
{
// Grab the file from args[] and open it with PdfReader.
string file = sourceFiles[f];
PdfReader reader = new PdfReader(file);
// Import the pages from the current file.
copy.AddDocument(reader);
// Create an ArrayList of bookmarks in the file being imported.
// ArrayList bookmarkLst = SimpleBookmark.GetBookmark(reader);
// Shift the pages to accomidate any pages that were imported before the current document.
// SimpleBookmark.ShiftPageNumbers(bookmarkLst, pageOffset, null);
// Fill the outlines ArrayList with each bookmark as a HashTable.
// foreach (Hashtable ht in bookmarkLst)
// {
// outlines.Add(ht);
// }
// Set the page offset to the last page imported.
//copy.Writer.SetPageSize(rec);
pageOffset += reader.NumberOfPages;
f++;
}
//4: Put the outlines from all documents under a new "Root" outline and
// set them for destination document
// copy.Writer.Outlines = GetBookmarks("Root", ((Hashtable)outlines[0])["Page"], outlines);
//5: Close the PdfCopyFields object.
copy.Close();
//6: Save the MemoryStream to a file.
MemoryStreamToFile(ms, destinationFile);
}
}
catch (System.Exception e)
{
System.Console.Error.WriteLine(e.Message);
System.Console.Error.WriteLine(e.StackTrace);
System.Console.ReadLine();
}
}
public static void MemoryStreamToFile(MemoryStream MS, string FileName)
{
using (FileStream fs = new FileStream(#FileName, FileMode.Create))
{
byte[] data = MS.ToArray();
fs.Write(data, 0, data.Length);
fs.Close();
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Corrupted pdf after some compression using iTextSharp - pdf

Related

iTextSharp Html Arabic Mixed Content to PDF

How to remove one indirectly referenced image from a PDF and keep all others?

Pdfbox - adding pdf embedded File and save the PDDocument to OutputStream does not keep the embedded Files

itextsharp form name and saving pdf

how to append one pdf to other pdf file using itextsharp

Categories

Resources