Unable to merge streams with PDFBox - pdf

Currently i am able to merge two PDF files when using java.io.File but unable to do merge them when using input and outputstreams.
Below the code works and generates merged PDF with success.
File mainDoc = new File(path...);
File additionalDoc = new File(path...);
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.setDestinationFileName(path + "/merged.pdf");
pdfMerger.addSource(mainDoc);
pdfMerger.addSource(additionalDoc);
pdfMerger.mergeDocuments(null);
I then tried to do the same by using streams.
ByteArrayOutputStream out = new ByteArrayOutputStream();
InputStream mainDocStream = new FileInputStream(path...);
InputStream additionalDocSteam = new FileInputStream(path...);
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.addSource(mainDocStream);
pdfMerger.addSource(additionalDocSteam);
pdfMerger.setDestinationStream(out);
pdfMerger.mergeDocuments(null);
The code above when reaching line pdfMerger.mergeDocuments(null); throws following exception :
java.io.IOException: Error: End-of-File, expected line at
org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at
org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2005)
at
org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1988)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:269)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1143)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1059)
at
org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:263)
At last i tried to follow this answer (Merge Pdf Files Using PDFBox) as an example but my generated pdf does not seem to have merged those two pdfs.
This is the code that i tried.
public InputStream createPDF() {
try{
// Note, i have also tried to use java.io.File instead of an
// InputStream but the result was the same
// File mainDoc = new File(path...);
// PDDocument document = PDDocument.load(mainDoc);
InputStream pdfInputStream = null;
InputStream mainDocStream = new FileInputStream(path...);
PDDocument document = PDDocument.load(mainDocStream);
InputStream additionalDocSteam = new FileInputStream(path...);
PDDocument additionalDocument = PDDocument.load(additionalDocSteam);
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.appendDocument(additionalDocument, document);
document.save(out);
document.close();
PDDocument.load(out.toByteArray());
pdfInputStream = new ByteArrayInputStream(out.toByteArray());
}catch(...){
....
}
return pdfInputStream;
}
The code above does generate a PDF but the newly created PDF contains only the content of the main document and not from the second one. So it looks that i am missing something and the documents are not merged.

I was able to find a solution, but i still cannot understand what is going wrong when using streams. In detail:
While the following code throws an exception (java.io.IOException: Error: End-of-File, expected line) :
ByteArrayOutputStream out = new ByteArrayOutputStream();
InputStream mainDocStream = new FileInputStream(path...);
InputStream additionalDocStream = new FileInputStream(path...);
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.addSource(mainDocStream);
pdfMerger.addSource(additionalDocStream);
pdfMerger.setDestinationStream(out);
pdfMerger.mergeDocuments(null);
When using a File in the addSource method, everything seems to work as required.
public InputStream createPDF() {
InputStream pdfInputStream = null;
try{
File mainDoc = new File(...);
File additionalDoc = new File(path...);
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.addSource(mainDoc);
pdfMerger.addSource(additionalDoc);
pdfMerger.setDestinationStream(out);
pdfMerger.mergeDocuments(null);
pdfInputStream = new ByteArrayInputStream(out.toByteArray());
}catch(...){
...
}
return pdfInputStream;
}
Now, why the first approach using streams throws an exception while using directly the file works, is something i would also like to know.

Related

Error using OpenXML to read a .docx file from a memorystream to a WordprocessingDocument to a string and back

I have an existing library that I can use to receive a docx file and return it. The software is .Net Core hosted in a Linux Docker container.
It's very limited in scope though and I need to perform some actions it can't do. As these are straightforward I thought I would use OpenXML, and for my proof of concept all I need to do is to read a docx as a memorystream, replace some text, turn it back into a memorystream and return it.
However the docx that gets returned is unreadable. I've commented out the text replacement below to eliminate that, and if I comment out the call to the method below then the docx can be read so I'm sure the issue is in this method.
Presumably I'm doing something fundamentally wrong here but after a few hours googling and playing around with the code I am not sure how to correct this; any ideas what I have wrong?
Thanks for the help
private MemoryStream SearchAndReplace(MemoryStream mem)
{
mem.Position = 0;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(mem, true))
{
string docText = null;
StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream());
docText = sr.ReadToEnd();
//Regex regexText = new Regex("Hello world!");
//docText = regexText.Replace(docText, "Hi Everyone!");
MemoryStream newMem = new MemoryStream();
newMem.Position = 0;
StreamWriter sw = new StreamWriter(newMem);
sw.Write(docText);
return newMem;
}
}
If your real requirement is to search and replace text in a WordprocessingDocument, you should have a look at this answer.
The following unit test shows how you can make your approach work if the use case really demands that you read a string from a part, "massage" the string, and write the changed string back to the part. It also shows one of the shortcomings of any other approach than the one described in the answer already mentioned above, e.g., by demonstrating that the string "Hello world!" will not be found in this way if it is split across w:r elements.
[Fact]
public void CanSearchAndReplaceStringInOpenXmlPartAlthoughThisIsNotTheWayToSearchAndReplaceText()
{
// Arrange.
using var docxStream = new MemoryStream();
using (var wordDocument = WordprocessingDocument.Create(docxStream, WordprocessingDocumentType.Document))
{
MainDocumentPart part = wordDocument.AddMainDocumentPart();
var p1 = new Paragraph(
new Run(
new Text("Hello world!")));
var p2 = new Paragraph(
new Run(
new Text("Hello ") { Space = SpaceProcessingModeValues.Preserve }),
new Run(
new Text("world!")));
part.Document = new Document(new Body(p1, p2));
Assert.Equal("Hello world!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
// Act.
SearchAndReplace(docxStream);
// Assert.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Paragraph p1 = part.Document.Descendants<Paragraph>().First();
Paragraph p2 = part.Document.Descendants<Paragraph>().Last();
Assert.Equal("Hi Everyone!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
}
private static void SearchAndReplace(MemoryStream docxStream)
{
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, true))
{
// If you wanted to read the part's contents as text, this is how you
// would do it.
string partText = ReadPartText(wordDocument.MainDocumentPart);
// Note that this is not the way in which you should search and replace
// text in Open XML documents. The text might be split across multiple
// w:r elements, so you would not find the text in that case.
var regex = new Regex("Hello world!");
partText = regex.Replace(partText, "Hi Everyone!");
// If you wanted to write changed text back to the part, this is how
// you would do it.
WritePartText(wordDocument.MainDocumentPart, partText);
}
docxStream.Seek(0, SeekOrigin.Begin);
}
private static string ReadPartText(OpenXmlPart part)
{
using Stream partStream = part.GetStream(FileMode.OpenOrCreate, FileAccess.Read);
using var sr = new StreamReader(partStream);
return sr.ReadToEnd();
}
private static void WritePartText(OpenXmlPart part, string text)
{
using Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write);
using var sw = new StreamWriter(partStream);
sw.Write(text);
}

Xamarin Android: How to Share PDF File From Assets Folder? Via WhatsApp I get message that the file you picked was not a document

I use Xamarin Android. I have a PDF File stored in Assets folder from Xamarin Android.
I want to share this file in WhatsApp, but I receive the message:
The file you picked was not a document.
I tried two ways:
This is the first way
var SendButton = FindViewById<Button>(Resource.Id.SendButton);
SendButton.Click += (s, e) =>
{
////Create a new file in the exteranl storage and copy the file from assets folder to external storage folder
Java.IO.File dstFile = new Java.IO.File(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");
dstFile.CreateNewFile();
var inputStream = new FileInputStream(Assets.OpenFd("my-pdf-File--2017.pdf").FileDescriptor);
var outputStream = new FileOutputStream(dstFile);
CopyFile(inputStream, outputStream);
//to let system scan the audio file and detect it
Intent intent = new Intent(Intent.ActionMediaScannerScanFile);
intent.SetData(Uri.FromFile(dstFile));
this.SendBroadcast(intent);
//share the Uri of the file
var sharingIntent = new Intent();
sharingIntent.SetAction(Intent.ActionSend);
sharingIntent.PutExtra(Intent.ExtraStream, Uri.FromFile(dstFile));
sharingIntent.SetType("application/pdf");
this.StartActivity(Intent.CreateChooser(sharingIntent, "#string/QuotationShare"));
};
This is the second
//Other way
var SendButton2 = FindViewById<Button>(Resource.Id.SendButton2);
SendButton2.Click += (s, e) =>
{
Intent intent = new Intent(Intent.ActionSend);
intent.SetType("application/pdf");
Uri uri = Uri.Parse(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");
intent.PutExtra(Intent.ExtraStream, uri);
try
{
StartActivity(Intent.CreateChooser(intent, "Share PDF file"));
}
catch (System.Exception ex)
{
Toast.MakeText(this, "Error: Cannot open or share created PDF report. " + ex.Message, ToastLength.Short).Show();
}
};
In other way, when I share via email, the PDF file is sent empty (corrupt file)
What can I do?
The solution is copying de .pdf file from assets folder to a local storage. Then We able to share de file.
First copy the file:
string fileName = "my-pdf-File--2017.pdf";
var localFolder = Android.OS.Environment.ExternalStorageDirectory.AbsolutePath;
var MyFilePath = System.IO.Path.Combine(localFolder, fileName);
using (var streamReader = new StreamReader(Assets.Open(fileName)))
{
using (var memstream = new MemoryStream())
{
streamReader.BaseStream.CopyTo(memstream);
var bytes = memstream.ToArray();
//write to local storage
System.IO.File.WriteAllBytes(MyFilePath, bytes);
MyFilePath = $"file://{localFolder}/{fileName}";
}
}
Then share the file, from local storage:
var fileUri = Android.Net.Uri.Parse(MyFilePath);
var intent = new Intent();
intent.SetFlags(ActivityFlags.ClearTop);
intent.SetFlags(ActivityFlags.NewTask);
intent.SetAction(Intent.ActionSend);
intent.SetType("*/*");
intent.PutExtra(Intent.ExtraStream, fileUri);
intent.AddFlags(ActivityFlags.GrantReadUriPermission);
var chooserIntent = Intent.CreateChooser(intent, title);
chooserIntent.SetFlags(ActivityFlags.ClearTop);
chooserIntent.SetFlags(ActivityFlags.NewTask);
Android.App.Application.Context.StartActivity(chooserIntent);
the file you picked was not a document
I had this issue when I trying to share a .pdf file via WhatsApp from assets folder, but it gives me the same error as your question :
the file you picked was not a document
Finally I got a solution that copy the .pdf file in assets folder to Download folder, it works fine :
var pathFile = Android.OS.Environment.GetExternalStoragePublicDirectory(Android.OS.Environment.DirectoryDownloads);
Java.IO.File dstFile = new Java.IO.File(pathFile.AbsolutePath + "/my-pdf-File--2017.pdf");
Effect like this.

Apache PDFBox and PDF/A-3

Is it possible to use Apache PDFBox to process PDF/A-3 documents? (Especially for changing field values?)
The PDFBox 1.8 Cookbook says that it is possible to create PDF/A-1 documents with pdfaid.setPart(1);
Can I apply pdfaid.setPart(3) for a PDF/A-3 document?
If not: Is it possible to read in a PDF/A-3 document, change some field values and safe it by what I have not need for >creation/conversion to PDF/A-3< but the document is still PDF/A-3?
How to create a PDF/A {2,3} - {B, U, A) valid: In this example I convert the PDF to Image, then I create a valid PDF / Ax-y with the image. PDFBOX2.0x
public static void main(String[] args) throws IOException, TransformerException
{
String resultFile = "result/PDFA-x.PDF";
FileInputStream in = new FileInputStream("src/PDFOrigin.PDF");
PDDocument doc = new PDDocument();
try
{
PDPage page = new PDPage();
doc.addPage(page);
doc.setVersion(1.7f);
/*
// A PDF/A file needs to have the font embedded if the font is used for text rendering
// in rendering modes other than text rendering mode 3.
//
// This requirement includes the PDF standard fonts, so don't use their static PDFType1Font classes such as
// PDFType1Font.HELVETICA.
//
// As there are many different font licenses it is up to the developer to check if the license terms for the
// font loaded allows embedding in the PDF.
String fontfile = "/org/apache/pdfbox/resources/ttf/ArialMT.ttf";
PDFont font = PDType0Font.load(doc, new File(fontfile));
if (!font.isEmbedded())
{
throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
+ " text rendering in rendering modes other than rendering mode 3 are embedded.");
}
*/
PDPageContentStream contents = new PDPageContentStream(doc, page);
try
{
PDDocument docSource = PDDocument.load(in);
PDFRenderer pdfRenderer = new PDFRenderer(docSource);
int numPage = 0;
BufferedImage imagePage = pdfRenderer.renderImageWithDPI(numPage, 200);
PDImageXObject pdfXOImage = LosslessFactory.createFromImage(doc, imagePage);
contents.drawImage(pdfXOImage, 0,0, page.getMediaBox().getWidth(), page.getMediaBox().getHeight());
contents.close();
}catch (Exception e) {
// TODO: handle exception
}
// add XMP metadata
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
PDDocumentCatalog catalogue = doc.getDocumentCatalog();
Calendar cal = Calendar.getInstance();
try
{
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
// dc.setTitle(file);
dc.addCreator("My APPLICATION Creator");
dc.addDate(cal);
PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
id.setPart(3); //value => 2|3
id.setConformance("A"); // value => A|B|U
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
catalogue.setMetadata(metadata);
}
catch(BadFieldValueException e)
{
throw new IllegalArgumentException(e);
}
// sRGB output intent
InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
"../../../pdmodel/sRGB.icc");
PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
intent.setInfo("sRGB IEC61966-2.1");
intent.setOutputCondition("sRGB IEC61966-2.1");
intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
intent.setRegistryName("http://www.color.org");
catalogue.addOutputIntent(intent);
catalogue.setLanguage("en-US");
PDViewerPreferences pdViewer =new PDViewerPreferences(page.getCOSObject());
pdViewer.setDisplayDocTitle(true);;
catalogue.setViewerPreferences(pdViewer);
PDMarkInfo mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject());
PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
catalogue.setMarkInfo(mark);
catalogue.setStructureTreeRoot(treeRoot);
catalogue.getMarkInfo().setMarked(true);
PDDocumentInformation info = doc.getDocumentInformation();
info.setCreationDate(cal);
info.setModificationDate(cal);
info.setAuthor("My APPLICATION Author");
info.setProducer("My APPLICATION Producer");;
info.setCreator("My APPLICATION Creator");
info.setTitle("PDF title");
info.setSubject("PDF to PDF/A{2,3}-{A,U,B}");
doc.save(resultFile);
}catch (Exception e) {
throw new IllegalArgumentException(e);
}
}
PDFBox supports that but please be aware that due to the fact that PDFBox is a low level library you have to ensure the conformance yourself i.e. there is no 'Save as PDF/A-3'. You might want to take a look at http://www.mustangproject.org which uses PDFBox to support ZUGFeRD (electronic invoicing) which also needs PDF/A-3.

Converting Asp.net page to pdf with Itextsharp (XMLWorker) returning damaged/blank pdf

Not sure if I skipped a step in my code, I am using ItextSharp version 5.5.1 and XML Worker version 5.5.1. The doc.Close throws an exception "the document has no pages", but I watched sw.toString (it has the html content).
private void ExporttoPDF()
{
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.AddHeader("Content-Disposition", "attachment;filename=RequestSummaryReport.pdf");
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.ContentType = "application/pdf";
StringWriter sw = new StringWriter();
HtmlTextWriter htw = new HtmlTextWriter(sw);
var doc = new Document(PageSize.A3, 45, 5, 5, 5);
PdfWriter writer = PdfWriter.GetInstance(doc, Response.OutputStream);
doc.Open();
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
ICSSResolver cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(false);
IPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(doc, writer)));
XMLWorker worker = new XMLWorker(pipeline, true);
XMLParser xmlParse = new XMLParser(true, worker);
pnlReport.RenderControl(htw);
StringReader sr = new StringReader(sw.ToString());
xmlParse.Parse(sr);
xmlParse.Flush();
doc.Close();
Response.Write(doc);
}
I just spent almost two hours with the same symptoms, and finally figured out the cause of the problem. I'm guessing you may have figured it out already (since, the question was asked 5 months ago) but I thought I'd post the answer in case there are others who run into the same problem.
When you create your PdfWriter, you pass in the stream (in your case Response.OutputStream) which is to be the destination for the generated PDF content. When as the PdfWriter writes content to the stream, the stream position is incremented accordingly. When the writer finishes, the stream position is at the end of the content, and this makes sense because any further calls to write should continue where the PdfWriter left off.
The problem is that when the MVC pipeline takes the Response.OutputStream (after your method returns) and attempts to read it to send its contents to the client (or in general, whenever the PDF destination stream is consumed), the stream position is at the end of the content, and that means that to the consumer it appears that the stream is empty, hence the empty PDF output.
To solve this, simply reset the the position of the stream immediate after you are finished writing to it, and before anything tries to read from it. In your case insert:
Response.OutputStream.Position = 0;
right after the line with xmlParse.Flush(), since that is the last line that will write to the stream.

PDF header signature not found error?

I am working on Asp.Net MVC application with Azure. When I upload the PDF document to Azure blob storage it will uploaded perfectly by using below code.
var filename = Document.FileName;
var contenttype = Document.ContentType;
int pdfocument = Request.ContentLength;
//uploading document in to azure blob
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount(FromConfigurationSetting("Connection"));
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("containername");
container.CreateIfNotExists();
var permissions = container.GetPermissions();
permissions.PublicAccess = BlobContainerPublicAccessType.Blob;
container.SetPermissions(permissions);
string uniqueBlobName = string.Format(filename );
CloudBlockBlob blob = container.GetBlockBlobReference(uniqueBlobName);
blob.Properties.ContentType = ;
blob.UploadFromStream(Request.InputStream);
after uploading the document to blob trying to to read the pdf document getting an error "PDF header signature not found." that erorr code is
byte[] pdf = new byte[pdfocument];
HttpContext.Request.InputStream.Read(pdf, 0, pdfocument);
PdfReader pdfReader = new PdfReader(pdf); //error getting here
and one more thing I forgot i.e if we comment the above code(uploading document in to Azure blob) then am not getting that error.
In your combined use case you try to read Request.InputStream twice, once during upload and once later when trying to read it into your byte[] pdf --- when you read it first, you read it until its end, so the second read most likely did not get any data at all.
As you anyways intend to read the PDF into memory (the afore mentioned byte[] pdf), you could in your combined use case
first read the data into that array
int pdfocument = Request.ContentLength;
byte[] pdf = new byte[pdfocument];
HttpContext.Request.InputStream.Read(pdf, 0, pdfocument);
then upload that array using CloudBlob.UploadByteArray
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount(FromConfigurationSetting("Connection"));
[...]
CloudBlockBlob blob = container.GetBlockBlobReference(uniqueBlobName);
blob.Properties.ContentType = ; // <--- something missing in your code...
blob.UploadByteArray(pdf); // <--- upload byte[] instead of stream
and then feed your PDF reader
PdfReader pdfReader = new PdfReader(pdf);
This way you read the stream only once, and a byte[] should be re-usable...