read word document from varbinary, edit, create a write protected pdf - pdf

I have a method that uses Microsoft.Office.Interop to read a Word document, edit it, and write it as a PDF. I have a separate method that uses Itext7 to read this PDF and write it to a different PDF that can be viewed and printed, but cannot be easily altered.
The first method reads the word document from disk; however, I've been asked to make it read the document from an sql query from a variable stored as varbinary and write the final result as a varbinary -without using any intermediate files on disk. I think I need to read these as "streams"
Here's what I have:
class clsMakeCert
{
string myName;
public string myDate;
public clsMakeCert(string name, DateTime date)
{
myName = name;
myDate = date.ToString("MM/dd/yyyy");
}
public void createCertPdf(string certFilename)
{
// Get the document out of the SQL table
System.Data.DataTable dtContents = new System.Data.DataTable();
SqlDataReader rdr_Contents = null;
using (SqlConnection conn = new SqlConnection("Server=KGREEN3-LT\\SQLEXPRESS;Initial Catalog=SAN;Integrated Security=SSPI"))
{
conn.Open();
SqlCommand cmd = new SqlCommand("Select [file_data] From cert_files where filename=#certFilename", conn);
{
cmd.CommandType = CommandType.Text;
cmd.Parameters.AddWithValue("#certFilename", certFilename);
rdr_Contents = cmd.ExecuteReader(CommandBehavior.CloseConnection);
dtContents.Load(rdr_Contents);
}
}
byte[] byteArray = (byte[])dtContents.Rows[0]["file_data"];
// Put it into a word document
Application wordApp = new Application { Visible = false };
Document doc = new Document();
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
doc = wordApp.Documents.Open(stream, ReadOnly: false, Visible: false);
doc.Activate();
FindAndReplace(wordApp, "First Middle Last", myName);
FindAndReplace(wordApp, "11/11/1111", myDate);
}
return ;
}
}
It doesn't look like the open method will accept a stream, though.
It looks like I can use openXML to read / edit the docx as a stream. But it won't convert to pdf.
It looks like I can use Interop to read /edit the docx AND write to PDF, but I can't do it as a stream (only as a file on disk).
Is there a way to get Interop to read a stream (i.e. a file loaded from a varbinary)?
k

No, the Word "Interop" does not support streaming. The closest it comes (and it's the only Office application that has this capability) is to pass in the necessary Word Open XML in the OPC flat file format using the obejct model's InsertXML method. It's not guaranteed, however, that the result will be an exact duplicate of the Word Open XML file being passed in as the target document's settings could override some of the in-coming settings.

Related

Error using OpenXML to read a .docx file from a memorystream to a WordprocessingDocument to a string and back

I have an existing library that I can use to receive a docx file and return it. The software is .Net Core hosted in a Linux Docker container.
It's very limited in scope though and I need to perform some actions it can't do. As these are straightforward I thought I would use OpenXML, and for my proof of concept all I need to do is to read a docx as a memorystream, replace some text, turn it back into a memorystream and return it.
However the docx that gets returned is unreadable. I've commented out the text replacement below to eliminate that, and if I comment out the call to the method below then the docx can be read so I'm sure the issue is in this method.
Presumably I'm doing something fundamentally wrong here but after a few hours googling and playing around with the code I am not sure how to correct this; any ideas what I have wrong?
Thanks for the help
private MemoryStream SearchAndReplace(MemoryStream mem)
{
mem.Position = 0;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(mem, true))
{
string docText = null;
StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream());
docText = sr.ReadToEnd();
//Regex regexText = new Regex("Hello world!");
//docText = regexText.Replace(docText, "Hi Everyone!");
MemoryStream newMem = new MemoryStream();
newMem.Position = 0;
StreamWriter sw = new StreamWriter(newMem);
sw.Write(docText);
return newMem;
}
}
If your real requirement is to search and replace text in a WordprocessingDocument, you should have a look at this answer.
The following unit test shows how you can make your approach work if the use case really demands that you read a string from a part, "massage" the string, and write the changed string back to the part. It also shows one of the shortcomings of any other approach than the one described in the answer already mentioned above, e.g., by demonstrating that the string "Hello world!" will not be found in this way if it is split across w:r elements.
[Fact]
public void CanSearchAndReplaceStringInOpenXmlPartAlthoughThisIsNotTheWayToSearchAndReplaceText()
{
// Arrange.
using var docxStream = new MemoryStream();
using (var wordDocument = WordprocessingDocument.Create(docxStream, WordprocessingDocumentType.Document))
{
MainDocumentPart part = wordDocument.AddMainDocumentPart();
var p1 = new Paragraph(
new Run(
new Text("Hello world!")));
var p2 = new Paragraph(
new Run(
new Text("Hello ") { Space = SpaceProcessingModeValues.Preserve }),
new Run(
new Text("world!")));
part.Document = new Document(new Body(p1, p2));
Assert.Equal("Hello world!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
// Act.
SearchAndReplace(docxStream);
// Assert.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Paragraph p1 = part.Document.Descendants<Paragraph>().First();
Paragraph p2 = part.Document.Descendants<Paragraph>().Last();
Assert.Equal("Hi Everyone!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
}
private static void SearchAndReplace(MemoryStream docxStream)
{
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, true))
{
// If you wanted to read the part's contents as text, this is how you
// would do it.
string partText = ReadPartText(wordDocument.MainDocumentPart);
// Note that this is not the way in which you should search and replace
// text in Open XML documents. The text might be split across multiple
// w:r elements, so you would not find the text in that case.
var regex = new Regex("Hello world!");
partText = regex.Replace(partText, "Hi Everyone!");
// If you wanted to write changed text back to the part, this is how
// you would do it.
WritePartText(wordDocument.MainDocumentPart, partText);
}
docxStream.Seek(0, SeekOrigin.Begin);
}
private static string ReadPartText(OpenXmlPart part)
{
using Stream partStream = part.GetStream(FileMode.OpenOrCreate, FileAccess.Read);
using var sr = new StreamReader(partStream);
return sr.ReadToEnd();
}
private static void WritePartText(OpenXmlPart part, string text)
{
using Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write);
using var sw = new StreamWriter(partStream);
sw.Write(text);
}

iTextSharp v5 - How do you concatenate PDFs in memory?

I am having trouble merging PDFs in-memory. I have 2 memory streams, a master and component stream, the idea is that as each component PDF is built up, the component PDF's bytes are added to the master stream. At the very end of all the components, we have a byte array that's a PDF.
I have the code below, but nothing is copying into my masterStream. I think the issue is with CopyPagesTo, but I'm not familiar enough and the documentation/examples are hard to find.
byte[] updated;
using (MemoryStream masterMemoryStream = new MemoryStream())
{
masterStream.WriteTo(masterMemoryStream);
// Read from master stream (ie. all existing components)
masterMemoryStream.Position = 0;
using (iText.Kernel.Pdf.PdfWriter masterPdfWriter = new iText.Kernel.Pdf.PdfWriter(masterMemoryStream))
using (iText.Kernel.Pdf.PdfDocument masterPdfDocument = new iText.Kernel.Pdf.PdfDocument(masterPdfWriter))
{
using (MemoryStream componentMemoryStream = new MemoryStream())
{
componentStream.WriteTo(componentMemoryStream);
// Read from new component
componentMemoryStream.Position = 0;
using (iText.Kernel.Pdf.PdfReader componentPdfReader = new iText.Kernel.Pdf.PdfReader(componentMemoryStream))
using (iText.Kernel.Pdf.PdfDocument componentPdfDocument = new iText.Kernel.Pdf.PdfDocument(componentPdfReader))
{
// Copy pages from component into master
componentPdfDocument.CopyPagesTo(1, componentPdfDocument.GetNumberOfPages(), masterPdfDocument);
}
}
}
updated = masterMemoryStream.GetBuffer();
}
// Write updates to master stream?
masterStream.SetLength(0);
using (MemoryStream temp = new MemoryStream(updated))
temp.WriteTo(masterStream);
Answer
This is mkl's answer with some of my corrections:
using (MemoryStream temporaryStream = new MemoryStream())
{
masterStream.Position = 0;
componentStream.Position = 0;
using (PdfDocument combinedDocument = new PdfDocument(new PdfReader(masterStream), new PdfWriter(temporaryStream)))
using (PdfDocument componentDocument = new PdfDocument(new PdfReader(componentStream)))
{
componentDocument.CopyPagesTo(1, componentDocument.GetNumberOfPages(), combinedDocument);
}
byte[] temporaryBytes = temporaryStream.ToArray();
masterStream.Position = 0;
masterStream.SetLength(temporaryBytes.Length);
masterStream.Capacity = temporaryBytes.Length;
masterStream.Write(temporaryBytes, 0, temporaryBytes.Length);
}
There are a number of issues in your code. I'll first give you a working version and then go into the issues in your code.
A working version (with an important limitation)
You can combine two PDFs given in MemoryStream instances masterStream and componentStream and get the result in the same MemoryStream instance masterStream as follows:
using (MemoryStream temporaryStream = new MemoryStream())
{
masterStream.Position = 0;
componentStream.Position = 0;
using (PdfDocument combinedDocument = new PdfDocument(new PdfReader(masterStream), new PdfWriter(temporaryStream)))
using (PdfDocument componentDocument = new PdfDocument(new PdfReader(componentStream)))
{
componentDocument.CopyPagesTo(1, componentDocument.GetNumberOfPages(), combinedDocument);
}
byte[] temporaryBytes = temporaryStream.ToArray();
masterStream.Position = 0;
masterStream.Capacity = temporaryBytes.Length;
masterStream.Write(temporaryBytes, 0, temporaryBytes.Length);
masterStream.Position = 0;
}
The limitation is that you have to have instantiated the masterStream with an expandable capacity; the MemoryStream class has a number of constructors only some of which create such an expandable instance while the others create non-resizable instances. For details read here.
Issues in your concept and code
Concatenating PDF files does not result in a valid merged PDF
You describe your concept like this
the idea is that as each component PDF is built up, the component PDF's bytes are added to the master stream
This does not work, though, the PDF format does not allow merging PDFs by simply concatenating them. In particular the (active) objects in a PDF have an identifier number which must be unique in the PDF, concatenating would result in a file with non-unique object identifiers; PDFs contain cross reference structures which map each object identifier to its offset from the file start, concatenating would get all these offsets wrong for the added PDFs; furthermore, a PDF has to have a single root object from which the other objects are referenced directly or indirectly, concatenating would result in multiple root objects.
Writing and immediately overwriting
In your code you have
masterStream.WriteTo(masterMemoryStream);
// Read from master stream (ie. all existing components)
masterMemoryStream.Position = 0;
using (iText.Kernel.Pdf.PdfWriter masterPdfWriter = new iText.Kernel.Pdf.PdfWriter(masterMemoryStream))
Here you write the contents of masterStream to masterMemoryStream, then set the masterMemoryStream position to the start and instantiate a PdfWriter which starts writing there. I.e. your original copy of the masterStream contents get overwritten, surely not what you wanted.
Using MemoryStream.GetBuffer
MemoryStream.GetBuffer does not only return the data written into the MemoryStream by design but the whole buffer; i.e. there may be a lot of trash bytes after the actual PDF in what you retrieve here
updated = masterMemoryStream.GetBuffer();
This may cause PDF processors trying to process your result PDFs to be unable to open the file: PDFs have a pointer to the last cross references at their end, so if you have trash bytes following the actual end of your PDF, PDF processors may not find that pointer.
PS
As worked out in the comments, the code above works fine in case of constantly growing stream lengths (which usually will happen in the use case at hand) but in general one needs to restrict the stream size before writing the new content, e.g. like this:
...
masterStream.Position = 0;
masterStream.SetLength(temporaryBytes.Length); // <<<<
masterStream.Capacity = temporaryBytes.Length;
...

Reading PDF Bookmarks in VB.NET using iTextSharp

I am making a tool that scans PDF files and searches for text in PDF bookmarks and body text. I am using Visual Studio 2008 with VB.NET with iTextSharp.
How do I load bookmarks' list from an existing PDF file?
It depends on what you understand when you say "bookmarks".
You want the outlines (the entries that are visible in the bookmarks panel):
The CreateOnlineTree examples shows you how to use the SimpleBookmark class to create an XML file containing the complete outline tree (in PDF jargon, bookmarks are called outlines).
Java:
PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
new FileOutputStream(dest), "ISO8859-1", true);
reader.close();
C#:
PdfReader reader = new PdfReader(pdfIn);
var list = SimpleBookmark.GetBookmark(reader);
using (MemoryStream ms = new MemoryStream()) {
SimpleBookmark.ExportToXML(list, ms, "ISO8859-1", true);
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms)) {
return sr.ReadToEnd();
}
}
The list object can also be used to examine the different bookmark elements one by one programmatically (this is all explained in the official documentation).
You want the named destinations (specific places in the document you can link to by name):
Now suppose that you meant to say named destinations, then you need the SimpleNamedDestination class as shown in the LinkActions example:
Java:
PdfReader reader = new PdfReader(src);
HashMap<String,String> map = SimpleNamedDestination.getNamedDestination(reader, false);
SimpleNamedDestination.exportToXML(map, new FileOutputStream(dest),
"ISO8859-1", true);
reader.close();
C#:
PdfReader reader = new PdfReader(src);
Dictionary<string,string> map = SimpleNamedDestination
.GetNamedDestination(reader, false);
using (MemoryStream ms = new MemoryStream()) {
SimpleNamedDestination.ExportToXML(map, ms, "ISO8859-1", true);
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms)) {
return sr.ReadToEnd();
}
}
The map object can also be used to examine the different named destinations one by one programmatically. Note the Boolean parameter that is used when retrieving the named destinations. Named destinations can be stored using a PDF name object as name, or using a PDF string object. The Boolean parameter indicates whether you want the former (true = stored as PDF name objects) or the latter (false = stored as PDF string objects) type of named destinations.
Named destinations are predefined targets in a PDF file that can be found through their name. Although the official name is named destinations, some people refer to them as bookmarks too (but when we say bookmarks in the context of PDF, we usually want to refer to outlines).
If someone is still searching the vb.net solution, trying to simplify, I have a large amount of pdf created with reportbuilder and with documentmap I automatically add a bookmarks "Title". So with iTextSharp I read the pdf and extract just the first bookmark value:
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim list As Object
list = SimpleBookmark.GetBookmark(oReader)
Dim string_book As String
string_book = list(0).item("Title")
It is a little help very simple for someone searching a start point to understand how it works.

Issues with iTextsharp and pdf manipulation

I am getting a pdf-document (no password) which is generated from a third party software with javascript and a few editable fields in it. If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages if I save the stream afterwards. When I now try to open the document the Acrobat Reader shows an extended feature warning and the fields are not fillable anymore (I haven't flattened the document). Do anyone know about such a problem?
Background Info:
My job is to remove the javascript code, fill out some fields and save the document afterwards.
I am using the iTextsharp version 5.5.3.0.
Unfortunately I can't upload a sample file because there are some confidental data in it.
private byte[] GetDocumentData(string documentName)
{
var document = String.Format("{0}{1}\\{2}.pdf", _component.OutputDirectory, _component.OutputFileName.Replace(".xml", ".pdf"), documentName);
if (File.Exists(document))
{
PdfReader.unethicalreading = true;
using (var originalData = new MemoryStream(File.ReadAllBytes(document)))
{
using (var updatedData = new MemoryStream())
{
var pdfTool = new PdfInserter(originalData, updatedData) {FormFlattening = false};
pdfTool.RemoveJavascript();
pdfTool.Save();
return updatedData.ToArray();
}
}
}
return null;
}
//Old version that wasn't working
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
}
//Solution
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream, char pdfVersion = '\0', bool append = true)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
}
public void RemoveJavascript()
{
for (int i = 0; i <= _pdfReader.XrefSize; i++)
{
PdfDictionary dictionary = _pdfReader.GetPdfObject(i) as PdfDictionary;
if (dictionary != null)
{
dictionary.Remove(PdfName.AA);
dictionary.Remove(PdfName.JS);
dictionary.Remove(PdfName.JAVASCRIPT);
}
}
}
The extended feature warning is a hint that the original PDF had been signed using a usage rights signature to "Reader-enable" it, i.e. to tell the Adobe Reader to activate some additional features when opening it, and the OP's operation on it has invalidated the signature.
Indeed, he operated using
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
which creates a PdfStamper which completely re-generates the document. To not invalidate the signature, though, one has to use append mode as in the OP's fixed code (for char pdfVersion = '\0', bool append = true):
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages
Quite likely it is a PDF with a XFA form, i.e. the PDF is only a carrier of some XFA data from which Adobe Reader builds those 17 pages. The actual PDF in that case usually only contains one page saying something like "if you see this, your viewer does not support XFA."
For a final verdict, though, one has to inspect the PDF.

Existing PDF to PDF/A "conversion"

I am trying to make an existing pdf into pdf/a-1b. I understand that itext cannot convert a pdf to pdf/a in the sense making it pdf/a compliant. But it definitely can flag the document as pdf/a. However, I looked at various examples and I cannot seem to figure out how to do it. The major problem is that
writer.PDFXConformance = PdfWriter.PDFA1B;
does not work anymore. First PDFA1B is not recognized, second, pdfwriter seems to have been rewritten and there is not much information about that.
It seems the only (in itext java version) way is:
PdfAWriter writer = PdfAWriter.getInstance(document, new FileOutputStream(filename), PdfAConformanceLevel.PDF_A_1B);
But that requires a document type, ie. it can be used when creating a pdf from scratch.
Can someone give an example of pdf to pdf/a conversion with the current version of itextsharp?
Thank you.
I can't imagine a valid reason for doing this but apparently you have one.
The conformance settings in iText are intended to be used with a PdfWriter and that object is (generally) only intended to be used with new documents. Since iText was never intended to convert documents to conformance that's just the way it was built.
To do what you want to do you could either just open the original document and update the appropriate tags in the document's dictionary or you could create a new document with the appropriate entries set and then import your old document. The below code shows the latter route, it first creates a regular non-conforming PDF and then creates a second document that says it is conforming even though it may or may not. See the code comments for more details. This targets iTextSharp 5.4.2.0.
//Folder that we're working from
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//Create a regular non-conformant PDF, nothing special below
var RegularPdf = Path.Combine(workingFolder, "File1.pdf");
using (var fs = new FileStream(RegularPdf, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("Hello world!"));
doc.Close();
}
}
}
//Create our conformant document from the above file
var ConformantPdf = Path.Combine(workingFolder, "File2.pdf");
using (var fs = new FileStream(ConformantPdf, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
//Use PdfSmartCopy to get every page
using (var copy = new PdfSmartCopy(doc, fs)) {
//Set our conformance levels
copy.SetPdfVersion(PdfWriter.PDF_VERSION_1_3);
copy.PDFXConformance = PdfWriter.PDFX1A2001;
//Open our new document for writing
doc.Open();
//Bring in every page from the old PDF
using (var r = new PdfReader(RegularPdf)) {
for (var i = 1; i <= r.NumberOfPages; i++) {
copy.AddPage(copy.GetImportedPage(r, i));
}
}
//Close up
doc.Close();
}
}
}
Just to be 100% clear, this WILL NOT MAKE A CONFORMANT PDF, just a document that says it conforms.