Merging Documents with Open XML

Merging Documents with Open XML - vb.net

I am looking for mail merge alternatives in my vb.net app. I have used the mail merge feature of word, and find that it is quite buggy when dealing with a large volume of documents. I am looking at alternate methods of generating the merge, and have come across open xml. I think this will probably be the answer I am looking for. I have come to understand that the merge will be entirely code-driven in vb.net. I have started playing around with the following code:
Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open("C:\Users\JasonB\Documents\test.docx", True)
'for each simplefield (mergefield)
For Each field In wordprocessingDocument.MainDocumentPart.Document.Body.Descendants(Of SimpleField)()
'get the document instruction values
Dim instruction As String() = field.Instruction.Value.Split(splitChar, StringSplitOptions.RemoveEmptyEntries)
'if mergefield
If instruction(0).ToLower.Equals("mergefield") Then
Dim fieldname As String = instruction(1)
For Each fieldtext In field.Descendants(Of Text)()
fieldtext.Text = "I AM TESTING"
Next
End If
wordprocessingDocument.MainDocumentPart.Document.Save()
wordprocessingDocument.Dispose()
Now this works great and all, but I am realizing that I need to create as many documents as I will have datarows (assuming I use a datatable to handle the data).
One suggestion I found was to loop through each datarow, take my document template, save it to a folder and insert the datarow data. This could mean however that I end up with 12,000 documents in a single folder that need to be joined later and converted to pdf.
Is there another option? The other thing that stood out to me is to create a new word document, and duplicate over the xml from the template, and then replace the values. I dont know however if there is a "simpler" way of doing this, thanks.

If you don't want to save all 12,000 documents to file you should be able to process, convert and email them one at a time using temporary files.
Converting the DOCX to PDF in .NET might be an issue but looks like it's possible using Word Automation (Saving Word DOCX files as PDF).
The bottom line is you don't need to generate all documents before emailing them if you perform the process one document at a time. You can use SmtpClient in VB.NET to email the PDF after it is generated.
In terms of creating the document I have seen reports generated where a simple string replace is used to replace a string such as '%FIRSTNAME%' with the person's name and so on. This isn't necessarily the best approach but can work quite well. This way you can create your template in Word or OpenOffice and then edit it in .NET using OpenXML.

Related

iText7 Merge of 2 PDF MemorStreams Not Working

I am upgrading some older iTextSharp code to the new iText 7 libraries. I am having a lot of trouble determining the proper way to merge 2 PDF MemoryStreams into one PDF MemoryStream that contains all the pages from both source PDF MemoryStreams. It seems simple and I think the code below is set up properly but the resulting PDF memory stream only contains the first file. The second PDF file is never present and never concatenated to the first.
I have found multiple ways documented on the Internet as the "proper" way to do the merge. The actual sample code with iText 7 seems to be unusually complex (in that is mixes multiple concepts into one sample repeatedly - as in doesn't reduce the concept to the simplest possible code), and seems to fail to demonstrate simple concepts. For instance, their PDFMerge documentation has no sample code at all in the documentation (nor does anything else I looked at in the class documentation). The examples they have online actually always mix merging from files (not MemoryStreams) with other concepts like adding page numbers or adding Table of Contents. So they never just show one concept and they never start with anything other than files. My PDFs are coming out of a database and we just need to merge them into one PDF memory stream and save it back out. My concern is that maybe I am not creating the MemoryStream properly when I initialize the PDFWriter. As none of their samples ever do anything but initial with an actual file, I was unable to confirm this was done properly. I also fully qualified all objects in the code because I want to leave the old iTextSharp code in place while I am upgrading to the new iText 7. This was done to make sure an iTextSharp object of the same name wasn't inadvertently being unknowingly used.
Also, in the interest of making the source as easy as possible to read I removed some of the declarations and initialization of objects being used. Everything was traced through and all values are fully loaded with proper values as you trace through the code. The only problem is that the second PDFMerge doesn't seem to do anything. I am assuming the problem is that I didn't prepare the PDF objects properly or that I have to do something special with the PDFWriter on the Destination PDF Document (p_pdfDocument) before the second PDF is written out with the PDFMerge object.
Dim p_bResult As Boolean = False
Dim p_bArray As Byte() = Nothing
Dim p_memStream As New System.IO.MemoryStream
Dim p_pdfWriter As New iText.Kernel.Pdf.PdfWriter(p_memStream)
Dim p_pdfDocument As New iText.Kernel.Pdf.PdfDocument(p_pdfWriter)
Dim p_pdf1Stream As New System.IO.MemoryStream(CType(p_cImage1.ImageFile, Byte()))
Dim p_pdf2Stream As New System.IO.MemoryStream(CType(p_cImage2.ImageFile, Byte()))
Dim p_pdf1Reader As New iText.Kernel.Pdf.PdfReader(p_pdf1Stream)
Dim p_pdf2Reader As New iText.Kernel.Pdf.PdfReader(p_pdf2Stream)
Dim p_pdf1Document As New iText.Kernel.Pdf.PdfDocument(p_pdf1Reader)
Dim p_pdf2Document As New iText.Kernel.Pdf.PdfDocument(p_pdf2Reader)
Dim p_pdfMerger As New iText.Kernel.Utils.PdfMerger(p_pdfDocument)
p_pdfMerger.Merge(p_pdf1Document, 1, p_pdf1Document.GetNumberOfPages())
p_pdfMerger.Merge(p_pdf2Document, 1, p_pdf2Document.GetNumberOfPages())
'Problem is here... the array only has the first PDF in it
'The second p_pdfMerger.Merge didn't seem to do anything
p_bArray = p_memStream.ToArray
p_pdf1Document.Close()
p_pdf2Document.Close()
p_pdfDocument.Close()
I expected the 2 source PDF MemoryStreams to be present in the destination MemoryStream but it only contains the first PDF in it.
Edit:
I changed the ending to...
p_pdfMerger.Merge(p_pdf1Document, 1, p_pdf1Document.GetNumberOfPages())
p_pdfMerger.Merge(p_pdf2Document, 1, p_pdf2Document.GetNumberOfPages())
p_cImage1.PageCount = p_pdfDocument.GetNumberOfPages()
p_pdfDocument.Close()
p_bArray = p_memStream.ToArray
p_pdf1Document.Close()
p_pdf2Document.Close()
Thing is that the p_pdfDocument.GetNumberOfPages() is correct but bytes are still just first PDF document when saved to database and viewed.

I tested your use case, condensing your code a bit, reading the input memory streams from files, and writing the output memory stream to a file as I don't have your database environment:
Using MemoryStream As New MemoryStream,
Pdf1MemoryStream As New MemoryStream(File.ReadAllBytes(MY_FIRST_PDF_FILE)),
Pdf2MemoryStream As New MemoryStream(File.ReadAllBytes(MY_SECOND_PDF_FILE))
Using PdfDocument As New PdfDocument(New PdfWriter(MemoryStream)),
Pdf1 As New PdfDocument(New PdfReader(Pdf1MemoryStream)),
Pdf2 As New PdfDocument(New PdfReader(Pdf2MemoryStream))
Dim Merger As New PdfMerger(PdfDocument)
Merger.Merge(Pdf1, 1, Pdf1.GetNumberOfPages)
Merger.Merge(Pdf2, 1, Pdf2.GetNumberOfPages)
End Using
Dim PdfBytes As Byte() = MemoryStream.ToArray()
Using FileStream As Stream = File.Create("TwoPdfsMergedInMemoryStream.pdf")
FileStream.Write(PdfBytes, 0, PdfBytes.Length)
End Using
End Using
As result I got the contents of both source files in TwoPdfsMergedInMemoryStream.pdf as it should be. Concerning your observation
Thing is that the p_pdfDocument.GetNumberOfPages() is correct but bytes are still just first PDF document when saved to database and viewed.
therefore, I would assume that p_bArray does contain a PDF with the contents of both your source PDFs but that there is an issue in saving to database or viewing.
To test this you could save the contents of the byte array to a file somewhere like I do above; then you can check what really is in the array.

Splitting MS Publisher 2010 document into multiple files

I want to split a multi-page MS Publisher 2010 document into a set of separate documents, one per page.
The starting document is from a mail-merge, and I am trying to produce a set of numbered and named tickets as PDFs to send to people for an event (this is for a charity). The mail-merge seems to work fine and I can save the merged document and it looks OK with e.g. a list of fifty people giving me a 50-page document.
Ideally the result would be a set of PDFs.
I have tried to create some simple VBA code to do this, but it is not working consistently. If I try this very simple macro below , I get the correct number of documents, but only perhaps 1 or 2 documents with the correct contents out of every five. Most of the documents are completely empty.
Sub splitter()
Dim i As Integer
Dim Source As Document
Dim Target As Document
Set Source = ActiveDocument
For i = 1 To Source.Pages.Count
Set Target = Documents.Add
Source.Pages(i).Shapes.Range.Copy
Target.Pages(1).Shapes.Paste
Target.SaveAs Filename:="C:\Temp\Ticket_" & i
Target.Close
Set Target = Nothing
Next i
End Sub
I did sometimes get an error that the clipboard is busy, but not always.
Another approach might be to start with the master document and do this looping over the separate documents and fill in the personal details for each person's ticket and directly produce the PDFs. But that seems more complex, and I am not a VB programmer (but been doing C++ etc for 20+ years, so I can program :-) )
A final annoyance is that it seems to keep opening a new Publisher window for each document. It takes a while to then close 50+ copies of publisher, and the laptop starts to crawl...
Please advise how best to get round these issues. I am probably missing something trivial, being a relative VB(A) newbie.
Thanks in advance for any suggestions

Try coding something like this:
Open Publisher application (CreateObject()?)
Open Publisher document (doc.Open(filename))
Store the total amount of pages in a global variable (doc.Pages.Count)
Close document (doc.Close())
Loop the following for each page
Copy the pub file and rename it to name & "page" & X
Open the new pub file
Remove all Pages except page X from the pub file
doc.Save()
doc.Close()
Copying files with VBA is easy, but copying pages in Publisher VBA is quite a hassle, so this should be easier to achieve

Batch add a macro to word documents?

I have several hundred .doc word documents to which I need to add a macro which runs when the .doc file is opened and creates a header for said document based on the file name. Is there a way to do this as a batch? I have been individually opening each document and going into visual basic --> Project --> This Document then inserting a .txt file which contains the code. Is there a fast way to do this for multiple documents?

As a learning exercise, put this into the "ThisDocument" part of Normal (the Normal.dot template) in the VBE
Open a word document and watch what happens.
I don't think you need to put your code in every single file, I think you should be OK with using the Document_Open event in Normal.dot.
Just make sure it shows up as a reference in your word documents that you open but I don't see why it wouldn't
If you absolutely need it in every file then it can be done but the problem is if you make one small change to the code, you have to go through all this again. The idea with code is to write it once, use it many times.

You can write VBA code that alters the VBA code in other documents, but you need to "Trust access to the VBA project object model" in the Trust Centre options. This could open you up to viral code if you download Word documents with malicious VBA code in them. What you want to do, essentially, is write a VBA virus. There are legitimate reasons for doing this, and also malicious ones, I leave the ethics of the uses of these techniques up to the user. Knowledge itself is not malicious.
Here's the meat, you will need to write your own code to loop through the documents and possibly save them as .docm files.
Sub ReplaceCode()
Set oDoc = ActiveDocument
Set oComponents = oDoc.VBProject.VBComponents
For i = oComponents.Count To 1 Step -1
If oComponents(i).Type = 100 And oComponents(i).Name = "ThisDocument" Then
With oComponents(i).CodeModule
.DeleteLines 1, .CountOfLines
.AddFromFile "C:\ThisDocument.cls"
End With
End If
Next i
End Sub
Also, if you create your code file by exporting from VBA, you will need to remove this from the top of the .cls file:
VERSION 1.0 CLASS
BEGIN
MultiUse = -1 'True
END
Personally, I would drive this from Excel, maybe using a worksheet to hold a list of the files or locations to update, and another sheet for the code to populate with a list of files updated.

Combine several Excel files into one using VB

I have a folder with several excel files that have a date field, i.e. 08-24-2010-123320564.xls. I want to be able to have some VB scripting that will simply take the files that start with todays date and merge them into one file.
08-24-2010-123320564.xls
08-24-2010-123440735.xls
08-24-2010-131450342.xls
into
08-24-2010.xls
Can someone please help?
Thanks
GabrielVA

assuming you just want to append rows in simple spreadsheets, follow this logic:
Psuedocode
use an excel macro
(you could just as well automate excel from vb but why vbscript alone since you need excel anyway?)
have it to a dir listing (dir function)
dim a date_start variable init to "new"
dim a merged_spreadsheet as new doc default to nothing
loop thru result of dir
if date_start <> start of filename
if merged_spreadsheet is not nothing
save it
set it to nothing
store start of date (left mid function) in date_start
if merged_spreadsheet is nothing
make a new one
open the file from the dir command's loop
select all the data
copy it
go to first empty row in merged_spreadsheet
paste it
loop files
if merged_spreadsheet is not nothing
save it
If you're not happy with all those 'nothings', you can set a separate flag to keep track of whether you have a merged_spreadsheet or not. Think about what happens for just one file in a date, no files at all, etc.
Of course you will tear out your hair finding out how to automate those excel functions. The secret to turning 'hard' into 'pretty darn easy' is this:
macro recorder will reveal the automation commands
They are not intuitive. So record a macro. Then do things that you'll need to do in your code. Stop recording and look at the result.
For example:
* load/save files
* select only entered fields
* Select all
* copy / switch files / paste
* create new sheet
* click in various single cells and type (how to examine/set a cell's contents)
In summary-
(1) Know exactly the steps to what you're doing
(2) Use macro recorder to give away the secrets of the excel object model. Steal its secrets.
Really this won't be all that hard if you marry these two concepts cleverly. Since the macro will be vbscript (at least if you use office 97 ;-) you can probably run it from vbs or vb6 if you want. A hop to vb.net shouldn't be that hard either.

Aspose makes it pretty easy to work with excel-files in .NET http://www.aspose.com

Combine several RTF texts into one RTF file using VBA

I'm extracting 'task notes' from MS Project using VBA and want to create a MS Word .DOC file and also copy those texts into EXCEL.
If you use the Notes property of the Task objects you only get 255 characters and formatting will not not be retained.
In order to keep formatting you can convert the .MPP file into .MPD and extract the notes. These notes have been stored using rtf (see PJDB.HTM look for 'sub getRtf').
This way I can extract all notes and write them into a .rtf file.
If I open that file (containing multiple notes [i verified]) using MS Word I ONLY see the first note (and it has been formatted well).
Info I gathered from other sites learns only ONE rtf text in a file will be handled and it is NOT trivial to join several rtf texts.
So my question is:
does anyone know how to combine several rtf lines into ONE rtf text.
I prefer answers using VBA.
Of course, if anyone knows how to extract notes from MS Project and create a .DOC file preserving formats it's ok as well

This is probably not the right answer but you could do this in VBA :
For each RTF files, open and save to the clipboard as rich text (via API), and paste in word.
It's ugly but it works.

I had a similar task to read RTF from a database and create a report for all records preserving the RTF formatting in a Word document. The code gets the RTF from the DB, writes it to a file with .rtf extension, then inserts the file into a table cell. Not exactly what you are doing I guess, but the report does display all the formatted text from N records.
So the "files" are not really "combined."
The Word table did paste into Excel manually and the formatting was preserved. I don't know the equivalent of InsertFile in Excel.
RS.Open SQL, con
Do Until RS.EOF
Set ts = fso.CreateTextFile("c:\temp\temp.rtf", True)
ts.Write RS(0)
ts.Close
If iRow > 1 Then tbl.Rows.Add
tbl.Cell(iRow, 1).Range.InsertFile "C:\temp\temp.rtf", , False
.....
iRow = iRow + 1
RS.MoveNext
Loop
A reference to Microsoft Scripting Runtime is needed and this:
Dim ts As TextStream
Dim fso As New FileSystemObject

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas