Fastest way to split Word document to separate files - vba

I'm looking for most efficient way to split one huge file to small files. Every small file is one paragraph from big one.
It's not a problem if big file has ~100 paragraphs, but if its over 12k its took to long time.
Now I'm seting bookmark for each paragraph and next I'm inserting each bookmark in new file (I'm seting bookmark beacause sometimes I have to insert more than one paragraph, but now I don't want complicate exapmple, so I'm describe my problem using paragraphs).
This is my code (ofc its a simple example without extra logic and error handling).
Creation new file, save and close takes the most time.
Private Sub InsertBookmarks()
Dim p As Paragraph
Dim counter As Long
For Each p In ActiveDocument.Paragraphs
counter = counter + 1
ActiveDocument.Bookmarks.Add "File" & Format(counter, "00000#"), p.Range
Next p
ActiveDocument.Save
Set p = Nothing
End Sub
Private Sub SplitToSeparateFiles()
Dim path As String
Dim doc As Document
Dim b As Bookmark
path = ActiveDocument.path & "\"
WordBasic.DisableAutoMacros
For Each b In ActiveDocument.Bookmarks
Set doc = Documents.Add(Visible:=False)
doc.Range.FormattedText = b.Range
doc.SaveAs2 FileName:=path & b.Name
doc.Close wdDoNotSaveChanges
Next b
Set b = Nothing
Set doc = Nothing
End Sub
I considered the change my code to handle splitting using WordOpenXml behind the scenes but I didn't find any solution.
I may use VSTO add-in if someone has any idea in .net environment.
Any idea for more efficient way?

Here's an excerpt from a C# program I use that uses the FreeSpire.Doc nuget package to read a Word Document. I know your questions was VBA but you mentioned .NET at the end, so I figure you're not averse to creating something in C# or VB (vsual studio should be free for small time use)
using (Document document = new Document())
{
document.LoadFromFileInReadMode(#"C:\temp\word.docx", FileFormat.Docx);
foreach (Section s in document.Sections)
{
int pCount = 0;
foreach (Paragraph p in s.Paragraphs)
{
File.WriteAllText(#"c:\temp\p"+pCount+".txt", p.Text);
pCount++;
}
}
}
I don't expect it will take several hours to write 12,000 files, but I don't have a word document with 12,000 paragraphs to test it with; let me know your results?
Edit:
The following program created 12000 files in 41 seconds on an SSD equipped Core i7:
using System;
using System.IO;
namespace ConsoleApp4
{
class Program
{
static void Main()
{
for(int i = 0; i < 12000; i++){
File.WriteAllText(#"c:\temp\x\" + i + ".txt", Guid.NewGuid().ToString());
}
}
}
}
Using it as a benchmark and assuming that it could take up to a minute to load a massive document, I would be hopeful that a .NET app doing the splitting would take a few minutes on a word doc with tens of thousands of paragraphs
Edit2:
Creating word files. The real process might be like this, if you're reading every paragraph out of a source and making a new doc with that same paragraph (try assigning the old paragraph to the new document):
using (Document document = new Document())
{
document.LoadFromFileInReadMode(#"C:\temp\word.docx", FileFormat.Docx);
foreach (Section s in document.Sections)
{
int pCount = 0;
foreach (Paragraph p in s.Paragraphs)
{
Document document = new Document();
Section s = document.AddSection();
s.Paragraphs.Add(p);
document.SaveToFile(#"c:\temp\x\" + pCount + ".docx", FileFormat.Docx);
}
}
}
I created 1200 word documents in 15 seconds:
static void Main()
{
for(int i = 0; i < 1200; i++){
Document document = new Document();
Section s = document.AddSection();
Paragraph p = s.AddParagraph();
TextRange textRange1 = p.AppendText(Guid.NewGuid().ToString());
textRange1.CharacterFormat.TextColor = Color.Blue;
textRange1.CharacterFormat.FontSize = 15;
textRange1.CharacterFormat.Bold = true;
TextRange textRange2 = p.AppendText(Guid.NewGuid().ToString());
textRange2.CharacterFormat.TextColor = Color.Black;
textRange2.CharacterFormat.FontSize = 10;
TextRange textRange3 = p.AppendText(Guid.NewGuid().ToString());
textRange3.CharacterFormat.TextColor = Color.Red;
textRange3.CharacterFormat.FontSize = 8;
textRange3.CharacterFormat.Italic = true;
document.SaveToFile(#"c:\temp\x\" + i + ".docx", FileFormat.Docx);
Console.Out.Write("\r" + i);
}
}
I did note that there was a huge amount of garbage collection going on. Reducing that would perhaps speed things up a bit, if you can work out how

Related

Create multiple PDF files from one input PDF file, split using bookmarks

I've been working on a VB.NET project to dynamically create report packs in PDF format using a SQL database and a number of input PDF templates. To cut a long story short, due to the way that Business Objects creates the input files it will be much more efficient to allow input of compiled PDF reports rather than individual report template pages. In order for this to work however, we would need to split the input PDF files into sections using the Bookmarks created by BOBJ. We are not sure how many pages will be in the range of each bookmark but require a consistent naming convention of the split files so that the next part of the process can pick the correct templates up and merge them in the required combinations.
The second part of this process is designed and working well using a .Net library called PDFSHARP. I have used the samples on their website to write some code which splits an input PDF file into one section per page of the input file, but do not understand how to split it using the bookmarks.
If I could understand how to parse the PDF and read in the meta data for the bookmarks which contain the start page and end page and the name of the bookmark then I think I could finish it.
An example of the input PDF format is here:
https://drive.google.com/open?id=0B0GZGW6CFCI-UWY2WGRvV0dQSWZSNnNOWlp4R21zbFVPZDBn
There are 5 bookmarks (TID01, TID02 ...) and 6 pages. Section TID04 would have two pages output.
The file names I would need would be in the format of "ExamplePDF_TID01.pdf"
Any help to move forward would be greatly appreciated. - Looking on the wiki for the project it seems that it isn't very active any more and whilst other people have asked questions about this in the past there aren't any answers that I can find.
Code to Split by Page:
Sub Splitfiles()
Dim inputdir As String = "O:\Transformation\Standardisation\Input PDFs"
Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
'inputdir = folder path containing input files
Dim fileEntries As String() = Directory.GetFiles(inputdir)
Dim filename As String
Dim pdfpage As PdfPage
Dim ccid As String
Dim pageid As Integer
Dim outputfilename As String
For Each filename In fileEntries
Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
Dim count As Integer = importdoc.PageCount
Dim x = 0
Do Until x = count
Dim outputdoc As PdfDocument = New PdfDocument
pdfpage = importdoc.Pages(x)
outputdoc.AddPage(pdfpage)
ccid = Strings.Right(filename, Len(filename) - Len(inputdir)) 'expand this to find CC ID
ccid = Strings.Left(ccid, Len(ccid) - 4)
pageid = x
outputfilename = outputdir & ccid & "_" & pageid & ".pdf"
outputdoc.Save(outputfilename)
x = x + 1
Loop
Next
End Sub
And the code I started to split by bookmark but couldn't finish:
Sub SplitPDFByBookmark()
Dim inputfile As String = "O:\Transformation\Standardisation\Input PDFs\Business Sub Area Report - Project Management - FY16_FP02 - 17062016_0709.PDF"
Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
'inputdir = folder path containing input files
'Dim fileEntries As String() = Directory.GetFiles(inputdir)
Dim filename As String
Dim pdfpage As PdfPage
Dim ccid As String
Dim pageid As Integer
Dim outputfilename As String
filename = inputfile
'For Each filename In fileEntries
Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
Dim count As Integer = importdoc.PageCount
Dim x = 0
For Each bookmark In importdoc.Outlines
Dim outputdoc As PdfDocument = New PdfDocument
pdfpage = importdoc.Pages(importdoc.Outlines.)
outputdoc.AddPage(pdfpage)
pageid = x
outputfilename = outputdir & "OutputFile_" & pageid & ".pdf"
outputdoc.Save(outputfilename)
x = x + 1
Next
'Next
End Sub
Thanks in advance for your help!

TagLib# using "/" as separator in Performers tag?

The code I'm using...
For Each file As String In My.Computer.FileSystem.GetFiles(directory)
Dim fi As FileInfo = New FileInfo(file)
If isNotMusic(fi.Extension.ToString) = True Then Continue For 'Checks file extension for non-music files; if test is true for-loop continues with next file
trackCounter += 1 'Adds 1 to trackCounter
Dim song As New musicInfo
Dim tagFile As TagLib.File = TagLib.File.Create(fi.FullName)
infoArtist = tagFile.Tag.Performers(0)
With song
.track = tagFile.Tag.Track
.title = tagFile.Tag.Title
.artist = tagFile.Tag.Performers(0)
.album = tagFile.Tag.Album
.extension = fi.Extension.ToString
End With
songs.Add(song)
Next
When I use this code on a folder filled with AC/DC songs, tagFile.Tag.Performers(0) returns "AC".
I looked up this problem elsewhere and from what I could see, only other tagging solutions such as MpTagThat and MP1 have addressed this problem and made a patch.
I'm aware that the Performers tag is an array and the other half "DC" is likely stored in tagFile.Tag.Performers(1). However, I will eventually be separating each artist with a ";" in my code and if I left everything as is, AC/DC would be returned as "AC;DC".

Progress bar with VB.NET Console Application

I've written a parsing utility as a Console Application and have it working pretty smoothly. The utility reads delimited files and based on a user value as a command line arguments splits the record to one of 2 files (good records or bad records).
Looking to do a progress bar or status indicator to show work performed or remaining work while parsing. I could easily write a <.> across the screen within the loop but would like to give a %.
Thanks!
Here is an example of how to calculate the percentage complete and output it in a progress counter:
Option Strict On
Option Explicit On
Imports System.IO
Module Module1
Sub Main()
Dim filePath As String = "C:\StackOverflow\tabSeperatedFile.txt"
Dim FileContents As String()
Console.WriteLine("Reading file contents")
Using fleStream As StreamReader = New StreamReader(IO.File.Open(filePath, FileMode.Open, FileAccess.Read))
FileContents = fleStream.ReadToEnd.Split(CChar(vbTab))
End Using
Console.WriteLine("Sorting Entries")
Dim TotalWork As Decimal = CDec(FileContents.Count)
Dim currentLine As Decimal = 0D
For Each entry As String In FileContents
'Do something with the file contents
currentLine += 1D
Dim progress = CDec((currentLine / TotalWork) * 100)
Console.SetCursorPosition(0I, Console.CursorTop)
Console.Write(progress.ToString("00.00") & " %")
Next
Console.WriteLine()
Console.WriteLine("Finished.")
Console.ReadLine()
End Sub
End Module
1rst you have to know how many lines you will expect.
In your loop calculate "intLineCount / 100 * intCurrentLine"
int totalLines = 0 // "GetTotalLines"
int currentLine = 0;
foreach (line in Lines)
{
/// YOUR OPERATION
currentLine ++;
int progress = totalLines / 100 * currentLine;
///print out the result with the suggested method...
///!Caution: if there are many updates consider to update the output only if the value has changed or just every n loop by using the MOD operator or any other useful approach ;)
}
and print the result on the same posititon in your loop by using the SetCursor method
MSDN Console.SetCursorPosition
VB.NET:
Dim totalLines as Integer = 0
Dim currentLine as integer = 0
For Each line as string in Lines
' Your operation
currentLine += 1I
Dim Progress as integer = (currentLine / totalLines) * 100
' print out the result with the suggested method...
' !Caution: if there are many updates consider to update the output only if the value has changed or just every n loop by using the MOD operator or any other useful approach
Next
Well The easiest way is to update the progressBar variable often,
Ex: if your code consist of around 100 lines or may be 100 functionality
after each function or certain lines of code update progressbar variable with percentage :)

VB.Net Replacing Specific Values in a Large Text File

I have some large csv files (1.5gb each) where I need to replace specific values. The method I'm currently using is terribly slow and I'm fairly certain that there should be a way to speed this up but I'm just not experienced enough to know what I should be doing. This is my first post and I tried searching through to find something relevant but didn't come across anything. Any help would be appreciated.
My other thought would be to break the file into chunks so that I can read the entire thing into memory, do all of the replacements there and then output to a consolidated file. I tried this but the way I did it actually ended up seeming slower than my current method.
Thanks!
Sub Main()
Dim fName As String = "2009.csv"
Dim wrtFile As String = "2009.1.csv"
Dim lRead
Dim lwrite As String
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim bulkWrite As String
bulkWrite = ""
Do While strRead.Peek <> -1
lRead = Split(strRead.ReadLine(), ",")
If lRead(9) = "5MM+" Then lRead(9) = "5000000"
If lRead(9) = "1MM+" Then lRead(9) = "1000000"
lwrite = ""
For i = LBound(lRead) To UBound(lRead)
lwrite = lwrite & lRead(i) & ","
Next
strWrite.WriteLine(lwrite)
Loop
strRead.Close()
strWrite.Close()
End Sub
You are splitting and the combining, which can take some time.
Why not just read the line of text. Then replace any occurance of "5MM+" and "1MM+" with the approiate value and then write the line.
Do While ...
s = strRead.ReadLine();
s = s.Replace("5MM+", "5000000")
s = s.Replace("1MM+", "1000000")
strWrite(s);
Loop

Most efficient way to jump through a file and read lines?

I want to use a FileStream and seek from the beginning of the file while moving forward in the file .01% of the file size at a time.
So I want to seek to a position in the file, read the entire line, if it matches my criteria I am done. If not, I seek ahead another .01.
C# is OK but VB.NET preferred.
I used to do it something like this in VB6...
FileOpen(1, CurrentFullPath, OpenMode.Input, OpenAccess.Read, OpenShare.Shared)
Dim FileLength As Long = LOF(1)
For x As Single = 0.99 To 0 Step -0.01
Seek(1, CInt(FileLength * x))
Dim S As String = LineInput(1)
S = LineInput(1)
filePosition = Seek(1)
If filePosition < 50000 Then
filePosition = 1
Exit For
End If
V = Split(S, ",")
Dim MessageTime As Date = CDate(V(3) & " " & Mid$(V(4), 1, 8))
Dim Diff As Integer = DateDiff(DateInterval.Minute, MessageTime, CDate(RequestedStartTime))
If Diff >= 2 Then
Exit For
End If
Next
But I don't want to use FileOpen, I want to use a FileStream.
Any help is greatly appreciated!
This is a more or less direct conversion of your code, where we use FileStream.Position to specify where in the file to read:
Using streamReader As System.IO.StreamReader = System.IO.File.OpenText(CurrentFullPath)
For x As Single = 0.99 To 0 Step -0.01
streamReader.BaseStream.Position = CLng(streamReader.BaseStream.Length * x)
Dim S As String = streamReader.ReadLine()
'... etc.
Next
End Using
what bout something like this (C# version):
using (var file = System.IO.File.OpenText(filename))
{
while (!file.EndOfStream)
{
string line = file.ReadLine();
//do your logic here
//Logical test - if true, then break
}
}
EDIT: VB version here (warning - from a C# dev!)
Using file as FileStream = File.OpenText(filename)
while Not file.EndOfStream
Dim line as string = file.ReadLine()
''//Test to break
''//exit while if condition met
End While
End Using
I normally prefer vb.net, but C#'s iterator blocks are slowly winning me over:
public static IEnumerable<string> SkimFile(string FileName)
{
long delta = new FileInfo(FileName).Length / 100;
long position = 0;
using (StreamReader sr = new StreamReader(FileName))
{
while (position < 100)
{
sr.BaseStream.Seek(position * delta, SeekOrigin.Begin);
yield return sr.ReadLine();
position++;
}
}
}
Put it in a class library project and use it from vb like this:
Dim isMatch as Boolean = False
For Each s As String in SkimFile("FileName.txt")
If (RequestedDate - CDate(s.SubString(3,11))).Minutes > 2 Then
isMatch = True
Exit For
End If
Next s
(I took some liberties with you criteria (assumed fixed-width values rather than delimited) to make the example easier)
There's an example on MSDN.
Edit in response to comment:
I must admit I'm a bit confused, as you seemed insistant on using a buffered FileStream, but want to read a file a line at a time? You can do that quite simply using a StreamReader. I don't know VB, but in C# it would be something like this:
using (StreamReader sr = File.OpenText(pathToFile))
{
string line = String.Empty;
while ((line = sr.ReadLine()) != null)
{
// process line
}
}
See http://msdn.microsoft.com/en-us/library/system.io.file.aspx.