Look for Style Changes in Word - vba

Is there any way when programming MS Word to list the points in the text where a change in character style occurs?
I'm programmatically trying to analyze a paragraph to retrieve all the contiguous blocks of text that have the same style - in other words, split the paragraph at the points where the text style changes. At the moment the way I'm doing it is to take each character and compare its style with the previous character - if the name of the style is different, I know I've found a point to split the results at. That works but is horrendously inefficient (for every character, you have to do a full string comparison of the style name). I'm wondering if there's a way in the Word object model to solve this problem without comparing every character?
The approximate code I'm currently using is as follows (It's C# code: I'm using COM Interop against Word 2003, but I'd be equally happy with a solution in VBA since once I know in principle how to do it, converting to C# should be easy. )
// used to store the results as we go
StringBuilder currentText = new StringBuilder();
string currentStyle = null;
// range contains the Range I want to split up
foreach (Range charRng in range.Characters)
{
string style = charRng.get_Style().NameLocal;
if (style == currentStyle)
{
currentText.Append(charRng.Text);
}
else
{
AddTextBlockToMyResults(currentStyle, currentText.ToString());
currentText = new StringBuilder(charRng.Text);
currentStyle = style;
}
}
AddTextBlockToMyResults(currentStyle, currentText.ToString());

What version(s) of Office were used to create the Word docs?
If it's Office 2007 or later (or, you can convert the docs to that format) then an office document is really just a .zip archive. If you open a .docx file with an archive utility like WinRAR, you'll see that it has a directory structure like:
_rels
customXml
docProps
word
|_ document.xml
That document.xml is an Open Office XML file that contains all the text and reference to styles in your Word doc. I bet you could parse that XML a heck of a lot faster than doing what you're doing now.

Related

Is there a way to save mathematical alphanumerical symbols (the ones that are in unicode) to a PDF or Word document in VB.NET?

Basically, I need to take a question from a text file and format it as a question would be formatted in a maths exam.
Right now, I'm using PDFsharp to do this but it always saves the alphanumerical symbols (for example, 𝑥) as boxes.
I tried copying from the sample on PDFsharp and have this
Dim document As New PdfSharp.Pdf.PdfDocument
Dim page As PdfSharp.Pdf.PdfPage = document.AddPage()
Dim gfx As PdfSharp.Drawing.XGraphics = PdfSharp.Drawing.XGraphics.FromPdfPage(page)
Dim tf As New PdfSharp.Drawing.Layout.XTextFormatter(gfx)
Dim options As New PdfSharp.Drawing.XPdfFontOptions(PdfSharp.Pdf.PdfFontEncoding.Unicode)
Dim font As New PdfSharp.Drawing.XFont("LastResort", 10, PdfSharp.Drawing.XFontStyle.Regular, options)
tf.Alignment = PdfSharp.Drawing.Layout.XParagraphAlignment.Left
tf.DrawString(questionArray(i)), font, PdfSharp.Drawing.XBrushes.Black, New PdfSharp.Drawing.XRect(0, 0, page.Width.Point, page.Height.Point), PdfSharp.Drawing.XStringFormats.TopLeft)
Dim filename As String = "test" + Str(i).Trim + ".pdf"
document.Save(filename)
Process.Start(filename)
I know I don't need to keep repeating the "PdfSharp.Pdf" stuff, my plan was to clean it all up when I get the characters saving properly.
Last Resort is a font that contains unicode symbols and the mathematical alphanumerical block, according to https://www.fileformat.info/info/unicode/block/mathematical_alphanumeric_symbols/fontsupport.htm
My end goal is to take a basic .txt file like "f(x) = 5[𝑥^2] + (k+7)𝑥 + k where k is a real constant." and save it to a PDF to resemble a real math exam question.
So, is there a better way to do this or a way to make PDFsharp do it?
Unicode support in PDFsharp works fine for characters in the range 0x0000 to 0xffff as long as you use a font that supports these characters.
Mathematical Alphanumeric Symbols are in the range U+1D400..U+1D7FF. You have to patch PDFsharp to make use of them. They are not yet supported out of the box as of today (December 2019).
In your snippet you give "LastResort" as the name of the font. Do you have a font with that name installed in Windows? Can you use it with e.g. Word or WordPad?
Maybe try "Arial" or "Tahoma" or "Verdana" instead.
Do you see the correct strings in the debugger? Maybe the problem is with the formatting of the text file or the encoding used to open it.
Update:
All characters in LastResort look like boxes. No good choice for math exam sheets:
Please try a different font.

Turning a string of characters into a hyperlink

I have searched a ton for an answer to this question and I can't seem to find one that is specific to my needs. I think it might be possible that I am just not understanding how hyperlinks work in VBA.
Currently, I have an array of strings (each representing a separate file on my server), and I want to add a hyperlink to each string that will take it to the file location on my server. I want that string to be hyperlink-ed so when I paste it in to Word or Outlook, it will already be hyperlink-ed. In my mind, it seems like this should be a fairly straight-forward task; you have a string of text, you have a file location, and you want to hyperlink that string of text with the file location.
For example, let's say I have an array like below:
docArray = {"myDoc1", "myDoc2", "myDoc3"}
which contains the names as string of 3 documents.
I have another array with the file location of each doc:
docLocArray = {"C:\Documents\myDoc1.docx", "C:\Documents\myDoc2.docx",
"C:\Documents\myDoc3.docx"}
The pseudo-code for this would be something like:
Hyperlink.Add(docArray(1), docLocArray(2))
Is there any way I can do something like this, or am I completely misunderstanding how hyperlinks can be used?
I am working in Autodesk Inventor if that is of any relevance to anyone.
Try this in Word:
ActiveDocument.Hyperlinks.Add Anchor:=Selection.Range, Address:="C:\Testdir\Testfile.txt", SubAddress:="", ScreenTip:="", TextToDisplay:="MyFile"
Then just loop through the arrays for the values of Path and Filename.

Word.Range : Move Range index in the formatted text that corresponds to the plain text

I need to analyze text of my Word document, and create bookmarks on range of text my analyzer has detected (almost like a grammar checker).
I don't want use Find() utility, because my needs are too specific.
Explanations
For that,
1/ Retrieve Document plain text
I Retrieve Plain text of the main story of my document :
String plainText = ActiveDocument.Range().Text;
2/ Analyze plain text and get results
I send it to my analyzer tool which return a collection of marker with position :
For example, if I wanted to detected the pattern "my pattern" in the document text, analyzer could return a marker as { pattern : "my marker", start: 5, end : 14 }, where "start" and "end" are the character indexes of the pattern in the plain text sent.
3/ Display results in Document
I create bookmark from theses markers
For previously example, it woold be :
// init a new range and collapse it
Word.Range range = activeDocument.Range(); range.Collapse(WdCollapseStart);
// move character-by-character in the "formatted" text
range.MoveStart(WdUnits.Character, Marker.start ); # Marker.start=5
//set length (end)
range.setRange(range.Start,range.Start+(Marker.End-Marker.Start)); #Marker.end=14
4/ Results
4.1 Global Result
Everything is OK when Document Main Story Contains Text, links, lists, titles :
Ranges are well positionned, Plain Text indexes correlate with formatted text indexes.
4.2 Arrays Issue
When a document contains an array, Ranges are bad positionned a few characters : Plain Text indexes correlate not exactly with formatted text indexes.
I found the reason of this issue (It was explained in others forums) : this is due to non printing char(7), which is a cell delimiter added in plain text. We can handle these chars to calculate position range and everything is OK !
4.3 Issue for Content Controls, Table of contents, Sections and others
When a document contains theses elements, Ranges are also bad positionned a few characters.
Others non printing appears in plain text but I don't understand what it means and how deal with to calculate position range.
By displaying Word element markers with "Developer ribbon > creation mode", we see 2 markers per elements : shifting plain text indexes by 2*elements resolve issues. It's seems OK.
4.4 Issue with Endpaper
I don't know how we says "page de garde" (french) in english, I think it's "endpaper" : this is the first page with specific header, footer and content controls :)
When a document contains an Endpaper, Ranges are also bad positionned a few characters.
But this time, there are not non printing marker in the plain text.
Other info, when I display word element markers with "Developer ribbon > creation mode", I see endpaper markers.
Questions
How detect Endpaper in Word Document Range ?
How understand Plain Text indexes don't always correlate with formatted text indexes, in function of Word document elements which contains ?
XML nodes manipulation would be a more reliable alternative for that? If yes, could you give me good examples to manage bookmars or others in current document with XML Api ?
Others ressources
I found similar issues :
Correlate Range.Text to Range.Start and Range.End
http://www.vbaexpress.com/forum/showthread.php?36710-Strange-character-on-table-range-text
I hope my explanations are clear and you can help me to understand what is wrong or show me a best way to do that ?
Thanks, really.
It's not really pretty but you can try to remove the unwanted characters by Regex. For example to remove the \a letters (it has code 7):
string j = new string(new char[] { (char)7 });
plainText = Regex.Replace(plainText,string.Format("[{0}]", j), "");
Now you have to identify the other 'evil' characters and add them to the char array. If it works you will get a string whose length corresponds with the number of Characters in your document. Probably you have to adapt this code by experimenting. (I was not sure which language you are using - I supposed C#.)
Update
Another idea (if it is applicable to your analyzer tool):
Break your problem down to single paragraphs:
foreach(Word.Paragraph pg in activeDocument.Paragraphs)
{
Word.Range range = pg.Range();
string text = range.Text;
// your stuff here
}
With this paragraph range objects and the contained text strings you do the same as you tried to do with the whole document object and its text - just paragraph by paragraph. All these paragraphs are 'addressable' by ranges and Move operations as you already do it. I suppose that the problematic characters are outside or at the end of the paragraphs so they don't influence the character counting inside these paragraphs.
As I can't reproduce what you call endpaper I can't validate it. Besides I don't know if special text ranges as page headers and tables of content are covered by paragraphs. But at least you can reduce your problem to smaller ranges. I think it is worth trying.

Copy Special Symbols From One PowerPoint Presentation to Another

I need to copy text from one PowerPoint presentation to another. However, I have problems copying special symbols, such as smileys, which appear in the target presentation as empty boxes. Looking at the Open XML file in the original presentation, I can see that the Run containing the smiley has a "SymbolFont" attribute:
<a:sym typeface="Wingdings" />
However, in VBA, Shape.TextFrame2.TextRange2.Font =the Font of that Run - shows Arial.
How can I determine the SymbolFont of a text Run using VBA or C# (not XML)?
Then I could specify that SymbolFont in the target presentation.
Perhaps there other ways for copying the text that do not involve XML?
Note that this problem happens not only with Smileys; other special characters may show different SymbolFonts, such as:
<a:symTypeface = "Symbol", PitchFamily = 18, CharacterSet = 2>
Code example:
getRuns(TextRange2 paragraph)
{
foreach(TextRange2 run in paragraph.get_Runs(-1,-1))
_myRuns.Add(new MyRun {_text=run.Text, _font=run.Font} );
}
copyRunsToParagraph(TextRange2 paragraph)
{
foreach(MyRun run in _myRuns)
paragraph=paragraph.InsertAfter(run._text);
}
Note: Run.Font seems to return only the Latin font, not the Symbol font, e.g., Arial but not Wingdings. As I wrote, different symbols may have different SymbolFonts, so always using Wingdings does not work.
This can't be done in VBA.

Specifying location of new inlineshape in Word VBA?

I'm working on a document "wizard" for the company that I work for. It's a .dot file with a header consisting of some text and some form fields, and a lot of VBA code. The body of the document is pulled in as an OLE object from a separate .doc file.
Currently, this is being done as a Shape, rather than an InlineShape. I did this because I can absolutely position the Shape, whereas the InlineShape always appears at the beginning of the document.
The problem with this is that a Shape doesn't move when the size of the header changes. If someone needs to add or remove a line from the header due to a special case, they also need to move the object that defines the body. This is a pain, and I'd like to avoid it if possible.
Long story short, how do I position an InlineShape using VBA in Word?
The version I'm using is Word 97.
InlineShape is treated as a letter. Hence, the same technique.
ThisDocument.Range(15).InlineShapes.AddPicture "1.gif"
My final code ended up using ThisDocument.Paragraphs to get the range I needed. But GSerg pointed me in the right direction of using a Range to get my object where it needed to be.