Word.Range : Move Range index in the formatted text that corresponds to the plain text - vba

I need to analyze text of my Word document, and create bookmarks on range of text my analyzer has detected (almost like a grammar checker).
I don't want use Find() utility, because my needs are too specific.
Explanations
For that,
1/ Retrieve Document plain text
I Retrieve Plain text of the main story of my document :
String plainText = ActiveDocument.Range().Text;
2/ Analyze plain text and get results
I send it to my analyzer tool which return a collection of marker with position :
For example, if I wanted to detected the pattern "my pattern" in the document text, analyzer could return a marker as { pattern : "my marker", start: 5, end : 14 }, where "start" and "end" are the character indexes of the pattern in the plain text sent.
3/ Display results in Document
I create bookmark from theses markers
For previously example, it woold be :
// init a new range and collapse it
Word.Range range = activeDocument.Range(); range.Collapse(WdCollapseStart);
// move character-by-character in the "formatted" text
range.MoveStart(WdUnits.Character, Marker.start ); # Marker.start=5
//set length (end)
range.setRange(range.Start,range.Start+(Marker.End-Marker.Start)); #Marker.end=14
4/ Results
4.1 Global Result
Everything is OK when Document Main Story Contains Text, links, lists, titles :
Ranges are well positionned, Plain Text indexes correlate with formatted text indexes.
4.2 Arrays Issue
When a document contains an array, Ranges are bad positionned a few characters : Plain Text indexes correlate not exactly with formatted text indexes.
I found the reason of this issue (It was explained in others forums) : this is due to non printing char(7), which is a cell delimiter added in plain text. We can handle these chars to calculate position range and everything is OK !
4.3 Issue for Content Controls, Table of contents, Sections and others
When a document contains theses elements, Ranges are also bad positionned a few characters.
Others non printing appears in plain text but I don't understand what it means and how deal with to calculate position range.
By displaying Word element markers with "Developer ribbon > creation mode", we see 2 markers per elements : shifting plain text indexes by 2*elements resolve issues. It's seems OK.
4.4 Issue with Endpaper
I don't know how we says "page de garde" (french) in english, I think it's "endpaper" : this is the first page with specific header, footer and content controls :)
When a document contains an Endpaper, Ranges are also bad positionned a few characters.
But this time, there are not non printing marker in the plain text.
Other info, when I display word element markers with "Developer ribbon > creation mode", I see endpaper markers.
Questions
How detect Endpaper in Word Document Range ?
How understand Plain Text indexes don't always correlate with formatted text indexes, in function of Word document elements which contains ?
XML nodes manipulation would be a more reliable alternative for that? If yes, could you give me good examples to manage bookmars or others in current document with XML Api ?
Others ressources
I found similar issues :
Correlate Range.Text to Range.Start and Range.End
http://www.vbaexpress.com/forum/showthread.php?36710-Strange-character-on-table-range-text
I hope my explanations are clear and you can help me to understand what is wrong or show me a best way to do that ?
Thanks, really.

It's not really pretty but you can try to remove the unwanted characters by Regex. For example to remove the \a letters (it has code 7):
string j = new string(new char[] { (char)7 });
plainText = Regex.Replace(plainText,string.Format("[{0}]", j), "");
Now you have to identify the other 'evil' characters and add them to the char array. If it works you will get a string whose length corresponds with the number of Characters in your document. Probably you have to adapt this code by experimenting. (I was not sure which language you are using - I supposed C#.)
Update
Another idea (if it is applicable to your analyzer tool):
Break your problem down to single paragraphs:
foreach(Word.Paragraph pg in activeDocument.Paragraphs)
{
Word.Range range = pg.Range();
string text = range.Text;
// your stuff here
}
With this paragraph range objects and the contained text strings you do the same as you tried to do with the whole document object and its text - just paragraph by paragraph. All these paragraphs are 'addressable' by ranges and Move operations as you already do it. I suppose that the problematic characters are outside or at the end of the paragraphs so they don't influence the character counting inside these paragraphs.
As I can't reproduce what you call endpaper I can't validate it. Besides I don't know if special text ranges as page headers and tables of content are covered by paragraphs. But at least you can reduce your problem to smaller ranges. I think it is worth trying.

Related

Replace a blank before a table page break Word VBA

I uncheck "Allow row to break across pages" for a table's properties So, the table is shown on a new page to ensure that all the content is on one page, this works fine. But Word generates a blank space before the page break, I need to replace it with some text for a legal reason. I can't use a watermark or shapes because un Oracle BI Publisher only prints it on PDF and I need to export it to a docx.
The data is dynamic, so sometimes the text before the table and the text inside the table may change.
Current Version https://imgur.com/a/FTx0q
I need some like this https://imgur.com/a/ySitL
MS Office support told me that it can't be done with Word...
Maybe with VBA code?
Update
Thanks Cindy for your help.
I have a table into another table many paragraphs, checkbox etc and they are fitting on a new page. It's working.
I understand there isn't a page break.
It's Paragraph mark.
But what I need to do is insert a kind of mark, a text like XXXX,-----------, Instead of leaving "free space",
It's a requirement not change the font size or another text format.
For a legal requirement, some paragraph must fit on a new page and "blank spaces" replaced by a kind of mark.
I can't hard code it because in several cases not all the paragraphs or section in a page will be shown and I don't know by default when a new page is needed.
I am available to use macros or anything.
What you could do is insert a page-size table into a textbox in the page header and format the body text with a white background. The table will thus be hidden behind any text on the page, but not otherwise (provided you don't pad unused space with empty paragraphs, etc.).

Word VBA search for adjacent (non-space) characters with different formatting

I need to be able to find every place in my document (hundreds of pages) where there is a formatting change without a space. For example:
a bold partnext to regular text
Or red text next to black with no space. I want to have my macro find each "word" (in the vba sense) like this, and execute code based on that character location accordingly. (The loop should identify the character position where the format change occurs... although I can do that part with a loop through the characters within the found word).
Is there a simpler way to do this than by looping character by character through the whole document and checking for a difference in formatting, which would be too resource-intensive?
Thanks for your help.

(MS Word) Removing character style IF text has specific paragraph style applied

My Google-fu must be very weak today, ’cause this seems like an obvious thing to need to do sometimes, yet I cannot find a single case of anyone ever asking about it anywhere…
I have a document that I am preparing for proper typesetting in InDesign, which includes among other things getting rid of local overrides to paragraph and character styles. I did an find-and-replace to replace all instances of italic text with a character style called Italic, but stupidly forgot to limit this to text with the Normal style applied.
There are hundreds of headers strewn throughout the document which are supposed to be italic; that’s part of their paragraph style definitions. Since I forgot to limit the find-and-replace, the Italic style was applied to all these many headers. Annoyingly, since ‘italic’ is something like a boolean switch in Word, this means that all these headers are now not italic in the document.
I didn’t notice this for a while, so I can’t simply undo it now—the file has been saved and worked on since the find-and-replace. The author (who is a cantankerous, octogenarian technophobe) also needs to see the file again before it’s set, so while ultimately it doesn’t matter whether or not the font is italic in the Word document, it will matter to him.
So what I would very much like to do is to search for all text which has both the paragraph style Header and the character style Italic applied, and remove the character style.
This is an easy task in InDesign where paragraph and character styles are separate entities, but not in Word where they’re all lumped together in one big, messy pile. It doesn’t seem like it can be done through the UI, so I’m guessing I’ll have to resort to a VBA macro… which I’m utterly incompetent at.
Is there a way to find text with a particular paragraph style and a particular character style, and then remove the character style from that text?
here is some code to get you started
press F5 to run code, it will stop at Stop command
examine the Immediate window to determine the header style
each paragraph gets selected, so that you can tell which one you are examining
you can then modify the code with if/then statement to make specific paragraphs italic
Sub aaaa()
Dim ppp As Paragraph
Dim ccc As Range
For Each ppp In ActiveDocument.Content.Paragraphs
ppp.Range.Select ' visual aid only. not used by any other part of the program
Debug.Print "style :", ppp.Style
Debug.Print ppp.Style.Description
Stop
For Each ccc In ppp.Range.Characters ' you can probably comment out these 3 lines
Debug.Print ccc.FormattedText.Italic ' True prints as -1
Next ccc
Debug.Print "italic :", ppp.Range.Italic ' prints -1 if all are italic. 9999999 if some. 0 if none
ppp.Range.Italic = False ' this removes italic from whole paragraph
Next ppp
End Sub

MS Word, how to change formatting of entire paragraphs automatically in whole document?

I have a 20-page word document punctuated with descriptive notes throughout, like this:
3 Input Data Requirements
Some requirement text.
NOTE: This is a descriptive note about the requirement, which is the paragraph that I would like to use find-and-replace or a VBA script to select automatically and change the formatting to italicized. The notes invariably end in a carriage-return: ¶.
If it was just a text document, not MS-Word, I would just use a regex in a code editor like sublime to wrap it with <I>...</I> or something along those lines.
Preferably, is there a way to do this in Word's "advanced" find-and-replace feature? Or if not, what's the best way to do it in VBA?
I've tried using a search string like this in find-and-replace: NOTE: *[a-z0-9,. A-Z)(-]{1,255}^l but the line-break part doesn't seem to work, and the 255 char max isn't enough for many of the paragraphs.
EDIT: Another slightly important detail: The doc is automatically generated from another piece of software as a .RTF, which I promptly converted to .docx.
Attempt #2: Use Notepad++ to find and replace using regex. Remove quotes.
Find: "( NOTE: .*?)\r"
Replace with: " \i \1 \i0 \r "
//OLD
Sure is. No VBA or fancy tricks needed.
CTRL + H to bring up the replace dialog.
Click "More".
Select "Font" in the drop down menu called "Format".
Click italics.
Enter find and replace text as the same thing. Make sure you set this up right so that you don't accidentally replace substrings (e.g. goal to replace all " test " with " nice ", testing -> niceing).
Should work. If you need to alter entire paragraphs, consistently, then you probably should have used the styles on those paragraphs to begin with. That way, you can change all of them at once by updating the style itself.
You can use Advance Find, yes. Find Next and then Replace makes the selection Italic.

automating word 2010 to generate docs

the webapp was already done on office2007 and i need to convert it so it'll work in office2010.
i was able to convert the header generator part of the code but i have problem with the body of the doc itself. the code copy the data from a "data" doc and paste it into the generated doc.
appword.activewindow.activepane.view.seekview = 0
'set appsel1 = appword.activewindow.selection
set appsel1 = appword.window(filepath).selection -that is the original one
appdoc1.bookmarks("b1").select
appword.selection.insertafter("some text")
appsel1.endkey(6) -the code stops here
appword.selection.insertafter("some other text")
the iexplorer debuger says ERROR:appsel1 object required. and when i view its data using the iexplorer debugger its data is "empty" instead of "{...}"
can anyone tell me what i'm doing wrong
if you need more of the code tell me.
From MSDN
After this method is applied, the selection expands to include the new
text.
If you use this method with a selection that refers to an entire
paragraph, the text is inserted after the ending paragraph mark (the
text will appear at the beginning of the next paragraph). To insert
text at the end of a paragraph, determine the ending point and
subtract 1 from this location (the paragraph mark is one character).
However, if the selection ends with a paragraph mark that also happens
to be the end of the document, Microsoft Word inserts the text before
the final paragraph mark rather than creating a new paragraph at the
end of the document.
Also, if the selection is a bookmark, Word inserts the specified
text but does not extend the selection or the bookmark to include the
new text.
So I suspect that you still have no selected text.
I wonder if you can do a Selection Collapse(wdCollapseStart) but that's just a thought.