Finding and redacting text highlighted with a specific color, but while keeping the spaces and line breaks (to maintain doc layout) - vba

I'm trying to use the VBA code from a similar question in this forum to redact text highlighted in a specific color, but I would like to keep the document layout, which means only replacing the words, but not the spaces and paragraph breaks in the document. Alternatively, I would be happy if we could identify the line breaks and put a space there.
At the end the document would not have large sections of unbroken text where words and spaces were replaced by XXXXXXXX and highlighted black. It the text would look more like XX X XXXX XXX X but all of it should be highlighted in black.
In other words, the text "Mary had a little lamb." would be redacted to "XXXX XXX X XXXXXX XXXXX" rather than XXXXXXXXXXXXXXXXXXXXXXXX.
I've tried changing the "If flag then" section to include unicode 32 (space) instead of the carriage return (unicode 13), but that doesn't seem to work.
Many thanks.
If flag Then
If Selection.Range.HighlightColorIndex = wdTurquoise Then
' Create replacement string
' If last character is a carriage return (unicode 13), then keep that carriage return
OldText = Selection.Text
OldLastChar = Right(OldText, 1)
NewLastChar = ReplaceChar
If OldLastChar Like String(1, 13) Then NewLastChar = String(1, 13)
NewText = String(Len(OldText) - 1, ReplaceChar) & NewLastChar
' Replace text, black block
Selection.Text = NewText
Selection.Font.ColorIndex = wdBlack
Selection.Font.Underline = False
Selection.Range.HighlightColorIndex = wdBlack
Selection.Collapse wdCollapseEnd
End If
End If

#freeflow has given you an answer in his comment on your post, but if you do that you should also include in the wildcard search, all potential punctuation characters excluding blank spaces.
However, with that said, I recommend you not try and eliminate punctuation characters and do not eliminate spaces between words. I’m recommending that because the purpose of redaction is to eliminate the possibility of someone comprehending what the redacted portion of the document originally contained. If you provide them clues, such as how many words in the sentence ... they can guess and sometimes be quite accurate because of the surrounding non-redacted script.
Oh course, that’s just my opinion.
To maintain document formatting, I suggest that you not use as replacement characters letters such as “X” because it is a wide character. I’ve found it better to use a symbol and I recommend a Wingdings character 127. It’s an average width and does a good job of balancing out sentence length ... but for added assurance I also recommend that you include in your replacement a Font.Spacing of -1, which will tighten up each redacted sentence even more.
In redacting, just be aware that maintaining the document formatting, no matter what your replacement character strategy might be, is very difficult. I’ve spent a lot of time experimenting with this and I’ve now shared what I do in my own redaction add-in. I don’t redact paragraph marks, I redact the entire highlighted string, including spaces and punctuation and I use a Wingding font character 127, set the Font.Spacing to -1, at the font color is the same as whatever color I’m using to highlight the redaction.
If you you are interested in seeing my add-in, do a Web search on AuthorTec Redactor.

Related

Word VBA search for adjacent (non-space) characters with different formatting

I need to be able to find every place in my document (hundreds of pages) where there is a formatting change without a space. For example:
a bold partnext to regular text
Or red text next to black with no space. I want to have my macro find each "word" (in the vba sense) like this, and execute code based on that character location accordingly. (The loop should identify the character position where the format change occurs... although I can do that part with a loop through the characters within the found word).
Is there a simpler way to do this than by looping character by character through the whole document and checking for a difference in formatting, which would be too resource-intensive?
Thanks for your help.

(MS Word) Removing character style IF text has specific paragraph style applied

My Google-fu must be very weak today, ’cause this seems like an obvious thing to need to do sometimes, yet I cannot find a single case of anyone ever asking about it anywhere…
I have a document that I am preparing for proper typesetting in InDesign, which includes among other things getting rid of local overrides to paragraph and character styles. I did an find-and-replace to replace all instances of italic text with a character style called Italic, but stupidly forgot to limit this to text with the Normal style applied.
There are hundreds of headers strewn throughout the document which are supposed to be italic; that’s part of their paragraph style definitions. Since I forgot to limit the find-and-replace, the Italic style was applied to all these many headers. Annoyingly, since ‘italic’ is something like a boolean switch in Word, this means that all these headers are now not italic in the document.
I didn’t notice this for a while, so I can’t simply undo it now—the file has been saved and worked on since the find-and-replace. The author (who is a cantankerous, octogenarian technophobe) also needs to see the file again before it’s set, so while ultimately it doesn’t matter whether or not the font is italic in the Word document, it will matter to him.
So what I would very much like to do is to search for all text which has both the paragraph style Header and the character style Italic applied, and remove the character style.
This is an easy task in InDesign where paragraph and character styles are separate entities, but not in Word where they’re all lumped together in one big, messy pile. It doesn’t seem like it can be done through the UI, so I’m guessing I’ll have to resort to a VBA macro… which I’m utterly incompetent at.
Is there a way to find text with a particular paragraph style and a particular character style, and then remove the character style from that text?
here is some code to get you started
press F5 to run code, it will stop at Stop command
examine the Immediate window to determine the header style
each paragraph gets selected, so that you can tell which one you are examining
you can then modify the code with if/then statement to make specific paragraphs italic
Sub aaaa()
Dim ppp As Paragraph
Dim ccc As Range
For Each ppp In ActiveDocument.Content.Paragraphs
ppp.Range.Select ' visual aid only. not used by any other part of the program
Debug.Print "style :", ppp.Style
Debug.Print ppp.Style.Description
Stop
For Each ccc In ppp.Range.Characters ' you can probably comment out these 3 lines
Debug.Print ccc.FormattedText.Italic ' True prints as -1
Next ccc
Debug.Print "italic :", ppp.Range.Italic ' prints -1 if all are italic. 9999999 if some. 0 if none
ppp.Range.Italic = False ' this removes italic from whole paragraph
Next ppp
End Sub

How do I create VBA Macros for MS Word using the .find property that can use variables?

Just as one example, I have a paragraph that contains the string "50 million dollars". I would like it to say "$50 million" instead, using the dollar sign instead of spelling out "dollars". Simply replacing that single instance is trivial. But say I want to replace all instances of the form "N million dollars" where N is a number between 1 and 1,000. How do I do that?
It's easy enough to find all instances of "N million dollars" by using something like
With Selection.Find
.Text = "[1-1000] million dollars"
.ClearFormatting
.Replacement.Text = "$[1-1000] million"
.Replacement.ClearFormatting
.Execute Replace:=wdReplaceAll, Forward:=True
End With
The problem is with the replacement text. We need to somehow store the numeric value that precedes "million dollars", otherwise, as written, it will replace it with the string "$[1-1000] million". If the original string is "75 million dollars", I want to replace it with "$75 million". What is the best way to do that?
With wildcard searching. You don't even need VBA — in the Find and Replace dialog box, check "Use Wildcards" (hit "More" if you don't see that). Enter the search text
$([0-9]{1,}) million
and the replacement text
\1 million dollars
and hit Replace All.
(Or, for the other way around, search for ([0-9]{1,}) million dollars and replace with $\1 million.)
To do this with VBA, set Selection.Find.MatchWildcards=True, .Text to the search expression above, and .Replacement.Text to the replacement text above.
Explanation: In a wildcard search expression, () surround portions of the text to be saved. The first such can then be referred to as \1, the second as \2, and so on. [0-9] is any digit, and [0-9]{1,} is any number of digits.
Note that wildcards are case-sensitive and whitespace-sensitive.
If you have to handle uppercase as well, replace any letter in the Find expression above with its lower- and upper-case forms in square brackets. E.g., [Mm]illion [Dd]ollars will match "million dollars", "Million dollars", and so forth.
If you have nonbreaking spaces in your document, or if you might have multiple spaces or tabs between words (e.g., million dollars vs. million dollars, replace each space in the Find expression above with [ ^s^t]{1,}. That will handle any number of spaces, NBSPs, or tabs.

Space characters in "like" operator in VBA

I am trying to use the like function to distinguish between cells that begin (other than white space characters) with either 2 or 3 numerical digits followed by more white space characters, but seem to be having trouble identifying the latter. For example, for two cells, one containing
11 some text
and another containing
111 some text
I have been trying to write an if statement that is true for the first but not the second. I have tried
if cells(i,1) like "*##[ ]*" then
and
if cells(i,1) like "*##\s*" then
and
if cells(i,1) like "*##[^#]*" then
following information on using regex from various sources (with the first two I was trying to identify 2 digits followed by a white space character, and the third 2 digits followed by a non-digit).
It is part of a for loop, and as in the examples above, the only numerical digits are at the beginning of the string, other than sometimes white space characters. In the first code example the if statement is true for both 2 and 3 numerical digits, and for the second and third, it is true for neither. I don't understand this given what I have read about regex and the like function.
I would greatly appreciate guidance. I expected this to be relatively simple and so I'm sure I am missing something obvious, and apologies if this is the case.
VBA's like operator doesn't support RegEx. Instead, it has its own format. Spaces are matched using the literal value, which does not need escaping.
Input Op Pattern Result
"11 Some Text" LIKE "## *" True
"11Some Text" LIKE "## *" False
For more see Microsoft's documentation.
If you would rather use RegEx take a look at this answer. #PortlandRunner has kindly taken the time to produce a great guide, that includes many examples.
I read through the material on this MSDN site and this seemed to work for me.
If Cells(i, 1).Value Like "## *" Then
Debug.Print ("Match")
ElseIf Cells(i, 1).Value Like "### *" Then
Debug.Print ("Match")
End If

How do I check if a string has a paragraph character?

I need to check if a string from a word document contains a paragraph character. I want to only extract the text without the paragraph character. Is There a constant for the paragraph character? I tried checking for vbnewLine and vbCrLF, but these didn't work.
Have a look at the wikipedia article on newlines.
In total there are 3 characters indicating a new line (in some context), and sometimes they are used in combinations.
I think it does not matter which ones Word uses and which ones it doesn't; You want them all gone.
So, I'd say run through all characters and remove all LF, CR and RS instances, or replace them by spaces (whilst avoiding double spaces)