Find & Replace Long Blocks of Text (Microsoft Word) without losing the styles - vba

I would like to find and replace long blocks of text with line breaks in Microsoft Word. With ctrl+h I am able to find and replace text up to 255 characters. But if I want to search for a text which has line breaks, ctrl+h does not work. So, I looked in the internet and found the following code which helped me to delete the copied text from the document. But the Macro is also removing the styles in the document.
Here is the code I took from https://answers.microsoft.com/en-us/msoffice/forum/all/find-replace-long-blocks-of-text-microsoft-word/2fa77e32-9085-4c74-9d11-04d86829442f
Sub Remove_text()
Dim strText As String
Dim strReplacement As String
strText = Selection.Range.Text
strReplacement = "" 'Use this command to delete the instances of the selected text
'strReplacement = InputBox("Enter the text to be used for the replacement", "Find and replace long text")
With ActiveDocument.Range
.Text = Replace(.Text, strText, strReplacement)
End With
End Sub
For e.g. if I run the macro with the fallowing words in a word document and run the macro after selecting two words, the resultant text will be with all the words in bold characters.
Before
Cow
Rabbit
Ducks
Shrimp
Pig
Goat
Crab
Deer
Chicken
Seagull
Ostrich
After
Cow
Rabbit
Ducks
Shrimp
Pig
Goat
Crab
Deer
Chicken
Seagull
Ostrich
If I run the macro with the fist word in the list without bold and italics all the words are becoming plain as shown below.
Before
Cow
Rabbit
Ducks
Shrimp
Pig
Goat
Crab
Deer
Chicken
Seagull
Ostrich
After
Cow
Rabbit
Ducks
Shrimp
Pig
Goat
Crab
Deer
Chicken
Seagull
Ostrich
The above list is just an example. The same thing also happens in a large document. In that document all the styles will be replaced with the first word's style. So the styles of the whole document is depending on the first word's style. Can someone help me to remove the text with line breaks at the same time preserve the styles?

What you are describing in your comments has nothing to do with the format of the document. Deleting text does not change the document's format. Deleting Section breaks, though, can do so drastically.
I note that you have radically changed your description of the problem since first posting it and receiving feedback. Clearly, you've wasted everyone's time by giving a poor problem description in the first place.
The examples in your updated description clearly show that you're replacing content, not merely deleting it. In that case, to preserve the formatting, you need to limit your replacement content to the same range as whatever exhibits the format of the content you want to replace. Either that, or you must type and format the replacement content exactly how you want it to appear, then copy it to the clipboard and use ^c for the replacement expression.

Related

Finding and redacting text highlighted with a specific color, but while keeping the spaces and line breaks (to maintain doc layout)

I'm trying to use the VBA code from a similar question in this forum to redact text highlighted in a specific color, but I would like to keep the document layout, which means only replacing the words, but not the spaces and paragraph breaks in the document. Alternatively, I would be happy if we could identify the line breaks and put a space there.
At the end the document would not have large sections of unbroken text where words and spaces were replaced by XXXXXXXX and highlighted black. It the text would look more like XX X XXXX XXX X but all of it should be highlighted in black.
In other words, the text "Mary had a little lamb." would be redacted to "XXXX XXX X XXXXXX XXXXX" rather than XXXXXXXXXXXXXXXXXXXXXXXX.
I've tried changing the "If flag then" section to include unicode 32 (space) instead of the carriage return (unicode 13), but that doesn't seem to work.
Many thanks.
If flag Then
If Selection.Range.HighlightColorIndex = wdTurquoise Then
' Create replacement string
' If last character is a carriage return (unicode 13), then keep that carriage return
OldText = Selection.Text
OldLastChar = Right(OldText, 1)
NewLastChar = ReplaceChar
If OldLastChar Like String(1, 13) Then NewLastChar = String(1, 13)
NewText = String(Len(OldText) - 1, ReplaceChar) & NewLastChar
' Replace text, black block
Selection.Text = NewText
Selection.Font.ColorIndex = wdBlack
Selection.Font.Underline = False
Selection.Range.HighlightColorIndex = wdBlack
Selection.Collapse wdCollapseEnd
End If
End If
#freeflow has given you an answer in his comment on your post, but if you do that you should also include in the wildcard search, all potential punctuation characters excluding blank spaces.
However, with that said, I recommend you not try and eliminate punctuation characters and do not eliminate spaces between words. I’m recommending that because the purpose of redaction is to eliminate the possibility of someone comprehending what the redacted portion of the document originally contained. If you provide them clues, such as how many words in the sentence ... they can guess and sometimes be quite accurate because of the surrounding non-redacted script.
Oh course, that’s just my opinion.
To maintain document formatting, I suggest that you not use as replacement characters letters such as “X” because it is a wide character. I’ve found it better to use a symbol and I recommend a Wingdings character 127. It’s an average width and does a good job of balancing out sentence length ... but for added assurance I also recommend that you include in your replacement a Font.Spacing of -1, which will tighten up each redacted sentence even more.
In redacting, just be aware that maintaining the document formatting, no matter what your replacement character strategy might be, is very difficult. I’ve spent a lot of time experimenting with this and I’ve now shared what I do in my own redaction add-in. I don’t redact paragraph marks, I redact the entire highlighted string, including spaces and punctuation and I use a Wingding font character 127, set the Font.Spacing to -1, at the font color is the same as whatever color I’m using to highlight the redaction.
If you you are interested in seeing my add-in, do a Web search on AuthorTec Redactor.

(MS Word) Removing character style IF text has specific paragraph style applied

My Google-fu must be very weak today, ’cause this seems like an obvious thing to need to do sometimes, yet I cannot find a single case of anyone ever asking about it anywhere…
I have a document that I am preparing for proper typesetting in InDesign, which includes among other things getting rid of local overrides to paragraph and character styles. I did an find-and-replace to replace all instances of italic text with a character style called Italic, but stupidly forgot to limit this to text with the Normal style applied.
There are hundreds of headers strewn throughout the document which are supposed to be italic; that’s part of their paragraph style definitions. Since I forgot to limit the find-and-replace, the Italic style was applied to all these many headers. Annoyingly, since ‘italic’ is something like a boolean switch in Word, this means that all these headers are now not italic in the document.
I didn’t notice this for a while, so I can’t simply undo it now—the file has been saved and worked on since the find-and-replace. The author (who is a cantankerous, octogenarian technophobe) also needs to see the file again before it’s set, so while ultimately it doesn’t matter whether or not the font is italic in the Word document, it will matter to him.
So what I would very much like to do is to search for all text which has both the paragraph style Header and the character style Italic applied, and remove the character style.
This is an easy task in InDesign where paragraph and character styles are separate entities, but not in Word where they’re all lumped together in one big, messy pile. It doesn’t seem like it can be done through the UI, so I’m guessing I’ll have to resort to a VBA macro… which I’m utterly incompetent at.
Is there a way to find text with a particular paragraph style and a particular character style, and then remove the character style from that text?
here is some code to get you started
press F5 to run code, it will stop at Stop command
examine the Immediate window to determine the header style
each paragraph gets selected, so that you can tell which one you are examining
you can then modify the code with if/then statement to make specific paragraphs italic
Sub aaaa()
Dim ppp As Paragraph
Dim ccc As Range
For Each ppp In ActiveDocument.Content.Paragraphs
ppp.Range.Select ' visual aid only. not used by any other part of the program
Debug.Print "style :", ppp.Style
Debug.Print ppp.Style.Description
Stop
For Each ccc In ppp.Range.Characters ' you can probably comment out these 3 lines
Debug.Print ccc.FormattedText.Italic ' True prints as -1
Next ccc
Debug.Print "italic :", ppp.Range.Italic ' prints -1 if all are italic. 9999999 if some. 0 if none
ppp.Range.Italic = False ' this removes italic from whole paragraph
Next ppp
End Sub

Word VBA superscript issue

I received a word document that I have to parse. I am looping through the paragraph object (after I remove the tables) to get the formatted text.
For the most part this is working well except for the superscript formatting. I placed a debug.print....following...to discover something I thought was strange.
Selection.HomeKey Unit:=wdStory
For Each Para In ActiveDocument.Paragraphs
Debug.Print Para.Range.Text
Debug.Print Para.Range.FormattedText
Next
The output of my debug.print is as follows...both .text and .formattedtext..I have removed the duplicate lines.
kilo (k) 10³
centi (c) 10-2
milli (m) 10-³
micro (µ) 10-6
nano (n) 10-9
pico (p) 10-12
This is the strange part (for me).
The superscripts that worked, lines 1 and 3 have no discernible formatting but as you can see, the debug.print displays the desired format of the superscript.
The lines that did not work, lines 2,4,5,6 are marked as superscript in the font properties of the document.
My question is how can I preserve the superscript formatting as it appears in the word document? All of the lines in my document that should have superscript formatting appear correct in the document but when I try to read the paragraphs the lines that are formatted as superscript do not persist...the paragraphs that have the appearance of superscript but not the formatting codes come across as superscript. This document is the product of a contracted team that I have been unable to contact.
The document can be downloaded at the following link.
https://onedrive.live.com/redir?resid=C636FDD587E837CB!517&authkey=!ADjgA8mlRJqvv6U&ithint=file%2cdoc
Thanks in advance for your help.
Anthony Jaxon, Los Angeles, USA

MigraDoc Pdf long word rendering

What is about word-breaking.
I have a purpose to render LONG word in pdf, and I need to replace part of word on the next row. So now I get word that starts after some indent and finishes after right page side (I don't see word's end).
I use something like this:
var addInfo = paragraph.AddFormattedText("LOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOONG TIIIIIIIIIIIITLE");
I didn't find any tips about in MigraDoc documentation...
Of course I can implement the logic for splitting words up by myself, and I'll do this, if there is no native solution.
MigraDoc breaks sentences at spaces, hyphens, and soft hyphens. A single long word without any of these won't get broken.
Try something useful ("The quick brown fox ...") and breaking will work.
See also:
https://stackoverflow.com/a/40975554/162529

Convert text to image in Microsoft Word

I have a large book written in Microsoft Word and want to create a macro that will find all text using a predefined style and convert that text to an inline image. This text will be in Arabic and generally no longer than 4-5 lines. Is this possible?
UPDATE: Here's an example to show what I'm referring to:
I want to replace that entire line in Arabic with an image (as if I cropped this attached image to only include the Arabic and then replaced the line in Arabic with the image).
The reason I want a macro or script to do this is because there are hundreds of such lines and updating them one by one is cumbersome plus that will make modifications difficult later on.
UPDATE2: I found an interesting option here: http://windowssecrets.com/forums/showthread.php/31344-Convert-Text-to-an-Image-of-Text-in-VBA-(Office-2000-Sr1a)
It looks like you can cut a piece of text and then "Paste Special" as an image. So if there's a way to automate that that might work.
This is not an answer although I hope it will grow into a community answer. At the moment it is an exploration of what is required to solve the problem.
I know from the discussion when this question was posted on Super User that Abdullah wishes to publish his book on Kindle. So the question is really about how to get a document in English and Arabic ready for publication as an e-Book.
The Kindle does not support Arabic. The number of languages it does support is slowly increasing but there is no evidence I can find that Amazon has plans to add Arabic in the foreseeable future.
The format behind an Amazon e-Book is a cut down version of HTML. If a Word document containing Arabic letters is exported to HTML, the Arabic letters are included as character entities; for example: “ﭐ &#amp;64337; ﭒ ﭓ”. Importing the original Word or the HTML version to Kindle, results in the leading bits being discarded so these characters are displayed as P, Q, R and S instead of “ﭐ ﭑ ﭒ ﭓ (Alef Wasla isolated form, Alef Wasla final form, Beeh Wasla isolated form and Beeh Wasla final form).
I have tried Abdullah’s idea of saving some Arabic letters in a PNG file and creating an HTML file containing <p> … </p> <img src= “Arabic.png” > <p> … </p>. The appearance of this file on my Kindle 2 is perfectly acceptable so this has the potential to be a solution. The question is: how can the necessary conversions be performed?
We need to extract each Arabic string from either the Word document or its HTML equivalent and import it into a program that can convert them to PNG files.
The only way that I know of automating this would be to copy each string to a slide within PowerPoint. With PowerPoint’s SaveAs option it is possible to save each slide as a separate PNG file. The slides are named: SLIDE1.PNG, SLIDE2.PNG, SLIDE3.PNG and so on in sequence which would allow a macro to relate the results to the original strings. It would then be possible to replace the Arabic strings in the HTML file with the image elements. None of this would be too difficult to automate but there is a problem with the slides all being the size of the PowerPoint page. The page could be made smallish but what we need is for each slide to be cropped to just bigger than that slide’s text. I cannot think of any way of automating this cropping.
Does anyone have a better approach than converting each Arabic phrase to a PNG file?
I have been looking for PNG editors with some sort of command line interface but can find nothing that would be easier than using PowerPoint. Does anyone know of an alternative to PowerPoint?
Does anyone have any suggestions for automating the cropping of each image? When a string is placed in a PowerPoint slide it is possible to set its width to, say, 6.5cm (which looks good on my Kindle) and get the height determined by PowerPoint. This could be saved for later use if anyone knows how to use it.
Implementing solution
Pending any suggestions for improving the approach described above, the following outlines how I would implement it.
I would not attempt to process the Word document. I would save it as a Web Page, Filtered HTML file, which is a required step on the way to creating a Kindle eBook, and process that.
Within the HTML file created from my test document, the Arabic phrase comes out as:
<p class="MsoNormal"></p>
<p class="MsoNormal" align="center" style="text-align:center"><span dir="RTL"
style="font-size:24.0pt;font-family:Arial">
&#64336;&#64337;&#64338;&#64339;&#64340;&#64341;
&#64342;&#64343;&#65153;&#65154;&#65276;&#65275;
&#65274;&#65273;&#65246;&#65226;&#65227;&#65228;
</span><span style="font-size:24.0pt"></span></p>
<p class="MsoNormal"></p>
<p class="MsoNormal"></p>
I assume Abdullah's document will result in something similar. Note 1: the above is a random collection of Arabic letters. Note 2: they are held left-to-right in reading sequence even though, when displayed or printed, they are read right-to-left.
The whole of this block will have to be replaced with something like:
<br><imc src="xxxx.png"><br>
where the file xxxx.png holds an image of the Arabic text.
The file names, such as xxxx.png, could be systematic (A001.png, A002.png, ...) but I would have thought that transliterating the first ten or twenty characters of the phrase from the Arabic to English alphabets and using the result, with a numeric suffix, as the file name would be more convenient.
I would hold the records necessary to manage the process in an Excel worksheet. I would place the VBA code in the same workbook.
The steps in the conversion process that I envisage are:
VBA macro to extract Arabic strings from latest HTML file and add new strings to the Excel worksheet. (More about the Excel worksheet later.)
VBA macro to create PowerPoint file, with one slide per new string, and use SaveAs in PNG format to create one PNG file per slide before discarding the PowerPoint file.
Human to crop each PNG file. (There appears to be no way of automating the cropping so this task will be minimised by use of data in the Excel worksheet.)
VBA macro to rename each slide from SLIDEnnn.PNG to its permanent name and to record the permanent name in the Excel worksheet.
VBA macro to update the latest HTML file by replacing the block containing the Arabic phrase with the appropriate HTML IMG element.
The Excel worksheet needs two columns: Arabic phrase and PNG file name. If there is any risk of the worksheet being sorted between steps 2 and 4, we may need a sequence number as well.
Macro 1 will extract an Arabic phrase from the HTML file, look down the list in the worksheet for this phrase and add the phrase at the bottom if it is not already present.
Macro 2 will look for phrases in the worksheet that do not have a PNG file name. These new phrases are the ones to be written to the PowerPoint presentation. That is, a phrase only goes into this process once.
Task 3, cropping each PNG file, will be a pain. All I can say is that it will only be once per phrase.
Macro 4 will assume that the SLIDE001.PNG, SLIDE002.PNG, … are in the sequence of phrases without PNG files in the worksheet. If this might not be true (because the worksheet has been sorted) we will either need a sequence number or to retain the PowerPoint file. The macro will assign a unique name to each new phrase, record this name in the worksheet and rename the PNG file.
Macro 5 creates a new copy of the latest HTML file using the contents of the worksheet to determine which phrase to replace with which PNG file.
This process is not ideal but it will achieve the desired result and has no obvious complications. Any suggestions for improving it?
Before you begin these instructions, press record in the Microsoft Word macro editor, so you can see what the VBA code is.
I'm wondering if this will be easier if you convert the docx file to .rtf (rich text format) and replace that line with an image? Go to File > Save As.. > name it "old.rtf", then replace the line with an image and Save As.. again and name it "new.rtf" and then download Beyond Compare or your favorite diff program to see what happened. It should be easy to do this pro-grammatically if you choose to. I think working in text would be easier than Microsoft's binary format unless you can find a good library to modify their doc or docx formats.
Sub CopySelPasteAsPicture()
' Take a picture of a selection and paste it at the
' document end
With Selection
.CopyAsPicture
End With
ActiveDocument.Content.Select
With Selection
.Collapse Direction:=wdCollapseEnd
.TypeParagraph
.TypeParagraph
.PasteSpecial DataType:=wdPasteMetafilePicture
End With
End Sub