Word VBA: iterating through characters incredibly slow - vba

I have a macro that changes single quotes in front of a number to an apostrophe (or close single curly quote). Typically when you type something like "the '80s" in word, the apostrophe in front of the "8" faces the wrong way. The macro below works, but it is incredibly slow (like 10 seconds per page). In a regular language (even an interpreted one), this would be a fast procedure. Any insights why it takes so long in VBA on Word 2007? Or if someone has some find+replace skills that can do this without iterating, please let me know.
Sub FixNumericalReverseQuotes()
Dim char As Range
Debug.Print "starting " + CStr(Now)
With Selection
total = .Characters.Count
' Will be looking ahead one character, so we need at least 2 in the selection
If total < 2 Then
Return
End If
For x = 1 To total - 1
a_code = Asc(.Characters(x))
b_code = Asc(.Characters(x + 1))
' We want to convert a single quote in front of a number to an apostrophe
' Trying to use all numerical comparisons to speed this up
If (a_code = 145 Or a_code = 39) And b_code >= 48 And b_code <= 57 Then
.Characters(x) = Chr(146)
End If
Next x
End With
Debug.Print "ending " + CStr(Now)
End Sub

Beside two specified (Why...? and How to do without...?) there is an implied question – how to do proper iteration through Word object collection.
Answer is – to use obj.Next property rather than access by index.
That is, instead of:
For i = 1 to ActiveDocument.Characters.Count
'Do something with ActiveDocument.Characters(i), e.g.:
Debug.Pring ActiveDocument.Characters(i).Text
Next
one should use:
Dim ch as Range: Set ch = ActiveDocument.Characters(1)
Do
'Do something with ch, e.g.:
Debug.Print ch.Text
Set ch = ch.Next 'Note iterating
Loop Until ch is Nothing
Timing: 00:03:30 vs. 00:00:06, more than 3 minutes vs. 6 seconds.
Found on Google, link lost, sorry. Confirmed by personal exploration.

Modified version of #Comintern's "Array method":
Sub FixNumericalReverseQuotes()
Dim chars() As Byte
chars = StrConv(Selection.Text, vbFromUnicode)
Dim pos As Long
For pos = 0 To UBound(chars) - 1
If (chars(pos) = 145 Or chars(pos) = 39) _
And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
' Make the change directly in the selection so track changes is sensible.
' I have to use 213 instead of 146 for reasons I don't understand--
' probably has to do with encoding on Mac, but anyway, this shows the change.
Selection.Characters(pos + 1) = Chr(213)
End If
Next pos
End Sub

Maybe this?
Sub FixNumQuotes()
Dim MyArr As Variant, MyString As String, X As Long, Z As Long
Debug.Print "starting " + CStr(Now)
For Z = 145 To 146
MyArr = Split(Selection.Text, Chr(Z))
For X = LBound(MyArr) To UBound(MyArr)
If IsNumeric(Left(MyArr(X), 1)) Then MyArr(X) = "'" & MyArr(X)
Next
MyString = Join(MyArr, Chr(Z))
Selection.Text = MyString
Next
Selection.Text = Replace(Replace(Selection.Text, Chr(146) & "'", "'"), Chr(145) & "'", "'")
Debug.Print "ending " + CStr(Now)
End Sub
I am not 100% sure on your criteria, I have made both an open and close single quote a ' but you can change that quite easily if you want.
It splits the string to an array on chr(145), checks the first char of each element for a numeric and prefixes it with a single quote if found.
Then it joins the array back to a string on chr(145) then repeats the whole things for chr(146). Finally it looks through the string for an occurence of a single quote AND either of those curled quotes next to each other (because that has to be something we just created) and replaces them with just the single quote we want. This leaves any occurence not next to a number intact.
This final replacement part is the bit you would change if you want something other than ' as the character.

I have been struggling with this for days now. My attempted solution was to use a regular expression on document.text. Then, using the matches in a document.range(start,end), replace the text. This preserves formatting.
The problem is that the start and end in the range do not match the index into text. I think I have found the discrepancy - hidden in the range are field codes (in my case they were hyperlinks). In addition, document.text has a bunch of BEL codes that are easy to strip out. If you loop through a range using the character method, append the characters to a string and print it you will see the field codes that don't show up if you use the .text method.
Amazingly you can get the field codes in document.text if you turn on "show field codes" in one of a number of ways. Unfortunately, that version is not exactly the same as what the range/characters shows - the document.text has just the field code, the range/characters has the field code and the field value. Therefore you can never get the character indices to match.
I have a working version where instead of using range(start,end), I do something like:
Set matchRange = doc.Range.Characters(myMatches(j).FirstIndex + 1)
matchRange.Collapse (wdCollapseStart)
Call matchRange.MoveEnd(WdUnits.wdCharacter, myMatches(j).Length)
matchRange.text = Replacement
As I say, this works but the first statement is dreadfully slow - it appears that Word is iterating through all of the characters to get to the correct point. In doing so, it doesn't seem to count the field codes, so we get to the correct point.
Bottom line, I have not been able to come up with a good way to match the indexing of the document.text string to an equivalent range(start,end) that is not a performance disaster.
Ideas welcome, and thanks.

This is a problem begging for regular expressions. Resolving the .Characters calls that many times is probably what is killing you in performance.
I'd do something like this:
Public Sub FixNumericalReverseQuotesFast()
Dim expression As RegExp
Set expression = New RegExp
Dim buffer As String
buffer = Selection.Range.Text
expression.Global = True
expression.MultiLine = True
expression.Pattern = "[" & Chr$(145) & Chr$(39) & "]\d"
Dim matches As MatchCollection
Set matches = expression.Execute(buffer)
Dim found As Match
For Each found In matches
buffer = Replace(buffer, found, Chr$(146) & Right$(found, 1))
Next
Selection.Range.Text = buffer
End Sub
NOTE: Requires a reference to Microsoft VBScript Regular Expressions 5.5 (or late binding).
EDIT:
The solution without using the Regular Expressions library is still avoiding working with Ranges. This can easily be converted to working with a byte array instead:
Sub FixNumericalReverseQuotes()
Dim chars() As Byte
chars = StrConv(Selection.Text, vbFromUnicode)
Dim pos As Long
For pos = 0 To UBound(chars) - 1
If (chars(pos) = 145 Or chars(pos) = 39) _
And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
chars(pos) = 146
End If
Next pos
Selection.Text = StrConv(chars, vbUnicode)
End Sub
Benchmarks (100 iterations, 3 pages of text with 100 "hits" per page):
Regex method: 1.4375 seconds
Array method: 2.765625 seconds
OP method: (Ended task after 23 minutes)
About half as fast as the Regex, but still roughly 10ms per page.
EDIT 2: Apparently the methods above are not format safe, so method 3:
Sub FixNumericalReverseQuotesVThree()
Dim full_text As Range
Dim cached As Long
Set full_text = ActiveDocument.Range
full_text.Find.ClearFormatting
full_text.Find.MatchWildcards = True
cached = full_text.End
Do While full_text.Find.Execute("[" & Chr$(145) & Chr$(39) & "][0-9]")
full_text.End = full_text.Start + 2
full_text.Characters(1) = Chr$(96)
full_text.Start = full_text.Start + 1
full_text.End = cached
Loop
End Sub
Again, slower than both the above methods, but still runs reasonably fast (on the order of ms).

Related

Extracting Date/Time from comment cell

I have a comment field with cells containing text like this:
Cancelled by user at 2018-01-03 03:11:57 without charge
I want to get the date and time information, but it may not always be in the 3rd/4th from last spaces, otherwise I might try to do some sort of complicated split of the cell. Is there an "in cell" way extract the date time information? Or will this need a VBA script? I prefer the former, but I'm trying to make a macro to simplify my life anyway, so VBA would work too.
I'd propose the following formula:
=MID(A1,FIND("at 20",A1)+3,19)
This would require that the date is always preceded by the word 'at' and the date string starts with 20.
You can try this function. It splits the string checking for items that have the first letter numeric, and builds a result string of just the date information.
Public Function ParseForDate(sCell As String) As String
Dim vSplit As Variant
Dim nIndex As Integer
Dim sResult As String
vSplit = Split(sCell, " ")
For nIndex = 0 To UBound(vSplit)
If IsNumeric(Left$(vSplit(nIndex), 1)) Then
sResult = sResult & vSplit(nIndex) & " "
End If
Next
ParseForDate = Trim$(sResult)
End Function
If you wanted to use it in a formula it would look something like this:
=ParseForDate(A1)
To use it in a VBA routine:
Dim s as String
s = ParseForDate(Range("A1"))
Non-VBA solution: (this is assuming the date format is always the same for all cells)
= MAX(IFERROR(DATEVALUE(MID(A1,ROW(INDEX($A:$A,1):INDEX($A:$A,LEN(A1)-19)),20)),0))
+MAX(IFERROR(TIMEVALUE(MID(A1,ROW(INDEX($A:$A,1):INDEX($A:$A,LEN(A1)-19)),20)),0))
Note this is an array formula, so you must press Ctrl+Shift+Enter instead of just Enter when typing this formula.
You will obviously then need to format the cell as a date and time, but this formula gets the numerical value that Excel uses for its internal date and time system.
Using a regex will enable you to fetch the date and time, irrespective of its placement in the string. The following solution will work if the date and time are of the same format as shown in the example string.
Code:
Sub getDateTime()
Dim objReg, matches, str
str = Sheet1.Cells(1, 1).Value 'Change this as per your requirements
Set objReg = CreateObject("vbscript.regexp")
objReg.Global = True
objReg.Pattern = "\d{4}(?:-\d{2}){2}\s*\d{2}(?::\d{2}){2}"
If objReg.test(str) Then
Set matches = objReg.Execute(str)
strResult = matches.Item(0)
MsgBox strResult
End If
End Sub
Click for Regex Demo
Regex Explanation:
\d{4} - matches 4 digits representing the year
(?:-\d{2}){2} - matches - followed by 2 digits. {2} in the end repeats this match 2 times. Once for getting MM and the next time for DD
\s* - matches 0+ whitespaces to match the space between the Date and Time
\d{2} - matches 2 digits representing the HH
(?::\d{2}){2} - matches : followed by 2 digits. The {2} in the end repeats this match 2 times. First time for matching the :MM and the next time for matching the :SS
Screenshots:
Output:
This will be good for about 90 years (using cell C3 for example):
Sub GetDate()
Dim s As String
s = Range("C3").Comment.Text
arr = Split(s, " ")
For i = LBound(arr) To UBound(arr)
If Left(arr(i), 2) = "20" Then
msg = arr(i) & " " & arr(i + 1)
MsgBox msg
Exit Sub
End If
Next i
End Sub

Excel VBA Using wildcard to replace string within string

I have a difficult situation and so far no luck in finding a solution.
My VBA collects number figures like $80,000.50. and I'm trying to get VBA to remove the last period to make it look like $80,000.50 but without using right().
The problem is after the last period there are hidden spaces or characters which will be a whole lot of new issue to handle so I'm just looking for something like:
replace("$80,000.50.",".**.",".**")
Is this possible in VBA?
I cant leave a comment so....
what about InStrRev?
Private Sub this()
Dim this As String
this = "$80,000.50."
this = Left(this, InStrRev(this, ".") - 1)
Debug.Print ; this
End Sub
Mid + Find
You can use Mid and Find functions. Like so:
The Find will find the first dot . character. If all the values you are collecting are currency with 2 decimals, stored as text, this will work well.
The formula is: =MID(A2,1,FIND(".",A2)+2)
VBA solution
Function getStringToFirstOccurence(inputUser As String, FindWhat As String) As String
getStringToFirstOccurence = Mid(inputUser, 1, WorksheetFunction.Find(FindWhat, inputUser) + 2)
End Function
Other possible solutions, hints
Trim + Clear + Substitute(Char(160)): Chandoo -
Untrimmable Spaces – Excel Formula
Ultimately, you can implement Regular expressions into Excel UDF: VBScript’s Regular Expression Support
How about:
Sub dural()
Dim r As Range
For Each r In Selection
s = r.Text
l = Len(s)
For i = l To 1 Step -1
If Mid(s, i, 1) = "." Then
r.Value = Mid(s, 1, i - 1) & Mid(s, i + 1)
Exit For
End If
Next i
Next r
End Sub
This will remove the last period and leave all the other characters intact. Before:
and after:
EDIT#1:
This version does not require looping over the characters in the cell:
Sub qwerty()
Dim r As Range
For Each r In Selection
If InStr(r.Value, ".") > 0 Then r.Characters(InStrRev(r.Text, "."), 1).Delete
Next r
End Sub
Shortest Solution
Simply use the Val command. I assume this is meant to be a numerical figure anyway? Get rid of commas and the dollar sign, then convert to value, which will ignore the second point and any other trailing characters! Robustness not tested, but seems to work...
Dim myString as String
myString = "$80,000.50. junk characters "
' Remove commas and dollar signs, then convert to value.
Dim myVal as Double
myVal = Val(Replace(Replace(myString,"$",""),",",""))
' >> myVal = 80000.5
' If you're really set on getting a formatted string back, use Format:
myString = Format(myVal, "$000,000.00")
' >> myString = $80,000.50
From the Documentation,
The Val function stops reading the string at the first character it can't recognize as part of a number. Symbols and characters that are often considered parts of numeric values, such as dollar signs and commas, are not recognized.
This is why we must first remove the dollar sign, and why it ignores all the junk after the second dot, or for that matter anything non numerical at the end!
Working with Strings
Edit: I wrote this solution first but now think the above method is more comprehensive and shorter - left here for completeness.
Trim() removes whitespace at the end of a string. Then you could simply use Left() to get rid of the last point...
' String with trailing spaces and a final dot
Dim myString as String
myString = "$80,000.50. "
' Get rid of whitespace at end
myString = Trim(myString)
' Might as well check if there is a final dot before removing it
If Right(myString, 1) = "." Then
myString = Left(myString, Len(myString) - 1)
End If
' >> myString = "$80,000.50"

Get maximum number of characters that a cell can contain

Excel specifications and limits says:
Total number of characters that a cell can contain: 32,767 characters
Is there a way to get this number programatically?
I'm asking because hardcoding constants should, in general, be avoided if and when feasible. This number may conceivably change by Office version (It hasn't changed between 2003 and 2013, but who knows what Microsoft has in store for us).
It's pretty easy to get the maximum number of rows in a worksheet:
Sheet1.Rows.Count ' returns 65,536 in Office 2003 and 1,048,576 in Office 2007-2013
but apparently, getting the maximum number of characters that a cell can contain isn't as straightforward.
Note that writing too many characters to a cell will not result in an error; it will silently fail and truncate the string — so proper error handling isn't an option here.
In a loop, append characters one by one to the cell contents. Each time, read cell contents, check if the last character added is present. If it isn't then that's the limit.
Upside: Works and is 100% reliable.
Downside: Really slow. It takes 10-15 seconds to complete, due to the many read-writes to/from sheet.
Obviously, this could be optimised by using a good guess (e.g. 32,767) as the initial condition, and using a hunt & bisect search algorithm rather than incrementing by 1. However if the answer is far enough away from the initial guess, this might still take ~1 second to run — not something you would want to call repeatedly.
Function MaximumNumberOfCharactersACellCanContain(r As Range)
'NB: Range r will be overwritten.
Dim sIn As String
Dim sOut As String
Dim i As Long
Application.ScreenUpdating = False
Do
i = i + 1
sIn = sIn & Chr(97 + (i - 1) Mod 26)
r.Cells(1, 1).Value = sIn
sOut = r.Cells(1, 1).Value
If Right(sOut, 1) <> Right(sIn, 1) Then Exit Do
'If Len(sOut) <> Len(sIn) Then Exit Do
Loop
Application.ScreenUpdating = True
MaximumNumberOfCharactersACellCanContain = i - 1
End Function
Example usage:
MsgBox MaximumNumberOfCharactersACellCanContain(Range("A1"))
Alternative: Loop appending a chunk until the assigned length is not whats expected
Const INT_MAX As Integer = 32767
Dim i As Long
ActiveCell.Value = ""
Dim buff As String: buff = Space$(INT_MAX)
Do
i = i + 1
ActiveCell.Value = ActiveCell.Value & buff
If Len(ActiveCell.Value) <> (i * INT_MAX) Then
MaxLen = Len(ActiveCell.Value)
Exit Function
End If
Loop
Or even
ActiveCell.Value = Space$(A_BIG_NUMBER)
MaxLen = Len(ActiveCell.Value)
Here's a variant where we take exponential steps (larger and larger steps whose size increases by a factor stepFactor each time).
Function MaximumNumberOfCharactersACellCanContain(r As Range, _
Optional ByVal stepFactor As Double = 2)
Dim n As Double
Dim nActual As Long
Dim l As Long
n = 1
Do
n = n * stepFactor
nActual = CLng(n)
r.Cells(1, 1).Value = Space$(nActual)
l = Len(r.Cells(1, 1).Value)
If l <> nActual Then
MaximumNumberOfCharactersACellCanContain = l
Exit Function
End If
Loop
End Function
Example usage:
Debug.Print MaximumNumberOfCharactersACellCanContain(Range("A1"), 8)
The choice of stepFactor is a compromise between:
Reducing the number of iterations (larger factor is better), and
Limiting down the cost of the last iteration (the one that fails). If stepFactor is too large, then you're writing a very long string to the cell and this is quite slow.
Making sure the last iteration will never hit the out of memory ceiling (~130 million characters on my system). (Could add error handling do deal with this eventuality.)
stepFactor somewhere between 2 and 8 should be robust and quick.

VBA "out of memory" error when Excel consume only 70MB

Q: Why out of memory when my system have plenty of it left (and office is 64bit)
Q: Could it be that data when split cause such strange behavior?
Q: If splitting that string cause trouble then how to sanititize/restore it for just operations of storing/restoring that string?
Specs: Win 8.1 Pro + Office 2013 64bit, 8GB RAM in system
And here is the code, which just get single LARGE (~1-2MB) string, and split it into multiple cells, so that 32k chars per cell limit do not cause harm:
Public Sub SaveConst(str As String)
Dim i As Long
i = 0
' Clear prior data
Do While LenB(Range("ConstJSON").Offset(0, i)) <> 0
Range("ConstJSON").Offset(0, i) = ""
i = i + 1
Loop
Dim strLen As Long
With Range("ConstJSON")
.Offset(0, 0) = Left$(str, 30000)
i = 1
strLen = Len(str)
Debug.Print strLen
Do While strLen > i * 30000
.Offset(0, i) = Mid$(str, i * 30000 + 1, 30000)
Debug.Print i
i = i + 1
Loop
End With
End Sub
Right now Len(str) report ~270k characters, and i goes up to 4 iteration, and then "Out of memory" bug kick in.
Now that is n-th iteration of that bug in this place. But I have simplified/modified code so that it works sometimes. For exact same data set.
UPDATE:
Thx to Jean code, I'm confident that its SAVING partial string to the cell that cause that error.
.Offset(0, i) = Mid$(str, i * 30000 + 1, 30000)
Or
Range("ConstJSON").Resize(nPieces).Value2 = v
Both cause errors.
UPDATE 2:
I was saving that string to single cell without any fuss. But now that string grew too big to fit, splitting sometimes cause that error "Out of the memory".
Exemplary string:
[...]
""ebiZlecenias"":[{""id"":""91a75940-6d3e-06f8-bcf7-28ecd49e85f2"",""lp"":null,""name"":""ZLECENIE
GŁÓWNE"",""date_entered"":""2014-04-15
08:13:18"",""date_modified"":""2014-04-15
08:13:18"",""modified_user_id"":""2"",""budowa_id"":""8614aab5-29da-ffac-4865-e8c5913c729c"",""rodzaj"":""1"",""etap"":""1"",""data_akceptacji"":null,""opis"":null,""user_id"":null,""data_bazowa_od"":null,""data_bazowa_do"":null,""data_rzeczywista_od"":null,""data_rzeczywista_do"":null,""archiwalny"":null,""deleted"":null,""termin_raportowania"":null,""okres_raportowania"":null,
[...]
EDIT: I believe the problem with your specimen string is that some of the substrings begin with a "-". When that happens, Excel thinks the contents is a formula, and that is what causes the error. Pre-formatting the cell as text did not correct the problem, but preceding each entry with a 'single quote', which coerces the entry to text and will not show up except in the formula bar, seems to have corrected the problem in my macros, even when using your specimen string above as the "base" string.
EDIT2: What seems to be happening is that, if the string length is greater than 8,192 characters (the longest allowed in a formula), and also starts with a token that makes Excel think it might be a formula (e.g: -, +, =), the write to the cell will fail with an out of memory error EVEN IF the cell is formatted as text. This does not happen if the single quote is inserted first.
Below is some code that works on much longer strings.
The code below first creates a long string, in this case the string is slightly more than 100,000,000 characters, and then splits it into sequential columns. No errors:
Option Explicit
Sub MakeLongString()
Dim S As String
Const strLEN As Long = 100 * 10 ^ 6
Const strPAT As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
S = strPAT
Do
S = S & S
Loop Until Len(S) > strLEN
Debug.Print Format(Len(S), "#,###")
SplitString (S)
Debug.Print Range("a1").End(xlToRight).Column
End Sub
Sub SplitString(STR)
Dim R As Range
Dim strLEN As Long
Set R = [a1]
Dim I As Long
strLEN = Len(STR)
Do Until I > strLEN
R(1, I / 30000 + 1) = "'" & Mid(STR, I + 1, 30000)
I = I + 30000
Loop
End Sub
I just ran a test where the range being written to was a multi-cell range, and the target was set by the Offset method as you did, and it also ran to completion without error, filling in the first four rows.
Sub SplitString(STR)
Dim R As Range
Dim strLEN As Long
Set R = [a1:a4]
Dim I As Long
strLEN = Len(STR)
Do Until I > strLEN
R.Offset(, I / 30000) = "'" & Mid(STR, I + 1, 30000)
I = I + 30000
Loop
End Sub
This is worth a try: first split the string into an array, then slap that entire array onto the sheet at once.
Const pieceLength As Long = 3000
Dim s As String
Dim i As Long
Dim nPieces As Long
Dim v As Variant
s = ... ' whatever your string is...
nPieces = WorksheetFunction.Ceiling(Len(s) / pieceLength, 1)
ReDim v(1 To nPieces, 1 To 1)
For i = 1 To nPieces
v(i, 1) = Mid(s, (pieceLength * i) + 1, pieceLength)
Next i
Range("ConstJSON").Resize(nPieces).Value2 = v
I haven't tested your code, so can't say exactly what's wrong with it, but I know that writing to (or reading from) individual cells one at a time is slow and expensive; it's usually much better to read/write large swaths of cells to/from arrays, and manipulate the arrays (instead of the cells).

How to normalize filenames listed in a range

I have a list of filenames in a spreadsheet in the form of "Smith, J. 010112.pdf". However, they're in the varying formats of "010112.pdf", "01.01.12.pdf", and "1.01.2012.pdf". How could I change these to one format of "010112.pdf"?
Personally I hate using VBA where worksheet functions will work, so I've worked out a way to do this with worksheet functions. Although you could cram this all into one cell, I've broken it out into a lot of independent steps in separate columns so you can see how it's working, step by step.
For simplicity I'm assuming your file name is in A1
B1 =LEN(A1)
determine the length of the filename
C1 =SUBSTITUTE(A1," ","")
replace spaces with nothing
D1 =LEN(C1)
see how long the string is if you replace spaces with nothing
E1 =B1-D1
determine how many spaces there are
F1 =SUBSTITUTE(A1," ",CHAR(8),E1)
replace the last space with a special character that can't occur in a file name
G1 =SEARCH(CHAR(8), F1)
find the special character. Now we know where the last space is
H1 =LEFT(A1,G1-1)
peel off everything before the last space
I1 =MID(A1,G1+1,255)
peel off everything after the last space
J1 =FIND(".",I1)
find the first dot
K1 =FIND(".",I1,J1+1)
find the second dot
L1 =FIND(".",I1,K1+1)
find the third dot
M1 =MID(I1,1,J1-1)
find the first number
N1 =MID(I1,J1+1,K1-J1-1)
find the second number
O1 =MID(I1,K1+1,L1-K1-1)
find the third number
P1 =TEXT(M1,"00")
pad the first number
Q1 =TEXT(N1,"00")
pad the second number
R1 =TEXT(O1,"00")
pad the third number
S1 =IF(ISERR(K1),M1,P1&Q1&R1)
put the numbers together
T1 =H1&" "&S1&".pdf"
put it all together
It's kind of a mess because Excel hasn't added a single new string manipulation function in over 20 years, so things that should be easy (like "find last space") require severe trickery.
Here's a screenshot of a simple four-step method based on Excel commands and formulas, as suggested in a comment to the answered post (with a few changes)...
This function below works. I've assumed that the date is in ddmmyy format, but adjust as appropriate if it's mmddyy -- I can't tell from your example.
Function FormatThis(str As String) As String
Dim strDate As String
Dim iDateStart As Long
Dim iDateEnd As Long
Dim temp As Variant
' Pick out the date part
iDateStart = GetFirstNumPosition(str, False)
iDateEnd = GetFirstNumPosition(str, True)
strDate = Mid(str, iDateStart, iDateEnd - iDateStart + 1)
If InStr(strDate, ".") <> 0 Then
' Deal with the dot delimiters in the date
temp = Split(strDate, ".")
strDate = Format(DateSerial( _
CInt(temp(2)), CInt(temp(1)), CInt(temp(0))), "ddmmyy")
Else
' No dot delimiters... assume date is already formatted as ddmmyy
' Do nothing
End If
' Piece it together
FormatThis = Left(str, iDateStart - 1) _
& strDate & Right(str, Len(str) - iDateEnd)
End Function
This uses the following helper function:
Function GetFirstNumPosition(str As String, startFromRight As Boolean) As Long
Dim i As Long
Dim startIndex As Long
Dim endIndex As Long
Dim indexStep As Integer
If startFromRight Then
startIndex = Len(str)
endIndex = 1
indexStep = -1
Else
startIndex = 1
endIndex = Len(str)
indexStep = 1
End If
For i = startIndex To endIndex Step indexStep
If Mid(str, i, 1) Like "[0-9]" Then
GetFirstNumPosition = i
Exit For
End If
Next i
End Function
To test:
Sub tester()
MsgBox FormatThis("Smith, J. 01.03.12.pdf")
MsgBox FormatThis("Smith, J. 010312.pdf")
MsgBox FormatThis("Smith, J. 1.03.12.pdf")
MsgBox FormatThis("Smith, J. 1.3.12.pdf")
End Sub
They all return "Smith, J. 010312.pdf".
You don't need VBA. Start by replacing the "."s with nothing:
=SUBSTITUTE(A1,".","")
This will change the ".PDF" to "PDF", so let's put that back:
=SUBSTITUTE(SUBSTITUTE(A1,".",""),"pdf",".pdf")
Got awk? Get the data into a text file, and
awk -F'.' '{ if(/[0-9]+\.[0-9]+\.[0-9]+/) printf("%s., %02d%02d%02d.pdf\n", $1, $2, $3, length($4) > 2 ? substr($4,3,2) : $4); else print $0; }' your_text_file
Assuming the data are exactly as what you described, e.g.,
Smith, J. 010112.pdf
Mit, H. 01.02.12.pdf
Excel, M. 8.1.1989.pdf
Lec, X. 06.28.2012.pdf
DISCLAIMER:
As #Jean-FrançoisCorbett has mentioned, this does not work for "Smith, J. 1.01.12.pdf". Instead of reworking this completely, I'd recommend his solution!
Option Explicit
Function ExtractNumerals(Original As String) As String
'Pass everything up to and including ".pdf", then concatenate the result of this function with ".pdf".
'This will not return the ".pdf" if passed, which is generally not my ideal solution, but it's a simpler form that still should get the job done.
'If you have varying extensions, then look at the code of the test sub as a guide for how to compensate for the truncation this function creates.
Dim i As Integer
Dim bFoundFirstNum As Boolean
For i = 1 To Len(Original)
If IsNumeric(Mid(Original, i, 1)) Then
bFoundFirstNum = True
ExtractNumerals = ExtractNumerals & Mid(Original, i, 1)
ElseIf Not bFoundFirstNum Then
ExtractNumerals = ExtractNumerals & Mid(Original, i, 1)
End If
Next i
End Function
I used this as a testcase, which does not correctly cover all your examples:
Sub test()
MsgBox ExtractNumerals("Smith, J. 010112.pdf") & ".pdf"
End Sub