How to split an unicode-string to readable characters? - vba

I have a VBA formula-function to split a string and add space between each character. It works fines only for an Ascii string. But I want to do the same for the Tamil Language. Since it is Unicode, the result is not readable. It splits even the auxiliary characters, Upper dots, Prefix, Suffix auxilary characters which should not be separated in Tamil/Hindi/Kanada/Malayalam/All India Languages. So, how to write a function to split a Tamil Word into readable characters.
Function AddSpace(Str As String) As String
Dim i As Long
For i = 1 To Len(Str)
AddSpace = AddSpace & Mid(Str, i, 1) & " "
Next i
AddSpace = Trim(AddSpace)
End Function
Adding Space is not the important point of this question. Splitting the Unicode string into an array from any of those languages is the requirement.
For example, the word, "பார்த்து" should be separated as "பா ர் த் து", not as "ப ா ர ் த ் த ு". As you can see, the first two letters "பா" (ப + ா) are combined. If I try to manually put a space in between them, I can't do it in any word processor. If you want to test, please put it in Notepad and add space between each character. It won't allow you to separate as ("ப ா"). So "பார்த்து" should be separated as "பா ர் த் து". It is the correct separation in Tamil like languages. This is the one that I am struggling to achieve in VBA.
The Character Code table for Tamil is here.
Tamil/Hindi/many Indian languages have (1)Consonants, (2)Independent vowels, (3)Dependent vowel signs, (4)Two-part dependent vowel signs. Among these 4 types, the first two are each one separate lettter, no issues with them. but the last 2 are dependent, they should not be separated from its joint character. For example, the letter, பா (ப + ் ), it contains one independent (ப) and one dependent (ா) letter.
If this info is not enough, please comment what should I post more.
(Note: It is possible in C#.Net using the code from the MS link by #Codo)

You can assign a string to a Byte array so the following might work
Dim myBytes as Byte
myBytes = "Tamilstring"
which generates two bytes for each character. You could then create a second byte array twice the size of the first by using space$ to crate a suitable string and then use a for loop (step 4) to copy two bytes at a time from the first to the second array. Finally, assign the byte array back to a string.

The problem you have is you are looking for what Unicode calls an extended grapheme cluster.
For a Unicode compatible regex engine that is simply /\X/
Not sure how you do that in VBA.

Referring the link mentioned by #ScottCraner in comments on the question and Character code for Tamil.
Check the result in cell A2 and highlighted in yellow are Dependent vowel signs which are used in DepVow string
Sub Split_Unicode_String()
'https://stackoverflow.com/questions/68774781/how-to-split-an-unicode-string-to-readable-characters
Dim my_string As String
'input string
Dim buff() As String
'array of input string characters
Dim DepVow As String
'Create string of Dependent vowel signs
Dim newStr As String
'result string with spaces as desired
Dim i As Long
my_string = Range("A1").Value
ReDim buff(Len(my_string) - 1) 'array of my_string characters
For i = 1 To Len(my_string)
buff(i - 1) = Mid$(my_string, i, 1)
Cells(1, i + 2) = buff(i - 1)
Cells(2, i + 2) = AscW(buff(i - 1)) 'used this for creating DepVow below
Next i
'Create string of Dependent vowel signs preceded and succeeded by comma
DepVow = "," & Join(Array(ChrW$(3006), ChrW$(3021), ChrW$(3009)), ",")
newStr = ""
For i = LBound(buff) To UBound(buff)
If InStr(1, DepVow, ChrW$(AscW(buff(i + 1))), vbTextCompare) > 0 Then
newStr = newStr & ChrW$(AscW(buff(i))) & ChrW$(AscW(buff(i + 1))) & " "
i = i + 1
Else
newStr = newStr & ChrW$(AscW(buff(i))) & " "
End If
Next i
'result string in range A2
Cells(2, 1) = Left(newStr, Len(newStr) - 1)
End Sub

Try below algorithm. which will concat all the mark characters with letter characters.
redim letters(0)
For i=1 To Len(Str)
If ascW(Mid(Str,i,1)) >3005 And ascW(Mid(Str,i,1)) <3022 Then
letters(UBound(letters)-1) = letters(UBound(letters)-1)+Mid(Str,i,1)
Else REDIM PRESERVE
letters(UBound(letters) + 1)
letters(UBound(letters)-1) = Mid(Str,i,1)
End If
Next
MsgBox(join(letters, ", "))'return பா, ர், த், து,

Related

Excel VBA Using wildcard to replace string within string

I have a difficult situation and so far no luck in finding a solution.
My VBA collects number figures like $80,000.50. and I'm trying to get VBA to remove the last period to make it look like $80,000.50 but without using right().
The problem is after the last period there are hidden spaces or characters which will be a whole lot of new issue to handle so I'm just looking for something like:
replace("$80,000.50.",".**.",".**")
Is this possible in VBA?
I cant leave a comment so....
what about InStrRev?
Private Sub this()
Dim this As String
this = "$80,000.50."
this = Left(this, InStrRev(this, ".") - 1)
Debug.Print ; this
End Sub
Mid + Find
You can use Mid and Find functions. Like so:
The Find will find the first dot . character. If all the values you are collecting are currency with 2 decimals, stored as text, this will work well.
The formula is: =MID(A2,1,FIND(".",A2)+2)
VBA solution
Function getStringToFirstOccurence(inputUser As String, FindWhat As String) As String
getStringToFirstOccurence = Mid(inputUser, 1, WorksheetFunction.Find(FindWhat, inputUser) + 2)
End Function
Other possible solutions, hints
Trim + Clear + Substitute(Char(160)): Chandoo -
Untrimmable Spaces – Excel Formula
Ultimately, you can implement Regular expressions into Excel UDF: VBScript’s Regular Expression Support
How about:
Sub dural()
Dim r As Range
For Each r In Selection
s = r.Text
l = Len(s)
For i = l To 1 Step -1
If Mid(s, i, 1) = "." Then
r.Value = Mid(s, 1, i - 1) & Mid(s, i + 1)
Exit For
End If
Next i
Next r
End Sub
This will remove the last period and leave all the other characters intact. Before:
and after:
EDIT#1:
This version does not require looping over the characters in the cell:
Sub qwerty()
Dim r As Range
For Each r In Selection
If InStr(r.Value, ".") > 0 Then r.Characters(InStrRev(r.Text, "."), 1).Delete
Next r
End Sub
Shortest Solution
Simply use the Val command. I assume this is meant to be a numerical figure anyway? Get rid of commas and the dollar sign, then convert to value, which will ignore the second point and any other trailing characters! Robustness not tested, but seems to work...
Dim myString as String
myString = "$80,000.50. junk characters "
' Remove commas and dollar signs, then convert to value.
Dim myVal as Double
myVal = Val(Replace(Replace(myString,"$",""),",",""))
' >> myVal = 80000.5
' If you're really set on getting a formatted string back, use Format:
myString = Format(myVal, "$000,000.00")
' >> myString = $80,000.50
From the Documentation,
The Val function stops reading the string at the first character it can't recognize as part of a number. Symbols and characters that are often considered parts of numeric values, such as dollar signs and commas, are not recognized.
This is why we must first remove the dollar sign, and why it ignores all the junk after the second dot, or for that matter anything non numerical at the end!
Working with Strings
Edit: I wrote this solution first but now think the above method is more comprehensive and shorter - left here for completeness.
Trim() removes whitespace at the end of a string. Then you could simply use Left() to get rid of the last point...
' String with trailing spaces and a final dot
Dim myString as String
myString = "$80,000.50. "
' Get rid of whitespace at end
myString = Trim(myString)
' Might as well check if there is a final dot before removing it
If Right(myString, 1) = "." Then
myString = Left(myString, Len(myString) - 1)
End If
' >> myString = "$80,000.50"

UDF to remove special characters, punctuation & spaces within a cell to create unique key for Vlookups

I hacked together the following User Defined Function in VBA that allows me to remove certain non-text characters from any given Cell.
The code is as follows:
Function removeSpecial(sInput As String) As String
Dim sSpecialChars As String
Dim i As Long
sSpecialChars = "\/:*?™""®<>|.&##(_+`©~);-+=^$!,'" 'This is your list of characters to be removed
For i = 1 To Len(sSpecialChars)
sInput = Replace$(sInput, Mid$(sSpecialChars, i, 1), " ")
Next
removeSpecial = sInput
End Function
This portion of the code obviously defines what characters are to be removed:
sSpecialChars = "\/:*?™""®<>|.&##(_+`©~);-+=^$!,'"
I also want to include a normal space character, " ", within this criteria. I was wondering if there is some sort of escape character that I can use to do this?
So, my goal is to be able to run this function, and have it remove all specified characters from a given Excel Cell, while also removing all spaces.
Also, I realize I could do this with a =SUBSTITUTE function within Excel itself, but I would like to know if it is possible in VBA.
Edit: It's fixed! Thank you simoco!
Function removeSpecial(sInput As String) As String
Dim sSpecialChars As String
Dim i As Long
sSpecialChars = "\/:*?™""®<>|.&## (_+`©~);-+=^$!,'" 'This is your list of characters to be removed
For i = 1 To Len(sSpecialChars)
sInput = Replace$(sInput, Mid$(sSpecialChars, i, 1), "") 'this will remove spaces
Next
removeSpecial = sInput
End Function
So after the advice from simoco I was able to modify my for loop:
For i = 1 To Len(sSpecialChars)
sInput = Replace$(sInput, Mid$(sSpecialChars, i, 1), "") 'this will remove spaces
Next
Now for every character in a given cell in my spreadsheet, the special characters are removed and replaced with nothing. This is essentially done by the Replace$ and Mid$ functions used together as shown:
sInput = Replace$(sInput, Mid$(sSpecialChars, i, 1), "") 'this will remove spaces
This code is executed for every single character in the cell starting with the character at position 1, via my for loop.
Hopefully this answer benefits someone in the future if the stumble upon my original question.

Comparing character only to character at end of string

I am writing a program in Visual Basic 2010 that lists how many times a word of each length occurs in a user-inputted string. Although most of the program is working, I have one problem:
When looping through all of the characters in the string, the program checks whether there is a next character (such that the program does not attempt to loop through characters that do not exist). For example, I use the condition:
If letter = Microsoft.VisualBasic.Right(input, 1) Then
Where letter is the character, input is the string, and Microsoft.VisualBasic.Right(input, 1) extracts the rightmost character from the string. Thus, if letter is the rightmost character, the program will cease to loop through the string.
This is where the problems comes in. Let us say the string is This sentence has five words. The rightmost character is an s, but an s is also the fourth and sixth character. That means that the first and second s will break the loop just as the others will.
My questions is whether there is a way to ensure that only the last s, or whatever character is the last one in the string can break the loop.
There are a few methods you can use for this, one as Neolisk shows; here are a couple of others:
Dim breakChar As Char = "s"
Dim str As String = "This sentence has five words"
str = str.Replace(".", " ")
str = str.Replace(",", " ")
str = str.Replace(vbTab, " ")
' other chars to replace
Dim words() As String = str.ToLower.Split(New Char() {" "}, StringSplitOptions.RemoveEmptyEntries)
For Each word In words
If word.StartsWith(breakChar) Then Exit For
Console.WriteLine("M1 Word: ""{0}"" Length: {1:N0}", word, word.Length)
Next
If you need to loop though chars for whatever reason, you can use something like this:
Dim breakChar As Char = "s"
Dim str As String = "This sentence has five words"
str = str.Replace(".", " ")
str = str.Replace(",", " ")
str = str.Replace(vbTab, " ")
' other chars to replace
'method 2
Dim word As New StringBuilder
Dim words As New List(Of String)
For Each c As Char In str.ToLower.Trim
If c = " "c Then
If word.Length > 0 'support multiple white-spaces (double-space etc.)
Console.WriteLine("M2 Word: ""{0}"" Length: {1:N0}", word.ToString, word.ToString.Length)
words.Add(word.ToString)
word.Clear()
End If
Else
If word.Length = 0 And c = breakChar Then Exit For
word.Append(c)
End If
Next
If word.Length > 0 Then
words.Add(word.ToString)
Console.WriteLine("M2 Word: ""{0}"" Length: {1:N0}", word.ToString, word.ToString.Length)
End If
I wrote these specifically to break on the first letter in a word as you ask, adjust as needed.
VB.NET code to calculate how many times a word of each length occurs in a user-inputted string:
Dim sentence As String = "This sentence has five words"
Dim words() As String = sentence.Split(" ")
Dim v = From word As String In words Group By L = word.Length Into Group Order By L
Line 2 may need to be adjusted to remove punctuation characters, trim extra spaces etc.
In the above example, v(i) contains word length, and v(i).Group.Count contains how many words of this length were encountered. For debugging purposes, you also have v(i).Group, which is an array of String, containing all words belonging to this group.

How to normalize filenames listed in a range

I have a list of filenames in a spreadsheet in the form of "Smith, J. 010112.pdf". However, they're in the varying formats of "010112.pdf", "01.01.12.pdf", and "1.01.2012.pdf". How could I change these to one format of "010112.pdf"?
Personally I hate using VBA where worksheet functions will work, so I've worked out a way to do this with worksheet functions. Although you could cram this all into one cell, I've broken it out into a lot of independent steps in separate columns so you can see how it's working, step by step.
For simplicity I'm assuming your file name is in A1
B1 =LEN(A1)
determine the length of the filename
C1 =SUBSTITUTE(A1," ","")
replace spaces with nothing
D1 =LEN(C1)
see how long the string is if you replace spaces with nothing
E1 =B1-D1
determine how many spaces there are
F1 =SUBSTITUTE(A1," ",CHAR(8),E1)
replace the last space with a special character that can't occur in a file name
G1 =SEARCH(CHAR(8), F1)
find the special character. Now we know where the last space is
H1 =LEFT(A1,G1-1)
peel off everything before the last space
I1 =MID(A1,G1+1,255)
peel off everything after the last space
J1 =FIND(".",I1)
find the first dot
K1 =FIND(".",I1,J1+1)
find the second dot
L1 =FIND(".",I1,K1+1)
find the third dot
M1 =MID(I1,1,J1-1)
find the first number
N1 =MID(I1,J1+1,K1-J1-1)
find the second number
O1 =MID(I1,K1+1,L1-K1-1)
find the third number
P1 =TEXT(M1,"00")
pad the first number
Q1 =TEXT(N1,"00")
pad the second number
R1 =TEXT(O1,"00")
pad the third number
S1 =IF(ISERR(K1),M1,P1&Q1&R1)
put the numbers together
T1 =H1&" "&S1&".pdf"
put it all together
It's kind of a mess because Excel hasn't added a single new string manipulation function in over 20 years, so things that should be easy (like "find last space") require severe trickery.
Here's a screenshot of a simple four-step method based on Excel commands and formulas, as suggested in a comment to the answered post (with a few changes)...
This function below works. I've assumed that the date is in ddmmyy format, but adjust as appropriate if it's mmddyy -- I can't tell from your example.
Function FormatThis(str As String) As String
Dim strDate As String
Dim iDateStart As Long
Dim iDateEnd As Long
Dim temp As Variant
' Pick out the date part
iDateStart = GetFirstNumPosition(str, False)
iDateEnd = GetFirstNumPosition(str, True)
strDate = Mid(str, iDateStart, iDateEnd - iDateStart + 1)
If InStr(strDate, ".") <> 0 Then
' Deal with the dot delimiters in the date
temp = Split(strDate, ".")
strDate = Format(DateSerial( _
CInt(temp(2)), CInt(temp(1)), CInt(temp(0))), "ddmmyy")
Else
' No dot delimiters... assume date is already formatted as ddmmyy
' Do nothing
End If
' Piece it together
FormatThis = Left(str, iDateStart - 1) _
& strDate & Right(str, Len(str) - iDateEnd)
End Function
This uses the following helper function:
Function GetFirstNumPosition(str As String, startFromRight As Boolean) As Long
Dim i As Long
Dim startIndex As Long
Dim endIndex As Long
Dim indexStep As Integer
If startFromRight Then
startIndex = Len(str)
endIndex = 1
indexStep = -1
Else
startIndex = 1
endIndex = Len(str)
indexStep = 1
End If
For i = startIndex To endIndex Step indexStep
If Mid(str, i, 1) Like "[0-9]" Then
GetFirstNumPosition = i
Exit For
End If
Next i
End Function
To test:
Sub tester()
MsgBox FormatThis("Smith, J. 01.03.12.pdf")
MsgBox FormatThis("Smith, J. 010312.pdf")
MsgBox FormatThis("Smith, J. 1.03.12.pdf")
MsgBox FormatThis("Smith, J. 1.3.12.pdf")
End Sub
They all return "Smith, J. 010312.pdf".
You don't need VBA. Start by replacing the "."s with nothing:
=SUBSTITUTE(A1,".","")
This will change the ".PDF" to "PDF", so let's put that back:
=SUBSTITUTE(SUBSTITUTE(A1,".",""),"pdf",".pdf")
Got awk? Get the data into a text file, and
awk -F'.' '{ if(/[0-9]+\.[0-9]+\.[0-9]+/) printf("%s., %02d%02d%02d.pdf\n", $1, $2, $3, length($4) > 2 ? substr($4,3,2) : $4); else print $0; }' your_text_file
Assuming the data are exactly as what you described, e.g.,
Smith, J. 010112.pdf
Mit, H. 01.02.12.pdf
Excel, M. 8.1.1989.pdf
Lec, X. 06.28.2012.pdf
DISCLAIMER:
As #Jean-FrançoisCorbett has mentioned, this does not work for "Smith, J. 1.01.12.pdf". Instead of reworking this completely, I'd recommend his solution!
Option Explicit
Function ExtractNumerals(Original As String) As String
'Pass everything up to and including ".pdf", then concatenate the result of this function with ".pdf".
'This will not return the ".pdf" if passed, which is generally not my ideal solution, but it's a simpler form that still should get the job done.
'If you have varying extensions, then look at the code of the test sub as a guide for how to compensate for the truncation this function creates.
Dim i As Integer
Dim bFoundFirstNum As Boolean
For i = 1 To Len(Original)
If IsNumeric(Mid(Original, i, 1)) Then
bFoundFirstNum = True
ExtractNumerals = ExtractNumerals & Mid(Original, i, 1)
ElseIf Not bFoundFirstNum Then
ExtractNumerals = ExtractNumerals & Mid(Original, i, 1)
End If
Next i
End Function
I used this as a testcase, which does not correctly cover all your examples:
Sub test()
MsgBox ExtractNumerals("Smith, J. 010112.pdf") & ".pdf"
End Sub

How to find which delimiter was used during string split (VB.NET)

lets say I have a string that I want to split based on several characters, like ".", "!", and "?". How do I figure out which one of those characters split my string so I can add that same character back on to the end of the split segments in question?
Dim linePunctuation as Integer = 0
Dim myString As String = "some text. with punctuation! in it?"
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "." Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "!" Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "?" Then linePunctuation += 1
Next
Dim delimiters(3) As Char
delimiters(0) = "."
delimiters(1) = "!"
delimiters(2) = "?"
currentLineSplit = myString.Split(delimiters)
Dim sentenceArray(linePunctuation) As String
Dim count As Integer = 0
While linePunctuation > 0
sentenceArray(count) = currentLineSplit(count)'Here I want to add what ever delimiter was used to make the split back onto the string before it is stored in the array.'
count += 1
linePunctuation -= 1
End While
If you add a capturing group to your regex like this:
SplitArray = Regex.Split(myString, "([.?!])")
Then the returned array contains both the text between the punctuation, and separate elements for each punctuation character. The Split() function in .NET includes text matched by capturing groups in the returned array. If your regex has several capturing groups, all their matches are included in the array.
This splits your sample into:
some text
.
with punctuation
!
in it
?
You can then iterate over the array to get your "sentences" and your punctuation.
.Split() does not provide this information.
You will need to use a regular expression to accomplish what you are after, which I infer as the desire to split an English-ish paragraph into sentences by splitting on punctuation.
The simplest implementation would look like this.
var input = "some text. with punctuation! in it?";
string[] sentences = Regex.Split(input, #"\b(?<sentence>.*?[\.!?](?:\s|$))");
foreach (string sentence in sentences)
{
Console.WriteLine(sentence);
}
Results
some text.
with punctuation!
in it?
But you are going to find very quickly that language, as spoken/written by humans, does not follow simple rules most of the time.
Here it is in VB.NET for you:
Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
Once you've called Split with all 3 characters, you've tossed that information away. You could do what you're trying to do by splitting yourself or by splitting on one punctuation mark at a time.