How to find which delimiter was used during string split (VB.NET) - vb.net

lets say I have a string that I want to split based on several characters, like ".", "!", and "?". How do I figure out which one of those characters split my string so I can add that same character back on to the end of the split segments in question?
Dim linePunctuation as Integer = 0
Dim myString As String = "some text. with punctuation! in it?"
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "." Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "!" Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "?" Then linePunctuation += 1
Next
Dim delimiters(3) As Char
delimiters(0) = "."
delimiters(1) = "!"
delimiters(2) = "?"
currentLineSplit = myString.Split(delimiters)
Dim sentenceArray(linePunctuation) As String
Dim count As Integer = 0
While linePunctuation > 0
sentenceArray(count) = currentLineSplit(count)'Here I want to add what ever delimiter was used to make the split back onto the string before it is stored in the array.'
count += 1
linePunctuation -= 1
End While

If you add a capturing group to your regex like this:
SplitArray = Regex.Split(myString, "([.?!])")
Then the returned array contains both the text between the punctuation, and separate elements for each punctuation character. The Split() function in .NET includes text matched by capturing groups in the returned array. If your regex has several capturing groups, all their matches are included in the array.
This splits your sample into:
some text
.
with punctuation
!
in it
?
You can then iterate over the array to get your "sentences" and your punctuation.

.Split() does not provide this information.
You will need to use a regular expression to accomplish what you are after, which I infer as the desire to split an English-ish paragraph into sentences by splitting on punctuation.
The simplest implementation would look like this.
var input = "some text. with punctuation! in it?";
string[] sentences = Regex.Split(input, #"\b(?<sentence>.*?[\.!?](?:\s|$))");
foreach (string sentence in sentences)
{
Console.WriteLine(sentence);
}
Results
some text.
with punctuation!
in it?
But you are going to find very quickly that language, as spoken/written by humans, does not follow simple rules most of the time.
Here it is in VB.NET for you:
Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")

Once you've called Split with all 3 characters, you've tossed that information away. You could do what you're trying to do by splitting yourself or by splitting on one punctuation mark at a time.

Related

How to split an unicode-string to readable characters?

I have a VBA formula-function to split a string and add space between each character. It works fines only for an Ascii string. But I want to do the same for the Tamil Language. Since it is Unicode, the result is not readable. It splits even the auxiliary characters, Upper dots, Prefix, Suffix auxilary characters which should not be separated in Tamil/Hindi/Kanada/Malayalam/All India Languages. So, how to write a function to split a Tamil Word into readable characters.
Function AddSpace(Str As String) As String
Dim i As Long
For i = 1 To Len(Str)
AddSpace = AddSpace & Mid(Str, i, 1) & " "
Next i
AddSpace = Trim(AddSpace)
End Function
Adding Space is not the important point of this question. Splitting the Unicode string into an array from any of those languages is the requirement.
For example, the word, "பார்த்து" should be separated as "பா ர் த் து", not as "ப ா ர ் த ் த ு". As you can see, the first two letters "பா" (ப + ா) are combined. If I try to manually put a space in between them, I can't do it in any word processor. If you want to test, please put it in Notepad and add space between each character. It won't allow you to separate as ("ப ா"). So "பார்த்து" should be separated as "பா ர் த் து". It is the correct separation in Tamil like languages. This is the one that I am struggling to achieve in VBA.
The Character Code table for Tamil is here.
Tamil/Hindi/many Indian languages have (1)Consonants, (2)Independent vowels, (3)Dependent vowel signs, (4)Two-part dependent vowel signs. Among these 4 types, the first two are each one separate lettter, no issues with them. but the last 2 are dependent, they should not be separated from its joint character. For example, the letter, பா (ப + ் ), it contains one independent (ப) and one dependent (ா) letter.
If this info is not enough, please comment what should I post more.
(Note: It is possible in C#.Net using the code from the MS link by #Codo)
You can assign a string to a Byte array so the following might work
Dim myBytes as Byte
myBytes = "Tamilstring"
which generates two bytes for each character. You could then create a second byte array twice the size of the first by using space$ to crate a suitable string and then use a for loop (step 4) to copy two bytes at a time from the first to the second array. Finally, assign the byte array back to a string.
The problem you have is you are looking for what Unicode calls an extended grapheme cluster.
For a Unicode compatible regex engine that is simply /\X/
Not sure how you do that in VBA.
Referring the link mentioned by #ScottCraner in comments on the question and Character code for Tamil.
Check the result in cell A2 and highlighted in yellow are Dependent vowel signs which are used in DepVow string
Sub Split_Unicode_String()
'https://stackoverflow.com/questions/68774781/how-to-split-an-unicode-string-to-readable-characters
Dim my_string As String
'input string
Dim buff() As String
'array of input string characters
Dim DepVow As String
'Create string of Dependent vowel signs
Dim newStr As String
'result string with spaces as desired
Dim i As Long
my_string = Range("A1").Value
ReDim buff(Len(my_string) - 1) 'array of my_string characters
For i = 1 To Len(my_string)
buff(i - 1) = Mid$(my_string, i, 1)
Cells(1, i + 2) = buff(i - 1)
Cells(2, i + 2) = AscW(buff(i - 1)) 'used this for creating DepVow below
Next i
'Create string of Dependent vowel signs preceded and succeeded by comma
DepVow = "," & Join(Array(ChrW$(3006), ChrW$(3021), ChrW$(3009)), ",")
newStr = ""
For i = LBound(buff) To UBound(buff)
If InStr(1, DepVow, ChrW$(AscW(buff(i + 1))), vbTextCompare) > 0 Then
newStr = newStr & ChrW$(AscW(buff(i))) & ChrW$(AscW(buff(i + 1))) & " "
i = i + 1
Else
newStr = newStr & ChrW$(AscW(buff(i))) & " "
End If
Next i
'result string in range A2
Cells(2, 1) = Left(newStr, Len(newStr) - 1)
End Sub
Try below algorithm. which will concat all the mark characters with letter characters.
redim letters(0)
For i=1 To Len(Str)
If ascW(Mid(Str,i,1)) >3005 And ascW(Mid(Str,i,1)) <3022 Then
letters(UBound(letters)-1) = letters(UBound(letters)-1)+Mid(Str,i,1)
Else REDIM PRESERVE
letters(UBound(letters) + 1)
letters(UBound(letters)-1) = Mid(Str,i,1)
End If
Next
MsgBox(join(letters, ", "))'return பா, ர், த், து,

How to increase numeric value present in a string

I'm using this query in vb.net
Raw_data = Alltext_line.Substring(Alltext_line.IndexOf("R|1"))
and I want to increase R|1 to R|2, R|3 and so on using for loop.
I tried it many ways but getting error
string to double is invalid
any help will be appreciated
You must first extract the number from the string. If the text part ("R") is always separated from the number part by a "|", you can easily separated the two with Split:
Dim Alltext_line = "R|1"
Dim parts = Alltext_line.Split("|"c)
parts is a string array. If this results in two parts, the string has the expected shape and we can try to convert the second part to a number, increase it and then re-create the string using the increased number
Dim n As Integer
If parts.Length = 2 AndAlso Integer.TryParse(parts(1), n) Then
Alltext_line = parts(0) & "|" & (n + 1)
End If
Note that the c in "|"c denotes a Char constant in VB.
An alternate solution that takes advantage of the String type defined as an Array of Chars.
I'm using string.Concat() to patch together the resulting IEnumerable(Of Char) and CInt() to convert the string to an Integer and sum 1 to its value.
Raw_data = "R|151"
Dim Result As String = Raw_data.Substring(0, 2) & (CInt(String.Concat(Raw_data.Skip(2))) + 1).ToString
This, of course, supposes that the source string is directly convertible to an Integer type.
If a value check is instead required, you can use Integer.TryParse() to perform the validation:
Dim ValuePart As String = Raw_data.Substring(2)
Dim Value As Integer = 0
If Integer.TryParse(ValuePart, Value) Then
Raw_data = Raw_data.Substring(0, 2) & (Value + 1).ToString
End If
If the left part can be variable (in size or content), the answer provided by Olivier Jacot-Descombes is covering this scenario already.
Sub IncrVal()
Dim s = "R|1"
For x% = 1 To 10
s = Regex.Replace(s, "[0-9]+", Function(m) Integer.Parse(m.Value) + 1)
Next
End Sub

VB.NET - Delete excess white spaces between words in a sentence

I'm a programing student, so I've started with vb.net as my first language and I need some help.
I need to know how I delete excess white spaces between words in a sentence, only using these string functions: Trim, instr, char, mid, val and len.
I made a part of the code but it doesn't work, Thanks.
enter image description here
Knocked up a quick routine for you.
Public Function RemoveMyExcessSpaces(str As String) As String
Dim r As String = ""
If str IsNot Nothing AndAlso Len(str) > 0 Then
Dim spacefound As Boolean = False
For i As Integer = 1 To Len(str)
If Mid(str, i, 1) = " " Then
If Not spacefound Then
spacefound = True
End If
Else
If spacefound Then
spacefound = False
r += " "
End If
r += Mid(str, i, 1)
End If
Next
End If
Return r
End Function
I think it meets your criteria.
Hope that helps.
Unless using those VB6 methods is a requirement, here's a one-line solution:
TextBox2.Text = String.Join(" ", TextBox1.Text.Split(New Char() {" "c}, StringSplitOptions.RemoveEmptyEntries))
Online test: http://ideone.com/gBbi55
String.Split() splits a string on a specific character or substring (in this case a space) and creates an array of the string parts in-between. I.e: "Hello There" -> {"Hello", "There"}
StringSplitOptions.RemoveEmptyEntries removes any empty strings from the resulting split array. Double spaces will create empty strings when split, thus you'll get rid of them using this option.
String.Join() will create a string from an array and separate each array entry with the specified string (in this case a single space).
There is a very simple answer to this question, there is a string method that allows you to remove those "White Spaces" within a string.
Dim text_with_white_spaces as string = "Hey There!"
Dim text_without_white_spaces as string = text_with_white_spaces.Replace(" ", "")
'text_without_white_spaces should be equal to "HeyThere!"
Hope it helped!

Comparing character only to character at end of string

I am writing a program in Visual Basic 2010 that lists how many times a word of each length occurs in a user-inputted string. Although most of the program is working, I have one problem:
When looping through all of the characters in the string, the program checks whether there is a next character (such that the program does not attempt to loop through characters that do not exist). For example, I use the condition:
If letter = Microsoft.VisualBasic.Right(input, 1) Then
Where letter is the character, input is the string, and Microsoft.VisualBasic.Right(input, 1) extracts the rightmost character from the string. Thus, if letter is the rightmost character, the program will cease to loop through the string.
This is where the problems comes in. Let us say the string is This sentence has five words. The rightmost character is an s, but an s is also the fourth and sixth character. That means that the first and second s will break the loop just as the others will.
My questions is whether there is a way to ensure that only the last s, or whatever character is the last one in the string can break the loop.
There are a few methods you can use for this, one as Neolisk shows; here are a couple of others:
Dim breakChar As Char = "s"
Dim str As String = "This sentence has five words"
str = str.Replace(".", " ")
str = str.Replace(",", " ")
str = str.Replace(vbTab, " ")
' other chars to replace
Dim words() As String = str.ToLower.Split(New Char() {" "}, StringSplitOptions.RemoveEmptyEntries)
For Each word In words
If word.StartsWith(breakChar) Then Exit For
Console.WriteLine("M1 Word: ""{0}"" Length: {1:N0}", word, word.Length)
Next
If you need to loop though chars for whatever reason, you can use something like this:
Dim breakChar As Char = "s"
Dim str As String = "This sentence has five words"
str = str.Replace(".", " ")
str = str.Replace(",", " ")
str = str.Replace(vbTab, " ")
' other chars to replace
'method 2
Dim word As New StringBuilder
Dim words As New List(Of String)
For Each c As Char In str.ToLower.Trim
If c = " "c Then
If word.Length > 0 'support multiple white-spaces (double-space etc.)
Console.WriteLine("M2 Word: ""{0}"" Length: {1:N0}", word.ToString, word.ToString.Length)
words.Add(word.ToString)
word.Clear()
End If
Else
If word.Length = 0 And c = breakChar Then Exit For
word.Append(c)
End If
Next
If word.Length > 0 Then
words.Add(word.ToString)
Console.WriteLine("M2 Word: ""{0}"" Length: {1:N0}", word.ToString, word.ToString.Length)
End If
I wrote these specifically to break on the first letter in a word as you ask, adjust as needed.
VB.NET code to calculate how many times a word of each length occurs in a user-inputted string:
Dim sentence As String = "This sentence has five words"
Dim words() As String = sentence.Split(" ")
Dim v = From word As String In words Group By L = word.Length Into Group Order By L
Line 2 may need to be adjusted to remove punctuation characters, trim extra spaces etc.
In the above example, v(i) contains word length, and v(i).Group.Count contains how many words of this length were encountered. For debugging purposes, you also have v(i).Group, which is an array of String, containing all words belonging to this group.

Break string up

I have a string that has 2 sections broken up by a -. When I pass this value to my new page I just want the first section.
An example value would be: MS 25 - 25
I just want to show: MS 25
I am looking at IndexOf() and SubString() but I can't find how to get the start of the string and drop the end.
This might help:
http://www.homeandlearn.co.uk/net/nets7p5.html
Basically the substring method takes 2 parameters. Start position and length.
In your case, the start position is 0 and length is going to be the position found by the IndexOf method -1.
For example:
Dim s as String
Dim result as String
s = "MS 25 - 25"
result = s.SubString(0, s.IndexOf("-")-1)
You could use the Split function on the hyphen.
.Split("-")
If you want to stay away from Split, you could use SubString
yourString.Substring(0, yourString.IndexOf("-") - 1)
EDIT
The above code will fail in the instances where there is no hyphen at all or the hyphen is in the beginning of the string, also when there are no spaces surrounding the hyphen, the full leading substring will not be returned. Consider using this for safety:
Dim pos As Integer
Dim result As String
pos = yourString.IndexOf("-")
If (pos > 0) Then
result = yourString.Substring(0, pos)
ElseIf (pos = 0) Then
result = String.Empty
Else
result = yourString
End If