Convert =00 formatted UTF codes in a plain text file to the correct utf character in vb.net - vb.net

writing a simple program to extract all the postal addresses from a big plain text file, having a problem as some of the addresses use non-standard characters.
This is some source text from the file I need to process:
Rua Vale de Louro, N=BA 97
Bloco 2, 1=BA A
but it needs to read:
Rua Vale de Louro, Nº 97
Bloco 2, 1º A
now obviously i could do a simple replace for this one characters but I need it to work with every character.
BA is the hex value of the º symbol in utf32 (albeit with a load of zeros preceding it) so if I can code something to find all these "=xx" instances in the string and replace them with the correct utf character that would solve it. but for the life of me I can't figure out how.
Can Anyone Help?
Thanks

Use
Dim txt As String = IO.File.ReadAllText("fileName", System.Text.Encoding.encoding) 'ASCII, UFT32, UFT8, Unicode etc...
Change the word encoding with the appropriate one.

It can be done using regular expressions with a match evaluator to calculate the replacement string.
Dim input = "Rua Vale de Louro, N=BA 97 Bloco 2, 1=BA A"
Dim expected = "Rua Vale de Louro, Nº 97 Bloco 2, 1º A"
Dim regex = new Regex("=([0-9A-Fa-f]+)",RegexOptions.CultureInvariant, TimeSpan.FromSeconds(10))
Dim evaluator = Function(match) Char.ConvertFromUtf32(Convert.ToInt32(match.Groups(1).Value, 16))
Dim actual = regex.Replace(input, evaluator)
The pattern matches = followed by one or more hex digits. The hex digits are in group 1.
The evaluator takes the hex digits, converts to an integer from base 16 and then converts to a Unicode codepoint.

Related

How to shrink 10 digit numeric into 2 character

I have input comprising five character upper-case English letters e.g ABCDE and I need to convert this into two character unique ASCII output.
e.g. ABCDE and ZZZZZ should both give two different outputs
I have converted from ABCDE into hex which gives me 4142434445, but from this can I get to a two character output value I require?
Example:
INPUT1 = ABCDE
Converted to hex = 4142434445
INPUT2 = 4142434445
OUTPUT = ?? Any 2 ASCII Characters
Other examples of INPUT1 =
BIRAL
BRMAL
KLAAX
So you're starting with a 5-digit base-26 number, and you want to squeeze that into some 2-digit scheme with base n?
All possible 1-5 digit base-26 numbers gives you a number space of 26^5 = 11,881,376.
So you want the minimum n where n^2 >= 11,881,376.
Which gives you 3446.
Now it's up to you to go and find a suitable glyph block somewhere in UTF where you can reliably block-out 3446 separate characters to act as your new base/alphabet. And construct a mapping from your 5-char base-26 ABCDE type number onto your 2-char base-3446 wierd-glyph number. Good luck with that.
There's not enough variety in ASCII to do this, since it's only 128 printable characters. Limiting yourself to 2-chars of ASCII means you can only address a number space of 16384.

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here:

VB Convert RGB String to Hex

The title pretty much explains my issue. I need to convert a single string RGB value into a Hex value. I can do this if the value is given in three separate strings, but as the RGB is given from a color picker I'm unable to do this - unless I split the string which I don't want to do as I feel it's unnecessary.
I want to be able to convert a string such as: 0, 112, 192 into it's hexadecimal equivalent. Can I convert the entire string or do I have to split the string into its RGB parts first?
Since you can have varying numbers of decimal digits for the RGB value, you'll need to separate it before you convert it.
s1 = "0, 112, 192"
s2 = ""
For Each s As String In s1.Split(",")
s2 &= CInt(s).ToString("x2")
Next s

Converting binary to base 4

What I hope to achieve:
I want to convert text to DNA (which is a base 4 system, "a,G,T,c")
How I plan to do it:
Convert text string to binary,
Dim BinaryConvert As String = ""
For Each C As Char In Textbox1.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
BinaryConvert &= s
Next
Textbox1.Text = BinaryConvert '//Changes the textbox1.Text into binary form
Then convert binary to base 4 via Pseudocode solution:
if (length of binary String is an odd number) add a zero to the front (leftmost position) of the String.
Create an empty String to add translated digits to.
While the original String of binary is not empty {
Translate the first two digits only of the binary String into a base-4 digit, and add this digit to the end (rightmost) index of the new String.
After this, remove the same two digits from the binary string and repeat if it is not empty.
}
The idea behind converting binary to DNA is simply setting G and T equal to one, with c and a equal to zero (G=T=1, a=c=0).
So all I have to do is convert the string to binary first, and then into base 4, in order to convert text to genetic code. Could you please help me write the code to convert binary to base 4.
Thank you for the help!
Converting to base 4 from base 2 is pretty simple. Since 4 itself is the 2nd power of 2, this means you can simply combine two bits to create one base 4 place (2 bits can represent 4 possible values, while 1 base 4 place can also represent 4 possible values). For example:
11100100 (base 2) = 3210 (base 4)

Problem with File IO and splitting strings with Environment.NewLine in VB.Net

I was experimenting with basic VB.Net File IO and String splitting. I encountered this problem. I don't know whether it has something to do with the File IO or String splitting.
I am writing text to a file like so
Dim sWriter As New StreamWriter("Data.txt")
sWriter.WriteLine("FirstItem")
sWriter.WriteLine("SecondItem")
sWriter.WriteLine("ThirdItem")
sWriter.Close()
Then, I am reading the text from the file
Dim sReader As New StreamReader("Data.txt")
Dim fileContents As String = sReader.ReadToEnd()
sReader.Close()
Now, I am splitting fileContents using Environment.NewLine as the delimiter.
Dim tempStr() As String = fileContents.Split(Environment.NewLine)
When I print the resulting Array, I get some weird results
For Each str As String In tempStr
Console.WriteLine("*" + str + "*")
Next
I added the *s to the beginning and end of the Array items during printing, to find out what is going on. Since NewLine is used as the delimiter, I expected the strings in the Array to NOT have any NewLine's. But the output was this -
*FirstItem*
*
SecondItem*
*
ThirdItem*
*
*
Shouldn't it be this -
*FirstItem*
*SecondItem*
*ThirdItem*
**
??
Why is there a new line in the beginning of all but the first string?
Update: I did a character by character print of fileContents and got this -
F - 70
i - 105
r - 114
s - 115
t - 116
I - 73
t - 116
e - 101
m - 109
- 13
- 10
S - 83
e - 101
c - 99
o - 111
n - 110
d - 100
I - 73
t - 116
e - 101
m - 109
- 13
- 10
T - 84
h - 104
i - 105
r - 114
d - 100
I - 73
t - 116
e - 101
m - 109
- 13
- 10
It seems 'Environment.NewLine' consists of
- 13
- 10
13 and 10.. I understand. But the empty space in between? I don't know whether it is coming due to printing to the console or is really a part of NewLine.
So, when splitting, only the character equivalent of ASCII value 13, which is the first character of NewLine, is used as delimiter (as explained in the replies) and the remaining stuff is still present in the strings. For some reason, the mysterious empty space in the list above and ASCII value 10 together result in a new line being printed.
Now it is clear. Thanks for the help. :)
First of all, yes, WriteLine tacks on a newline to the end of the string, hence the blank line at the end.
The problem is the way you're calling fileContents.Split(). The only version of that function that takes only one argument takes a char(), not a string. Environment.NewLine is a string, not a char, so (assuming you have Option Strict Off) when you're calling the function it's implicitly converting it to a char, using only the first character in the string. This means that instead of splitting your string on the actual sequence of two characters that make up Environment.NewLine, it's actually splitting only on the first of those characters.
To get your desired output, you need to call it like this:
Dim delims() as String = { Environment.NewLine }
Dim tempStr() As String = fileContents.Split(delims, _
StringSplitOptions.RemoveEmptyEntries)
This will cause it to split on the actual string, rather than the first character as it's doing now, and it will remove any blank entries from the results.
Why not just use File.ReadAllLines? One single call reads the file and returns a string array with the lines.
Dim tempStr() As String = File.ReadAllLines("data.txt")
I just ran into the same issue, and found all the comments very helpful. However, I corrected my issue by replacing "Environment.NewLine" with vbLF (as opposed to vbCrLf, which had the same issue). Any issues with this approach? (It seems more straight forward, but I'm not a programmer, so I wouldn't know of any potential issues).