How to parse specific data in a string line? - vb.net

What makes my question different than many others already asked is that I know how to parse data with delimiters such as a comma or space, but I'm unsure how to parse data that is separated by spaces but also contain spaces. This is an example line:
M 9 12.02 Adam Productions Inc
So I need to parse "M", 9, 12.02, and "Adam Productions Inc" out to different variables but I'm not sure how to do that. Here is my code to parse by spaces.
Dim contents() As String = strRawFile.Split(New [Char]() {CChar(" ")}, StringSplitOptions.RemoveEmptyEntries)
Obviously this does parse my first 3 pieces of data, but then it tears apart the 4th piece. How can I modify my code to overcome this?
Desired Result:
M
9
12.02
Adam Productions Inc

One way is to split it out and then join together items from the 4th element of the array until the array length. This assumes that there will not be any embedded spaces in the first 3 items.

Related

pandas read_csv with multiple separators does not work

I need to be able to parse 2 different types of CSVs with read_csv, the first has ;-separated values and the second has ,-separated values. I need to do this at the same time.
That is, the CSV can have this format:
some;csv;values;here
or this:
some,csv,values,here
or even mixed:
some;csv,values;here
I tried many things like the following regex but nothing worked:
data = pd.read_csv(csv_file, sep=r'[,;]', engine='python')
Am I doing something wrong with the regex?
Instead of reading from a file, I ran your code sample
reading from a string:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1'''
data = pd.read_csv(io.StringIO(txt), sep='[,;]', engine='python')
and got a proper result:
C1 C2 C3 C4
0 some csv values here
1 some1 csv1 values1 here1
Note that the sep parameter can be even an ordinary (not raw) string,
because it does not contain any backslashes.
So your idea to specify multiple separators as a regex pattern is OK.
The reason that your code failed is probably an "inconsistent" division of
lines into fileds. Maybe you should ensure that each line contains the
same number of commas and semi-colons (at least not too many).
Look thoroughly at your stack trace. There should include some information
about which line of the source file caused the problem.
Then look at the indicated line and correct it.
Edit
To look what happens in a "failure case", I changed the source string to:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1
some2;csv2,values2;here2,xxxx'''
i.e. I added one line with 5 fields (one too many).
Then execution of the above code results in an error message:
ParserError: Expected 4 fields in line 4, saw 5. ...
Note words in line 4, precisely indicating the offending input line
(line numbers starts from 1).

Regex to match BIN ranges

I'm trying to write a regex that matches the numbers 456725 to 456744 (Last 2 digits, 25-44), but can't seem to figure out a correct regex format. I've tried ^(4567[2-4][0-9]) but using this also matches 456745 which it shouldn't.
If you do it like ^(4567[2-4][0-9]), you are allowing any number in the range between [2-4] together with any number in the range between [0-9], which is obviously not what you wanted.
So you need to change for something like:
^4567(?:2[5-9]|3[0-9]|4[0-4])
Explanation
^ asserts position at start of the string
4567 matches the characters 4567 literally
Non-capturing group (?:2[5-9]|3[0-9]|4[0-4])
1st Alternative 2[5-9]
2 matches the character 2 literally
Match a single character present in the list [5-9]
2nd Alternative 3[0-9]
3 matches the character 3 literally
Match a single character present in the list [0-9]
3rd Alternative 4[0-4]
4 matches the character 4 literally
Match a single character present in the list [0-4]
You could use the page regex101 to learn more and read good explanations on the subject. Hope it helps.
If your variable is just an integer it is best to just compare it as such...
For the regex though..the ^(4567 is correct your issue is the [2-4] and [0-9] those are independent of each other. You need to put the pieces together so only 25-29 and 40-44 are allowed.
This should get you on the right track:
^(4567(?:2[5-9]|3[0-9]|4[0-4]))$

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does that mean? Below is an example of my tries, I tried adding period (.) and exclamatory sign(!) at different points of the sentence. I'm getting different outputs, can someone explain on this?
with period (.)
select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 1 array
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
with '!'
select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 2 arrays
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
If you understand the functionality of sentences()..it clears your doubt.
Definition of sentences(str):
Splits str into arrays of sentences, where each sentence is an array
of words.
Example:
SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;
[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]
SELECT sentences('review . language') FROM movies;
[["review","language"]]
An exclamation point is a type of punctuation mark that goes at the end of a sentence. Other examples of related punctuation marks include periods and question marks, which also go at the end of sentences.But as per the definition of sentences() ,Unnecessary punctuation, such as periods and commas in English, is automatically stripped.So,we are able to get two arrays of words with !. It completely involves java.util.Locale.java
I don't know the actual reason but observed after period(.) if you put space and next word first letter as capital then it is working.
Here I changed from where to Where it it worked. However this is not require for !
Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.
And this is giving below output
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

How Do I get this Split Function to Work? (VB.NET)

So, I made a program that for the most part, converts numbers to letters. My problem before was it was converting each individual digit instead of each number e.g. (1-0-1 instead of 101). Someone suggested that I use the Split function:
Dim numbers As String() = DTB.Split(" ")
So now it's reading the number all the way through being that it will only the split if there's a space in between. My problem now is that it's translating for example: "[102, 103, 104]" as "[102", "103" and "104]" because it will only split if there's a space between. Obviously, you can't convert "[102" or "104]" because they aren't actual numbers.
Does anyone have a solution on what I should do to get this to convert no matter the spacing? Would Regex be the way to go?
use a regular expression with \d+ it will match numbers
so
12234abcsdf23434
will return two matches
12234
23434

How to separate words characters and non word characters?

Unicode have categories of characters. Some are alpha numeric. Some are punctuation.
What about if I want to know whether a word belongs to keyword or not
For example,
A,a,b,c, tend to belong to words. So is Ƈ,Ǝ,ǟ, so are all chinese characters.
Sentences like
Hello World, I "like" (to) eat ƇƎǟ and 款开源 ©
Have keywords:
Hello
World
I
like
to
eat
ƇƎǟ
款
开
源
Here, , (),© are not word characters and hence should just be ignored and use.
© doesn't count as punctuation either. '©'.IsPunctuation returns false in vb.net but I want to get rid of that too.
Now I want to make a program that can split sentences into keywords. For that I need to know which characters are word characters and which one is not.
Is there a vb.net function for that?
Do it the other way round: use IsLetter for your test. Or better yet, use regular expressions to split your string by words:
Dim str = "Hello World, I ""like"" (to) eat ƇƎǟ and 款开源 ©"
Dim wordPattern As New Regex("\p{L}+")
For Each match in wordPattern.Matches(str))
Console.WriteLine(match)
Next
Here, \p{L} matches any word character. However, the above matches “款开源” in a single rather than in separate matches since there is no separator between the characters.
u need to deal with "keycodes"
like if u only want letters [a-z]
then
for(c>='a' && c<='z'){
}
or
for(c>=97 && C<=122){
}