How can I remove a word from string? - vb.net

I have a string which contains words with parentheses. I need to remove the whole word from the string.
For example: for the input, "car wheels_(four) klaxon" the result should be, "car klaxon".
Can someone give me an example that would accomplish this?

You can do this with regular expressions. The regular expression you need is:
"\s?\S+[()]\S+\s?"
This removes any word containing either ( or ) or both, and removes both the word and collapses the surrounding whitespace. The match should be replaced with a single space.
In C# the regular expression could be used like this:
string s = "car wheels_(four) klaxon";
s = Regex.Replace(s, #"\s?\S*[()]\S*\s?", " ");
I'm not entirely sure of the VB translation for this, but hopefully you can figure it out.

Slightly different:
sed "s/\s\+\S*(.\+)\S*\s\+/ /g" yourfile
It works like this:
yourfile:
car wheels_(four) klaxon
ciao (wheel) hey
foo bar (baz) qux
stack overflow_(rulez)_the world
transformed in:
car klaxon
ciao hey
foo bar qux
stack world

If speed isn't an issue and you want to avoid overcomplicated regular expressions, you can use String.Split on " " to create an array of "words", iterate through each word, replace any that String.Contains "(" with an empty string, then use String.Join with a separator of "" to get your results.
Sorry can't send the codez, don't have a VB.NET compiler on hand.

Related

openrefine how remove certain words from the end of each cell

i have a column in openrefine, which has cells with content like:
This dog is a great dog.
This cat is a great cat,
i would like to remove the words dog, cat from the end of each cell (if punctuation could be removed also, it would be great).
i have tried with
\bdog\s*$
but i get errors, or no replacement done
I am using openrefine 3.3.
value.replace(\bdog|\bcat\s*$,'')
error i get:
Parsing error at offset 14: Missing number, string, identifier, regex, or parenthesized expression
desired output:
This dog is a great
This cat is a great
also, it would be great if i could also remove all characters in the end like " : , . (actually i am looking for a regex to cluster publishers -librarian data) so if you could suggest words i should remove from the end of the cells i would be grateful
I combined Ettore answer with the split() function value.split(' ')[-1] that select the last part word of a string.
The results is :
replace(value,value.split(' ')[-1],'') + value.split(' ')[-1].replace(/cat|dog/,'')
where
replace(value,value.split(' ')[-1],'') select your string expect the last work
value.split(' ')[-1].replace(/cat|dog/,'') replace the last word with nothing if it contains cat or dog.
Note that the expression is working because of the punctuation at the end of the string. Not a perfect solution but you may be able to build something from here.

How do I replace duplicate whitespaces in a String in Kotlin?

Say I have a string: "Test me".
how do I convert it to: "Test me"?
I've tried using:
string?.replace("\\s+", " ")
but it appears that \\s is an illegal escape in Kotlin.
replace function in Kotlin has overloads for either raw string and regex patterns.
"Test me".replace("\\s+", " ")
This replaces raw string \s+, which is the problem.
"Test me".replace("\\s+".toRegex(), " ")
This line replaces multiple whitespaces with a single space.
Note the explicit toRegex() call, which makes a Regex from a String, thus specifying the overload with Regex as pattern.
There's also an overload which allows you to produce the replacement from the matches. For example, to replace them with the first whitespace encountered, use this:
"Test\n\n me".replace("\\s+".toRegex()) { it.value[0].toString() }
By the way, if the operation is repeated, consider moving the pattern construction out of the repeated code for better efficiency:
val pattern = "\\s+".toRegex()
for (s in strings)
result.add(s.replace(pattern, " "))

Using groups in OpenRefine regex

I'm wondering if it is possible to use "groups" in ReGeX used in Open Refine GREL syntax. I mean, I'd like to replace all the dots followed and preceded by a character WITH the same character and dot but followed by a space and then the character.
Something like:
s.replace(/(.{1})\..({1})/,/(1).\s(2)/)
It should, but your last argument needs to be a string, not a regular expression. Internally Refine uses Java's Matcher#replaceAll method which accepts a string argument.
I think I found out how to deal with this. You need to put $X in your string value to address a Xth capture group.
It should be like this:
s.replace(/.?(#capcure group 1).?(#capcure group 2).*?/), " some text $1 some text $2 some text")

How to separate words characters and non word characters?

Unicode have categories of characters. Some are alpha numeric. Some are punctuation.
What about if I want to know whether a word belongs to keyword or not
For example,
A,a,b,c, tend to belong to words. So is Ƈ,Ǝ,ǟ, so are all chinese characters.
Sentences like
Hello World, I "like" (to) eat ƇƎǟ and 款开源 ©
Have keywords:
Hello
World
I
like
to
eat
ƇƎǟ
款
开
源
Here, , (),© are not word characters and hence should just be ignored and use.
© doesn't count as punctuation either. '©'.IsPunctuation returns false in vb.net but I want to get rid of that too.
Now I want to make a program that can split sentences into keywords. For that I need to know which characters are word characters and which one is not.
Is there a vb.net function for that?
Do it the other way round: use IsLetter for your test. Or better yet, use regular expressions to split your string by words:
Dim str = "Hello World, I ""like"" (to) eat ƇƎǟ and 款开源 ©"
Dim wordPattern As New Regex("\p{L}+")
For Each match in wordPattern.Matches(str))
Console.WriteLine(match)
Next
Here, \p{L} matches any word character. However, the above matches “款开源” in a single rather than in separate matches since there is no separator between the characters.
u need to deal with "keycodes"
like if u only want letters [a-z]
then
for(c>='a' && c<='z'){
}
or
for(c>=97 && C<=122){
}

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.
Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.
You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt
The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words
You can match the words and remove them using regular expressions.