How to split a string based on capitalized initials - openrefine

I was asked how to split a string like "Abc def GhI jKl MNO Pqr" into ["Abc def";"Ghi jKl";"MNO Pqr"]
A collegue found a solution in Python, and me this solution in GREL. It seems to work properly, but I am not sure to understand myself the regex ;-)
value.find(/(\b\p{Lu}[^\s](\s+[^\p{Lu}][^\s])*)/)
Is there a simpler solution?

Related

Convert to string to 'Proper' casing

Is there a way to convert a string to 'Proper' casing? I'm using the Excel definition of 'Proper' which will format text such that the first letter of any word is capitalized and the remaining letters are lower case.
Sample Inputs | Outputs
I browsed the string function/operators Presto documentation so it seems like this isn't possible, but hoping someone here can prove me wrong!
You use regexp_replace to turn string into the title case:
select regexp_replace('Hell asdasd QWEEQ aWQW', '(\w)(\w*)', x -> upper(x[1]) || lower(x[2]));
Output:
_col0
Hell Asdasd Qweeq Awqw

using Regex get substring between underscore 2 and underscore 3 of string, vb.net

I have a string like: Title Name_2021-04-13_A+B+C_Division.txt. I need to extract the A+B+C. The A+B+C may be other letters. I believe that using Regex would be the simplest way to do this. In other words I need to get the substring between underscore 2 and underscore 3 of string. All of my code is written in vb.net. I have tried:
boatClass = Regex.Match(myFile, "(?<=_)(.*)(?=_)").ToString
I know this is not right but I think it is close. What do I need to add or change?
The regex code that will extract a substring between the second and third underscore of a string is:
(?:[^_]+_){2}([^_]+)
However, I chose to use the split function:
myString.Split("_"c)(2)

Word2Vec word containing numeric values

When I am adding sentences to Word2Vec model it seems to remove the words which end or start with numeric values, for example "ISO 9001" is returned as "ISO ", I've guessing it's something simple...
Thanks in advance.
I think you already answered your question in the tags you gave to this question. Most likely your tokenizer splits by blank spaces, and leaves out numbers. If you paste the tokenize code you use here we will be able to help you further.
Good luck!

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.
Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.
You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt
The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words
You can match the words and remove them using regular expressions.

RegEx and matching codes right to left

Stuggling a little bit with the RegEx, I've got 4 codes in a string
CODE4:CODE3:CODE2:CODE1
each code is optional apart from CODE1
So I could have ab:bc:de:fg
or
bc::fg
of
ab:::fg
In each case of the above CODE1 = fg dnd for the dear life of me I can't work out the RegEX
Would be easy to do as a standard string parse, but unforunatly because of buisness objects in needs to be done via regex :-( and return via a vb.net RegEX.matche,groups("Code1") fg (I hope that makes sense)
Thanks in advance for any help
Ended up with a bit of RegEx that does the job, bit messy but it works
(^(?<code1>[\w]*)$)|(^(?<code2>[\w]*):(?<code1>[\w]*)$)|(^(?<code3>[\w]*):(?<code2>[\w]*):(?<code1>[\w]*)$)|(^(?<code4>[\w]*):(?<code3>[\w]*):(?<code2>[\w]*):(?<code1>[\w]*)$)
Ta all
There's no need to use a regular expression here.
I don't know what language you're using, but split the string on ':' and you'll have an array of codes.
If you really just want to validate whether a string is valid for this then
/(\w*:){0,3}\w+/
matches your description and the few examples you've given.
I'm not sure why you have to match the codes right to left. Simply use a regular expression to pick apart the string:
/(.*):(.*):(.*):(.+)/
and then you have CODE1 in $4, CODE2 in $3, CODE3 in $2, CODE4 in $1.
(CODE1)?:(CODE2)?:(CODE3)?:CODE4 would work - if the leading : don't matter. Otherwise, if you can't have leading colons, enumerate:
(CODE1:(CODE2)?:(CODE3)?:|CODE2:(CODE3)?:|CODE3)?CODE4
There's nothing special about the fact that the right-most part is mandatory, and the left-most parts aren't.