Regex: possible to match repeated patterns in openrefine? - openrefine

In openrefine I'm trying, for example, to get all the occurences of [aeio]+ in "abeadsabmoloei", in an array : ["a","ea","a","o","oei"]
Let's suppose we don't know the content of the string.
Is it possible with match function?

The match() function is not made to find multiple instances of a pattern in the same string. This is why a discussion is under way to implement a find() or findAll() function. In the meantime, two lines of Python/Jython will do the trick:
import re
return re.findall(r"[aeio]+", value)

Related

How to write xpath for following example?

For example, I have div tag that has two attributes.
class='hello#123' text='321#he#321llo#321'
<div> class='hello#123' text='321#he#321llo#321'></div>
Here, I want to write xpath for both class and text attributes but numbers may change dynamically. ie., "hello#123" may become "345" when we reload. "321#he#321llo#321" may become "567#he#456llo#321".
Note: Need to write xpath in single line not separately.
Assuming that you have the (corrected) two-attribute-HTML
<div class='hello#123' text='321#he#321llo#321'>...</div>
you can select it using the following, for example:
Using the contains() function
//div[contains(#class,'hello') and contains(#text,'#he#')]
This is quite specific and only applicable if the "hello" is always split in the same way
Using the translate() function to mask everything except the chars for "hello"
//div[translate(#class,'#0123456789','')='hello' and translate(#text,'#0123456789','')='hello']
This removes all # chars and digits and checks if the remaining string is "hello"
I guess combining these two approaches you will be able to create your own XPath expression fitting your needs. The patterns you provided were not fully clear, so this may only approach a good enough solution.

Regex for extracting certain information from a string

Below is the string that I have -
vdp_plus_forecast_aucc_VDP_20221024_variance_analysis_20221107_backcasting_actuals_asp_True_vlt_True.csv
I need RegEx to take out following items from the string -
20221107
vlt_True
Need help with writing right RegEx for these two extractions. I'm performing the operation on a PySpark DF.
I'm assuming that the answer is based on the variable in front of it so it's capturing the value of variance analysis:
(?<=_variance_analysis_)[0-9]+|vlt_(True|False)
This should capture the variables you wanted, if you only need the value of vlt, you can replace vlt_ with (?<=_vlt) which will just capture the value without the variable

String Template: is it possible to get the n-th element of a Java List in the template?

In String Template one can easily get an element of a Java Map within the template.
Is it possible to get the n-th element of an array in a similar way?
According to the String Template Cheat Sheet you can easily get the first or second element:
You can combine operations to say things like first(rest(names)) to get second element.
but it doesn't seem possible to get the n-th element easily. I usually transform my list into a map with list indexes as keys and do something like
map.("25")
Is there some easier/more straightforward way?
Sorry, there is no mechanism to get a[i].
There is no easy way getting n-th element of the list.
In my opinion this indicates that your view and business logic are not separated enough: knowledge of what magic number 25 means is spread in both tiers.
One possible solution might be converting list of values to object which provides meaning to the elements. For example, lets say list of String represents address lines, in which case instead of map.("3") you would write address.street.

xPath last select element

Can someone help me to bring this code working? I have several select fields and I only want the last one in my variable.
variable = browser.elements_by_xpath('//div[#class="nested-field"]//select[last()]
Thanks!
This is a FAQ: The [] operator in XPath has higher precedence (priority) than the // pseudo-operator. This is why brackets must be used to change the default operator priorities. There are at least several similar questions with good explanations -- search for them and read and understand.
Instead of:
//div[#class="nested-field"]//select[last()]
Use:
(//div[#class="nested-field"]//select)[last()]
is the class attribute an exact match?
if the mark up is like this
<div class="nested-field other">
...
then you'll have to either match by the exact class or use xpath contains.

sorting and getting uniques

i have a string that looks like this
"apples,fish,oranges,bananas,fish"
i want to be able to sort this list and get only the uniques. how do i do it in vb.net? please provide code
A lot of your questions are quite basic, so rather than providing the code I'm going to provide the thought process and let you learn from implementing it.
Firstly, you have a string that contains multiple items separated by commas, so you're going to need to split the string at the commas to get a list. You can use String.Split for that.
You can then use some of the extension methods for IEnumerable<T> to filter and order the list. The ones to look at are Enumerable.Distinct and Enumerable.OrderBy. You can either write these as normal methods, or use Linq syntax.
If you need to get it back into a comma-separated string, then you'll need to re-join the strings using the String.Join method. Note that this needs an array so Enumerable.ToArray will be useful in conjunction.
You can do it using LINQ, like this:
Dim input = "apples,fish,oranges,bananas,fish"
Dim strings = input.Split(","c).Distinct().OrderBy(Function(s) s)
I'm not a VB.NET programmer, but I can give you a suggestion:
Split the string into an array
Create a second array
Cycle through the first array, adding any value that is not in the second.
Upon completion, your second array will have only unique values.