Extract text using GREL in OpenRefine - openrefine

I'm trying to add a column based on a column in OpenRefine using GREL.
I need to extract every text after the second space in scientific name.
Here is two examples of the original cell data ---> what I want to extract:
Amandinea punctata (Hoffm.) Coppins & Scheid. ---> (Hoffm.) Coppins & Scheid.
Agonimia tristicula (Nyl.) Zahlbr. ---> (Nyl.) Zahlbr.

Here are three ways to achieve the desired result on the given data, ordered from easy to understand to more advanced.
Use column splitting
You can split the column into three columns by choosing a whitespace as separator and limit the number of new columns to 3 in the corresponding dialog. Then you can delete the first two columns and have your desired result.
Use Array functions
You can use the same technique via GREL and arrays... split on whitespace, discard the first two entries and join the rest on whitespace.
value.split(" ").slice(2).join(" ")
Use regular expressions
You can also use the match function with a regular expression.
value.match(/\S+\s\S+\s(.+)/)[0]

A solution :
partition on what appears to be a good separator : " (", take the right part and add a missing "(" at the beginning.
"("+value.partition(" (")[2]

Related

Hive SQL regexp_extract (number)_(number)

I'm new to hiveSQL and I'm trying to extract a value from the column col_a from the data df which is in this format:
\\\"id\\\":\\\"101_12345\\\"
I only need to extract 101_12345, but underscore makes it hard to satisfy my need. I tried using regexp_extract(col_a, '(\\d+)[_](\\d+)') but only outputs 101.
Could I get some help with regexp? Thank you
Simple solution: You don't need the two brackets.
Here's a working solution: '\\d+[_]\\d+'
When you put tokens into parentheses, the regex engine will group its match together, separate from the complete match. So the final result will comprise the complete match, and two extra matches representing the one before and after the underscore. To avoid this, just remove the brackets as you don't really need them.
In the future, if you want to group a regex together but don't want the result to contain it separately, use a non-capturing group given by (?:).
Here's a demo of what your code resulted in, hosted at regex101.com

How to split a column value to multiple columns based on delimiter in Snowflake

Need to split column value into multiple columns based on delimiter,
Also need to create columns dynamically based on no. of delimiters, delimiter could be comma or so. Thanks,
You might be better off working with an array in this case. You haven't specified whether each record will have the same number of delimiters, so dynamically creating columns for tables will take some scripting. If they are, you could potentially use a SPLIT, LATERAL FLATTEN, and PIVOT, but the pivot needs static column names, so you'll likely need a stored procedure to deal with that.
To answer your first question, you can use SPLIT and/or SPLIT_PART to split the column into values. The SPLIT function creates an array for you, while the SPLIT_PART function creates the array, but outputs a single value from the array.
https://docs.snowflake.com/en/sql-reference/functions/split.html
https://docs.snowflake.com/en/sql-reference/functions/split_part.html
https://docs.snowflake.com/en/sql-reference/functions/flatten.html
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html

Extract all elements from a string separated by underscores using regex

I am trying to capture all elements and store in separate column from the below string,seprated via underscores(campaign name for an advertisement) and then I wish to compare it with a master table having the true values to determine how accurate the data is being recorded.
eg: Input :
Expected output is :
My first element extraction was : REGEXP_EXTRACT(campaign_name, r"[^_+]{3}")) as parsed_campaign_agency
I only extracted first 3 letters because according to the naming convention(truth table), the agency name is made of only 3 letters.
Caveat: Some elements can have variable lengths too. eg. The third element "CrossBMC" could be 3 letters in length or more.
I am new to regex and the data lies in a SQL table(in BigQuery) so I thought it could be achieved via SQL's regex_extract but what I am having trouble is to extract all elements at once.
Any help is appreciated :)
If number of underscores constant and knows you can use SUBSTRING_INDEX like:
SELECT
SUBSTRING_INDEX(campaign_name,'_',1) first,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',2),'_',-1) second,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',3),'_',-1) third
FROM your_table;
Here you can try an example SQLize.online

Split column preserving values

I have a column of words followed by numbers, like this:
I want to split it into two columns, putting the text to the left of the digits in the first column, and the digits and any text that follow into the second column.
I suspect I'll have to add a column based on this column, containing the digits and everything after. Then I'll have to delete the digits and everything after from the previous column.
I'm not great at GREL, and the examples I've found don't work. Help?
There are several ways. If you don't like GREL but you know some regular expressions, you can use "Edit column" -> Split into several columns "and use as separator this regex :
\s(?=\d)
It means "any space that is before a number".
(Don't forget to check the box "regular expression".)
If any of your values contain multiple numbers (eg, "text 123 newtext 345 sometext"), specify "split into 2 columns at most".

Hive Regular expression - only portion of string needed

Hi i was trying to extract portion of data from one column in my hive table but the position of character is not in one place
select value4,regexp_extract(value4,'*****',0) from hive_table;
column value is shown below
grade:data:home made;Cat;dinnerbox_grade_Enroll
list:date:may;animal;dinnerbox_list_value
cgrade:made_data;dinnerbox_cgrade_notEnroll
I want data from dinnerbox to till end.
Can any one help on this?
It is a pretty simple regular expression
.*dinnerbox(.*?)$
Using a non-greedy wildcard, but forcing it to the end of the line makes sure that you always get the dinnerbox at the end.
You want capture group 1
To get rid of the _ you can use
.*dinnerbox_(.*?)$