Extract all elements from a string separated by underscores using regex - sql

I am trying to capture all elements and store in separate column from the below string,seprated via underscores(campaign name for an advertisement) and then I wish to compare it with a master table having the true values to determine how accurate the data is being recorded.
eg: Input :
Expected output is :
My first element extraction was : REGEXP_EXTRACT(campaign_name, r"[^_+]{3}")) as parsed_campaign_agency
I only extracted first 3 letters because according to the naming convention(truth table), the agency name is made of only 3 letters.
Caveat: Some elements can have variable lengths too. eg. The third element "CrossBMC" could be 3 letters in length or more.
I am new to regex and the data lies in a SQL table(in BigQuery) so I thought it could be achieved via SQL's regex_extract but what I am having trouble is to extract all elements at once.
Any help is appreciated :)

If number of underscores constant and knows you can use SUBSTRING_INDEX like:
SELECT
SUBSTRING_INDEX(campaign_name,'_',1) first,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',2),'_',-1) second,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',3),'_',-1) third
FROM your_table;
Here you can try an example SQLize.online

Related

How to use REGEXP_LIKE() for concatenation in Oracle

I need to make some changes in SQL within a CURSOR. Previously, the maximum value for column 'code' was 4 characters (e.g. K100, K101,....K999) but now it needs to be 8 characters (e.g. K1000, K1001, K1002,....K1000000).
CURSOR c_code(i_prefix VARCHAR2)
IS
SELECT NVL(MAX(SUBSTR(code,2))+1,100) code
FROM users
WHERE code LIKE i_prefix||'___';
The 'code' column value starts from 100 and increments +1 each time a new record is inserted. Currently, the maximum value is 'K999' and I would like it to be K1000, K1001, K1002 and so on.
I have altered and modified the 'code' column to VARCHAR(8) in the users table.
Note: i_prefix value is always 'K'.
I have tried to amend the SQL -
CURSOR c_code(i_prefix VARCHAR2)
IS
SELECT NVL(MAX(SUBSTR(code,2))+1,100) code
FROM users
WHERE code LIKE i_prefix||'________';
However, it restarts from 100 and not from K1000, K1001, K1002, etc. each time a record is inserted.
I have been suggested to use REGEXP_LIKE() but not sure how to properly use it to get the desired outcome in this case.
Can you please guide me on how can we get this result using REGEXP_LIKE().
Thank you.
Your old code
WHERE code LIKE i_prefix||'___';
will match K followed by exactly three characters, which is what you had. Your new code
WHERE code LIKE i_prefix||'________';
will match K followed by exactly eight characters, which is one too many for a start, since you said the total length was eigh - which means you need sever wilcard placeholders:
WHERE code LIKE i_prefix||'_______';
... but that still won't work at the moment since your existing values aren't that long. As all your current values are at least four, you could do:
WHERE code LIKE i_prefix||'___%';
which will match K followed by three or more characters - with no upper limit, but your column is restricted to eight too anyway.
If you did want to use a regular expression, which are generally slower, you could do:
WHERE REGEXP_LIKE(code, i_prefix||'.{3,7}');
which would match K followed by three to seven characters, or:
WHERE REGEXP_LIKE(code, i_prefix||'\d{3,7}');
which would only match K followed by three to seven digits.
fiddle
However, I would suggest you use a sequence to generate the numeric part, and just prefix that with the K character. The sequence could start from 100 on a new system with no data, or from the current maximum number in an existing system with data.
I would also consider zero-padding the data, including all the existing values, to allow them to be compared; so update K100 to K0000100. Or if you can't do that, once you get past K199 jump to K2000000. Either would then allow the values to be sorted easily as strings. Or, perhaps, add a virtual column that extracts the numeric part as a number.

Hive SQL regexp_extract (number)_(number)

I'm new to hiveSQL and I'm trying to extract a value from the column col_a from the data df which is in this format:
\\\"id\\\":\\\"101_12345\\\"
I only need to extract 101_12345, but underscore makes it hard to satisfy my need. I tried using regexp_extract(col_a, '(\\d+)[_](\\d+)') but only outputs 101.
Could I get some help with regexp? Thank you
Simple solution: You don't need the two brackets.
Here's a working solution: '\\d+[_]\\d+'
When you put tokens into parentheses, the regex engine will group its match together, separate from the complete match. So the final result will comprise the complete match, and two extra matches representing the one before and after the underscore. To avoid this, just remove the brackets as you don't really need them.
In the future, if you want to group a regex together but don't want the result to contain it separately, use a non-capturing group given by (?:).
Here's a demo of what your code resulted in, hosted at regex101.com

Extract text using GREL in OpenRefine

I'm trying to add a column based on a column in OpenRefine using GREL.
I need to extract every text after the second space in scientific name.
Here is two examples of the original cell data ---> what I want to extract:
Amandinea punctata (Hoffm.) Coppins & Scheid. ---> (Hoffm.) Coppins & Scheid.
Agonimia tristicula (Nyl.) Zahlbr. ---> (Nyl.) Zahlbr.
Here are three ways to achieve the desired result on the given data, ordered from easy to understand to more advanced.
Use column splitting
You can split the column into three columns by choosing a whitespace as separator and limit the number of new columns to 3 in the corresponding dialog. Then you can delete the first two columns and have your desired result.
Use Array functions
You can use the same technique via GREL and arrays... split on whitespace, discard the first two entries and join the rest on whitespace.
value.split(" ").slice(2).join(" ")
Use regular expressions
You can also use the match function with a regular expression.
value.match(/\S+\s\S+\s(.+)/)[0]
A solution :
partition on what appears to be a good separator : " (", take the right part and add a missing "(" at the beginning.
"("+value.partition(" (")[2]

How do I use the contain function in AWS Athena to find certain text

I have a simple table let's say Names. In the column First I want to find all rows that contain the letters Jo. It would then result in Joe and John.
However when I try and use contains function it talks about an array and I have no idea how to get it to work.
Thanks.
CONTAINS() is a function that checks if an elements is in an array.
You can use the LIKE operator (which is Standard SQL). If you want names that start with "Jo":
where name like 'Jo%'
If you want to match names with "Jo" anywhere:
where name like '%Jo%'

SSIS Transform -- Split one column into multiple columns

I'm trying to find out how to split a column I have in a table and split it into three columns after the result is exported to a CSV file.
For example, I have a field called fullpatientname. It is listed in the following text format:
Smith, John C
The expectation is to have it in three separate columns:
Smith
John
C
I'm quite sure I have to split this in a derived column, but I'm not sure how to proceed with that
You are going to need to use a derived column for this process.
The SUBSTRING and FINDSTRING functions will be key to pull this off.
To get the first segment you would use something like this:
(DT_STR,25,1252) SUBSTRING([fullpatientname], 1, FINDSTRING(",",[fullpatientname],1)-1)
The above should display a substring starting with the beginning of the [fullpatientname] to the position prior to the comma (,).
The next segment would be from the position after the comma to the final space separator, and the final would be everything from the position following the final space separator to the end.
It sounds like your business rule is
The "last name" is all of the characters up to the first comma
The "first name" will be all of the characters after the first comma and a space
The "middle name" will be what (and is it always present)?
the last character in the string (you will only ever have an initial letter)
All of the characters after the second space
This logic will fail in lots of fun ways so be prepared for it. And also remember that once you combine information together, you cannot, with 100 accuracy, restore it to the component parts. Capture first, middle, last/surname and store them separately.
Approach A
A derived column component. Actually, a few of them added to your data flow will cover this. The first Derived Column will be tasked with finding the positions of the name breaks. This could be done all in a single Component but debugging becomes a challenge and then you will need to reference the same expression multiple times in a single row * 3 it quickly becomes a maintenance nightmare.
The second Derived Column will then use the positions defined in the first to call the LEFT and SUBSTRING functions to access points in the column
Approach B
I never reach for a script component first and the same should hold true for you. However, this is a mighty fine case for a script. The base .NET string library has a Split function that will break a string into pieces based on whatever delimiter you supply. The default is whitespace. The first call to split will use the ',' as the argument. The zeroeth ordinal string will be the last name. The first ordinal string will contain the first and middle name pieces. Call the string.Split method again, this time using the default value and the last element is the middle name and the remaining elements are called the first name. Or vice versa, the zeroeth element is the first name and everything else is last.
I've had to deal with cleaning names before and so I've seen different rules based on how they want to standardize the name.
Try something like this, if your names are always in the same format (LastName-comma-space-FirstName-space-MI):
declare #FullName varchar(25) = 'Smith, John C'
select
substring(#FullName, 1, charindex(',', #FullName)-1 ) as LastName,
substring(#FullName, charindex(',',#FullName) + 2, charindex(' ',#FullName,charindex(',',#FullName)+2) - (charindex(',',#FullName) + 2) ) as FirstName,
substring(#FullName, len(#FullName), 1) as MiddleInitial
I am using SQL SERVER 2016 with SSIS in Visual Studio 2015. If you are using findstring you need to make sure the order is correct. I tried this first -
FINDSTRING(",",[fullpatientname],1), but it wouldn't work. I had to look up the documentation and found the order to be incorrect. FINDSTRING([fullpatientname],",",1) fixed the problem for me. I am not sure if this is due to differences in versions.