sql: regular expression for certain pattern separation - sql

In table "example" I have a column "col1" with following strings
some example text here x2.0.3-a abc
some other example text 1.5 abc
another example text 0.1.4 mnp
some other example text abc
another example text mnp
Now I need following things
Add the part before . to another column "col1"
Add the part . to another column "col2"
So the output should look like this
col1 col2
some example text here x2.0.3-a
some other example text 1.5
another example text 0.1.4
some other example text
another example text
Some of the properties of the string in col1 are
String in the col1 always end with either abc or mnp
Numbers like these x2.0.3-a or 0.1.4 are properties. These properties may not always exists in the col1 string. But if it exits then it always exists before the ending string abc or mnp.
there is always an space before properties and after the properties i.e another space between ending string abc/mnp and the properties.
So my question is how can I separate the properties and add them into col2?
One idea that comes into my head is that try to find something with *.* abc/mnp or *.*.* abc/mnp that is anything.anything. space abc/mnp OR anything.anything.anything space abc/mnp. I am not sure if I explained it properly.

As far as I understood, you'd like your column to be splitted into 3 columns. You should explain your 2nd column's scope and semantics better, so you can make sure to regular expression regularly matches with it.
I built a regex in parallel to the data you supplied, so it may not match for future incoming lines. Regex is here : https://regex101.com/r/seLgca/2/ What it does is, it captures three main groups:
(.+?)\s?([a-z]?\d(?:\.\d){1,2}(?:-[a-z])?)?\s(abc|mnp)
Let's break regex into parts:
(.+?)
\s?
([a-z]?\d(?:.\d){1,2}(?:-[a-z])?)?
\s
(abc|mnp)
Starting in reverse order, fifth part is simply matches abc or mnp. Forth part expects a space. Third part matches your second column if it exists, note that this part is for what you supplied so you can modify this part to suit your data better. Second part expects a space if it exists, this is for the lines contain empty second columns. First part is for the rest.
In Oracle, we have search and substring functions with regex as far as I know. So, you'll need a programming language to capture those groups.
I wrote a Java method for the purpose:
static List<String> getGroups(String content, String regex){
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
List<String> groupsMatched = new ArrayList<String>();
if(matcher.find()){
for(int i=0; i<matcher.groupCount(); i++)
groupsMatched.add(matcher.group(i));
return groupsMatched;
}else
return null;
}
So, if I call the method with the lines you supplied like this:
for(String content : listOfContent){
List<String> groupsMatched = getGroups(content, regex);
if(groupsMatched != null)
System.out.println(groupsMatched.get(1) + "\t" + groupsMatched.get(2) + "\t" + groupsMatched.get(3) );
}
Here is what I have:
some example text here x2.0.3-a abc
some other example text 1.5 abc
another example text 0.1.4 mnp
some other example text null abc
another example text null mnp
Hope this helps.
Cheers,

Related

PostgreSQL - find matching line in char/string column?

How can I find matching line in char/string type column?
For example let say I have column called text and some row has content of:
12345\nabcdf\nXKJKJ
(where \n are real new lines)
Now I want to find related row if any of lines match. For example, I have value 12345,
then it should find match. But if I have value 123, It would not.
I tried using like but it finds in both cases, when I have matching value (like 12345) and partially matching value (like 123).
For example something like this, but to have boundary for checking whole line:
SELECT id
FROM my_table
WHERE text like [SOME_VALUE]
Update
Maybe its not yet clear what Im asking. But basically I want something equivalent what you can do with regular expression,
like this: https://regexr.com/5akj1
Here regular expression /^123$/m would not match my string, it would only match if it would have been with pattern /^12345$/m (when I use pattern, value is dynamic, so pattern would change depending what value I got).
You may use regexp_replace and then check that the replaced string is not equal to the original column value:
select count(*)
from dummy
where regexp_replace(mytext, '(?m)^1234$', '') <> mytext;
You have a demo here.
Bear in mind that I have used the (?m) modifier, which makes ^ and $ match begin and end of line instead of begin and end of string.
You should be able to use ~ for matching:
where mytext ~ '(\n|^)1234(\n|$)'

Search string with regular expression SQL Server

The regex I want to use is: ^(?=.*[,])(,?)ABC(,?)$
What I want to get out is:
^ // start
(?=.*[,]) // contains at least one comma (,)
(,?)ABC(,?) // The comma is either in the beginning or in the end of the string "ABC"
$ // end
Of course ABC is ought to be a variable based on my search term.
So if ABC = 'abc' then ",abc", "abc,", ",abc," will match but not "abc" or "abcd"
Better way to do this is also welcome.
The value in the record looks like "abc,def,ghi,ab,cde..." and I need to find out if it contains my element (i.e. 'abc'). I cannot change the data structure. We can assume that in no case the record will contain only one sub-value, so it is correct to assume that there always is a comma in the value.
If you want to know if a comma delimited string contains abc, then I think like is the easiest method in any database:
where ',' + col + ',' like '%,abc,%'

regexp after a word appear

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

Postgresql query to update fields using a regular expression

I have the following data in my "Street_Address_1" column:
123 Main Street
Using Postgresql, how would I write a query to update the "Street_Name" column in my Address table? In other words, "Street_Name" is blank and I'd like to populate it with the street name value contained in the "Street_Address_1" column.
From what I can tell, I would want to use the "regexp_matches" string method. Unfortunately, I haven't had much luck.
NOTE: You can assume that all addresses are in a "StreetNumber StreetName StreetType" format.
If you just want to take Street_Address_1 and strip out any leading numbers, you can do this:
UPDATE table
SET street_name = regexp_replace(street_address_1, '^[0-9]* ','','');
This takes the value in street_address_1 and replaces any leading string of numbers (plus a single space) with an empty string (the fourth parameter is for optional regex flags like "g" (global) and "i" (case-insensitive)).
This version allows things like "1212 15th Street" to work properly.
Something like...:
UPDATE table
SET Street_Name = substring(Street_Address_1 FROM '^[0-9]+ ([a-zAZ]+) ')
See relevant section from PGSQL 8.3.7 docs, the substring form is detailed shortly after the start of the section.