PostgreSQL String search for partial patterns removing exrtaneous characters

PostgreSQL String search for partial patterns removing exrtaneous characters - sql

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...

You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1

SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~

figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

Only return fields that contain numbers or special characters EXCEPT . Error

In Redshift I want to return fields that contain numbers or special characters EXCEPT . (anything other and a-z and A-Z)
The following gets me anything that contains a number but I need to extend this to any special character except full stop (.)
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]'
I need something like:
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]' OR name ~'[,#';:#~[]{}etcetc'
Sample Data:
name
john
joh1n1
j!ohn!
jo!h2n
joh.n
jo.&hn
j.3ohn
j.$9ohn
Expected Output:
name
joh1n1
j!ohn!
jo!h2n
jo.&hn
j.3ohn
j.$9ohn

You may use
WHERE name !~ '^[[:alpha:].]+$'
Here, all records that do not consist of only alphabetic or dot symbols will be returned. ^ matches the start of a string position, [[:alpha:].]+ matches one or more letters or dots and $ matches the end of string position.
If it is for PostgreSQL you may use
WHERE name SIMILAR TO '%[^[:alpha:].]%'
The SIMILAR TO operator accepts POSIX character classes and bracket expressions and wildcards, too, and requires a full string match. So, % allows any chars before any 1 char other than letter or dot ([^[:alpha:].]), and then there may also be any other chars till the end of the string.

You can do:
SELECT DISTINCT name FROM table WHERE name !~* '[a-z]'
This means: match on names that do not contain any alphanumeric character.
Operator !~* means:
Does not match regular expression, case insensitive
Edit based on the provided sample data and expected results.
If you want to match on names that contain at least one character other than an alphabetic character or a dot, then you can do:
select * from mytable where name ~* '[^a-z.]'
Demo on DB Fiddle:
with mytable(name) as (values
('john'),
('joh1n1'),
('j!ohn!'),
('jo!h2n'),
('joh.n'),
('jo.&hn'),
('j.3ohn'),
('j.$9ohn')
)
select * from mytable where name ~* '[^a-z.]'
| name |
| :------ |
| joh1n1 |
| j!ohn! |
| jo!h2n |
| jo.&hn |
| j.3ohn |
| j.$9ohn |

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))

You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.

Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.

Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

How to regex_replace in 10th position from CLOB field

I have this code:
SELECT REGEXP_REPLACE(name,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[[:alpha:][:space:][:punct:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:punct:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:digit:][:alpha:][:space:]]*)\|\|\|([[:digit:][:alpha:]]*)\|\|\|([[:alpha:][:space:]]*)\|\|\|([[:alpha:]]*).*','[p1=\10]') as replaced
FROM Dual
Editor's note: the above is a single unreadable line. Here is the same regex with line breaks for readability:
SELECT REGEXP_REPLACE(name
,'^name\[([[:alpha:][:space:][:digit:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:punct:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*).*'
,'[p1=\10]') as replaced
FROM Dual
I want to select tenth position out of it. I am able to select until nine positions but I am not able to make its tenth position on above logic. Any guess or help.
[p1=\9] if I use this expression I am able to select nine positions but I want tenth position string from the above expression.
[p1=\10] if my expression is like this it's selecting first position's value followed by 0.
Any help?

Here's a very basic example of a string that matches your regex:
name[a|||b|||c|||d|||0|||e|||f|||1|||g|||h|||i|||j|||k|||l|||m
So, you want to return 'h', the tenth field, but \10 returns a0.
If you're only interested in the tenth capturing group and none of the previous ones, then you can just remove the brackets on all capturing groups up to that one, and then use \1.
UPDATE: OP wants 2,3,4,8,9,10 and 12th fields, so just add brackets for those fields.
Field | Capture Group number
====================================
2 | \1
3 | \2
4 | \3
8 | \4
9 | \5
10 | \6
12 | \7
The code:
select REGEXP_REPLACE(name
,'^name\[[[:alpha:][:space:][:digit:]]*\|\|\|
([[:alpha:]]*)\|\|\|
([[[:alpha:][:space:][:punct:]]*)\|\|\|
([[:digit:][:alpha:]]*)\|\|\|
[[:digit:][:punct:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*\|\|\|
([[:digit:]]*)\|\|\|
([[:alpha:][:space:]]*)\|\|\|
([[:alpha:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
([[:digit:][:alpha:][:space:]]*)\|\|\|
[[:digit:][:alpha:]]*\|\|\|
[[:alpha:][:space:]]*\|\|\|
[[:alpha:]]*.*','[p1=\1]') as replaced
FROM Dual
(Linebreaks added to the regex for clarity)
I should add that it looks like the broader question you're asking is how to get the tenth field from a triple-pipe delimited string in Oracle, which may be achievable in other ways that don't involve lengthy regexes like this.

How to search for different character sets in postgresql?

I want to search a table in a postgres DB which contains both Arabic and English text. For example:
id | content
-----------------
1 | دجاج
2 | chicken
3 | دجاج chicken
The result would get me row 3.
I imagine this has to do with limiting characters using regex, but I cannot find a clean solution to select both. I tried:
SELECT regexp_matches(content, '^([x00-\xFF]+[a-zA-Z][x00-\xFF]+)*')
FROM mg.messages;
However, this only matches english and some non english characters within {}.

I know nothing about Arabic text or RTL languages in general, but this worked:
create table phrase (
id serial,
phrase text
);
insert into phrase (phrase) values ('apple pie');
insert into phrase (phrase) values ('فطيرة التفاح');
select *
from phrase
where phrase like ('apple%')
or phrase like ('فطيرة%');
http://sqlfiddle.com/#!15/75b29/2

If you want to find all articles that has at least one Unicode characters from the Arabic range (U+0600 -> U-06FF), you would have to use the following:
SELECT content FROM mg.messages WHERE content ~ E'[\u0600-\u06FF]';
Which would indeed return id 1 (Arabic only),
...you would have to adapt the pattern to match any Arabic character followed or preceded by another ASCII (english?) character.
If you want to search for any other character set (range), here is a list of all the Unicode Blocks (Hebrew, Greek, Cyrillic, Hieroglyphs, Ideographs, dingbats, etc.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL String search for partial patterns removing exrtaneous characters - sql

SELECT * FROM Productions WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') = REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '') Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

Only return fields that contain numbers or special characters EXCEPT . Error

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

How to regex_replace in 10th position from CLOB field

How to search for different character sets in postgresql?

Categories

Resources