PostgreSQL Reverse LIKE - sql

I need to test if any part of a column value is in a given string, instead of whether the string is part of a column value.
For instance:
This way, I can find if any of the rows in my table contains the string 'bricks' in column:
SELECT column FROM table
WHERE column ILIKE '%bricks%';
But what I'm looking for, is to find out if any part of the sentence "The ships hung in the sky in much the same way that bricks don’t" is in any of the rows.
Something like:
SELECT column FROM table
WHERE 'The ships hung in the sky in much the same way that bricks don’t' ILIKE '%' || column || '%';
So the row from the first example, where the column contains 'bricks', will show up as result.
I've looked through some suggestions here and some other forums but none of them worked.

Your simple case can be solved with a simple query using the ANY construct and ~*:
SELECT *
FROM tbl
WHERE col ~* ANY (string_to_array('The ships hung in the sky ... bricks don’t', ' '));
~* is the case insensitive regular expression match operator. I use that instead of ILIKE so we can use original words in your string without the need to pad % for ILIKE. The result is the same - except for words containing special characters: %_\ for ILIKE and !$()*+.:<=>?[\]^{|}- for regular expression patterns. You may need to escape special characters either way to avoid surprises. Here is a function for regular expressions:
Escape function for regular expression or LIKE patterns
But I have nagging doubts that will be all you need. See my comment. I suspect you need Full Text Search with a matching dictionary for your natural language to provide useful word stemming ...
Related:
IN vs ANY operator in PostgreSQL
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

This query:
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' );
gives a following result:
| regexp_split_to_table |
|-----------------------|
| The |
| ships |
| hung |
| in |
| the |
| sky |
| in |
| much |
| the |
| same |
| way |
| that |
| bricks |
| don’t |
Now just do a semijoin against a result of this query to get desired results
SELECT * FROM table t
WHERE EXISTS (
SELECT * FROM (
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' ) x
) x
WHERE t.column LIKE '%'|| x.x || '%'
)

Related

Redshift - How to use column in one table as pattern in SIMILAR TO

I have a problem where I have two tables. One table constains urls and their information and another groups of urls that should be grouped by a pattern.
Urls table:
------------------------------------------------
| url | files |
| https://myurl1/test/one/es/main.html | 530 |
| https://myurl1/test/one/en/main.html | 530 |
| https://myurl1/test/one/ar/main.html | 530 |
------------------------------------------------
Urls patterns table:
---------------------------------------------
| group | url_pattern |
| group1 | https://myurl1/test/one/(es|en)/%|
| group2 | https://myurl1/test/one/(ar)/% |
---------------------------------------------
I have tried something like this bearing in mind that url_patterns will only have one row per group.
SELECT * FROM urls_table
WHERE url SIMILAR TO (SELECT MAX (url_pattern) FROM url_patterns WHERE group='group1')
LIMIT 10
The main problem here is that it seems that applying SIMILAR TO with a column argument is not working.
Could anyone give me some advices?
Thanks in advance.
You are running into the requirement that regexp patterns are compiled and that SIMILAR TO is a layer on regexp. So what you are trying to do won't work. I believe there are a number of other ways to do this.
I) Change to LIKE pattern matching: LIKE patterns aren't precompiled so can use dynamic patterns. The downside is that they are more limited but I think you can still do what you want. Just change your patterns to be set of pattern columns (if the number of patterns is limited) and test for all the patterns. Unneeded patterns can just be a value that can never match. Definitely a brute force hack.
II) Change to LIKE pattern matching w/ SQL to provide OR behavior: have multiple LIKE patterns in the url_pattern column separated by '|' (for example). Then use split_part to match each sub-pattern - a bit complex and possible slow but works. Like this:
SELECT url
FROM urls_table
LEFT JOIN (SELECT split_part(pattern, '|', part_no::int) as pattern
FROM url_patterns
CROSS JOIN (SELECT row_number() over () as part_no FROM urls_table)
WHERE "group" = 'group1'
)
ON url LIKE pattern
WHERE p.pattern IS NOT NULL;
You will also need to change your pattern strings to use the simpler LIKE format and use '|' for multiple possibilities - Ex: Group1 pattern becomes 'https://myurl1/test/one/es/%|https://myurl1/test/one/en/%'
III) Use some front-end query modification to find the pattern for the group and apply it to query BEFORE it is sent to the compiler. This could be an external tool or a stored procedure on Redshift. Get the pattern in one query and use it to issue the second query.
Do you want exists?
SELECT u.*
FROM urls_table u
WHERE EXISTS (SELECT 1
FROM url_patterns p
WHERE u.url SIMILAR TO p.url_pattern AND
p.group = 'group1'
)
LIMIT 10;

Oracle Regular Expression using instead of INSTR function

i keep data on table rows as followed like this;
t_course
+------+------------------------------------------+
| sid | courses |
+------+------------------------------------------+
| 1 | cs101.math102.ns202-2.phy104 |
+------+------------------------------------------+
| 2 | cs101.math201.ens202-1.phy104-10.chm105 |
+------+------------------------------------------+
| 3 | cs101.ns202-2.math201.ens202-1.phy104 |
+------+------------------------------------------+
Now, i want to take the sum of courses mentioned ns202 and ens202 in same time. Normally it should only brings record which id is 3, it brings all of the records (because of instr). i have used many methods for this, but it doesn't work. For example;
select count(*) from
t_course
where
instr(courses, 'ns202') > 0
and instr(courses, 'ens202') > 0;
Above code doesn't work properly because it takes ns202 but ens202 contains ns202 in itself.
I tried using regular expressions, i converted all course to row (split) but this has both broke working logic and slowed down.
How can i do this with regular expressions instead of instr according to begin withs (for example ns202%) logic? (Begining with ns202 first or after dot)
You can use regexp_like with word boundaries to get rows which have both ns202 and ens_202. Normally you would use \b for word-boundaries. As Oracle doesn't support it, the alternate is to use (\s|\W) with start ^ and end $ anchors.
\s - space character, \W - non word character. Add more characters as needed, as word-boundaries based on your requirements.
select *
from t_course
where regexp_like(courses,'(^|\s|\W)ns202(\s|\W|$)')
and regexp_like(courses,'(^|\s|\W)ens202(\s|\W|$)')
You will have the same problem with ens202, by the way - what if there is also cens202or tens202?
You can solve your problem with regular expressions. You can also solve it with the LIKE operator:
select <whatever>
from <table or tables>
where (courses like 'ns202%' or courses like '%.ns202%')
and (courses like 'ens202%' or courses like '%.ens202%')
You can test both approaches to see which works best for your data.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).

Match a Query to a Regular Expression in SQL?

I'm trying to find a way to match a query to a regular expression in a database. As far as I can tell (although I'm no expert), while most DBMS like MySQL have a regex option for searching, you can only do something like:
Find all rows in Column 1 that match the regex in my query.
What I want to be able to do is the opposite, i.e.:
Find all rows in Column 1 such that the regex in Column 1 matches my query.
Simple example - say I had a database structured like so:
+----------+-----------+
| Column 1 | Column 2 |
+----------+-----------+
| [a-z]+ | whatever |
+----------+-----------+
| [\w]+ | whatever |
+----------+-----------+
| [0-9]+ | whatever |
+----------+-----------+
So if I queried "dog", I would want it to return the rows with [a-z]+ and [\w]+, and if I queried 123, it would return the row with [0-9]+.
If you know of a way to do this in SQL, a short SELECT example or a link with an example would be much appreciated.
For MySQL (and may be other databases too):
SELECT * FROM table WHERE "dog" RLIKE(`Column 1`)
In PostgreSQL it would be:
SELECT * FROM table WHERE 'dog' ~ "Column 1";