String matching in PostgreSQL - sql

I need to implement a regular expression (as I understand) matching in PostgreSQL 8.4. It seems regular expression matching are only available in 9.0+.
My need is:
When I give an input 14.1 I need to get these results:
14.1.1
14.1.2
14.1.Z
...
But exclude:
14.1.1.1
14.1.1.K
14.1.Z.3.A
...
The pattern is not limited to a single character. There is always a possibility that a pattern like this will be presented: 14.1.1.2K, 14.1.Z.13.A2 etc., because the pattern is provided the user. The application has no control over the pattern (it's not a version number).
Any idea how to implement this in Postgres 8.4?
After one more question my issue was solved:
Escaping a LIKE pattern or regexp string in Postgres 8.4 inside a stored procedure

Regular expression matching has been in Postgres practically for ever, at least since version 7.1. Use the these operators:
~ !~ ~* !~*
For an overview, see:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
The point in your case seems to be to disallow more dots:
SELECT *
FROM tbl
WHERE version LIKE '14.1.%' -- for performance
AND version ~ '^14\.1\.[^.]+$'; -- for correct result
db<>fiddle here
Old sqlfiddle
The LIKE expression is redundant, but it is going to improve performance dramatically, even without index. You should have an index, of course.
The LIKE expression can use a basic text_pattern_ops index, while the regular expression cannot, at least in Postgres 8.4.
Or with COLLATE "C" since Postgres 9.1. See:
Is there a difference between text_pattern_ops and COLLATE "C"?
PostgreSQL LIKE query performance variations
[^.] in the regex pattern is a character class that excludes the dot (.). So more characters are allowed, just no more dots.
Performance
To squeeze out top performance for this particular query you could add a specialized index:
CREATE INDEX tbl_special_idx ON tbl
((length(version) - length(replace(version, '.', ''))), version text_pattern_ops);
And use a matching query, the same as above, just replace the last line with:
AND length(version) - length(replace(version, '.', '')) = 2
db<>fiddle here
Old sqlfiddle

You can't do regex matching, but I believe you can do like operators so:
SELECT * FROM table WHERE version LIKE '14.1._';
Will match any row with a version of '14.1.' followed by a single character. This should match your examples. Note that this will not match just '14.1', if you needed this as well. You could do this with an OR.
SELECT * FROM table WHERE version LIKE '14.1._' OR version = '14.1';

Regex matching should be possible with Postgresql-8.4 like this:
SELECT * FROM table WHERE version ~ '^14\.1\..$';

Related

String matching in Peewee (SQL)

I am trying to query in Peewee with results that should have a specific substring in them.
For instance, if I want only activities with "Physics" in the name:
schedule = Session.select().join(Activity).where(Activity.name % "%Physics%").join(Course).join(StuCouRel).join(Student).where(Student.id == current_user.id)
The above example doesn't give any errors, but doesn't work correctly.
In python, I would just do if "Physics" in Activity.name, so I'm looking for an equivalent which I can use in a query.
You could also use these query methods: .contains(substring), .startswith(prefix), .endswith(suffix).
For example, your where clause could be:
.where(Activity.name.contains("Physics"))
I believe this is case-insensitive and behaves the same as LIKE '%Physics%'.
Quick answer:
just use Activity.name.contains('Physics')
Depending on the database backend you're using you'll want to pick the right "wildcard". Postgresql and MySQL use "%", but for Sqlite if you're performing a LIKE query you will actually want to use "*" (although for ILIKE it is "%", confusing).
I'm going to guess you're using SQLite since the above query is failing, so to recap, with SQLite if you want case-sensitive partial-string matching: Activity.name % "*Physics*", and for case-insensitive: Activity.name ** "%Physics%".
http://www.sqlite.org/lang_expr.html#like

SELECT by string prefix using an index

You have a column foo, of some string type, with a index on that column. You want to SELECT from the table WHERE the foo column has the prefix 'pre'. Obviously, the index should be able to help here.
Here is the most obvious way to search by prefix:
SELECT * FROM tab WHERE foo LIKE 'pre%';
Unfortunately, this does not get optimized to use the index (in Oracle or Postgres, at least).
The following, however, does work:
SELECT * FROM tab WHERE 'pre' <= foo AND foo < 'prf';
But are there better ways to accomplish this, or are there ways of making the above more elegant? In particular:
I need a function from 'pre' to 'prf', but this has to work for any underlying collation. Also, it's more complicated than above, because if searching for e.g. 'prz' then the upper bound would have to be 'psa', and so on.
Can I abstract this into a stored function/procedure and still hit the index? So I could write something like ... WHERE prefix('pre', foo);?
Answers for all DBMSes appreciated.
The database is quite important here. It so happens that SQL Server does this optimization for like.
One way is to do something like this:
where foo >= 'pre' and foo <= 'pre+'~'
'~' has the largest 7-bit ASCII value of a printable character, so it is basically bigger than anything else. This however, may be a problem if you are using wide characters or a non-standard character set.
You cannot abstract this into a function, because use of a function generally precludes the use of indexes. If you are always looking at the first three characters, then in Oracle you can create an index on those three characters (something called a "function-based index").
How about
select * from tab where foo between 'pre' and 'prf' and foo != 'prf'
this enables the index same way. The RDBMS must be pretty dumb not to use an index for that.

REGEXP_LIKE in SQLAlchemy

Any one knows how could I use the equivalent of REGEXP_LIKE in SQLAlchemy? For example I'd like to be able to do something like:
sa.Session.query(sa.Table).filter(sa.Table.field.like(regex-to match))
Thanks for your help!
It should (I have no access to Oracle) work like this:
sa.Session.query(sa.Table) \
.filter(sa.func.REGEXP_LIKE(sa.Table.c.column, '[[:digit:]]'))
In cases when you need to do database specific function which is not supported by SQLAlchemy you can use literal filter. So you can still use SQLAlchemy to build query for you - i.e. take care about joins etc.
Here is example how to put together literal filter with PostgreSQL Regex Matching operator ~
session.query(sa.Table).filter("%s ~':regex_pattern'" % sa.Table.c.column.name).params(regex_pattern='stack')
or you can manually specify table and column as a part of literal string to avoid ambigious column names case
session.query(sa.Table).filter("table.column ~':regex_pattern'" ).params(regex_pattern='[123]')
This is not fully portable, but here is a Postgres solution, which uses a ~ operator. We can use arbitrary operators thus:
sa.Session.query(sa.Table).filter(sa.Table.field.op('~', is_comparison=True)(regex-to match))
Or, assuming a default precedence of 0,
sa.Session.query(sa.Table).filter(sa.Table.field.op('~', 0, True)(regex-to match))
This also works with ORM constructs:
sa.Session.query(SomeClass).filter(SomeClass.field.op('~', 0, True)(regex-to match))

Postgresql prefix wildcard for full text

I am trying to run a fulltext query using Postgresql that can cater for partial matches using wildcards.
It seems easy enough to have a postfix wildcard after the search term, however I cannot figure out how to specify a prefix wildcard.
For example, I can perform a postfix search easily enough using something like..
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', 'don:*') )
should return results matching "London"
However I cant seem to do a prefix search like...
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', ':*don') )
Ideally I'd like to have a wildcard prefixed to the front and end of the search term, something like...
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', ':*don:*') )
I can use a LIKE condition however I was hoping to benefit from the performance of the full text search features in Postgres.
Full text search is good for finding words, not substrings.
For substring searches you'd better use like '%don%' with pg_trgm extension available from PostgreSQL 9.1 and using gin (column_name gin_trgm_ops) or using gist (column_name gist_trgm_ops) indexes. But your index would be very big (even several times bigger than your table) and write performance not very good.
There's a very good example of using pg_trgm for substring search on select * from depesz blog.
One wild and crazy way of doing it would be to create a tsvector index of all your documents, reversed. And reverse your queries for postfix search too.
This is essentially what Solr does with its ReversedWildcardFilterFactory
select
reverse('brown fox')::tsvector ## (reverse('rown') || ':*')::tsquery --true

How to create simple fuzzy search with PostgreSQL only?

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:
Product.where("code ILIKE ?", "%" + params[:search] + "%")
It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".
What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?
Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
Try this query sample: (with your own object names and data of course)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
Note: if you get an error like:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch extension using:
test=# CREATE EXTENSION fuzzystrmatch;
Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.
If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:
WHERE code ILIKE '%AB-123-lHdfj%'
Use:
WHERE code ILIKE '%AB%123%lHdfj%'
Or, dynamically:
WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'
% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:
WHERE code ~* 'AB.?123.?lHdfj'
.? ... 0 or 1 characters
Or:
WHERE code ~* 'AB\-?123\-?lHdfj'
\-? ... 0 or 1 dashes
You may want to escape special characters in LIKE or regexp patterns. See:
Escape function for regular expression or LIKE patterns
If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:
There is full text search, of course. But this may be an overkill in your case.
A more likely candidate is trigram-matching with the additional module pg_trgm. See:
Using Levenshtein function on each element in a tsvector?
PostgreSQL LIKE query performance variations
Related blog post by Depesz
Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
Also interesting in this context: the similarity() function or % operator of that module.
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.
Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.
Overview of pattern-matching techniques:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL