How to get position of regexp match in string in PostgreSQL? - sql

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.
It's easy for a single-word searches. E.g.
TABLE book
id title
1 The Sun
2 The Dead Sun
3 Sun Kissed
I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*sun.*'
ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;
RESULT
id title
3 Sun Kissed
1 The Sun
2 The Dead Sun
But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.
TABLE books
id title
4 Deep Space Endeavor
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
6 Deep Black: Space Espionage and National Security
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*deep.*space.*'
ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);
DESIRED RESULT
id title
4 Deep Space Endeavor
6 Deep Black: Space Espionage and National Security
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?

One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:
SELECT id, title
FROM book
WHERE title ILIKE '%deep%space%'
ORDER BY length(regexp_replace(title, 'deep.*space.*', '','i'));
Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.
Alternatives
As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).
SELECT id, title
,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1
,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2
,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3
,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM book
ORDER BY title ILIKE '%deep%space%' DESC NULLS LAST
,length(regexp_replace(title, 'deep.*space.*', '','i'));
You can find documentation for all of the above in the manual here and here.
-> SQLfiddle demonstrating all.

Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:
strpos(input, (regexp_match(input, pattern, 'i'))[1]);
Or in this case:
SELECT id, title
FROM book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);
However, there are few caveats:
this is not very efficient as it will scan the input string twice.
this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match.
e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').
BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

Related

PGSQL Query to Order by with a specific letter in every user name

I have to write a query that sort (order by) my user names in a way where it should be sorted with a specific letter comes in priority (within the name).
For example if I have users Lemon, Loger, Alan, Avon, Bland, Cavin, Clauge then my query should return these in following order:
Lemon
Loger
Alan
Bland
Clauge
Avon
Cavin
i.e. "L" letter should be priority in sorting
You can use the position function to extract the position of l, and sort according to that. There are, however, two caveats to keep in mind:
position is case-sensitive, so you'd have to explicitly deal with cases (e.g., by lowercasing the string to search through).
If the substring you're searching for (l, in this case) isn't in the string, position will return 0, so you'll have to deal with 0 explicitly lest names without Ls come first instead of last:
SELECT name
FROM mytable
ORDER BY CASE POSITION('l' IN LOWER(name))
WHEN 0 THEN NULL
ELSE POSITION('l' IN LOWER(name))
END ASC NULLS LAST,
name

Index for comparing to beginning of every word in a column

So I have a table
id | name | gender
---+-----------------+-------
0 | Markus Meskanen | M
1 | Jack Jackson | M
2 | Jane Jackson | F
And I've created an index
CREATE INDEX people_name_idx ON people (LOWER(name));
And then I query with
SELECT * FROM people WHERE name LIKE LOWER('Jack%');
Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.
Is there a way to create an index so that each word gets a separate row?
Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.
Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.
Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).
There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).
These can speed up the searches that you are trying to do.
The pg_trgm module does exactly what you want.
You need to create either:
CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);
Or:
CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);
See the difference here.
After that, these queries could use one of the indexes above:
SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';
As #GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:
SELECT * FROM people WHERE to_tsvector('english', name) ## to_tsquery('english', 'Jack:*');
Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).
Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:
SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';
This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.
You can use Regex expressions to find text inside name:
create table ci(id int, name text);
insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');
select id, name
from ci
where name ~ 'Pe+'
;
Returns:
1 John McEnroe Blackbird Petrus
Or can use something similar where substring(name, <regex exp>) is not null
Check it here: http://rextester.com/LHA16094
If you know that the words are space separated, You can do
SELECT * FROM people WHERE name LIKE LOWER('Jack%') or name LIKE LOWER(' Jack%') ;
For more control you can use RegEx with MySQl
see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

Oracle 'Contains' / 'Group' function return incorrect value

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)
The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);
The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.

Custom SQL sort by

Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.