Index for comparing to beginning of every word in a column - sql

So I have a table
id | name | gender
---+-----------------+-------
0 | Markus Meskanen | M
1 | Jack Jackson | M
2 | Jane Jackson | F
And I've created an index
CREATE INDEX people_name_idx ON people (LOWER(name));
And then I query with
SELECT * FROM people WHERE name LIKE LOWER('Jack%');
Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.
Is there a way to create an index so that each word gets a separate row?
Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.
Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.

Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).
There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).
These can speed up the searches that you are trying to do.

The pg_trgm module does exactly what you want.
You need to create either:
CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);
Or:
CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);
See the difference here.
After that, these queries could use one of the indexes above:
SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';
As #GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:
SELECT * FROM people WHERE to_tsvector('english', name) ## to_tsquery('english', 'Jack:*');
Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).
Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:
SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';
This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.

You can use Regex expressions to find text inside name:
create table ci(id int, name text);
insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');
select id, name
from ci
where name ~ 'Pe+'
;
Returns:
1 John McEnroe Blackbird Petrus
Or can use something similar where substring(name, <regex exp>) is not null
Check it here: http://rextester.com/LHA16094

If you know that the words are space separated, You can do
SELECT * FROM people WHERE name LIKE LOWER('Jack%') or name LIKE LOWER(' Jack%') ;
For more control you can use RegEx with MySQl
see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

Related

Parsing location from a search query in postgresql

I have a table of location data that is stored in json format with an attributes column that contains data as below:-
{
"name" : "Common name of a place or a postcode",
"other_name":"Any aliases",
"country": "country"
}
This is indexed as follows:-
CREATE INDEX location_jsonb_ts_vector
ON location
USING gin (jsonb_to_tsvector('simple'::regconfig, attributes,'["string","numeric"]'::jsonb));
I can search this for a location using the query:-
SELECT *
FROM location
WHERE jsonb_to_tsvector('simple'::regconfig, attributes, '["string", "numeric"]'::jsonb) ## plainto_tsquery('place name')
This works well if just using place names. But I want to search using more complex text strings such as:-
'coffee shops with wifi near charing cross'
'all restaurants within 10 miles of swindon centre'
'london nightlife'
I want to get the location found first and then strip it from the search text and go looking for the items in other tables using my location record to narrow down the scope.
This does not work with my current search mechanism as the intent and requirement pollute the text search vector and can cause odd results. I know this is a NLP problem and needs proper parsing of the search string, but this is for a small proof of concept and needs to work entirely in postgres via SQL or PL/PGSQL.
How can I modify my search to get better matches? I've tried splitting into keywords and looking for them individually, but they risk not bring back results unless combined. For example; "Kings Cross" will bring back "Kings".
I've come up with a cheap and cheerful solution:-
WITH tsv AS (
SELECT to_tsquery('english', 'football | matches | in | swindon') AS search_vector,
'football matches in swindon' AS search_text
)
SELECT * FROM
(
SELECT attributes,
position(lower(ATTRIBUTES->>'name1') IN lower(search_text)) AS name1_position
FROM location,tsv
WHERE jsonb_to_tsvector('simple'::regconfig, attributes, '["string", "numeric"]'::jsonb) ## search_vector
) loc
ORDER BY name1_position DESC

How to get position of regexp match in string in PostgreSQL?

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.
It's easy for a single-word searches. E.g.
TABLE book
id title
1 The Sun
2 The Dead Sun
3 Sun Kissed
I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*sun.*'
ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;
RESULT
id title
3 Sun Kissed
1 The Sun
2 The Dead Sun
But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.
TABLE books
id title
4 Deep Space Endeavor
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
6 Deep Black: Space Espionage and National Security
SELECT book.id, book.title FROM book
WHERE book.title ~* '.*deep.*space.*'
ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);
DESIRED RESULT
id title
4 Deep Space Endeavor
6 Deep Black: Space Espionage and National Security
5 Star Trek: Deep Space Nine: The Never Ending Sacrifice
I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?
One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:
SELECT id, title
FROM book
WHERE title ILIKE '%deep%space%'
ORDER BY length(regexp_replace(title, 'deep.*space.*', '','i'));
Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.
Alternatives
As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).
SELECT id, title
,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1
,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2
,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3
,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM book
ORDER BY title ILIKE '%deep%space%' DESC NULLS LAST
,length(regexp_replace(title, 'deep.*space.*', '','i'));
You can find documentation for all of the above in the manual here and here.
-> SQLfiddle demonstrating all.
Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:
strpos(input, (regexp_match(input, pattern, 'i'))[1]);
Or in this case:
SELECT id, title
FROM book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);
However, there are few caveats:
this is not very efficient as it will scan the input string twice.
this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match.
e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').
BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

SQL :: How to find records with most common in search string (Tags field)

We have tbl_Articles:
id title tags
=================================
1 article1 science;
2 article2 art;
3 article3 sports;art;
I am looking for a query to return records from tbl_Articles which have most common words with a specific tags string (Ex: politics;art;):
EX: Select (something from tbl_articles) where Tags has common in "politics;art;"
Result:
tbl_Articles
id title tags
=================================
2 article2 art;
3 article3 sports;art;
Are you looking for this?
select a.*
from articles a
where ';'+tags+';' like '%;politics;%' and
';'+tags+';' like '%;art;%'
Notice that I use the separator at the beginning and end so you can have "art" and "smart" as tags.
You can accomplish this task using LIKE predicate BUT, Please note that it only works on Character Patterns.
I think you are looking for Full-Text Query. The Free Text not only returns the exact wording, but also the nearest meanings attached to it.
For more details check for the following types..
FREETEXT
FREETEXTTABLE
CONTAINS
CONTAINSTABLE
The Performance benefit of using Full-Text search can be best realized when querying against a large amount of unstructured text data. A LIKE query (for example, ‘%microsoft%’) against millions of rows of text data can take minutes to return; whereas a Full-Text query (for ‘microsoft’) can take only seconds or less against the same data, depending on the number of rows that are returned.

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.

Custom SQL sort by

Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.