Postgresql, tsquery doesn't work with part of string - sql

I'm using postgres' tsquery function to search in a field that might contain letters in multiple languages and numbers.
it seems that in every case the search works up to a part of the searched phrase and stops working until you write the full phrase.
for example:
searching for the name '15339' outputs the right row when the search term is '15339' but if it's '153' it won't.
searching for Al-Alamya, if the term is 'al-' it will work and return the row, but adding letters after that, for example, 'al-alam' won't return it until I finish writing the full name ('Al-Alamya').
my query:
SELECT *
FROM (SELECT DISTINCT ON ("consumer_api_spot"."id") "consumer_api_spot"."id",
"consumer_api_spot"."name",
FROM "consumer_api_spot"
INNER JOIN "consumer_api_account" ON ("consumer_api_spot"."account_id" = "consumer_api_account"."id")
INNER JOIN "users_user" ON ("consumer_api_account"."id" = "users_user"."account_id")
WHERE (
users_user.id = 53 AND consumer_api_spot.active
AND
"consumer_api_spot"."vectorized_name" ## tsquery('153')
)
GROUP BY "consumer_api_spot"."id"
) AS "Q"
LIMIT 50 OFFSET 0

If you check the documentation, you'll find more information about what you can specify as a tsquery. They support grouping, combining using boolean operations, and also prefixing which is probably something you want. An example from the docs:
Also, lexemes in a tsquery can be labeled with * to specify prefix matching:
SELECT 'super:*'::tsquery;
This query will match any word in a tsvector that begins with “super”.
So in your query you should modify the part of tsquery('153') to tsquery('153:*').
Btw. I don't know exactly how you constructed your database schema, but you can add a tsvector index for a column using a GIN index. I will assume that you generate the "consumer_api_spot"."vectorized_name" column from a "consumer_api_spot"."name" column. If that's the case you can create a tsvector index for that column like this:
CREATE INDEX gin_name on consumer_api_spot using gin (to_tsvector('english',name))
And then you could change this query:
"consumer_api_spot"."vectorized_name" ## tsquery('153')
into this:
to_tsvector('english', "consumer_api_spot"."name") ## to_tsquery('english', '153:*')
and get a potential speed benefit, because the query would utilize an index.
Note about the 'english': You cannot omit the language, when creating the index, but it won't have an effect on queries in other languages, or queries with numbers. However, be careful, the language must be the same for creating the index and performing the query to enable PostgreSQL to use the index.

Related

Efficient way to ignore whitespace in DB2?

I am running queries in a large IBM DB2 database table (let's call it T) and have found that the cells for column Identifier tend to be padded not just on the margins, but in between as well, as in: ' ID1 ID2 '. I do not have rights to update this DB, nor would I, given a number of factors. However, I want a way to ignore the whitespace AT LEAST on the left and right, even if I need to simply add a couple of spaces in between. The following queries work, but are slow, upwards of 20 seconds slow....
SELECT * FROM T WHERE Identifier LIKE '%ID1%ID2%';
SELECT * FROM T WHERE TRIM(Identifier) LIKE 'ID1%ID2';
SELECT * FROM T WHERE TRIM(Identifier) = 'ID1 ID2';
SELECT * FROM T WHERE LTRIM(RTRIM(Identifier)) = 'ID1 ID2';
SELECT * FROM T WHERE LTRIM(Identifier) LIKE 'ID1 ID2%';
SELECT * FROM T WHERE LTRIM(Identifier) LIKE 'ID1%ID2%';
SELECT * FROM T WHERE RTRIM(Identifier) LIKE '%ID1 ID2';
SELECT * FROM T WHERE RTRIM(Identifier) LIKE '%ID1%ID2';
Trying to query something like "Select * FROM T WHERE REPLACE(Identifier, ' ', '')..." of course just freezes up Access until I Ctrl+Break to end the operation. Is there a better, more efficient way to ignore the whitespace?
================================
UPDATE:
As #Paul Vernon describes below, "Trailing spaces are ignored in Db2 for comparison purpose, so you only need to consider the leading and embedded spaces."
This led me to generate combinations of spaces before 'ID1' and 'ID2' and select the records using the IN clause. The number of combinations means that the query is slower than if I knew the exact match. This is how it looks in my Java code with Jdbc (edited to make it more generic to the key issue):
private static final int MAX_LENGTH = 30;
public List<Parts> queryMyTable(String ID1, String ID2) {
String query="SELECT * FROM MYTABLE WHERE ID IN (:ids)";
final Map<String, List<String>> parameters = getIDCombinations(ID1, ID2);
return namedJdbcTemplate.query(query,parameters,new PartsMapper());
}
public static List<String> getIDCombinations(String ID1, String ID2) {
List<String> combinations = new ArrayList<>();
final int literalLength = ID1.length() + ID2.length();
final int maxWhitespace = MAX_LENGTH - literalLength;
combinations.add(ID1+ID2);
for(int x = 1; x <= maxWhitespace; x++){
String xSpace = String.format("%1$"+x+"s", "");
String idZeroSpaceBeforeBase = String.format("%s%s%s",ID1,xSpace,ID2);
String idZeroSpaceAfterBase = String.format("%s%s%s",xSpace,ID1,ID2);
combinations.add(idZeroSpaceBeforeBase);
combinations.add(idZeroSpaceAfterBase);
for(int y = 1; (x+y) <= maxWhitespace; y++){
String ySpace = String.format("%1$"+y+"s", "");
String id = String.format("%s%s%s%s",xSpace,ID1,ySpace,ID2);
combinations.add(id);
}
}
return combinations;
}
Trailing spaces are ignored in Db2 for comparison purpose, so you only need to consider the leading and embedded spaces.
Assuming there is an index on the Identifier, your only option (if you can't change the data, or add a functional index or index a generated column), is probably something like this
SELECT * FROM T
WHERE
Identifier = 'ID1 ID2'
OR Identifier = ' ID1 ID2'
OR Identifier = ' ID1 ID2'
OR Identifier = 'ID1 ID2'
OR Identifier = ' ID1 ID2'
OR Identifier = ' ID1 ID2'
which the Db2 optimize might implement as 6 index lookups, which would be faster than a full index or table scan
You could also try this
SELECT * FROM T
WHERE
Identifier LIKE 'ID1 %ID2'
OR Identifier LIKE ' ID1 %ID2'
OR Identifier LIKE ' ID1 %ID2'
which the Db2 optimize might implement as 3 index range scans,
In both examples add more lines to cover the maximum number of leading spaces you have in your data if needed. In the first example add more lines for the embeded spaces too if needed
Index on the expression REGEXP_REPLACE(TRIM(Identifier), '\s{2,}', ' ') and the following query should make Db2 use this index:
SELECT *
FROM T
WHERE REGEXP_REPLACE(TRIM(Identifier), '\s{2,}', ' ') = 'ID1 ID2'
If you need to search excluding leading and trailing spaces, then no traditional indexes can help you with that, at least as you show the case. To make the query fast, the options I can see are:
Full Text Search
You can use a "full text search" solution. DB2 does include this functionality, but I don't remember if it's included by default in the license or is sold separately. In any case, it requires a bit of indexing or periodic re-indexing of the data to make sure the search is up to date. It's worth the effort if you really need it. You'll need to change your app, since the mechanics are different.
Index on extra, clean column
Another solution is to index the column without the leading or trailing spaces. But you'll need to create an extra column; on a massive table this operation can take some time. The good news is that once is created then there's no more delay. For example:
alter table t add column trimmed_id varchar(100)
generated always as (trim(identifier));
Note: You may need to disable/enable integrity checks on the table before and after this clause. DB2 is picky about this. Read the manual to make sure it works. The creation of this column will take some time.
Then, you need to index it:
create index ix1 on t (trimmed_id);
The creation of the index will also take some time, but it should be faster than the step above.
Now, it's ready. You can query your table by using the new column instead of the original one (that's still there), but this time, you can forget about leading and traling spaces. For example:
SELECT * FROM T WHERE trimmed_id LIKE 'ID1%ID2';
The only wildcard now shows up in the middle. This query will be much faster than reading the whole table. In fact, the longer the string ID1 is, the faster the query will be, since the selectivity will be better.
Now, if ID2 is longer than ID1 then you can reverse the index to make it fast.

Find duplicates in case-sensitive query in MS Access

I have a table containing Japanese text, in which I believe that there are some duplicate rows. I want to write a SELECT query that returns all duplicate rows. So I tried running the following query based on an answer from this site (I wasn't able to relocate the source):
SELECT [KeywordID], [Keyword]
FROM Keyword
WHERE [Keyword] IN (SELECT [Keyword]
FROM [Keyword] GROUP BY [Keyword] HAVING COUNT(*) > 1);
The problem is that Access' equality operator treats the two Japanese writing systems - hiragana and katakana - as the same thing, where they should be treated as distinct. Both writing systems have the same phonetic value, although the written characters used to represent the sound are different - e.g. あ (hiragana) and ア (katakana) both represent the sound 'a'.
When I run the above query, however, both of these characters will appear, as according to Access, they're the same character and therefore a duplicate. Essentially it's a case-insensitive search where I need a case-sensitive one.
I got around this issue when doing a simple SELECT to find a Keyword using StrComp to perform a binary comparison, because this method correctly treats hiragana and katakana as distinct. I don't know how I can adapt the query above to use StrComp, though, because it's not directly evaluating one string against another as in the linked question.
Basically what I'm asking is: how can I do a query that will return all duplicates in a table, case-sensitive?
You can use exists instead:
SELECT [KeywordID], [Keyword]
FROM Keyword as k
WHERE EXISTS (SELECT 1
FROM Keyword as k2
WHERE STRCOMP(k2.Keyword, k.KeyWord, 0) = 0 AND
k.KeywordID <> k2.KeywordID
);
Try with a self join:
SELECT k1.[KeywordID], k1.[Keyword], k2.[KeywordID], k2.[Keyword]
FROM Keyword AS k1 INNER JOIN Keyword AS k2
ON k1.[KeywordID] < k2.[KeywordID] AND STRCOMP(k1.[Keyword], k2.[Keyword], 0) = 0

SQL full text search behavior on numeric values

I have a table with about 200 million records. One of the columns is defined as varchar(100) and it's included in a full text index. Most of the values are numeric. Only few are not numeric.
The problem is that it's not working well. For example if a row contains the value '123456789' and i look for '567', it's not returning this row. It will only return rows where the value is exactly '567'.
What am I doing wrong?
sql server 2012.
Thanks.
Full text search doesn't support leading wildcards
In my setup, these return the same
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'28400')
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"2840*"')
This gives zero rows
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"*840*"')
You'll have to use LIKE or some fancy trigram approach
The problem is probably that you are using a wrong tool since Full-text queries perform linguistic searches and it seems like you want to use simple "like" condition.
If you want to get a solution to your needs then you can post DDL+DML+'desired result'
You can do this:
....your_query.... LIKE '567%' ;
This will return all the rows that have a number 567 in the beginning, end or in between somewhere.
99% You're missing % after and before the string you search in the LIKE clause.
es:
SELECT * FROM t WHERE att LIKE '66'
is the same as as using WHERE att = '66'
if you write:
SELECT * FROM t WHERE att LIKE '%66%'
will return you all the lines containing 2 'sixes' one after other

PostgreSQL ORDER BY issue - natural sort

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

"ORDER BY ... USING" clause in PostgreSQL

The ORDER BY clause is decribed in the PostgreSQLdocumentation as:
ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...]
Can someone give me some examples how to use the USING operator? Is it possible to get an alternating order of the resultset?
A very simple example would be:
> SELECT * FROM tab ORDER BY col USING <
But this is boring, because this is nothing you can't get with the traditional ORDER BY col ASC.
Also the standard catalog doesn't mention anything exciting about strange comparison functions/operators. You can get a list of them:
> SELECT amoplefttype::regtype, amoprighttype::regtype, amopopr::regoper
FROM pg_am JOIN pg_amop ON pg_am.oid = pg_amop.amopmethod
WHERE amname = 'btree' AND amopstrategy IN (1,5);
You will notice, that there are mostly < and > functions for primitive types like integer, date etc and some more for arrays and vectors and so on. None of these operators will help you to get a custom ordering.
In most cases where custom ordering is required you can get away using something like ... ORDER BY somefunc(tablecolumn) ... where somefunc maps the values appropriately. Because that works with every database this is also the most common way. For simple things you can even write an expression instead of a custom function.
Switching gears up
ORDER BY ... USING makes sense in several cases:
The ordering is so uncommon, that the somefunc trick doesn't work.
You work with a non-primitive type (like point, circle or imaginary numbers) and you don't want to repeat yourself in your queries with strange calculations.
The dataset you want to sort is so large, that support by an index is desired or even required.
I will focus on the complex datatypes: often there is more than one way to sort them in a reasonable way. A good example is point: You can "order" them by the distance to (0,0), or by x first, then by y or just by y or anything else you want.
Of course, PostgreSQL has predefined operators for point:
> CREATE TABLE p ( p point );
> SELECT p <-> point(0,0) FROM p;
But none of them is declared usable for ORDER BY by default (see above):
> SELECT * FROM p ORDER BY p;
ERROR: could not identify an ordering operator for type point
TIP: Use an explicit ordering operator or modify the query.
Simple operators for point are the "below" and "above" operators <^ and >^. They compare simply the y part of the point. But:
> SELECT * FROM p ORDER BY p USING >^;
ERROR: operator > is not a valid ordering operator
TIP: Ordering operators must be "<" or ">" members of __btree__ operator families.
ORDER BY USING requires an operator with defined semantics: Obviously it must be a binary operator, it must accept the same type as arguments and it must return boolean. I think it must also be transitive (if a < b and b < c then a < c). There may be more requirements. But all these requirements are also necessary for proper btree-index ordering. This explains the strange error messages containing the reference to btree.
ORDER BY USING also requires not just one operator to be defined but an operator class and an operator family. While one could implement sorting with only one operator, PostgreSQL tries to sort efficiently and minimize comparisons. Therefore, several operators are used even when you specify only one - the others must adhere to certain mathematical constraints - I've already mentioned transitivity, but there are more.
Switching Gears up
Let's define something suitable: An operator for points which compares only the y part.
The first step is to create a custom operator family which can be used by the btree index access method. see
> CREATE OPERATOR FAMILY xyzfam USING btree; -- superuser access required!
CREATE OPERATOR FAMILY
Next we must provide a comparator function which returns -1, 0, +1 when comparing two points. This function WILL be called internally!
> CREATE FUNCTION xyz_v_cmp(p1 point, p2 point) RETURNS int
AS $$BEGIN RETURN btfloat8cmp(p1[1],p2[1]); END $$ LANGUAGE plpgsql;
CREATE FUNCTION
Next we define the operator class for the family. See the manual for an explanation of the numbers.
> CREATE OPERATOR CLASS xyz_ops FOR TYPE point USING btree FAMILY xyzfam AS
OPERATOR 1 <^ ,
OPERATOR 3 ?- ,
OPERATOR 5 >^ ,
FUNCTION 1 xyz_v_cmp(point, point) ;
CREATE OPERATOR CLASS
This step combines several operators and functions and also defines their relationship and meaning. For example OPERATOR 1 means: This is the operator for less-than tests.
Now the operators <^ and >^ can be used in ORDER BY USING:
> INSERT INTO p SELECT point(floor(random()*100), floor(random()*100)) FROM generate_series(1, 5);
INSERT 0 5
> SELECT * FROM p ORDER BY p USING >^;
p
---------
(17,8)
(74,57)
(59,65)
(0,87)
(58,91)
Voila - sorted by y.
To sum it up: ORDER BY ... USING is an interesting look under the hood of PostgreSQL. But nothing you will require anytime soon unless you work in very specific areas of database technology.
Another example can be found in the Postgres docs. with source code for the example here and here. This example also shows how to create the operators.
Samples:
CREATE TABLE test
(
id serial NOT NULL,
"number" integer,
CONSTRAINT test_pkey PRIMARY KEY (id)
)
insert into test("number") values (1),(2),(3),(0),(-1);
select * from test order by number USING > //gives 3=>2=>1=>0=>-1
select * from test order by number USING < //gives -1=>0=>1=>2=>3
So, it is equivalent to desc and asc. But you may use your own operator, that's the essential feature of USING
Nice answers, but they didn't mentioned one real valuable case for ´USING´.
When you have created an index with non default operators family, for example varchar_pattern_ops ( ~>~, ~<~, ~>=~, ... ) instead of <, >, >= then if you search based on index and you want to use index in order by clause you need to specify USING with the appropriate operator.
This can be illustrated with such example:
CREATE INDEX index_words_word ON words(word text_pattern_ops);
Lets compare this two queries:
SELECT * FROM words WHERE word LIKE 'o%' LIMIT 10;
and
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word LIMIT 10;
The difference between their executions is nearly 100 times in a 500K words DB! And also results may not be correct within non-C locale.
How this could happend?
When you making search with LIKE and ORDER BY clause, you actually make this call:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING < LIMIT 10;
Your index created with ~<~ operator in mind, so PG cannot use given index in a given ORDER BY clause. To get things done right query must be rewritten to this form:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING ~<~ LIMIT 10;
or
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word USING ~<~ LIMIT 10;
Optionally one can add the key word ASC (ascending) or DESC
(descending) after any expression in the ORDER BY clause. If not
specified, ASC is assumed by default. Alternatively, a specific
ordering operator name can be specified in the USING clause. An
ordering operator must be a less-than or greater-than member of some
B-tree operator family. ASC is usually equivalent to USING < and DESC
is usually equivalent to USING >.
PostgreSQL 9.0
It may look something like this I think (I don't have postgres to verify this right now, but will verify later)
SELECT Name FROM Person
ORDER BY NameId USING >