Real number comparison for trigram similarity - sql

I am implementing trigram similarity for word matching in column comum1. similarity() returns real. I have converted 0.01 to real and rounded to 2 decimal digits. Though there are rank values greater than 0.01, I get no results on screen. If I remove the WHERE condition, lots of results are available. Kindly guide me how to overcome this issue.
SELECT *,ROUND(similarity(comum1,"Search_word"),2) AS rank
FROM schema.table
WHERE rank >= round(0.01::real,2)
I have also converted both numbers to numeric and compared, but that also didn't work:
SELECT *,ROUND(similarity(comum1,"Search_word")::NUMERIC,2) AS rank
FROM schema.table
WHERE rank >= round(0.01::NUMERIC,2)
LIMIT 50;

The WHERE clause can only reference input column names, coming from the underlying table(s). rank in your example is the column alias for a result - an output column name.
So your statement is illegal and should return with an error message - unless you have another column named rank in schema.table, in which case you shot yourself in the foot. I would think twice before introducing such a naming collision, while I am not completely firm with SQL syntax.
And round() with a second parameter is not defined for real, you would need to cast to numeric like you tried. Another reason your first query is illegal.
Also, the double-quotes around "Search_word" are highly suspicious. If that's supposed to be a string literal, you need single quotes: 'Search_word'.
This should work:
SELECT *, round(similarity(comum1,'Search_word')::numeric,2) AS rank
FROM schema.table
WHERE similarity(comum1, 'Search_word') > 0.01;
But it's still pretty useless as it fails to make use of trigram indexes. Do this instead:
SET pg_trgm.similarity_threshold = 0.01; -- set once
SELECT *
FROM schema.table
WHERE comum1 % 'Search_word';
See:
Finding similar strings with PostgreSQL quickly
That said, a similarity of 0.01 is almost no similarity. Typically, you need a much higher threshold.

Related

Search Through All Between Values SQL

I have data following data structure..
_ID _BEGIN _END
7003 99210 99217
7003 10225 10324
7003 111111
I want to look through every _BEGIN and _END and return all rows where the input value is between the range of values including the values themselves (i.e. if 10324 is the input, row 2 would be returned)
I have tried this filter but it does not work..
where #theInput between a._BEGIN and a._END
--THIS WORKS
where convert(char(7),'10400') >= convert(char(7),a._BEGIN)
--BUT ADDING THIS BREAKS AND RETURNS NOTHING
AND convert(char(7),'10400') < convert(char(7),a._END)
Less than < and greater than > operators work on xCHAR data types without any syntactical error, but it may go semantically wrong. Look at examples:
1 - SELECT 'ab' BETWEEN 'aa' AND 'ac' # returns TRUE
2 - SELECT '2' BETWEEN '1' AND '10' # returns FALSE
Character 2 as being stored in a xCHAR type has greater value than 1xxxxx
So you should CAST types here. [Exampled on MySQL - For standard compatibility change UNSIGNED to INTEGER]
WHERE CAST(#theInput as UNSIGNED)
BETWEEN CAST(a._BEGIN as UNSIGNED) AND CAST(a._END as UNSIGNED)
You'd better change the types of columns to avoid ambiguity for later use.
This would be the obvious answer...
SELECT *
FROM <YOUR_TABLE_NAME> a
WHERE #theInput between a._BEGIN and a._END
If the data is string (assuming here as we don't know what DB) You could add this.
Declare #searchArg VARCHAR(30) = CAST(#theInput as VARCHAR(30));
SELECT *
FROM <YOUR_TABLE_NAME> a
WHERE #searchArg between a._BEGIN and a._END
If you care about performance and you've got a lot of data and indexes you won't want to include function calls on the column values.. you could in-line this conversion but this assures that your predicates are Sargable.
SELECT * FROM myTable
WHERE
(CAST(#theInput AS char) >= a._BEGIN AND #theInput < a.END);
I also saw several of the same type of questions:
SQL "between" not inclusive
MySQL "between" clause not inclusive?
When I do queries like this, I usually try one side with the greater/less than on either side and work from there. Maybe that can help. I'm very slow, but I do lots of trial and error.
Or, use Tony's convert.
I supposed you can convert them to anything appropriate for your program, numeric or text.
Also, see here, http://technet.microsoft.com/en-us/library/aa226054%28v=sql.80%29.aspx.
I am not convinced you cannot do your CAST in the SELECT.
Nick, here is a MySQL version from SO, MySQL "between" clause not inclusive?

Absolute maxvalue comparison of columns in Firebird SQL

I want to perform comparison for the specified columns in database, the comparison logic should compare the numbers regardless of their signs and will retrieve the result original with its sign.
For example, below code works well but as can be seen in the select block it returns the absolute value of columns. Is there any trick, cheat in Firebird 2.1 to overcome that?
SELECT a.ELM_NUM,a.COMBO, maxvalue(abs(a.N_1),abs(a.N_2)) as maxN from ntm a order by a.ELM_NUM
You can use a CASE condition:
SELECT a.ELM_NUM,a.COMBO,
CASE WHEN abs(a.N_1) > abs(a.N_2) THEN a.N_1 ELSE a.N_2 END as maxN
from ntm a
order by a.ELM_NUM

Query speed and expressions with constant value

Is
SELECT * FROM Table WHERE ID=2*2
slower than
SELECT * FROM Table WHERE ID=4
I mean: is 2*2 calculated once or is it calculated when checking each record?
You may ask why write 2*2 and not just simply 4?
Well actually in my real case I have some numbers which represents characters building some text (somebody had a "great" idea of saving array of numeric data as string).
So my query would be like:
SELECT * FROM Table where value1=(CHAR(20)+CHAR(8)+CHAR(32))
I could calculate this string on my side but there could be possible problems with encoding, passing characters like /n or /r in a query etc.
Will this (CHAR(20)+CHAR(8)+CHAR(32)) slow my query if I use it instead of plain string text like '#!_'
it will be the same performance with a constant expression like 2*2. However if you start comparing ID/2 = 2. It will slow down the query because it has to calculate for all rows.

Determine MAX Decimal Scale Used on a Column

In MS SQL, I need a approach to determine the largest scale being used by the rows for a certain decimal column.
For example Col1 Decimal(19,8) has a scale of 8, but I need to know if all 8 are actually being used, or if only 5, 6, or 7 are being used.
Sample Data:
123.12345000
321.43210000
5255.12340000
5244.12345000
For the data above, I'd need the query to either return 5, or 123.12345000 or 5244.12345000.
I'm not concerned about performance, I'm sure a full table scan will be in order, I just need to run the query once.
Not pretty, but I think it should do the trick:
-- Find the first non-zero character in the reversed string...
-- And then subtract from the scale of the decimal + 1.
SELECT 9 - PATINDEX('%[1-9]%', REVERSE(Col1))
I like #Michael Fredrickson's answer better and am only posting this as an alternative for specific cases where the actual scale is unknown but is certain to be no more than 18:
SELECT LEN(CAST(CAST(REVERSE(Col1) AS float) AS bigint))
Please note that, although there are two explicit CAST calls here, the query actually performs two more implicit conversions:
As the argument of REVERSE, Col1 is converted to a string.
The bigint is cast as a string before being used as the argument of LEN.
SELECT
MAX(CHAR_LENGTH(
SUBSTRING(column_name::text FROM '\.(\d*?)0*$')
)) AS max_scale
FROM table_name;
*? is the non-greedy version of *, so \d*? catches all digits after the decimal point except trailing zeros.
The pattern contains a pair of parentheses, so the portion of the text that matched the first parenthesized subexpression (that is \d*?) is returned.
References:
https://www.postgresql.org/docs/9.6/static/sql-createcast.html
https://www.postgresql.org/docs/9.6/static/functions-matching.html
Note this will scan the entire table:
SELECT TOP 1 [Col1]
FROM [Table]
ORDER BY LEN(PARSENAME(CAST([Col1] AS VARCHAR(40)), 1)) DESC

"ORDER BY ... USING" clause in PostgreSQL

The ORDER BY clause is decribed in the PostgreSQLdocumentation as:
ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...]
Can someone give me some examples how to use the USING operator? Is it possible to get an alternating order of the resultset?
A very simple example would be:
> SELECT * FROM tab ORDER BY col USING <
But this is boring, because this is nothing you can't get with the traditional ORDER BY col ASC.
Also the standard catalog doesn't mention anything exciting about strange comparison functions/operators. You can get a list of them:
> SELECT amoplefttype::regtype, amoprighttype::regtype, amopopr::regoper
FROM pg_am JOIN pg_amop ON pg_am.oid = pg_amop.amopmethod
WHERE amname = 'btree' AND amopstrategy IN (1,5);
You will notice, that there are mostly < and > functions for primitive types like integer, date etc and some more for arrays and vectors and so on. None of these operators will help you to get a custom ordering.
In most cases where custom ordering is required you can get away using something like ... ORDER BY somefunc(tablecolumn) ... where somefunc maps the values appropriately. Because that works with every database this is also the most common way. For simple things you can even write an expression instead of a custom function.
Switching gears up
ORDER BY ... USING makes sense in several cases:
The ordering is so uncommon, that the somefunc trick doesn't work.
You work with a non-primitive type (like point, circle or imaginary numbers) and you don't want to repeat yourself in your queries with strange calculations.
The dataset you want to sort is so large, that support by an index is desired or even required.
I will focus on the complex datatypes: often there is more than one way to sort them in a reasonable way. A good example is point: You can "order" them by the distance to (0,0), or by x first, then by y or just by y or anything else you want.
Of course, PostgreSQL has predefined operators for point:
> CREATE TABLE p ( p point );
> SELECT p <-> point(0,0) FROM p;
But none of them is declared usable for ORDER BY by default (see above):
> SELECT * FROM p ORDER BY p;
ERROR: could not identify an ordering operator for type point
TIP: Use an explicit ordering operator or modify the query.
Simple operators for point are the "below" and "above" operators <^ and >^. They compare simply the y part of the point. But:
> SELECT * FROM p ORDER BY p USING >^;
ERROR: operator > is not a valid ordering operator
TIP: Ordering operators must be "<" or ">" members of __btree__ operator families.
ORDER BY USING requires an operator with defined semantics: Obviously it must be a binary operator, it must accept the same type as arguments and it must return boolean. I think it must also be transitive (if a < b and b < c then a < c). There may be more requirements. But all these requirements are also necessary for proper btree-index ordering. This explains the strange error messages containing the reference to btree.
ORDER BY USING also requires not just one operator to be defined but an operator class and an operator family. While one could implement sorting with only one operator, PostgreSQL tries to sort efficiently and minimize comparisons. Therefore, several operators are used even when you specify only one - the others must adhere to certain mathematical constraints - I've already mentioned transitivity, but there are more.
Switching Gears up
Let's define something suitable: An operator for points which compares only the y part.
The first step is to create a custom operator family which can be used by the btree index access method. see
> CREATE OPERATOR FAMILY xyzfam USING btree; -- superuser access required!
CREATE OPERATOR FAMILY
Next we must provide a comparator function which returns -1, 0, +1 when comparing two points. This function WILL be called internally!
> CREATE FUNCTION xyz_v_cmp(p1 point, p2 point) RETURNS int
AS $$BEGIN RETURN btfloat8cmp(p1[1],p2[1]); END $$ LANGUAGE plpgsql;
CREATE FUNCTION
Next we define the operator class for the family. See the manual for an explanation of the numbers.
> CREATE OPERATOR CLASS xyz_ops FOR TYPE point USING btree FAMILY xyzfam AS
OPERATOR 1 <^ ,
OPERATOR 3 ?- ,
OPERATOR 5 >^ ,
FUNCTION 1 xyz_v_cmp(point, point) ;
CREATE OPERATOR CLASS
This step combines several operators and functions and also defines their relationship and meaning. For example OPERATOR 1 means: This is the operator for less-than tests.
Now the operators <^ and >^ can be used in ORDER BY USING:
> INSERT INTO p SELECT point(floor(random()*100), floor(random()*100)) FROM generate_series(1, 5);
INSERT 0 5
> SELECT * FROM p ORDER BY p USING >^;
p
---------
(17,8)
(74,57)
(59,65)
(0,87)
(58,91)
Voila - sorted by y.
To sum it up: ORDER BY ... USING is an interesting look under the hood of PostgreSQL. But nothing you will require anytime soon unless you work in very specific areas of database technology.
Another example can be found in the Postgres docs. with source code for the example here and here. This example also shows how to create the operators.
Samples:
CREATE TABLE test
(
id serial NOT NULL,
"number" integer,
CONSTRAINT test_pkey PRIMARY KEY (id)
)
insert into test("number") values (1),(2),(3),(0),(-1);
select * from test order by number USING > //gives 3=>2=>1=>0=>-1
select * from test order by number USING < //gives -1=>0=>1=>2=>3
So, it is equivalent to desc and asc. But you may use your own operator, that's the essential feature of USING
Nice answers, but they didn't mentioned one real valuable case for ´USING´.
When you have created an index with non default operators family, for example varchar_pattern_ops ( ~>~, ~<~, ~>=~, ... ) instead of <, >, >= then if you search based on index and you want to use index in order by clause you need to specify USING with the appropriate operator.
This can be illustrated with such example:
CREATE INDEX index_words_word ON words(word text_pattern_ops);
Lets compare this two queries:
SELECT * FROM words WHERE word LIKE 'o%' LIMIT 10;
and
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word LIMIT 10;
The difference between their executions is nearly 100 times in a 500K words DB! And also results may not be correct within non-C locale.
How this could happend?
When you making search with LIKE and ORDER BY clause, you actually make this call:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING < LIMIT 10;
Your index created with ~<~ operator in mind, so PG cannot use given index in a given ORDER BY clause. To get things done right query must be rewritten to this form:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING ~<~ LIMIT 10;
or
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word USING ~<~ LIMIT 10;
Optionally one can add the key word ASC (ascending) or DESC
(descending) after any expression in the ORDER BY clause. If not
specified, ASC is assumed by default. Alternatively, a specific
ordering operator name can be specified in the USING clause. An
ordering operator must be a less-than or greater-than member of some
B-tree operator family. ASC is usually equivalent to USING < and DESC
is usually equivalent to USING >.
PostgreSQL 9.0
It may look something like this I think (I don't have postgres to verify this right now, but will verify later)
SELECT Name FROM Person
ORDER BY NameId USING >