Postgres select rows that have a percentage match, and sort by percentage - sql

I have a table of sentences in a postgres database. The table is called sentences and the column that stores the sentence for each row is called sentence.
How can I compare the sentences to a given sentence and return the ones in which, say, 60% of the words (or even better the roots of the words) match and then sort the results by the quality of the match?
Ideally a 90% match would come before 70% match and a 50% match wouldn't show at all.
Ideally it would exclude punctuation as well, but that's not a necessity.

Check out the fuzzystrmatch module, especially the levenshtein function. This calculates the "distance" between two strings, with lower values meaning they are more similar. It's generally used between two words, but as long as the sentences aren't too long (max string length for each argument is 255 bytes), you could use them with sentences as well.
Then you would sort by the output of the levenshtein function ascending, with the results going from most to least similar.
If you wanted to exclude punctuation, call regexp_replace on the strings with a regex to remove all characters you would like, replacing it with empty string, and use those return values as the arguments to levenshtein.

Related

Selecting common parts of a string in GBQ

I have the following column_x in a table.
I would like to write a query to extract only the common parts of the string (it is not a split between numeric and non numeric characters), in other words abc, cbrd abd, abd from this column.
For this I do not want to use a dimensional table, as new strings could emerge in the future. Would you know how I could approach this extraction?

How to use REGEXP_LIKE() for concatenation in Oracle

I need to make some changes in SQL within a CURSOR. Previously, the maximum value for column 'code' was 4 characters (e.g. K100, K101,....K999) but now it needs to be 8 characters (e.g. K1000, K1001, K1002,....K1000000).
CURSOR c_code(i_prefix VARCHAR2)
IS
SELECT NVL(MAX(SUBSTR(code,2))+1,100) code
FROM users
WHERE code LIKE i_prefix||'___';
The 'code' column value starts from 100 and increments +1 each time a new record is inserted. Currently, the maximum value is 'K999' and I would like it to be K1000, K1001, K1002 and so on.
I have altered and modified the 'code' column to VARCHAR(8) in the users table.
Note: i_prefix value is always 'K'.
I have tried to amend the SQL -
CURSOR c_code(i_prefix VARCHAR2)
IS
SELECT NVL(MAX(SUBSTR(code,2))+1,100) code
FROM users
WHERE code LIKE i_prefix||'________';
However, it restarts from 100 and not from K1000, K1001, K1002, etc. each time a record is inserted.
I have been suggested to use REGEXP_LIKE() but not sure how to properly use it to get the desired outcome in this case.
Can you please guide me on how can we get this result using REGEXP_LIKE().
Thank you.
Your old code
WHERE code LIKE i_prefix||'___';
will match K followed by exactly three characters, which is what you had. Your new code
WHERE code LIKE i_prefix||'________';
will match K followed by exactly eight characters, which is one too many for a start, since you said the total length was eigh - which means you need sever wilcard placeholders:
WHERE code LIKE i_prefix||'_______';
... but that still won't work at the moment since your existing values aren't that long. As all your current values are at least four, you could do:
WHERE code LIKE i_prefix||'___%';
which will match K followed by three or more characters - with no upper limit, but your column is restricted to eight too anyway.
If you did want to use a regular expression, which are generally slower, you could do:
WHERE REGEXP_LIKE(code, i_prefix||'.{3,7}');
which would match K followed by three to seven characters, or:
WHERE REGEXP_LIKE(code, i_prefix||'\d{3,7}');
which would only match K followed by three to seven digits.
fiddle
However, I would suggest you use a sequence to generate the numeric part, and just prefix that with the K character. The sequence could start from 100 on a new system with no data, or from the current maximum number in an existing system with data.
I would also consider zero-padding the data, including all the existing values, to allow them to be compared; so update K100 to K0000100. Or if you can't do that, once you get past K199 jump to K2000000. Either would then allow the values to be sorted easily as strings. Or, perhaps, add a virtual column that extracts the numeric part as a number.

Regular expression metacharacter in SQL yields different results in Oracle vs. Postgres

I'm trying to convert some queries from an Oracle environment to Postgres. This is a simplified version of one of the queries:
SELECT * FROM TABLE
WHERE REGEXP_LIKE(TO_CHAR(LINK_ID),'\D')
I believe the equivalent postgreSQL should be this:
SELECT * FROM TABLE
WHERE CAST(LINK_ID AS TEXT) ~ '\D'
But when I run these queries in their respective environments on the exact same dataset, the first query outputs no records (which is correct) and the second query outputs all records in the table. I didn't write the original code, but as I understand it, it's looking for any values in the numeric field LINK_ID that are non-digit characters. Is the \D metacharacter supposed to behave differently in Oracle vs. postgres? I'm not seeing anything in documentation to say they should.
The documentation for Oracle's TO_CHAR(number) states
If you omit fmt, then n is converted to a VARCHAR2 value exactly long enough to hold its significant digits.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions181.htm
This means that the only non-numeric character which might be produced is a negative sign or a decimal point. If the number is positive and has no fractional part, it will not match the regular expression \D.
On the other hand, on PostgreSQL CAST(numeric(38,8)as TEXT) returns a value with the number of decimal places specified by the type specification, in this case 8.
E.g.:
cast( cast(12341234 as numeric(38,8)) as TEXT)
Generates 12341234.00000000 The result of such a cast will always contain a decimal point and therefore will always match the regular expression \D.
You may find that replacing it with this solves your problem:
(LINK_ID % 1) <> 0.0
Alternatively, If you need to use the regex (e.g. to simplify migration work), consider changing it to '\.0*[1-9]' i.e. to find a decimal point with any nonzero digit after it.

Difference between _%_% and __% in sql server

I am learning basics of SQL through W3School and during understanding basics of wildcards I went through the following query:
--Finds any values that start with "a" and are at least 3 characters in length
WHERE CustomerName LIKE 'a_%_%'
as per the example following query will search the table where CustomerName column start with 'a' and have at least 3 characters in length.
However, I try the following query also:
WHERE CustomerName LIKE 'a__%'
The above query also gives me the exact same result.
I want to know whether there is any difference in both queries? Does the second query produce a different output in some specific scenario? If yes what will be that scenario?
Both start with A, and end with %. In the middle part, the first says "one char, then between zero and many chars, then one char", while the second one says "one char, then one char".
Considering that the part that comes after them (the final part) is %, which means "between zero and many chars", I can only see both clauses as identical, as they both essentially just want a string starting with A then at least two following characters. Perhaps if there were at least some limitations on what characters were allowed by the _, then maybe they could have been different.
If I had to choose, I'd go with the second one for being more intuitive. After all, many other masks (e.g. a%%%%%%_%%_%%%%%) will yield the same effect, but why the weird complexity?
For Like operator a single underscore "_" means, any single character, so if you put One underscore like
ColumnName LIKE 'a_%'
you basically saying you need a string where first letter is 'a' then followed by another single character and then followed by anything or nothing.
ColumnName LIKE 'a__%' OR ColumnName LIKE 'a_%_%'
Both expressions mean first letter 'a' then followed by two characters and then followed by anything or nothing. Or in simple English any string with 3 or more character starting with a.

Simple thing about Index on sqlite

I have a sqlite database with 3 column:
id, word, bitmask
I make a bitmask out of the vowels in the word, so I can quickly find every word that contains a certain vowel:
SELECT word FROM words WHERE bitmask & 7 = 0
I have two questions.
Should I add an index? If that case, to what column and how do I write the query
I tried the code below, but didn't see any improvements in performance.
CREATE INDEX bitmask_index ON words (bitmask);
The "bitmask" column contains values from 1-256. Would it be a good thing to sort the "bitmask" column by value? In that case, how do I write the query for this?
Indexing is unlikely to help, because you are applying a function to the value before the search. Typically, this kills the effects of indexing.
Sorting bitmasks rarely makes sense, because bit positions in bitmaps do not correspond to something that is ordered across the rows (their ordering is associated with something inside the same row, e.g. the vowels in some word).