Custom sorting (order by) in PostgreSQL, independent of locale - sql

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?

If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.

If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).

Related

What is the maximum value for STRING ordering in SQL (SQLite)?

I have a SQLite database and I want to order my results by ascending order of a String column (name). I want the null-valued rows to be last in ascending order.
Moreover, I am doing some filtering on the same column (WHERE name>"previously obtained value"), which filters out the NULL-valued rows, which I do not want. Plus, the version of SQLite I'm using (I don't have control over this) does not support NULLS LAST. Therefore, to keep it simple I want to use IFNULL(name,"Something") in my ORDER BY and my comparison.
I want this "Something" to be as large as possible, so that my null-valued rows are always last. I have texts in Japanese and Korean, so I can't just use "ZZZ".
Therefore, I see two possible solutions. First, use the "maximum" character used by SQLite in the default ordering of strings, do you know what this value is or how to obtain it? Second, as the cells can contain any type in SQLite, is there a value of any other type that will always be considered larger than any string?
Example:
+----+-----------------+---------------+
| id | name | othercol |
+----+-----------------+---------------+
| 1 | English name | hello |
| 2 | NULL | hi |
| 3 | NULL | hi hello |
| 4 | 暴鬼 | hola |
| 5 | NULL | bonjour hello |
| 6 | 아바키 | hello bye |
+----+-----------------+---------------+
Current request:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (name,id)>("English name",1) ORDER BY (name,id)
Result (by ids): 6
Problems: NULL names are filtered out because of the comparison, and when I have no comparison they are shown first.
What I think would solve these problems:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,"Something"),id)>("English name",1) ORDER BY (IFNULL(name,"Something"),id)
But I need "Something" to be larger than any string I might encounter.
Expected result: 6, 3, 5
I think a simpler way is to use nulls last:
order by column nulls last
This works with both ascending and descending sorts. And it has the advantage that it can make use of an index on the column, which coalesce() would probably prevent.
Change your WHERE clause to:
WHERE SOMECOL > "previously obtained value" OR SOMECOL IS NULL
so the NULLs are not filtered out (since you want them).
You can sort the NULLs last, like this:
ORDER BY SOMECOL IS NULL, SOMECOL
The expresssion:
SOMECOL IS NULL
evaluates to 1 (True) or 0 (False), so the values that are not NULL will be sorted first.
Edit
If you want a string that is greater than any name in the table, then you can get it by:
select max(name) || ' ' from mytable
so in your code use:
ifnull(name, (select max(name) || ' ' from mytable))
Finally found a solution, for anyone looking for a character larger than any other (when I'm posting this, the unicode table might get expanded), here's your guy:
CAST(x'f48083bf' AS TEXT).
Example in my case:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)>("English name",1) ORDER BY (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)

Efficiently return words that match, or whose synonym(s), match a keyword

I have a database of industry-specific terms, each of which may have zero or more synonyms. Users of the system can search for terms by keyword and the results should include any term that contains the keyword or that has at least one synonym that contains the keyword. The result should then include the term and ONLY ONE of the matching synonyms.
Here's the setup... I have a term table with 2 fields: id and term. I also have a synonym table with 3 fields: id, termId, and synonym. So there would data like:
term Table
id | term
-- | -----
1 | dog
2 | cat
3 | bird
synonym Table
id | termId | synonym
-- | ------ | --------
1 | 1 | canine
2 | 1 | man's best friend
3 | 2 | feline
A keyword search for (the letter) "i" should return the following as a result:
id | term | synonym
-- | ------ | --------
1 | dog | canine <- because of the "i" in "canine"
2 | cat | feline <- because of the "i" in "feline"
3 | bird | <- because of the "i" in "bird"
Notice how, even though both "dog" synonyms contain the letter "i", only one was returned in the result (doesn't matter which one).
Because I need to return all matches from the term table regardless of whether or not there's a synonym and I need no more than 1 matching synonym, I'm using an OUTER APPLY as follows:
<!-- language: sql -->
SELECT
term.id,
term.term,
synonyms.synonym
FROM
term
OUTER APPLY (
SELECT
TOP 1
term.id,
synonym.synonym
FROM
synonym
WHERE
term.id = synonym.termId
AND synonym.synonym LIKE #keyword
) AS synonyms
WHERE
term.term LIKE #keyword
OR synonyms.synonym LIKE #keyword
There are indexes on term.term, synonym.termId and synonym.synonym. #Keyword is always something like '%foo%'. The problem is that, with close to 50,000 terms (not that much for databases, I know, but...), the performance is horrible. Any thoughts on how this can be done more efficiently?
Just a note, one thing I had thought to try was flattening the synonyms into a comma-delimited list in the term table so that I could get around the OUTER APPLY. Unfortunately though, that list can easily exceed 900 characters which would then prevent SQL Server from adding an index to that column. So that's a no-go.
Thanks very much in advance.
You've got a lot of unnecessary logic in there. There's no telling how SQL server is creating an execution path. It's simpler and more efficient to split this up into two separate db calls and then merge them in your code:
Get matches based on synonyms:
SELECT
term.id
,term.term
,synonyms.synonym
FROM
term
INNER JOIN synonyms ON term.termId = synonyms.termId
WHERE
synonyms.synonym LIKE #keyword
Get matches based on terms:
SELECT
term.id
,term.term
FROM
term
WHERE
term.term LIKE #keyword
For "flattening the synonyms into a comma-delimited list in the term table: - Have you considered using Full Text Search feature? It would be much faster even when your data goes on becoming bulky.
You can put all synonyms (as comma delimited) in "synonym" column and put full text index on the same.
If you want to get results also with the synonyms of the words, I recommend you to use Freetext. This is an example:
SELECT Title, Text, * FROM [dbo].[Post] where freetext(Title, 'phone')
The previous query will match the words with ‘phone’ by it’s meaning, not the exact word. It will also compare the inflectional forms of the words. In this case it will return any title that has ‘mobile’, ‘telephone’, ‘smartphone’, etc.
Take a look at this article about SQL Server Full Text Search, hope it helps

Oracle Regular Expression using instead of INSTR function

i keep data on table rows as followed like this;
t_course
+------+------------------------------------------+
| sid | courses |
+------+------------------------------------------+
| 1 | cs101.math102.ns202-2.phy104 |
+------+------------------------------------------+
| 2 | cs101.math201.ens202-1.phy104-10.chm105 |
+------+------------------------------------------+
| 3 | cs101.ns202-2.math201.ens202-1.phy104 |
+------+------------------------------------------+
Now, i want to take the sum of courses mentioned ns202 and ens202 in same time. Normally it should only brings record which id is 3, it brings all of the records (because of instr). i have used many methods for this, but it doesn't work. For example;
select count(*) from
t_course
where
instr(courses, 'ns202') > 0
and instr(courses, 'ens202') > 0;
Above code doesn't work properly because it takes ns202 but ens202 contains ns202 in itself.
I tried using regular expressions, i converted all course to row (split) but this has both broke working logic and slowed down.
How can i do this with regular expressions instead of instr according to begin withs (for example ns202%) logic? (Begining with ns202 first or after dot)
You can use regexp_like with word boundaries to get rows which have both ns202 and ens_202. Normally you would use \b for word-boundaries. As Oracle doesn't support it, the alternate is to use (\s|\W) with start ^ and end $ anchors.
\s - space character, \W - non word character. Add more characters as needed, as word-boundaries based on your requirements.
select *
from t_course
where regexp_like(courses,'(^|\s|\W)ns202(\s|\W|$)')
and regexp_like(courses,'(^|\s|\W)ens202(\s|\W|$)')
You will have the same problem with ens202, by the way - what if there is also cens202or tens202?
You can solve your problem with regular expressions. You can also solve it with the LIKE operator:
select <whatever>
from <table or tables>
where (courses like 'ns202%' or courses like '%.ns202%')
and (courses like 'ens202%' or courses like '%.ens202%')
You can test both approaches to see which works best for your data.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

SQL: Find highest number if its in nvarchar format containing special characters

I need to pull the record containing the highest value, specifically I only need the value from that field. The problem is that the column is nvarchar format that contains a mix of numbers and special characters. The following is just an example:
PK | Column 2 (nvarchar)
-------------------
1 | .1.1.
2 | .10.1.1
3 | .5.1.7
4 | .4.1.
9 | .10.1.2
15 | .5.1.4
Basically, because of natural sort, the items in column 2 are sorted as strings. So instead of returning the PK for the row containing ".10.1.2" as the highest value i get the PK for the row that contains ".5.1.7" instead.
I attempted to write some functions to do this but it seems what I've written looked way more complicated than it should be. Anyone got something simple or complicated functions are the only way?
I want to make clear that I'm trying to grab the PK of the record that contains the highest Column 2 value.
This query might return what you desire
SELECT MAX(CAST(REPLACE(Column2, '.', '') as INT)) FROM table