What is the maximum value for STRING ordering in SQL (SQLite)? - sql

I have a SQLite database and I want to order my results by ascending order of a String column (name). I want the null-valued rows to be last in ascending order.
Moreover, I am doing some filtering on the same column (WHERE name>"previously obtained value"), which filters out the NULL-valued rows, which I do not want. Plus, the version of SQLite I'm using (I don't have control over this) does not support NULLS LAST. Therefore, to keep it simple I want to use IFNULL(name,"Something") in my ORDER BY and my comparison.
I want this "Something" to be as large as possible, so that my null-valued rows are always last. I have texts in Japanese and Korean, so I can't just use "ZZZ".
Therefore, I see two possible solutions. First, use the "maximum" character used by SQLite in the default ordering of strings, do you know what this value is or how to obtain it? Second, as the cells can contain any type in SQLite, is there a value of any other type that will always be considered larger than any string?
Example:
+----+-----------------+---------------+
| id | name | othercol |
+----+-----------------+---------------+
| 1 | English name | hello |
| 2 | NULL | hi |
| 3 | NULL | hi hello |
| 4 | 暴鬼 | hola |
| 5 | NULL | bonjour hello |
| 6 | 아바키 | hello bye |
+----+-----------------+---------------+
Current request:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (name,id)>("English name",1) ORDER BY (name,id)
Result (by ids): 6
Problems: NULL names are filtered out because of the comparison, and when I have no comparison they are shown first.
What I think would solve these problems:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,"Something"),id)>("English name",1) ORDER BY (IFNULL(name,"Something"),id)
But I need "Something" to be larger than any string I might encounter.
Expected result: 6, 3, 5

I think a simpler way is to use nulls last:
order by column nulls last
This works with both ascending and descending sorts. And it has the advantage that it can make use of an index on the column, which coalesce() would probably prevent.

Change your WHERE clause to:
WHERE SOMECOL > "previously obtained value" OR SOMECOL IS NULL
so the NULLs are not filtered out (since you want them).
You can sort the NULLs last, like this:
ORDER BY SOMECOL IS NULL, SOMECOL
The expresssion:
SOMECOL IS NULL
evaluates to 1 (True) or 0 (False), so the values that are not NULL will be sorted first.
Edit
If you want a string that is greater than any name in the table, then you can get it by:
select max(name) || ' ' from mytable
so in your code use:
ifnull(name, (select max(name) || ' ' from mytable))

Finally found a solution, for anyone looking for a character larger than any other (when I'm posting this, the unicode table might get expanded), here's your guy:
CAST(x'f48083bf' AS TEXT).
Example in my case:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)>("English name",1) ORDER BY (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)

Related

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

Why postgres returns unordered data in select query, after updation of row?

I am bit confused over default ordering of the rows returned by postgres.
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
5 | c1\
6 | c2
7 | c3
(7 rows)
postgres=# update check_user set name = 'c1' where name = 'c1\';
UPDATE 1
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
6 | c2
7 | c3
5 | c1
(7 rows)
Before any updation, it was returning rows ordered by id, but after updation, the order has changed. So my question is that if order by is not specified, what default ordering does postgres uses ?
Thanks in advance.
Put very simply the "default order" is whatever it happens to read from the disk. Updating a row will not change the row in place... Usually it marks the old row as deleted and writes a new one.
When postgres reads rows from pages of memory, it will (probably) read them in the order they are stored on the page. It will read pages in whatever order it thinks is quickest (that may or may not be how they appear on disk). It can change based on whether or not it decides to use an index. So it can suddenly change without your app asking for anything different.
If you don't specify an order by it will not take any action to re-order them.
NEVER rely on the default order. It is undefined behaviour.
SQL tables represent unordered sets.
SQL results sets are unordered unless you explicitly include an order by.
Your select has no order by. Hence, the rows can come back in any order. Even running the same query twice can produce different orders.

In Postgres: Select columns from a set of arrays of columns and check a condition on all of them

I have a table like this:
I want to perform count on different set of columns (all subsets where there is at least one element from X and one element from Y). How can I do that in Postgres?
For example, I may have {x1,x2,y3}, {x4,y1,y2,y3},etc. I want to count number of "id"s having 1 in each set. So for the first set:
SELECT COUNT(id) FROM table WHERE x1=1 AND x2=1 AND x3=1;
and for the second set does the same:
SELECT COUNT(id) FROM table WHERE x4=1 AND y1=1 AND y2=1 AND y3=1;
Is it possible to write a loop that goes over all these sets and query the table accordingly? The array will have more than 10000 sets, so it cannot be done manually.
You should be able convert the table columns to an array using ARRAY[col1, col2,...], then use the array_positions function, setting the second parameter to be the value you're checking for. So, given your example above, this query:
SELECT id, array_positions(array[x1,x2,x3,x4,y1,y2,y3,y4], 1)
FROM tbl
ORDER BY id;
Will yield this result:
+----+-------------------+
| id | array_positions |
+----+-------------------+
| a | {1,4,5} |
| b | {1,2,4,7} |
| c | {1,2,3,4,6,7,8} |
+----+-------------------+
Here's a SQL Fiddle.

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).

Query Performance with NULL

I would like to know about how NULL values affect query performance in SQL Server 2005.
I have a table similar to this (simplified):
ID | ImportantData | QuickPickOrder
--------------------------
1 | 'Some Text' | NULL
2 | 'Other Text' | 3
3 | 'abcdefg' | NULL
4 | 'whatever' | 4
5 | 'it is' | 2
6 | 'technically' | NULL
7 | 'a varchar' | NULL
8 | 'of course' | 1
9 | 'but that' | NULL
10 | 'is not' | NULL
11 | 'important' | 5
And I'm doing a query on it like this:
SELECT *
FROM MyTable
WHERE QuickPickOrder IS NOT NULL
ORDER BY QuickPickOrder
So the QuickPickOrder is basically a column used to single out some commonly chosen items from a larger list. It also provides the order in which they will appear to the user. NULL values mean that it doesn't show up in the quick pick list.
I've always been told that NULL values in a database are somehow evil, at least from a normalization perspective, but is it an acceptable way to filter out unwanted rows in a WHERE constraint?
Would it be better to use specific number value, like -1 or 0, to indicate items that aren't wanted? Are there other alternatives?
EDIT:
The example does not accuratly represent the ratio of real values to NULLs. An better example might show at least 10 NULLs for every non-NULL. The table size might be 100 to 200 rows. It is a reference table so updates are rare.
SQL Server indexes NULL values, so this will most probably just use the Index Seek over an index on QuickPickOrder, both for filtering and for ordering.
Another alternative would be two tables:
MyTable:
ID | ImportantData
------------------
1 | 'Some Text'
2 | 'Other Text'
3 | 'abcdefg'
4 | 'whatever'
5 | 'it is'
6 | 'technically'
7 | 'a varchar'
8 | 'of course'
9 | 'but that'
10 | 'is not'
11 | 'important'
QuickPicks:
MyTableID | QuickPickOrder
--------------------------
2 | 3
4 | 4
5 | 2
8 | 1
11 | 5
SELECT MyTable.*
FROM MyTable JOIN QuickPicks ON QuickPickOrder.MyTableID = MyTable.ID
ORDER BY QuickPickOrder
This would allow updating QuickPickOrder without locking anything in MyTable or logging a full row transaction for that table. So depending how big MyTable is, and how often you are updating QuickPickOrder, there may be a scalability advantage.
Also, having a separate table will allow you to add a unique index on QuickPickOrder to ensure no duplication, and could be more easily scaled later to allow different kinds of QuickPicks, having them specific to certain contexts or users, etc.
They do not have a negative performance hit on the database. Remember, NULL is more of a state than a value. Checking for NOT NULL vs setting that value to a -1 makes no difference other than the -1 is probably breaking your data integrity, imo.
SQL Server's performance can be affected by using NULLS in your database. There are several reasons for this.
First, NULLS that appear in fixed length columns (CHAR) take up the entire size of the column. So if you have a column that is 25 characters wide, and a NULL is stored in it, then SQL Server must store 25 characters to represent the NULL value. This added space increases the size of your database, which in turn means that it takes more I/O overhead to find the data you are looking for. Of course, one way around this is to use variable length fields instead. When NULLs are added to a variable length column, space is not unnecessarily wasted as it is with fixed length columns.
Second, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.
Third, the use of NULLS can lead to convoluted Transact-SQL code, which can mean code that doesn't run efficiently or that is buggy.
Ideally, NULLs should be avoided in your SQL Server databases.
Instead of using NULLs, use a coding scheme similar to this in your databases:
NA: Not applicable
NYN: Not yet known
TUN: Truly unknown
Such a scheme provides the benefits of using NULLs, but without the drawbacks.
NULL looks fine to me for this purpose. Performance is likely to be basically the same as with a non-null column and constant value, or maybe even better for filtering out all NULLs.
The alternative is to normalize QuickPickOrder into a table with a foreign key, and then perform an inner join to filter the nulls out (or a left join with a where clause to filter the non-nulls out).
NULL looks good to me as well. SQL Server has many kinds of indices to choose from. I forget which ones do this, but some only index values in a given range. If you had that kind of index on the column being tested, the NULL valued records would not be in the index, and the index scan would be fast.
Having a lot of NULLs in a column which has an index on it (or starting with it) is generally beneficial to this kind of query.
NULL values are not entered into the index, which means that inserting / updating rows with NULL in there doesn't take the performance hit of having to update another secondary index. If, say, only 0.001% of your rows have a non-null value in that column, the IS NOT NULL query becomes pretty efficient as it just scans a relatively small index.
Of course all of this is relative, if your table is tiny anyway, it makes no appreciable difference.