Why postgres returns unordered data in select query, after updation of row? - sql

I am bit confused over default ordering of the rows returned by postgres.
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
5 | c1\
6 | c2
7 | c3
(7 rows)
postgres=# update check_user set name = 'c1' where name = 'c1\';
UPDATE 1
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
6 | c2
7 | c3
5 | c1
(7 rows)
Before any updation, it was returning rows ordered by id, but after updation, the order has changed. So my question is that if order by is not specified, what default ordering does postgres uses ?
Thanks in advance.

Put very simply the "default order" is whatever it happens to read from the disk. Updating a row will not change the row in place... Usually it marks the old row as deleted and writes a new one.
When postgres reads rows from pages of memory, it will (probably) read them in the order they are stored on the page. It will read pages in whatever order it thinks is quickest (that may or may not be how they appear on disk). It can change based on whether or not it decides to use an index. So it can suddenly change without your app asking for anything different.
If you don't specify an order by it will not take any action to re-order them.
NEVER rely on the default order. It is undefined behaviour.

SQL tables represent unordered sets.
SQL results sets are unordered unless you explicitly include an order by.
Your select has no order by. Hence, the rows can come back in any order. Even running the same query twice can produce different orders.

Related

How to vertically flip the column values top to down in SQL

How to vertically flip the column values from top to bottom in SQL which are not in any specific order(neither asc nor desc)?
Example: Table named 'Country' has a single column c1 with values
| C1 |
---------
| JAPAN |
| NEPAL |
| INDIA |
---------
Now I want to display the Result-set as below:
| C1 |
----------
| INDIA |
| NEPAL |
| JAPAN |
----------
If anyone can kindly suggest?
It's not possible using standard SQL (at least before 2003) because when not explicitly ordered using ORDER BY, a resultset is an "unordered set".
This means that the order you'll get your results is supposed to be totally random and may vary from a call to a given query to the next call to this same query. Most of the time, though, you'll simply get them the way they have been inserted into the database, but this is not guaranteed. When using advanced RDBMs, you may for instance get your entries sorted according to the last ORDER BY you did before.
That's also why you don't get any row number when querying a table. Because it just doesn't make sense.
However, since SQL 2003, OLAP functions have been introduced, including ROW_NUMBER(). For the reasons exposed above, it has to be "windowed" using an OVER clause, which breaks initial order. But you still can lean on a subquery to fetch them all, joining aside a column that will never vary:
WITH subquery(c1,c2) AS (SELECT C1,1 AS C2 FROM yourtable)
SELECT c1
FROM subquery
ORDER BY row_number() OVER (PARTITION BY c2) DESC
Don't use RANK() as this will return the same "ex-æquo" row number if two rows are identical.

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

For Sql performances, several equals or one between

For a new developement, I will have a big SQL table (~100M rows).
4 fields will be used to query the data.
Is it better to query one concatenated field with between or several equals ?
Exemple :
MainTable
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4
1 | test | 1 | 4 | 3 | 1
Datas in Fk tables are static, example :
FkTable1
Id | Value
1 | a
2 | b
3 | c
To query the datas, the classic sql query is :
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1=1 and FkId2=2 and FkId3 in(2, 3)
The idea to optimize performance is to add one field "UniqueId" calculated backend before the insert :
UniqueId = FkId1*1000000 + FkId2*10000 + FkId3*100 + FkId4
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4 | UniqueId
1 | test | 1 | 4 | 3 | 1 | 1040301
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where UniqueId between 1020200 and 1040000
Moreover, with the UniqueId field, an index on this field only will be sufficient.
What do you think ?
Thanks
For this query:
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1 = 1 and FkId2 = 2 and FkId3 in (2, 3)
The optimal index is on MainTable(FkID1, FkId2, FkId3). You can also add Label and FkId4 to the index if you want a covering index (so the index can handle the entire query without referring to the original data pages).
There is no need for a computed field for the example you provided.
Since you will have 100M rows, thinking about optimisations from the start seems sensible to me.
However, your proposed solution will not work in this way:
Your formula above has two times the SAME factor 10000. You have to use different factors, i.e. different powers of 10.
Your select example has a "IN" clause (FkId3 in(2, 3)). This will only work, if only one of the FKs is queried this way. This fk should be the one with no factor in the formula for computing UniqueId (i.e. gives the least significant Digits of UniqueId).
Now seeing Gordons answer, I agree with him, i.e. using a combined index may be good enough for you (though your solution would probably slightly better). However, also the combined index has a similar problem: The FK field beeing queried with the IN clause should be the last field in the index.

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.

Query Performance with NULL

I would like to know about how NULL values affect query performance in SQL Server 2005.
I have a table similar to this (simplified):
ID | ImportantData | QuickPickOrder
--------------------------
1 | 'Some Text' | NULL
2 | 'Other Text' | 3
3 | 'abcdefg' | NULL
4 | 'whatever' | 4
5 | 'it is' | 2
6 | 'technically' | NULL
7 | 'a varchar' | NULL
8 | 'of course' | 1
9 | 'but that' | NULL
10 | 'is not' | NULL
11 | 'important' | 5
And I'm doing a query on it like this:
SELECT *
FROM MyTable
WHERE QuickPickOrder IS NOT NULL
ORDER BY QuickPickOrder
So the QuickPickOrder is basically a column used to single out some commonly chosen items from a larger list. It also provides the order in which they will appear to the user. NULL values mean that it doesn't show up in the quick pick list.
I've always been told that NULL values in a database are somehow evil, at least from a normalization perspective, but is it an acceptable way to filter out unwanted rows in a WHERE constraint?
Would it be better to use specific number value, like -1 or 0, to indicate items that aren't wanted? Are there other alternatives?
EDIT:
The example does not accuratly represent the ratio of real values to NULLs. An better example might show at least 10 NULLs for every non-NULL. The table size might be 100 to 200 rows. It is a reference table so updates are rare.
SQL Server indexes NULL values, so this will most probably just use the Index Seek over an index on QuickPickOrder, both for filtering and for ordering.
Another alternative would be two tables:
MyTable:
ID | ImportantData
------------------
1 | 'Some Text'
2 | 'Other Text'
3 | 'abcdefg'
4 | 'whatever'
5 | 'it is'
6 | 'technically'
7 | 'a varchar'
8 | 'of course'
9 | 'but that'
10 | 'is not'
11 | 'important'
QuickPicks:
MyTableID | QuickPickOrder
--------------------------
2 | 3
4 | 4
5 | 2
8 | 1
11 | 5
SELECT MyTable.*
FROM MyTable JOIN QuickPicks ON QuickPickOrder.MyTableID = MyTable.ID
ORDER BY QuickPickOrder
This would allow updating QuickPickOrder without locking anything in MyTable or logging a full row transaction for that table. So depending how big MyTable is, and how often you are updating QuickPickOrder, there may be a scalability advantage.
Also, having a separate table will allow you to add a unique index on QuickPickOrder to ensure no duplication, and could be more easily scaled later to allow different kinds of QuickPicks, having them specific to certain contexts or users, etc.
They do not have a negative performance hit on the database. Remember, NULL is more of a state than a value. Checking for NOT NULL vs setting that value to a -1 makes no difference other than the -1 is probably breaking your data integrity, imo.
SQL Server's performance can be affected by using NULLS in your database. There are several reasons for this.
First, NULLS that appear in fixed length columns (CHAR) take up the entire size of the column. So if you have a column that is 25 characters wide, and a NULL is stored in it, then SQL Server must store 25 characters to represent the NULL value. This added space increases the size of your database, which in turn means that it takes more I/O overhead to find the data you are looking for. Of course, one way around this is to use variable length fields instead. When NULLs are added to a variable length column, space is not unnecessarily wasted as it is with fixed length columns.
Second, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.
Third, the use of NULLS can lead to convoluted Transact-SQL code, which can mean code that doesn't run efficiently or that is buggy.
Ideally, NULLs should be avoided in your SQL Server databases.
Instead of using NULLs, use a coding scheme similar to this in your databases:
NA: Not applicable
NYN: Not yet known
TUN: Truly unknown
Such a scheme provides the benefits of using NULLs, but without the drawbacks.
NULL looks fine to me for this purpose. Performance is likely to be basically the same as with a non-null column and constant value, or maybe even better for filtering out all NULLs.
The alternative is to normalize QuickPickOrder into a table with a foreign key, and then perform an inner join to filter the nulls out (or a left join with a where clause to filter the non-nulls out).
NULL looks good to me as well. SQL Server has many kinds of indices to choose from. I forget which ones do this, but some only index values in a given range. If you had that kind of index on the column being tested, the NULL valued records would not be in the index, and the index scan would be fast.
Having a lot of NULLs in a column which has an index on it (or starting with it) is generally beneficial to this kind of query.
NULL values are not entered into the index, which means that inserting / updating rows with NULL in there doesn't take the performance hit of having to update another secondary index. If, say, only 0.001% of your rows have a non-null value in that column, the IS NOT NULL query becomes pretty efficient as it just scans a relatively small index.
Of course all of this is relative, if your table is tiny anyway, it makes no appreciable difference.