COALESCE in postgresql conditional displaying seemingly undocumented behavior? - sql

I have looked at the COALESCE documentation and it mentions the typical case of using COALESCE to make default/situational parameters, e.g.
COALESCE(discount, 5)
which evaluates to 5 if discount is not defined as something else.
However, I have seen it used where COALESCE actually evaluated all the arguments, despite the documentation explicitly saying it stops evaluating arguments after the first non-null argument.
Here is an example similar to what I encountered, say you have a table like this:
id | wind | rain | snow
1 | null | 2 | 3
2 | 5 | null | 6
3 | null | 7 | 2
Then you run
SELECT *
FROM weather_table
WHERE
COALESCE(wind, rain, snow) >= 5
You would expect this to only select rows with wind >= 5, right? NO! It selects all rows with either wind, rain or snow more than 5. Which in this case is 2 rows, specifically these two:
2 | 5 | null | 6
3 | null | 7 | 2
Honestly, pretty cool functionality, but it really irks me that I couldn't find any example of this online or in the documentation.
Can anyone tell me what's going on? Am I missing something?

You would expect this to only select rows with wind >= 5, right?
No, I expect it to select rows with what the Coalesce function returns.
The Coalesce function delivers the value of the first non-null parameter. You had Coalesce(wind,rain,snow). The first row had (null,2,3), so coalesce returned 2. The second row had (5,null,6) so returned 5. The third row had (null,7,2) so returned 7.
The last two rows meet the condition >=5, so 2 rows are retrieved.
Notice that the value for snow was never returned in your example, because either wind or rain always had a value.

After writing out the question so clear, I realized what was going on myself. But I want to answer it here in case anyone else is confused.
Turns out the reason is the COALESCE function is run once for each row, which I suppose I could have known. Then it all makes sense.
It checks for each row, do I have non-null wind, if it is >= 5 I add this row to the result, if not I check if rain is non-null, and so on.
Notably though, if my table was had been like this:
id | wind | rain | snow
1 | 0 | 2 | 3
2 | 5 | 0 | 6
3 | 0 | 7 | 2
The command would have worked like I thought, and the COALESCE function completely useless, would have picked only that one row
2 | 5 | 0 | 6
equal to SELECT * FROM weather_table WHERE wind >= 5.
It only works if there are columns which are null (0 <> null).

Related

Understanding window function frame with RANGE mode

SELECT
sum(unique1) OVER () AS total,
sum(unique1) OVER
(PARTITION BY four ORDER BY unique1 RANGE BETWEEN 5::int8 PRECEDING AND 6::int2 FOLLOWING),
unique1,
four
FROM
tenk1
WHERE
unique1 < 10;
return:
total | sum | unique1 | four
-------+-----+---------+------
45 | 4 | 0 | 0
45 | 12 | 4 | 0
45 | 12 | 8 | 0
45 | 6 | 1 | 1
45 | 15 | 5 | 1
45 | 14 | 9 | 1
45 | 8 | 2 | 2
45 | 8 | 6 | 2
45 | 10 | 3 | 3
45 | 10 | 7 | 3
(10 rows)
Minor change based on this
Since partition by four make each frame only 2 or 3 rows. and if you between 5 preceding and 6 following, then I thought in this case, rows/range frame doesn't matter. I thought range from, rows from will return the same result. because 5 preceding 6 following covered enough 2, 3 rows per frame.
However it does matter. I guess I do understand the same query with ROWS instead of RANGE.
Quote from manual:
In RANGE or GROUPS mode, a frame_start of CURRENT ROW means the frame
starts with the current row's first peer row (a row that the window's
ORDER BY clause sorts as equivalent to the current row), while a
frame_end of CURRENT ROW means the frame ends with the current row's
last peer row. In ROWS mode, CURRENT ROW simply means the current row.
Question: How to interpret
partition by four order by unique1 rows between 5::int8 preceding and 6::int2 following
The documentation states:
In ROWS mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of rows before or after the current row.
[...]
In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column.
(The emphasis is mine.)
So with ROWS, you will get the 5 rows before and the 6 rows after the current row. With RANGE, you will get those rows where unique1 is no more than 5 less or 6 more than the unique1 of the current row.
In your example, if you consider the first row, ROWS BETWEEN ... AND 6 FOLLOWING would include the third row, but RANGE BETWEEN ... AND 6 FOLLOWING would not, because the difference between 8 (the value of unique1 in the third row) and 0 is greater than 6.

What is the expected behaviour for multiple set-returning functions in SELECT clause?

I'm trying to get a "cross join" with the result of two set-returning functions, but in some cases I don't get the "cross join", see example
Behaviour 1: When set lenghts are the same, it matches item by item from each set
postgres=# SELECT generate_series(1,3), generate_series(5,7) order by 1,2;
generate_series | generate_series
-----------------+-----------------
1 | 5
2 | 6
3 | 7
(3 rows)
Behaviour 2: When set lenghts are different, it "cross join"s the sets
postgres=# SELECT generate_series(1,2), generate_series(5,7) order by 1,2;
generate_series | generate_series
-----------------+-----------------
1 | 5
1 | 6
1 | 7
2 | 5
2 | 6
2 | 7
(6 rows)
I think I'm not understanding something here, can someone explain the expeted behaviour?
Another example, even weirder:
postgres=# SELECT generate_series(1,2) x, generate_series(1,4) y order by x,y;
x | y
---+---
1 | 1
1 | 3
2 | 2
2 | 4
(4 rows)
I am looking for an answer to the question in the title, ideally with link(s) to documentation.
Postgres 10 or newer
pads with null values for smaller set(s). Demo with generate_series():
SELECT generate_series( 1, 2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;
row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
null | 13 | 23
null | null | 24
dbfiddle here
The manual for Postgres 10:
If there is more than one set-returning function in the query's select
list, the behavior is similar to what you get from putting the
functions into a single LATERAL ROWS FROM( ... ) FROM-clause item. For
each row from the underlying query, there is an output row using the
first result from each function, then an output row using the second
result, and so on. If some of the set-returning functions produce
fewer outputs than others, null values are substituted for the missing
data, so that the total number of rows emitted for one underlying row
is the same as for the set-returning function that produced the most
outputs. Thus the set-returning functions run “in lockstep” until they
are all exhausted, and then execution continues with the next
underlying row.
This ends the traditionally odd behavior.
Some other details changed with this rewrite. The release notes:
Change the implementation of set-returning functions appearing in a query's SELECT list (Andres Freund)
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed
in a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods. In
addition, set-returning functions are now disallowed within CASE and
COALESCE constructs. For more information see Section 37.4.8.
Bold emphasis mine.
Postgres 9.6 or older
The number of result rows (somewhat surprisingly!) is the lowest common multiple of all sets in the same SELECT list. (Only acts like a CROSS JOIN if there is no common divisor to all set-sizes!) Demo:
SELECT generate_series( 1, 2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;
row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
1 | 13 | 23
2 | 11 | 24
1 | 12 | 21
2 | 13 | 22
1 | 11 | 23
2 | 12 | 24
1 | 13 | 21
2 | 11 | 22
1 | 12 | 23
2 | 13 | 24
dbfiddle here
Documented in manual for Postgres 9.6 the chapter SQL Functions Returning Sets, along with the recommendation to avoid it:
Note: The key problem with using set-returning functions in the select
list, rather than the FROM clause, is that putting more than one
set-returning function in the same select list does not behave very
sensibly. (What you actually get if you do so is a number of output
rows equal to the least common multiple of the numbers of rows
produced by each set-returning function.) The LATERAL syntax produces
less surprising results when calling multiple set-returning functions,
and should usually be used instead.
Bold emphasis mine.
A single set-returning function is OK (but still cleaner in the FROM list), but multiple in the same SELECT list is discouraged now. This was a useful feature before we had LATERAL joins. Now it's merely historical ballast.
Related:
Parallel unnest() and sort order in PostgreSQL
Unnest multiple arrays in parallel
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
I cannot find any documentation for this. However, I can describe the behavior that I observe.
The set generating functions each return a finite number of rows. Postgres seems to run the set generating functions until all of them are on their last row -- or, more likely stop when all are back to their first rows. Technically, this would be the least common multiple (LCM) of the series lengths.
I'm not sure why this is the case. And, as I say in a comment, I think it is better to generally put the functions in the from clause.
There is the only note about the issue in the documentation. I'm not sure whether this explains the described behavior or not. Perhaps more important is that such function usage is deprecated:
Currently, functions returning sets can also be called in the select list of a query. For each row that the query generates by itself, the function returning set is invoked, and an output row is generated for each element of the function's result set. Note, however, that this capability is deprecated and might be removed in future releases.

SQL: What is a value?

The Question
One thing that I am confused about is the technical definition of possibly the most basic component of a database: a single value.
Some Examples
I understand and follow (at a minimum) the first three normal forms of database normalization - or so I think. That said, with the introduction of RANGE in PostgreSQL 9.2 I started thinking about what makes a single value.
From the docs:
Range types are useful because they represent many element values in a single range value
So, what are you? Several values, or a single value... nothingness... 42?
Why does this matter?
Because is speaks directly to the Second Normal Form:
Create separate tables for sets of values that apply to multiple records.
Relate these tables with a foreign key.
#1 Ranges
For example, in Postgres 9.1 I had some tables structured like this:
"SomeSchema"."StatusType"
"StatusTypeID" | "StatusType"
--------------------|----------------
1 | Start
2 | Stop
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "StatusType" | "Value" | "Timestamp"
---------------|----------------|----------------|---------|---------------------
1 | 1 | 1 | 0 | 2000-01-01 00:00:00
2 | 1 | 2 | 5 | 2000-01-02 12:00:00
3 | 2 | 1 | 1 | 2000-01-01 00:00:00
4 | 3 | 1 | 2 | 2000-01-01 00:00:00
5 | 2 | 2 | 7 | 2000-01-01 18:30:00
6 | 1 | 2 | 3 | 2000-01-02 12:00:00
This enabled me to keep an historical record of how things were configured at any given point in time.
This structure takes the position that the data in the "Value" column were all separate values.
Now, in Postgres 9.2 if I do the same thing with a RANGE value it would look like this:
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "Value" | "Timestamp"
---------------|----------------|-------------|---------------------
1 | 1 | (0, NULL) | 2000-01-01 00:00:00
2 | 1 | (0, 5) | 2000-01-02 12:00:00
3 | 2 | (1, NULL) | 2000-01-01 00:00:00
4 | 3 | (2, NULL) | 2000-01-01 00:00:00
5 | 2 | (1, 7) | 2000-01-01 18:30:00
6 | 1 | (0, 3) | 2000-01-02 12:00:00
Again, this structure would enable me to keep an historical record of how things were configured, but I would be storing the same value several times in separate places. It makes updating (technically inserting a new record) more tricky because I have to make sure the data rolls over from the original record.
#2 Arrays
Arrays have been around for a long time, and while they can be abused, I tend to use them for things like color codes. For example, my project stores information and at times needs to know how to display it. I could create three columns to store red, green, and blue values; but that just seems silly. When would I ever create a foreign key (or even just filter) based on one of the given color codes.
When I created the field it was from the perspective that I needed to store a color in a neutral format so that I could feed anything that accepts a color value. I made the column an array and filled it with the appropriate codes to make the color I want.
#3 PostGIS: Geometry & Geography
When storing a polygon in PostGIS, it stores all the points that make the boundary in a single field. If one point were to change and I wanted to keep an historical record, I would have to store all of the points that have not changed twice in order to store the new polygon along with the old.
So, what is a value? and... if RANGE, ARRAY, and GEOGRAPHY are values do they really break the second normal form?
The fact that some operation can derive new values from X that appear to be components of X's value doesn't mean X itself isn't "single valued". Thus "range" values and "geography" values should be single values as far as the DBMSs type system is concerned. I don't know enough about Postgresql's implementation to know whether "arrays" can be considered as single values in themselves. SQL DBMSs like Postgresql are not truly relational DBMSs and SQL supports various structures that certainly aren't proper relation variables, values or types (pointers, nulls and other exotica).
This is a difficult and sometimes controversial topic however. If you haven't read it then I recommend the book Databases, Types, and the Relational Model - The Third Manifesto by Date and Darwen. It addresses exactly the kind of questions you are asking about.
I don't like your description of 2NF but it's not very relevant here.

Pairwise testing: How to create the table?

Hello I have doubt regarding how to create the table for the pairwise testing.
For example if I have three parameter which can each attain two different values. How do I create a table of input with all possible combination then? Would it look something like this?
| 1 2 3
-----------
1 | 1 1 1
2 | 1 2 2
3 | 1 1 2
4 | 1 2 1
Does each parameter corresponds to each column?
However since I have 3 parameter, which each can take 2 different value. The number of test cases should be 2^3 isn't it?
There's a good article with links to some useful tools here:
http://blog.josephwilk.net/ruby/pairwise-testing-with-cucumber.html
For the parameters: each column is a parameter, and each row is a possible combination. Here is the table:
| 1 2 3
-----------
1 | 1 1 1
2 | 2 1 1
3 | 1 2 1
4 | 1 1 2
5 | 2 2 1
6 | 2 1 2
7 | 1 2 2
8 | 2 2 2
so 2^3=8 possible combinations as you can see :)
For the values: each column is a value, and each row is a possible combination:
| 1 2
--------
1 | 1 1
2 | 2 1
3 | 1 2
4 | 2 2
They are 2^2=4 possible combinations. Hope it helps.
1) Please note that pair-wise testing is not about scanning exhaustively all possible combination of values of all parameters. Firstly, such a scanning would give you an enormous amount of test cases that almost no existing system could be able to run all of them.
Secondly, pair-wise testing for a software system is based on the hope that the two parameters having the highest number of possible values are the culprit for the highest percentage of faults of that system.
This is of course only a hope and almost no rigorous scientific research has existed so far to prove that.
2) What I often see in the documentations discussing pair wise testing, like this is that the list of all possible values (aka the pair-wise test table) is not constructed in a well thought way. This creates confusions.
In your case, all the parameters have the same number of possible values (2 values), therefore you could choose any two parameters of those three to build the table. What you could pay attention is the ordering of the combination: you iterate first the top-right parameter, then the next parameter to the left, and so on, ...
Say if you have two parameters p1 and p2, p1 has two possible values apple and orange; and p2 has two possible values red and blue, then your pair-wise test table would be:
index| p1 p2
------------------
1 | apple red
2 | apple blue
3 | orange red
4 | orange blue

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.