Understanding window function frame with RANGE mode - sql

SELECT
sum(unique1) OVER () AS total,
sum(unique1) OVER
(PARTITION BY four ORDER BY unique1 RANGE BETWEEN 5::int8 PRECEDING AND 6::int2 FOLLOWING),
unique1,
four
FROM
tenk1
WHERE
unique1 < 10;
return:
total | sum | unique1 | four
-------+-----+---------+------
45 | 4 | 0 | 0
45 | 12 | 4 | 0
45 | 12 | 8 | 0
45 | 6 | 1 | 1
45 | 15 | 5 | 1
45 | 14 | 9 | 1
45 | 8 | 2 | 2
45 | 8 | 6 | 2
45 | 10 | 3 | 3
45 | 10 | 7 | 3
(10 rows)
Minor change based on this
Since partition by four make each frame only 2 or 3 rows. and if you between 5 preceding and 6 following, then I thought in this case, rows/range frame doesn't matter. I thought range from, rows from will return the same result. because 5 preceding 6 following covered enough 2, 3 rows per frame.
However it does matter. I guess I do understand the same query with ROWS instead of RANGE.
Quote from manual:
In RANGE or GROUPS mode, a frame_start of CURRENT ROW means the frame
starts with the current row's first peer row (a row that the window's
ORDER BY clause sorts as equivalent to the current row), while a
frame_end of CURRENT ROW means the frame ends with the current row's
last peer row. In ROWS mode, CURRENT ROW simply means the current row.
Question: How to interpret
partition by four order by unique1 rows between 5::int8 preceding and 6::int2 following

The documentation states:
In ROWS mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of rows before or after the current row.
[...]
In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column.
(The emphasis is mine.)
So with ROWS, you will get the 5 rows before and the 6 rows after the current row. With RANGE, you will get those rows where unique1 is no more than 5 less or 6 more than the unique1 of the current row.
In your example, if you consider the first row, ROWS BETWEEN ... AND 6 FOLLOWING would include the third row, but RANGE BETWEEN ... AND 6 FOLLOWING would not, because the difference between 8 (the value of unique1 in the third row) and 0 is greater than 6.

Related

COALESCE in postgresql conditional displaying seemingly undocumented behavior?

I have looked at the COALESCE documentation and it mentions the typical case of using COALESCE to make default/situational parameters, e.g.
COALESCE(discount, 5)
which evaluates to 5 if discount is not defined as something else.
However, I have seen it used where COALESCE actually evaluated all the arguments, despite the documentation explicitly saying it stops evaluating arguments after the first non-null argument.
Here is an example similar to what I encountered, say you have a table like this:
id | wind | rain | snow
1 | null | 2 | 3
2 | 5 | null | 6
3 | null | 7 | 2
Then you run
SELECT *
FROM weather_table
WHERE
COALESCE(wind, rain, snow) >= 5
You would expect this to only select rows with wind >= 5, right? NO! It selects all rows with either wind, rain or snow more than 5. Which in this case is 2 rows, specifically these two:
2 | 5 | null | 6
3 | null | 7 | 2
Honestly, pretty cool functionality, but it really irks me that I couldn't find any example of this online or in the documentation.
Can anyone tell me what's going on? Am I missing something?
You would expect this to only select rows with wind >= 5, right?
No, I expect it to select rows with what the Coalesce function returns.
The Coalesce function delivers the value of the first non-null parameter. You had Coalesce(wind,rain,snow). The first row had (null,2,3), so coalesce returned 2. The second row had (5,null,6) so returned 5. The third row had (null,7,2) so returned 7.
The last two rows meet the condition >=5, so 2 rows are retrieved.
Notice that the value for snow was never returned in your example, because either wind or rain always had a value.
After writing out the question so clear, I realized what was going on myself. But I want to answer it here in case anyone else is confused.
Turns out the reason is the COALESCE function is run once for each row, which I suppose I could have known. Then it all makes sense.
It checks for each row, do I have non-null wind, if it is >= 5 I add this row to the result, if not I check if rain is non-null, and so on.
Notably though, if my table was had been like this:
id | wind | rain | snow
1 | 0 | 2 | 3
2 | 5 | 0 | 6
3 | 0 | 7 | 2
The command would have worked like I thought, and the COALESCE function completely useless, would have picked only that one row
2 | 5 | 0 | 6
equal to SELECT * FROM weather_table WHERE wind >= 5.
It only works if there are columns which are null (0 <> null).

width_bucket not returning buckets of equal width

I'm using Postgres version 9.6.9 and attempting to use width_bucket() to generate a histogram with buckets consisting of equal widths. However, the query I'm using is not returning buckets of equal widths.
As you can see in the example below, the values in the bucket have varying widths. e.g. bucket 1 has a min of 7 and a max of 18 - a width of 11. bucket 3 has a min of 52 and a max of 55 - a width of 3.
How can I adjust my query to ensure that each bucket has the same width?
Here's what the data looks like:
value
-------
7
7
15
17
18
22
23
25
29
42
52
52
55
60
74
85
90
90
92
95
(20 rows)
Here's the query and resulting histogram:
WITH min_max AS (
SELECT
min(value) AS min_val,
max(value) AS max_val
FROM table
)
SELECT
min(value),
max(value),
count(*),
width_bucket(value, min_val, max_val, 5) AS bucket
FROM table, min_max
GROUP BY bucket
ORDER BY bucket;
min | max | count | bucket
-----+-----+-------+--------
7 | 23 | 7 | 1
25 | 42 | 3 | 2
52 | 55 | 3 | 3
60 | 74 | 2 | 4
85 | 92 | 4 | 5
95 | 95 | 1 | 6
( 6 rows )
From https://prestodb.io/docs/current/functions/window.html
Have a look at ntile():
ntile(n) → bigint
Divides the rows for each window partition into n buckets ranging from 1 to at most n. Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket.
For example, with 6 rows and 4 buckets, the bucket values would be as follows: 1 1 2 2 3 4
Or say to rank each runner's 100m race times to find their personal best out of their 10 races:
SELECT
NTILE(10) over (PARTITION BY runners ORDER BY racetimes)
FROM
table
Your buckets are the same size. You just don't have data that accurately represents the end-points.
For instance, would 24 be in the first or second bucket? This is more notable for the ranges without any data, such as 75-83.
From https://www.oreilly.com/library/view/sql-in-a/9780596155322/re91.html
WIDTH_BUCKET( expression, min, max, buckets)
The buckets argument specifies the number of buckets to create over the range defined by min through max. min is inclusive, whereas max is not.
Maximum is not included. so set
WIDTH_BUCKET( expression, min, max + 1, buckets)

How do I compare rows of a table against all other rows of the table?

I would like to create a script that takes the rows of a table which have a specific mathematical difference in their ASCII sum and to add the rows to a separate table, or even to flag a different field when they have that difference.
For instance, I am looking to find when the ASCII sum of word A and the ASCII sum of word B, both stored in rows of a table, have a difference of 63 or 31.
I could probably use a loop to select these rows, but SQL is not my greatest virtue.
ItemID | asciiSum |ProperDiff
-------|----------|----------
1 | 100 |
2 | 37 |
3 | 69 |
4 | 23 |
5 | 6 |
6 | 38 |
After running the code, the field ProperDiff will be updated to contain 'yes' for ItemID 1,2,3,5,6, since the AsciiSum for 1 and 2 (100-37) = 63 etc.
This will not be fast, but I think it does what you want:
update t
set ProperDiff = 'yes'
where exists (select 1
from t t2
where abs(t2.AsciiSum - t.AsciiSum) in (63, 31)
);
It should work okay on small tables.

Ransack search- select rows whose sum adds up to a given value

Im using ransack search with ruby on rails and trying to output random rows between 1-6, whose time adds up to a given value specified by the search.
For example search for rows whose time value adds up to 40. In this case id 12 and 14 will be returned. Any combination between 1-6 can be randomly outputted.
If a combination of 3 rows meet the criteria then 3 rows should be outputted. likewise 1,2,3,4,5,6. If no single row or combination can be found then the output should return nil
id | title | time
----+-------------------------+-----------
26 | example | 10
27 | example | 26
14 | example | 20
28 | example | 50
12 | example | 20
20 | example | 6
Note - Not sure if ransack search is the best to perform this type of query
Thanks in advance

Postgres Aggregator which checks for occurrences

Does there exist a Postgres Aggregator such that, when used on the following table:
id | value
----+-----------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 3
6 | 3
7 | 3
8 | 4
9 | 4
10 | 5
in a query such as:
select agg_function(4,value) from mytable where id>5
will return
agg_function
--------------
t
(a boolean true result) because a row or rows with value=4 were selected?
In other words, one argument specifies the value you are looking for, the other argument takes the column specifier, and it returns true if the column value was equal to the specified value for one or more rows?
I have successfully created an aggregate to do just that, but I'm wondering if I have just re-created the wheel...
select sum(case when value = 4 then 1 else 0 end) > 0
from mytable
where id > 5