How to use a ring data structure in window functions - sql

I have data that is arranged in a ring structure (or circular buffer), that is it can be expressed as sequences that cycle: ...-1-2-3-4-5-1-2-3-.... See this picture to get an idea of a 5-part ring:
I'd like to create a window query that can combine the lag and lead items into a three point array, but I can't figure it out. For example at part 1 of a 5-part ring, the lag/lead sequence is 5-1-2, or at part 4 is 3-4-5.
Here is an example table of two rings with different numbers of parts (always more than three per ring):
create table rp (ring int, part int);
insert into rp(ring, part) values(1, generate_series(1, 5));
insert into rp(ring, part) values(2, generate_series(1, 7));
Here is a nearly successful query:
SELECT ring, part, array[
lag(part, 1, NULL) over (partition by ring),
part,
lead(part, 1, 1) over (partition by ring)
] AS neighbours
FROM rp;
ring | part | neighbours
------+------+------------
1 | 1 | {NULL,1,2}
1 | 2 | {1,2,3}
1 | 3 | {2,3,4}
1 | 4 | {3,4,5}
1 | 5 | {4,5,1}
2 | 1 | {NULL,1,2}
2 | 2 | {1,2,3}
2 | 3 | {2,3,4}
2 | 4 | {3,4,5}
2 | 5 | {4,5,6}
2 | 6 | {5,6,7}
2 | 7 | {6,7,1}
(12 rows)
The only thing I need to do is to replace the NULL with the ending point of each ring, which is the last value. Now, along with lag and lead window functions, there is a last_value function which would be ideal. However, these cannot be nested:
SELECT ring, part, array[
lag(part, 1, last_value(part) over (partition by ring)) over (partition by ring),
part,
lead(part, 1, 1) over (partition by ring)
] AS neighbours
FROM rp;
ERROR: window function calls cannot be nested
LINE 2: lag(part, 1, last_value(part) over (partition by ring)) ...
Update. Thanks to #Justin's suggestion to use coalesce to avoid nesting window functions. Furthermore, it has been pointed out by numerous folks that first/last values need an explicit order by on the ring sequence, which happens to be part for this example. So randomising the input data a bit:
create table rp (ring int, part int);
insert into rp(ring, part) select 1, generate_series(1, 5) order by random();
insert into rp(ring, part) select 2, generate_series(1, 7) order by random();

Use COALESCE like #Justin provided.
With first_value() / last_value() you need to add an ORDER BY clause to the window definition or the order is undefined. You just got lucky in the example, because the rows happen to be in order right after creating the dummy table.
Once you add ORDER BY, the default window frame ends at the current row, and you need to special case the last_value() call - or revert the sort order in the window frame like demonstrated in my first example.
When reusing a window definition multiple times, an explicit WINDOW clause simplifies syntax a lot:
SELECT ring, part, ARRAY[
coalesce(
lag(part) OVER w
,first_value(part) OVER (PARTITION BY ring ORDER BY part DESC))
,part
,coalesce(
lead(part) OVER w
,first_value(part) OVER w)
] AS neighbours
FROM rp
WINDOW w AS (PARTITION BY ring ORDER BY part);
Better yet, reuse the same window definition, so Postgres can calculate all values in a single scan. For this to work we need to define a custom window frame:
SELECT ring, part, ARRAY[
coalesce(
lag(part) OVER w
,last_value(part) OVER w)
,part
,coalesce(
lead(part) OVER w
,first_value(part) OVER w)
] AS neighbours
FROM rp
WINDOW w AS (PARTITION BY ring
ORDER BY part
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
ORDER BY 1,2;
You can even adapt the frame definition for each window function call:
SELECT ring, part, ARRAY[
coalesce(
lag(part) OVER w
,last_value(part) OVER (w RANGE BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING))
,part
,coalesce(
lead(part) OVER w
,first_value(part) OVER w)
] AS neighbours
FROM rp
WINDOW w AS (PARTITION BY ring ORDER BY part)
ORDER BY 1,2;
Might be faster for rings with many parts. You'll have to test.
SQL Fiddle demonstrating all three with an improved test case. Consider query plans.
More about window frame definitions:
In the manual.
PostgreSQL window function: partition by comparison
PostgreSQL query with max and min date plus associated id per row

Query:
SQLFIDDLEExample
SELECT ring, part, array[
coalesce(lag(part, 1, NULL) over (partition by ring),
max(part) over (partition by ring)),
part,
lead(part, 1, 1) over (partition by ring)
] AS neighbours
FROM rp;
Result:
| RING | PART | NEIGHBOURS |
|------|------|------------|
| 1 | 1 | 5,1,2 |
| 1 | 2 | 1,2,3 |
| 1 | 3 | 2,3,4 |
| 1 | 4 | 3,4,5 |
| 1 | 5 | 4,5,1 |
| 2 | 1 | 7,1,2 |
| 2 | 2 | 1,2,3 |
| 2 | 3 | 2,3,4 |
| 2 | 4 | 3,4,5 |
| 2 | 5 | 4,5,6 |
| 2 | 6 | 5,6,7 |
| 2 | 7 | 6,7,1 |

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

How to join tables (from one-to-many to one-to-one) based on proportion

I have two tables:
|Order|Produce| |Produce|Type |Proportion|
| 1 |Apple | |Apple |Red |0.67 |
| 2 |Orange | |Apple |Green|0.33 |
| 3 |Apple | |Orange |Sweet|1.0 |
| 4 |Apple |
| 5 |Orange |
I would like to assign type (from the right table) on the produce based on the proportion (it can be random).
As an example because there are:
3 apple orders
2 apple types (red - 67% and green - 33%)
Two of the apple orders will be allocated red and one order to be allocated green:
|Order|Produce|Type |
| 1 |Apple |Red |
| 2 |Orange |Sweet|
| 3 |Apple |Green|
| 4 |Apple |Red |
| 5 |Orange |Sweet|
I understand the calculation will not always be precise and neat so if the allocation is a bit off because of sample size that's fine.
I've been trying to come up with some logic using SQL window function (sorting based on row number and groups) but couldn't get the results I want.
I've also tried using the data merge in SAS but can't get the random allocation going.
Any ideas/suggestion on the way/logic to do this?
You would enumerate the rows and do simple calculation:
select o.*, t.type
from (select o.*,
row_number() over (partition by produce order by newid()) as seqnum,
count(*) over (partition by produce) as cnt
from orders o
) o join
(select t.*,
sum(proportion) over (partition by produce order by proportion) - proportion as lower,
sum(proportion) over (partition by produce order by proportion) as upper
from types t
) t
on o.produce = t.produce and
(seqnum - 1) * 1.0 / cnt >= lower and
(seqnum - 1) * 1.0 / cnt < upper;
The key here is calculating the upper and lower bounds using a cumulative sum of the proportion. Then using this to compare to the enumeration divided by the count.
Here is a db<>fiddle.

Find blocks of data in Hive

I have a table with series of measurements done by some modules, let's call them N and W.
Each row contains one measurement.
I want to assign an unique block identifier to each measurement, see expected output below.
Assumption: records in table are written in some order based on time and module, therefore one can assume, that ROW_NUMBER() OVER (ORDER BY 1) delivers an ordering column.
How can I do this in Hive?
Expected output:
+---------+-------+
| module | block |
+---------+-------+
| W | 1 |
| W | 1 |
| W | 1 |
| N | 2 |
| N | 2 |
| W | 3 |
| W | 3 |
| W | 3 |
+---------+-------+
Sample data:
DROP TABLE IF EXISTS so_sample;
CREATE TABLE so_sample (
module string
);
INSERT INTO TABLE so_sample
VALUES ("W"), ("W"), ("W"), ("N"), ("N"), ("W"), ("W"), ("W")
;
Regards
Paweł
I found a solution: build a block identifier from a module name and a number made by subtracting row number within a certain module from global row number.
Block identifier is unique and additionally preserves order of blocks within a module:
WITH ordered AS (
SELECT
module,
ROW_NUMBER() OVER (ORDER BY 1) as row_nr
FROM so_sample
)
SELECT
module,
row_nr,
module ||
cast(
row_nr - ROW_NUMBER() OVER (PARTITION BY module ORDER BY row_nr)
AS STRING ) as block_id
FROM ordered
ORDER BY row_nr
;
Regards
Paweł

Use something like TOP with GROUP BY

With table table1 like below
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
I'm aggregating the table by orig like below
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?
Output expected
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.
Select first row in each GROUP BY group?
But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.
If passenger can be NULL, it is important to add NULLS LAST:
PostgreSQL sort by datetime asc, null first?
From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.
Performance?
Typically, a single scan is superior, especially with sequential scans.
The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)
The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.
first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:
Optimize GROUP BY query to retrieve latest record per user
The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.
One method uses the window function first_value(). Unfortunately, this is not available as an aggregation function:
select orig,
count(*) flight_cnt,
count(distinct passenger) as pass_cnt,
percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
max(longest_name) as longest_name
from (select t1.*,
first_value(name) over (partition by orig order by length(name) desc) as longest_name
from table1
) t1
group by orig;
You are looking for something like Oracle's KEEP FIRST/LAST where you get a value (the passenger name) according to an aggregate (the name length). PostgreSQL doesn't have such function as far as I know.
One way to go about this is a trick: Combine length and name, get the maximum, then extract the name: '0005david' > '0003kim' etc.
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
, substr(max(to_char(char_length(passenger), '0000') || passenger), 5) as name
from table1
group by orig
order by orig;
For small group sizes, you could use array_agg()
SELECT
orig
, COUNT (*) AS flight_cnt
, COUNT (DISTINCT passenger) AS pass_cnt
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY bags ASC) AS bag_cnt_med
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] AS pass_max_len_name
FROM table1
GROUP BY orig
Having said so, while this is shorter syntax, a first_value() window function based approach might be faster for larger data sets as array accumulation might become expensive.
bot it does not solve problem if you have several names wqith same length:
t=# with p as (select distinct orig,passenger,length(trim(passenger)),max(length(trim(passenger))) over (partition by orig) from s127)
, o as ( select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from s127
group by orig)
select distinct o.*,p.passenger from o join p on p.orig = o.orig where max=length;
orig | flight_cnt | pass_cnt | bag_cnt_med | passenger
---------+------------+----------+-------------+--------------
lax | 3 | 3 | 6 | ameera
sfo | 3 | 2 | 7 | david
(2 rows)
populate:
t=# create table s127(flight int,orig text,dest text, passenger text, bags int);
CREATE TABLE
Time: 52.678 ms
t=# copy s127 from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1111 | sfo | chi | david | 3
>> 1112 | sfo | dal | david | 7
1112 | sfo | dal | kim | 10
1113 | lax | san | ameera | 5
1114 | lax | lfr | tim | 6
1114 | lax | lfr | jake | 8 >> >> >> >>
>> \.
COPY 6

Select first & last date in window

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................
If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.
Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);