dynamically select arbitrary n columns with window function - sql

Abstract of my real select statement:
select
lag(somecol) over (partition by thing_id) as prev_1,
lag(somecol,2) over (partition by thing_id) as prev_2,
lag(somecol,3) over (partition by thing_id) as prev_3,
othercol,
...
In the real query the over is much more complex which leads to pretty dense, unreadable code. Additionally, getting the last 3 rows is hardcoded (vs n=whatever).
Is there any way in straight SQL to iteratively or recursively specify these prev_x columns, so that 1) the code is more readable and 2) you could dynamically specify the number n of prev cols?

To answer just the first question, to make the code more readable, Postgres allows to define the window, name it and then reference it several times in the query.
See the docs for Window Functions:
When a query involves multiple window functions, it is possible to
write out each one with a separate OVER clause, but this is
duplicative and error-prone if the same windowing behavior is wanted
for several functions. Instead, each windowing behavior can be named
in a WINDOW clause and then referenced in OVER. For example:
SELECT sum(salary) OVER w, avg(salary) OVER w
FROM empsalary
WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);
I don't know if this feature is part of the SQL standard or not, but I know that SQL Server doesn't support it.
So, your query would look like this:
select
lag(somecol) over w as prev_1,
lag(somecol,2) over w as prev_2,
lag(somecol,3) over w as prev_3,
othercol,
...
from
...
WINDOW w AS (partition by thing_id)
;
Regarding your second question how to "dynamically specify the number n of prev cols" - you'll need to generate the text of SELECT statement dynamically to achieve that. RDBMS assume stable schema, i.e. the number of columns in tables and queries is usually fixed, not dynamic.

Related

SQLite alias (AS) not working in the same query

I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))

How to shuffle array in PostgreSQL 9.6 and also lower versions?

The following custom stored function -
CREATE OR REPLACE FUNCTION words_shuffle(in_array varchar[])
RETURNS varchar[] AS
$func$
SELECT array_agg(letters.x) FROM
(SELECT UNNEST(in_array) x ORDER BY RANDOM()) letters;
$func$ LANGUAGE sql STABLE;
was shuffling character array in PostgreSQL 9.5.3:
words=> select words_shuffle(ARRAY['a','b','c','d','e','f']);
words_shuffle
---------------
{c,d,b,a,e,f}
(1 row)
But now after I have switched to PostgreSQL 9.6.2 the function stopped working:
words=> select words_shuffle(ARRAY['a','b','c','d','e','f']);
words_shuffle
---------------
{a,b,c,d,e,f}
(1 row)
Probably because the ORDER BY RANDOM() stopped working:
words=> select unnest(ARRAY['a','b','c','d','e','f']) order by random();
unnest
--------
a
b
c
d
e
f
(6 rows)
I am looking please for a better method to shuffle character array, which would work in the new PostgreSQL 9.6, but also in 9.5.
I need it for my word game in development, which uses Pl/PgSQL functions.
UPDATE:
Reply by Tom Lane:
Expansion of SRFs in the targetlist now happens after ORDER BY.
So the ORDER BY is sorting a single dummy row and then the unnest
happens after that. See
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9118d03a8
Generally, a set returning function should be placed in FROM clause:
select array_agg(u order by random())
from unnest(array['a','b','c','d','e','f']) u
array_agg
---------------
{d,f,b,e,c,a}
(1 row)
For the documentation (emphasis added):
Currently, functions returning sets can also be called in the select list of a query. For each row that the query generates by itself, the function returning set is invoked, and an output row is generated for each element of the function's result set. Note, however, that this capability is deprecated and might be removed in future releases.
No doubt, this is a change and due to some "improvement" in the optimizer. Given that the documentation sort of says that this works, it is frustrating.
However, I would suggest that you not depend on the subquery:
SELECT array_agg(letters.x ORDER BY random())
FROM UNNEST(in_array) l(x);
This should also work in order versions of Postgres.
The documentation says:
Alternatively, supplying the input values from a sorted subquery will
usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not
portable to other database systems.
(I freely admit that "will usually work" is not a guarantee. But having a substandard code sample in the documentation is really misleading. Why doesn't it show the correct sample using the ORDER BY clause in the aggregation function?)
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

Using part of the select clause without rewriting it

I am using an Oracle SQL Db and I am trying to count the number of terms starting with X letter in a dictionnary.
Here is my query :
SELECT Substr(Lower(Dict.Term),0,1) AS Initialchar,
Count(Lower(Dict.Term))
FROM Dict
GROUP BY Substr(Lower(Dict.Term),0,1)
ORDER BY Substr(Lower(Dict.Term),0,1);
This query is working as expected, but the thing that I'm not really happy about is the fact that I have to rewrite the long "Substr(Lower(Dict.Term),0,1)" in the GROUP BY and ORDER BY clause. Is there any way to reuse the one I defined in the SELECT part ?
Thanks
You can use a subquery. Because Oracle follows the SQL standard, substr() starts counting at 1. Although Oracle does explicitly allow 0 ("If position is 0, then it is treated as 1"), I find it misleading because "0" and "1" refer to the same position.
So:
select first_letter, count(*)
from (select d.*, substr(lower(d.term), 1, 1) as first_letter
from dict d
) d
group by first_letter
order by first_letter;
Not directly. The output columns can only be referred to in the ORDER BY clause, but not used in any other way. The only way would be to make it into a subselect, but it wouldn't be any clearer and might cause issues with performance.
I prefer subquery factoring for this purpose.
with init as (
select substr(lower(d.term), 1, 1) as Initialchar
from dict d)
select Initialchar, count(*)
from init
group by Initialchar
order by Initialchar;
Contrary to opposite meaning, IMO this makes the query much clearer and defines natural order; especially while using more subqueries.
I'm not aware about performance caveats, but there are some limitation, such as it not possible to use with clause within another with clause: ORA-32034: unsupported use of WITH clause.

Output of CSUM() in teradata

Can anyone please help me in unstanding below csum function.
What will be the output in each case.
csum(1,1),
csum(1,1) + emp_no
csum(1,emp_no)+emp_no
CSUM is an old deprecated function from V2R3, over 15 years ago. It can always be rewritten using newer Standard SQL compliant syntax.
CSUM(1,1) returns the same as ROW_NUMBER() OVER (ORDER BY 1), a sequence starting with 1.
But you should never use it like that as ORDER BY 1 within a Windowed Aggregate Function is not the same as the final ORDER BY 1 of a SELECT, it's ordering all rows by the same value 1. Teradata calculates those functions in parallel based on the values in PARTITION BY and ORDER BY, this means all rows with the same PARTITION/ORDER data are processed on a single AMP, if there's only a single value one AMP will process all rows, resulting in a totally skewed distribution.
Instead of ORDER BY 1 you should use a column which is more or less unique in best case.
csum(1,emp_no)+emp_no is probably used with another SELECT to get the current maximum value of a column and add the new sequential values to it, i.e. creating your own gap-less sequence numbers.
This is the best way to do it:
SELECT ROW_NUMBER() OVER (ORDER BY column(s)_with_a_low_number_of_rows_per_value)
+ COALESCE((SELECT MAX(seqnum) FROM table),0)
,....
FROM table

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)