Selecting positive aggregate value and ignoring negative in Postgres SQL - sql

I must apply a certain transformation fn(argument). Here argument is equal to value, but not when it is negative. When you get a first negative value, then you "wait" until it sums up with consecutive values and this sum becomes positive. Then you do fn(argument). See the table I want to get:
value argument
---------------------
2 2
3 3
-10 0
4 0
3 0
10 7
1 1
I could have summed all values and apply fn to the sum, but fn can be different for different rows and it is essential to know the row number to choose a concrete fn.
As want a Postgres SQL solution, looks like window functions fit, but I am not experienced enough to write expression that does that yet. In fact, I am new to "thinking in sql", unfortunately. I guess that can be easily done in an imperative way, but I do not want to write a stored procedure yet.

I suppose I'm late, but this may help someone:
select
value,
greatest(0, value) as argument
from your_table;

This doesn't really fit any of the predefined aggregation functions. You probably need to write your own. Note that in postgresql, aggregate functions can be used as window functions, and in fact that is the only way to write window functions in anything other than C, as of 9.0.
You can write a function that tracks the state of "summing" the values, except that it always returns the input value if the current "sum" is positive, and just keeps adding when the "sum" is negative. Then you simply need to take the greater of either this sum or zero. To whit:
-- accumulator function: first arg is state, second arg is input
create or replace function ouraggfunc(int, int)
returns int immutable language plpgsql as $$
begin
raise info 'ouraggfunc: %, %', $1, $2; -- to help you see what's going on
-- get started by returning the first value ($1 is null - no state - first row)
if $1 is null then
return $2;
end if;
-- if our state is negative, we're summing until it becomes positive
-- otherwise, we're just returning the input
if $1 < 0 then
return $1 + $2;
else
return $2;
end if;
end;
$$;
You need to create an aggregate function to invoke this accumulator:
create aggregate ouragg(basetype = int, sfunc = ouraggfunc, stype = int);
This defines that the aggregate takes integers as input and stores its state as an integer.
I copied your example into a table:
steve#steve#[local] =# create table t(id serial primary key, value int not null, argument int not null);
NOTICE: CREATE TABLE will create implicit sequence "t_id_seq" for serial column "t.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_pkey" for table "t"
CREATE TABLE
steve#steve#[local] =# copy t(value, argument) from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2 2
>> 3 3
>> -10 0
>> 4 0
>> 3 0
>> 10 7
>> 1 1
>> \.
And you can now have those values produced by using the aggregate function with a window clause:
steve#steve#[local] =# select value, argument, ouragg(value) over(order by id) from t;
INFO: ouraggfunc: <NULL>, 2
INFO: ouraggfunc: 2, 3
INFO: ouraggfunc: 3, -10
INFO: ouraggfunc: -10, 4
INFO: ouraggfunc: -6, 3
INFO: ouraggfunc: -3, 10
INFO: ouraggfunc: 7, 1
value | argument | ouragg
-------+----------+--------
2 | 2 | 2
3 | 3 | 3
-10 | 0 | -10
4 | 0 | -6
3 | 0 | -3
10 | 7 | 7
1 | 1 | 1
(7 rows)
So as you can see, the final step is that you need to take the output of the function if it is positive, or zero. This can be done by wrapping the query, or writing a function to do that:
create function positive(int) returns int immutable strict language sql as
$$ select case when $1 > 0 then $1 else 0 end $$;
and now:
select value, argument, positive(ouragg(value) over(order by id)) as raw_agg from t
This produces the arguments for the function that you specified in the question.

Related

Accessing specific row value by id in a scalar SQL function

How can I use a specific value of a field 'ranking' from table 'course' by 'course_id' in a function? I need to return 1 if the value given by a parameter is higher and 0 if the value is lower. So, I need to get the data from the table somehow based on course_id as a parameter.
CREATE OR ALTER FUNCTION dbo.f_rank (#course_id INT, #user_ranking INT)
RETURNS INT
AS
BEGIN
RETURN
CASE
WHEN #course_id.ranking > #user_ranking THEN 1
ELSE 0
END
END
After the function returns 0 or 1 I need to display:
If on call function returns 1 then display ‘Ranking of <course_name> is above score’, else ‘Ranking of <course_name> is below score’.
Sample data:
course_id | course_name | ranking
1 | GIT | 10
2 | CSS | 2
3 | C++ | 6
I need to compare a ranking of course_id = 1 for example which is 10 with the random number given as a parameter. course_id is also given as a parameter.
For example:
If the user chooses as input params (course_id = 1 and user_ranking = 5)
Expected result:
'Ranking of GIT is above score' - if function returns 1
'Ranking of GIT is below score' - if function returns 0
I assume that you probably want something like this (no expected results were given on request):
--Note you will have to DROP your old function if already exists as a scalar function first.
--You cannot ALTER a scalar function to be a table function.
CREATE OR ALTER FUNCTION dbo.f_rank (#course_id INT, #user_ranking INT)
RETURNS table
AS RETURN
SELECT CONVERT(bit, CASE WHEN C.ranking > #user_ranking THEN 1 ELSE 0 END) AS SomeColumnAlias --Obviusly give this a proper name
FROM dbo.Course C
WHERE C.Course_id = #course_id;
GO
As I mentioned, I use an inline table value function, as this will likely be more performant, and you don't mention your version of SQL Server (so don't know if you are on 2019, and so could use an inlineable scalar function).

Print limited number of elements in collect_set array using printf function

I want to printf() just the first 3 patients in collect_set() of patient numbers.
A. I have created "patient_list" using collect_set
collect_set(distinct patient_seq) AS patient_list
which yields arrays of patients numbers of varying length (4, 5 or 6 digits)
Example:
["16189","26599","406622","419117","5551"]
["223587","224663","232072","326504","433430","436673","54540","58188","74118"]
B. I then stripped out the commas and quotes and separated by '*' (in order to grab just the first 3 patients, in the next step):
concat_ws('*', patient_list) AS pat_list
This produces:
16189*26599*406622*419117*5551
223587*224663*232072*326504*433430*436673*54540*58188*74118
C. I tried to use SUBSTRING_INDEX() to create a new variable (pat_list_short) containing just the first 3 patients, but this function is not supported in hive 1.1.0 (not supported until 1.3.0).
substring_index(pat_list, '*', 3) AS pat_list_short
What other option do I have?
I want to feed the pat_list_short into the PRINTF using %s in order to print out just the first three patient numbers for review team. Since the patient num varies in length I can't just limit the print to a certain length
Thanks
Using the data you provided
--------------
key | pat_id
--------------
1 16189
1 26599
1 406622
1 419117
1 5551
2 223587
2 224663
2 232072
2 326504
2 433430
2 436673
2 54540
2 58188
2 74118
you can use this UDF here to truncate an array to a desired length. There are instructions on the main page how to build and use the jar.
Query:
add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function trunc_array as 'brickhouse.udf.collect.TruncateArrayUDF';
select key
, concat(' ', trunc_array(collect_set( pat_id ), 3)) pat_list_short
from db.tbl
group by key
Output:
----------------------
key | pat_list_short
----------------------
1 5551 26599 16189
2 232072 58188 223587
I must admit I'm a bit unclear has to how printf() plays a part in this problem as the query returns a result and prints it. It is also with noting that in your query in A, the distinct in collect_set(distinct) is redundant, as collect_set's purpose is to collect distinct elements.

Alternating true/false flag for current time given step

I have a query being executed every X milliseconds. As a part of result I would like to have alternating true/false flag. This flag should change whenever the query is executed again.
Example
Sample query: select 1, <<boolean alternator>>;
1st execution returns: 1 | true
2nd execution returns: 1 | false
3rd execution returns: 1 | true
and so on. It does not matter if it returns true or false for the first time.
For cases when X is odd number of seconds I have the following solution:
select
mod(right(extract(epoch from current_timestamp)::int::varchar,1)::int, 2) = 0
as alternator
This extracts last digit from epoch and then test if this is even number. Because X is defined as odd number, this test will alternate from one execution to another.
What would work in the same way when X is different - even or not in whole seconds? I would like to make it work for X like 500ms, 1200ms, 2000ms, ...
Note: I plan to use this with PostgreSQL.
I suggest a dedicated SEQUENCE.
CREATE SEQUENCE tf_seq MINVALUE 0 MAXVALUE 1 START 0 CYCLE;
Each call with nextval() returns 0 / 1 alternating. You can cast to boolean:
0::bool = FALSE
1::bool = TRUE
So:
SELECT nextval('tf_seq'::regclass)::int::bool;
To keep other roles from messing with the state of the sequence, only
GRANT USAGE ON SEQUENCE tf_seq TO $dedicated_role;. Run your query as that role or create a function with SECURITY DEFINER and ALTER FUNCTION foo() OWNER TO $dedicated_role;
Or, simpler yet, just make it the column default and completely ignore it in your inserts:
ALTER TABLE foo ALTER COLUMN bool_col
SET DEFAULT nextval('tf_seq'::regclass)::int::bool;
You need to grant USAGE on the sequence to roles that can insert.
Every next row gets the flipped value automatically.
The usual notes for sequences apply. Like, if you roll back an INSERT, the sequence stays flipped. Sequence states are never rolled back.
A temporary table can save the boolean state:
create temporary table t (b boolean);
insert into t (b) values (true);
with u as (
update t
set b = not b
returning b
)
select 1, b
from t;
I would do the same (boolean flipping with not b) with a pltcl function.
Advantage: variable is cached in your session tcl interpreter. Disadvantage: Any pl??? function has a call overhead.
Test table, insert some values:
strobel=# create table test (i integer);
CREATE TABLE
strobel=# insert into test (select * from generate_series(1, 10));
INSERT 0 10
The function:
create or replace function flipper () returns boolean as $$
if {![info exists ::flag]} {set ::flag false}
return [set ::flag [expr {! $::flag}]]
$$ language pltcl volatile;
The test:
strobel=# select *, flipper() from test;
i | flipper
----+---------
1 | t
2 | f
3 | t
4 | f
5 | t

How to label a big set of “transitive groups” with a constraint?

EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.
An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.
I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.
The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;

How to avoid multiple function evals with the (func()).* syntax in a query?

Context
When a function returns a TABLE or a SETOF composite-type, like this one:
CREATE FUNCTION func(n int) returns table(i int, j bigint) as $$
BEGIN
RETURN QUERY select 1,n::bigint
union all select 2,n*n::bigint
union all select 3,n*n*n::bigint;
END
$$ language plpgsql;
the results can be accessed by various methods:
select * from func(3) will produce these output columns :
i | j
---+---
1 | 3
2 | 9
3 | 27
select func(3) will produce only one output column of ROW type.
func
-------
(1,3)
(2,9)
(3,27)
select (func(3)).* will produce like #1:
i | j
---+---
1 | 3
2 | 9
3 | 27
When the function argument comes from a table or a subquery, the syntax #3 is the only possible one, as in:
select N, (func(N)).*
from (select 2 as N union select 3 as N) s;
or as in this related answer. If we had LATERAL JOIN we could use that, but until PostgreSQL 9.3 is out, it's not supported, and the previous versions will still be used for years anyway.
Problem
The problem with syntax #3 is that the function is called as many times as there are columns in the result. There's no apparent reason for that, but it happens.
We can see it in version 9.2 by adding a RAISE NOTICE 'called for %', n in the function. With the query above, it outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
Now, if the function is changed to return 4 columns, like this:
CREATE FUNCTION func(n int) returns table(i int, j bigint,k int, l int) as $$
BEGIN
raise notice 'called for %', n;
RETURN QUERY select 1,n::bigint,1,1
union all select 2,n*n::bigint,1,1
union all select 3,n*n*n::bigint,1,1;
END
$$ language plpgsql stable;
then the same query outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
2 function calls were needed, 8 were actually made. The ratio is the number of output columns.
With syntax #2 that produces the same result except for the output columns layout, these multiple calls don't happen:
select N, func(N)
from (select 2 as N union select 3 as N) s;
gives:
NOTICE: called for 2
NOTICE: called for 3
followed by the 6 resulting rows:
n | func
---+------------
2 | (1,2,1,1)
2 | (2,4,1,1)
2 | (3,8,1,1)
3 | (1,3,1,1)
3 | (2,9,1,1)
3 | (3,27,1,1)
Questions
Is there a syntax or a construct with 9.2 that would achieve the expected result by doing only the minimum required function calls?
Bonus question: why do the multiple evaluations happen at all?
You can wrap it up in a subquery but that's not guaranteed safe without the OFFSET 0 hack. In 9.3, use LATERAL. The problem is caused by the parser effectively macro-expanding * into a column list.
Workaround
Where:
SELECT (my_func(x)).* FROM some_table;
will evaluate my_func n times for n result columns from the function, this formulation:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table
) sub;
generally will not, and tends not to add an additional scan at runtime. To guarantee that multiple evaluation won't be performed you can use the OFFSET 0 hack or abuse PostgreSQL's failure to optimise across CTE boundaries:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table OFFSET 0
) sub;
or:
WITH tmp(mf) AS (
SELECT my_func(x) FROM some_table
)
SELECT (mf).* FROM tmp;
In PostgreSQL 9.3 you can use LATERAL to get a saner behaviour:
SELECT mf.*
FROM some_table
LEFT JOIN LATERAL my_func(some_table.x) AS mf ON true;
LEFT JOIN LATERAL ... ON true retains all rows like the original query, even if the function call returns no row.
Demo
Create a function that isn't inlineable as a demonstration:
CREATE OR REPLACE FUNCTION my_func(integer)
RETURNS TABLE(a integer, b integer, c integer) AS $$
BEGIN
RAISE NOTICE 'my_func(%)',$1;
RETURN QUERY SELECT $1, $1, $1;
END;
$$ LANGUAGE plpgsql;
and a table of dummy data:
CREATE TABLE some_table AS SELECT x FROM generate_series(1,10) x;
then try the above versions. You'll see that the first raises three notices per invocation; the latter only raise one.
Why?
Good question. It's horrible.
It looks like:
(func(x)).*
is expanded as:
(my_func(x)).i, (func(x)).j, (func(x)).k, (func(x)).l
in parsing, according to a look at debug_print_parse, debug_print_rewritten and debug_print_plan. The (trimmed) parse tree looks like this:
:targetList (
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 1
:resulttype 23
:resulttypmod -1
:resultcollid 0
}
:resno 1
:resname i
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 2
:resulttype 20
:resulttypmod -1
:resultcollid 0
}
:resno 2
:resname j
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 3
:...
}
:resno 3
:resname k
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 4
...
}
:resno 4
:resname l
...
}
)
So basically, we're using a dumb parser hack to expand wildcards by cloning nodes.