Compare result of two table functions using one column from each - sql

According the instructions here I have created two functions that use EXECUTE FORMAT and return the same table of (int,smallint).
Sample definitions:
CREATE OR REPLACE FUNCTION function1(IN _tbl regclass, IN _tbl2 regclass,
IN field1 integer)
RETURNS TABLE(id integer, dist smallint)
CREATE OR REPLACE FUNCTION function2(IN _tbl regclass, IN _tbl2 regclass,
IN field1 integer)
RETURNS TABLE(id integer, dist smallint)
Both functions return the exact same number of rows. Sample result (will be always ordered by dist):
(49,0)
(206022,3)
(206041,3)
(92233,4)
Is there a way to compare values of the second field between the two functions for the same rows, to ensure that both results are the same:
For example:
SELECT
function1('tblp1','tblp2',49),function2('tblp1_v2','tblp2_v2',49)
Returns something like:
(49,0) (49,0)
(206022,3) (206022,3)
(206041,3) (206041,3)
(92233,4) (133,4)
Although I am not expecting identical results (each function is a topK query and I have ties which are broken arbitrarily / with some optimizations in the second function for faster performance) I can ensure that both functions return correct results, if for each row the second numbers in the results are the same. In the example above, I can ensure I get correct results, because:
1st row 0 = 0,
2nd row 3 = 3,
3rd row 3 = 3,
4th row 4 = 4
despite the fact that for the 4th row, 92233!=133
Is there a way to get only the 2nd field of each function result, to batch compare them e.g. with something like:
SELECT COUNT(*)
FROM
(SELECT
function1('tblp1','tblp2',49).field2,
function2('tblp1_v2','tblp2_v2',49).field2 ) n2
WHERE function1('tblp1','tblp2',49).field2 != function1('tblp1','tblp2',49).field2;
I am using PostgreSQL 9.3.

Is there a way to get only the 2nd field of each function result, to batch compare them?
All of the following answers assume that rows are returned in matching order.
Postgres 9.3
With the quirky feature of exploding rows from SRF functions returning the same number of rows in parallel:
SELECT count(*) AS mismatches
FROM (
SELECT function1('tblp1','tblp2',49) AS f1
, function2('tblp1_v2','tblp2_v2',49) AS f2
) sub
WHERE (f1).dist <> (f2).dist; -- note the parentheses!
The parentheses around the row type are necessary to disambiguate from a possible table reference. Details in the manual here.
This defaults to Cartesian product of rows if the number of returned rows is not the same (which would break it completely for you).
Postgres 9.4
WITH ORDINALITY to generate row numbers on the fly
You can use WITH ORDINALITY to generate a row number o the fly and don't need to depend on pairing the result of SRF functions in the SELECT list:
SELECT count(*) AS mismatches
FROM function1('tblp1','tblp2',49) WITH ORDINALITY AS f1(id,dist,rn)
FULL JOIN function2('tblp1_v2','tblp2_v2',49) WITH ORDINALITY AS f2(id,dist,rn) USING (rn)
WHERE f1.dist IS DISTINCT FROM f2.dist;
This works for the same number of rows from each function as well as differing numbers (which would be counted as mismatch).
Related:
PostgreSQL unnest() with element number
ROWS FROM to join sets row-by-row
SELECT count(*) AS mismatches
FROM ROWS FROM (function1('tblp1','tblp2',49)
, function2('tblp1_v2','tblp2_v2',49)) t(id1, dist1, id2, dist2)
WHERE t.dist1 IS DISTINCT FROM t.dist2;
Related answer:
Is it possible to answer queries on a view before fully materializing the view?
Aside:
EXECUTE FORMAT is not a set plpgsql functionality. RETURN QUERY is. format() is just a convenient function for building a query string, can be used anywhere in SQL or plpgsql.

The order in which the rows are returned from the functions is not guaranteed. If you can return the row_number() (rn in the below example) from the functions then:
select
count(f1.dist is null or f2.dist is null or null) as diff_count
from
function1('tblp1','tblp2',49) f1
inner join
function2('tblp1_v2','tblp2_v2',49) f2 using(rn)

For future reference:
Checking difference in number of rows:
SELECT
ABS(count(f1a.*)-count(f2a.*))
FROM
(SELECT f1.dist, row_number() OVER(ORDER BY f1.dist) rn
FROM
function1('tblp1','tblp2',49) f1)
f1a FULL JOIN
(SELECT f2.dist, row_number() OVER(ORDER BY f2.dist) rn
FROM
function2('tblp1_v2','tblp2_v2',49) f2) f2a
USING (rn);
Checking difference in dist for same ordered rows:
SELECT
COUNT(*)
FROM
(SELECT f1.dist, row_number() OVER(ORDER BY f1.dist) rn
FROM
function1('tblp1','tblp2',49) f1)
f1a
(SELECT f2.dist, row_number() OVER(ORDER BY f2.dist) rn
FROM
function2('tblp1_v2','tblp2_v2',49) f2) f2a
WHERE f1a.rn=f2a.rn
AND f1a.distance <> f2a.distance;
A simple OVER() might also work since results of the functions are already ordered but is added for extra check.

Related

Adding a single static column to SQL query results

I have a pretty big query (no pun intended) written out in BigQuery that returns about 5 columns. I simply want to append an extra column to it that is not joined to any other table and just returns a single word in every row. As if to be an ID for the entire table.
Just wrap original select and add new constant or add it into original query. The answer might be more precise if you put your query and expected result to your question.
select q.*, 'JOHN' as new_column
from ( <your_big_query> ) q
previous (now unrelated) answer follows
You can use row_number window function:
select q.*, row_number() over (order by null) as id
from ( <your_big_query> ) q
It returns values 1,2, etc.
Depending on how complicated your query is, the row_number could be inlined directly into your query.
If all you want is one static column, just add an extra static column at the end of your existing select columns list.
select {ALL_COLUMNS_YOU_ARE_JOINING_COMPUTING_ETC}, 'something' as your_new_static_col from {YOUR_QUERY}
This static column does not need to be a string, it can be an int or some other type.

How to call a function using the results of a query ordered

I'm trying to call a function on each of the values that fit my query ordered by date. The reason being that the (black box) function is internally aggregating values into a string and I need the aggregated values to be in the timestamp order.
The things that I have tried are below: (function returns a boolean value and does blackbox things that I do not know and cannot modify)
-- This doesn't work
SELECT
bool_and (
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
ORDER BY MT.timestamp_value, MT.id
and got the error column "mt.timestamp_value" must appear in the GROUP BY clause or be used in an aggregate function. If I remove the ORDER BY as below, it will also work:
-- This works!
SELECT
bool_and (
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
I also tried removing the function and only selected MT.id and it worked, but with the function it doesn't. So I tried using the GROUP BY clause.
Doing that, I tried:
-- This also doesn't work
SELECT
bool_and(
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
GROUP BY MT.id, MT.timestamp_value
ORDER BY MT.timestamp_value, MT.id
but this gives the error more than one row returned by a subquery used as an expression. MT.id is the primary key btw. It also works without the function and just SELECT MT.id
Ideally, a fix to either one of the code bits above would be nice or otherwise something that fulfills the following:
-- Not real postgresql code but something I want it to do
SELECT FUNCTION(id)
FOR EACH id in (MY SELECT STATEMENT HERE ORDERED)
In response to #a_horse_with_no_name
This code falls under a section of another query that looks like the below:
SELECT Function2()
WHERE true = (
(my_snippet)
AND (...)
AND (...)
...
)
The error is clear. The subquery SELECT function(MT.id) is returning more than 1 row and the calling function bool_and can only operate on 1 row at a time. Adjust the subquery so that it only returns 1 record.
Issue resolution:
I discovered the reason that everything was failing was because of the AND in
WHERE true = (
(my_snippet)
AND (...)
AND (...)
...
)
What happened was that using GROUP BY and using ORDER BY caused the value returned by my snippet to be multiple rows of true.
Without the GROUP BY and the ORDER BY, it only returned a single row of true.
So my solution was to wrap the code into another bool_and and use that.
SELECT
bool_and(vals)
FROM(
SELECT
bool_and(
function(MT.id)
) as vals
FROM my_table MT
WHERE ...conditions...
GROUP BY MT.id, MT.timestamp_value
ORDER BY MT.timestamp_value, MT.id
) t
Since I have to guess the reason, and the way that you are trying to accomplish your stated goal:
"return value of the function is aggregated into a string"
And since:
You are using: bool_and and therefore the return value of the function must be boolean, and
The only aggregation I can see is the bool_and aggregation into either true or false, and
You mention that the function is a black box to you,
I would presume that instead of:
"return value of the function is aggregated into a string"
You meant to say: function is aggregating (input/transformed input) values into a string,
And you need this aggregating to be in a certain order.
Further I assume that you own the my_table and can create indexes on it.
So, if you need the function being used in the context:
bool_and ( function(MT.id) )
to process (and therefore aggregate into string) MT.id inputs (or their transformed values) in a certain order, you need to create a clustered index in that order for your my_table table.
To accomplish that in postgresql, you need to (instead of using the group by and order by):
create that_index, in the order you need for the aggregation, for your my_table table, and then
run: CLUSTER my_table USING that_index to physically bake in that order to table structure, and therefore ensure the default aggregation order to be in that order in the bool_and ( function(MT.id) ) aggregation.
(see CLUSTER for more info)

Select random value for each row

I'm trying to select a new random value from a column in another table for each row of a table I'm updating. I'm getting the random value, however I can't get it to change for each row. Any ideas? Here's the code:
UPDATE srs1.courseedition
SET ta_id = teacherassistant.ta_id
FROM srs1.teacherassistant
WHERE (SELECT ta_id FROM srs1.teacherassistant ORDER BY RANDOM()
LIMIT 1) = teacherassistant.ta_id
My guess is that Postgres is optimizing out the subquery, because it has no dependencies on the outer query. Have you simply considered using a subquery?
UPDATE srs1.courseedition
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
ORDER BY RANDOM()
LIMIT 1
);
I don't think this will fix the problem (smart optimizers, alas). But, if you correlate to the outer query, then it should run each time. Perhaps:
UPDATE srs1.courseedition ce
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
WHERE ce.ta_id IS NULL -- or something like that
ORDER BY RANDOM()
LIMIT 1
);
You can replace the WHERE clause with something more nonsensical such as WHERE COALESCE(ca.ta_id, '') IS NOT NULL.
This following solution should be faster by order(s) of magnitude than running a correlated subquery for every row. N random sorts over the whole table vs. 1 random sort. The result is just as random, but we get a perfectly even distribution with this method, whereas independent random picks like in Gordon's solution can (and probably will) assign some rows more often than others. There are different kinds of "random". Actual requirements for "randomness" need to be defined carefully.
Assuming the number of rows in courseedition is bigger than in teacherassistant.
To update all rows in courseedition:
UPDATE srs1.courseedition c1
SET ta_id = t.ta_id
FROM (
SELECT row_number() OVER (ORDER BY random()) - 1 AS rn -- random order
, count(*) OVER () As ct -- total count
, ta_id
FROM srs1.teacherassistant -- smaller table
) t
JOIN (
SELECT row_number() OVER () - 1 AS rn -- arbitrary order
, courseedition_id -- use actual PK of courseedition
FROM srs1.courseedition -- bigger table
) c ON c.rn%t.ct = t.rn -- rownumber of big modulo count of small table
WHERE c.courseedition_id = c1.courseedition_id;
Notes
Match the random rownumber of the bigger table modulo the count of the smaller table to the rownumber of the smaller table.
row_number() - 1 to get a 0-based index. Allows using the modulo operator % more elegantly.
Random sort for one table is enough. The smaller table is cheaper. The second can have any order (arbitrary is cheaper). The assignment after the join is random either way. Perfect randomness would only be impaired indirectly if there are regular patterns in sort order of the bigger table. In this unlikely case, apply ORDER BY random() to the bigger table to eliminate any such effect.

Postgres: limit by the results of a sum function

CREATE TABLE inventory_box (
box_id varchar(10),
value integer
);
INSERT INTO inventory_box VALUES ('1', 10), ('2', 15), ('3', 20);
I prepared a sql fiddle with the schema.
I would like to select a list of inventory boxes with combined value of above 20
possible result 1. box 1 + box 2 (10 + 15 >= 20)
Here is what I am doing right now:
SELECT * FROM inventory_box LIMIT 1 OFFSET 0;
-- count on the client side and see if I got enough
-- got 10
SELECT * FROM inventory_box LIMIT 1 OFFSET 1;
-- count on the client side and see if I got enough
-- got 15, add it to the first query which returned 10
-- total is 25, ok, got enough, return answer
I am looking for a solution where the scan will stop as soon as it reaches the target value
One possible approach scans the table in box_id order until the total is above 30, then returns all the previous rows plus the row that tipped the sum over the limit. Note that the scan doesn't stop when the sum is reached, it totals the whole table then goes back over the results to pick the results.
http://sqlfiddle.com/#!15/1c502/4
SELECT
array_agg(box_id ORDER BY box_id) AS box_ids,
max(boxsum) AS boxsum
FROM
(
SELECT
box_id,
sum(value) OVER (ORDER BY box_id) AS boxsum,
sum(value) OVER (ORDER BY box_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevboxsum
FROM
inventory_box
) x
WHERE prevboxsum < 30 OR prevboxsum IS NULL;
but really, this is going to be pretty gruesome to do in a general and reliable manner in SQL (or at all).
You can ORDER BY value ASC instead of ORDER BY box_id if you like; this will add boxes from the smallest to the biggest. However, this will catastrophically fail if you then remove all the small boxes from the pool and run it again, and repeat. Soon it'll just be lumping two big boxes together inefficiently.
To solve this for the general case, finding the smallest combination, is a hard optimization problem that probably benefits from imprecise sample- and probabilistic based methods.
To scan the table in order until the sum reaches the target, lock the table then use PL/PgSQL to read rows from a cursor that returns the rows in value order plus an array_agg(box_id) OVER (ORDER BY value) and sum(value) OVER (order by value). When you reach the desired sum, return the current row's array. This won't produce an optimal solution, but it'll produce a solution, and I think it'll do so without a full table scan if there's a suitable index in place.
Your question update clarifies that your actual requirements are much simpler than a full-blown "subset sum problem" as suspected by #GhostGambler:
Just fetch rows until the sum is big enough.
I am sorting by box_id to get deterministic results. You might even drop the ORDER BY altogether to get any valid result a bit faster, yet.
Slow: Recursive CTE
WITH RECURSIVE i AS (
SELECT *, row_number() OVER (ORDER BY box_id) AS rn
FROM inventory_box
)
, r AS (
SELECT box_id, val, val AS total, 2 AS rn
FROM i
WHERE rn = 1
UNION ALL
SELECT i.box_id, i.val, r.total + i.val, r.rn + 1
FROM r
JOIN i USING (rn)
WHERE r.total < 20
)
SELECT box_id, val, total
FROM r
ORDER BY box_id;
Fast: PL/pgSQL function with FOR loop
Using sum() as window aggregate function (cheapest this way).
CREATE OR REPLACE FUNCTION f_shop_for(_total int)
RETURNS TABLE (box_id text, val int, total int) AS
$func$
BEGIN
total := 0;
FOR box_id, val, total IN
SELECT i.box_id, i.val
, sum(i.val) OVER (ORDER BY i.box_id) AS total
FROM inventory_box i
LOOP
RETURN NEXT;
EXIT WHEN total >= _total;
END LOOP;
END
$func$ LANGUAGE plpgsql STABLE;
SELECT * FROM f_shop_for(35);
I tested both with a big table of 1 million rows. The function only reads the necessary rows from index and table. The CTE is very slow, seems to scan the whole table ...
SQL Fiddle for both.
Aside: sorting by a varchar column (box_id) containing numeric data yields dubious results. Maybe this should be a numeric type, really?

Joining Two Same-Sized Resultsets by Row Number

I have two table functions that return a single column each. One function is guaranteed to return the same number of rows as the other.
I want to insert the values into a new two-column table. One colum will receive the value from the first udf, the second column from the second udf. The order of the inserts will be the order in which the rows are returned by the udfs.
How can I JOIN these two udfs given that they do not share a common key? I've tried using a ROW_NUMBER() but can't quite figure it out:
INSERT INTO dbo.NewTwoColumnTable (Column1, Column2)
SELECT udf1.[value], udf2.[value]
FROM dbo.udf1() udf1
INNER JOIN dbo.udf2() udf2 ON ??? = ???
This will not help you, but SQL does not guarantee row order unless it is asked to explicitly, so the idea that they will be returned in the order you expect may be true for a given set, but as I understand the idea of set based results, is fundamentally not guaranteed to work properly. You probably want to have a key returned from the UDF if it is associated with something that guarantees the order.
Despite this, you can do the following:
declare #val int
set #val=1;
Select Val1,Val2 from
(select Value as Val2, ROW_NUMBER() over (order by #val) r from udf1) a
join
(select Value as Val2, ROW_NUMBER() over (order by #val) r from udf2) b
on a.r=b.r
The variable addresses the issue of needing a column to sort by.
If you have the privlidges to edit the UDF, I think the better practice is to already sort the data coming out of the UDF, and then you can add ident int identity(1,1) to your output table in the udf, which makes this clear.
The reaosn this might matter is if your server decided to split the udf results into two packets. If the two arrive out of the order you expected, SQL could return them in the order received, which ruins the assumption made that he UDF will return rows in order. This may not be an issue, but if the result is needed later for a real system, proper programming here prevents unexpected bugs later.
In SQL, the "order returned by the udfs" is not guaranteed to persist (even between calls).
Try this:
WITH q1 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY whatever1) rn
FROM udf1()
),
q2 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY whatever2) rn
FROM udf2()
)
INSERT
INTO dbo.NewTwoColumnTable (Column1, Column2)
SELECT q1.value, q2.value
FROM q1
JOIN q2
ON q2.rn = q1.rn
PostgreSQL 9.4+ could append a INT8 column at the end of the udfs result using the WITH ORDINALITY suffix
-- set returning function WITH ORDINALITY
SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
ls | n
-----------------+----
pg_serial | 1
pg_twophase | 2
postmaster.opts | 3
pg_notify | 4
official doc: http://www.postgresql.org/docs/devel/static/functions-srf.html
related blogspot: http://michael.otacoo.com/postgresql-2/postgres-9-4-feature-highlight-with-ordinality/