Finding the sum of a column that contains a window function - sql

I have this select query:
SELECT
total,
COALESCE(total - Lag(total)OVER(ORDER BY total), 0) AS dif_total
FROM ( select count(*) as total
FROM
tbl_person
left join
tbl_census
on
tbl_census.person_id = tbl_person.person_id
group by extract(year from tbl_census.date)
) abc
Is there a way I could find the sum of the column dif_total?
I can't use the Sum() because it contains a window function.
I tried saving the column to an array because I figure maybe I could call the function and convert the array to a column then use Sum().
But I messed it up.
Here is my query for the function.
CREATE OR REPLACE function growth() Returns int[] as $$
declare total2 integer[];
BEGIN
SELECT
total,
COALESCE(total - Lag(total)OVER(ORDER BY total), 0) into total2
FROM
( select count(*) as total
from
tbl_person
group by extract(year from bdate)
) abc ;
RETURN total2;
END; $$ LANGUAGE plpgsql;
The function query runs successfully and does not show any warning or error but I think I was doing it wrong because when I try to SELECT it it will say
Array value must start with "{" or dimension information
I'm very new in using stored function in postgre.
What changes should I do to my function to work?
Or what are the other ways for me to sum the column dif_total above?

Why don't you just wrap it with another select?
SELECT total,sum(dif_total) as total_2
FROM ( YOUR QUERY HERE...)
GROUP BY total

Related

PostgreSQL - multiple aggregate queries from the same function call

I have a function that returns a setof from a table:
CREATE OR REPLACE FUNCTION get_assoc_addrs_from_bbl(_bbl text)
RETURNS SETOF wow_bldgs AS $$
SELECT bldgs.* FROM wow_bldgs AS bldgs
...
$$ LANGUAGE SQL STABLE;
Here's a sample of what the table would return:
Now I'm writing an "aggregate" function that will return only one row that with various (aggregated) data points about the table that this function returns. Here is my current working (& naive) example:
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT unnest(corpnames) as corpname
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT unnest(businessaddrs) as businessaddr
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
As you can see, for the two "subqueries" that require a custom grouping/ordering method, I need to repeat the call to get_assoc_addrs_from_bbl(). Ideally, I'm looking for a structure that would avoid the repeated calls as the function requires a lot of processing and I want the capacity for an arbitrary number of subqueries. I've looked into CTEs and window expressions and the like but no luck.
Any tips? Thank you!
Create simple aggregate function:
create aggregate array_agg2(anyarray) (
sfunc=array_cat,
stype=anyarray);
It aggregates array values into one single-dim array. Example:
# with t(x) as (values(array[1,2]),(array[2,3,4])) select array_agg2(x) from t;
┌─────────────┐
│ array_agg2 │
╞═════════════╡
│ {1,2,2,3,4} │
└─────────────┘
After that your query could be rewritten as
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT * FROM unnest(array_agg2(corpnames)) as corpname
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT * FROM unnest(array_agg2(businessaddrs)) as businessaddr
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
(surely if I understand your goal correctly)

Record returned from function has columns concatenated

I have a table which stores account changes over time. I need to join that up with two other tables to create some records for a particular day, if those records don't already exist.
To make things easier (I hope), I've encapsulated the query that returns the correct historical data into a function that takes in an account id, and the day.
If I execute "Select * account_servicetier_for_day(20424, '2014-08-12')", I get the expected result (all the data returned from the function in separate columns). If I use the function within another query, I get all the columns joined into one:
("2014-08-12 14:20:37",hollenbeck,691,12129,20424,69.95,"2Mb/1Mb 20GB Limit",2048,1024,20.000)
I'm using "PostgreSQL 9.2.4 on x86_64-slackware-linux-gnu, compiled by gcc (GCC) 4.7.1, 64-bit".
Query:
Select
'2014-08-12' As day, 0 As inbytes, 0 As outbytes, acct.username, acct.accountid, acct.userid,
account_servicetier_for_day(acct.accountid, '2014-08-12')
From account_tab acct
Where acct.isdsl = 1
And acct.dslservicetypeid Is Not Null
And acct.accountid Not In (Select accountid From dailyaccounting_tab Where Day = '2014-08-12')
Order By acct.username
Function:
CREATE OR REPLACE FUNCTION account_servicetier_for_day(_accountid integer, _day timestamp without time zone) RETURNS setof account_dsl_history_info AS
$BODY$
DECLARE _accountingrow record;
BEGIN
Return Query
Select * From account_dsl_history_info
Where accountid = _accountid And timestamp <= _day + interval '1 day - 1 millisecond'
Order By timestamp Desc
Limit 1;
END;
$BODY$ LANGUAGE plpgsql;
Generally, to decompose rows returned from a function and get individual columns:
SELECT * FROM account_servicetier_for_day(20424, '2014-08-12');
As for the query:
Postgres 9.3 or newer
Cleaner with JOIN LATERAL:
SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, f.* -- but avoid duplicate column names!
FROM account_tab a
, account_servicetier_for_day(a.accountid, '2014-08-12') f -- <-- HERE
WHERE a.isdsl = 1
AND a.dslservicetypeid IS NOT NULL
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
ORDER BY a.username;
The LATERAL keyword is implicit here, functions can always refer earlier FROM items. The manual:
LATERAL can also precede a function-call FROM item, but in this
case it is a noise word, because the function expression can refer to
earlier FROM items in any case.
Related:
Insert multiple rows in one table based on number in another table
Short notation with a comma in the FROM list is (mostly) equivalent to a CROSS JOIN LATERAL (same as [INNER] JOIN LATERAL ... ON TRUE) and thus removes rows from the result where the function call returns no row. To retain such rows, use LEFT JOIN LATERAL ... ON TRUE:
...
FROM account_tab a
LEFT JOIN LATERAL account_servicetier_for_day(a.accountid, '2014-08-12') f ON TRUE
...
Also, don't use NOT IN (subquery) when you can avoid it. It's the slowest and most tricky of several ways to do that:
Select rows which are not present in other table
I suggest NOT EXISTS instead.
Postgres 9.2 or older
You can call a set-returning function in the SELECT list (which is a Postgres extension of standard SQL). For performance reasons, this is best done in a subquery. Decompose the (well-known!) row type in the outer query to avoid repeated evaluation of the function:
SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, (a.rec).* -- but be wary of duplicate column names!
FROM (
SELECT *, account_servicetier_for_day(a.accountid, '2014-08-12') AS rec
FROM account_tab a
WHERE a.isdsl = 1
AND a.dslservicetypeid Is Not Null
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
) a
ORDER BY a.username;
Related answer by Craig Ringer with an explanation, why it's better not to decompose on the same query level:
How to avoid multiple function evals with the (func()).* syntax in an SQL query?
Postgres 10 removed some oddities in the behavior of set-returning functions in the SELECT:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Use the function in the from clause
Select
'2014-08-12' As day,
0 As inbytes,
0 As outbytes,
acct.username,
acct.accountid,
acct.userid,
asfd.*
From
account_tab acct
cross join lateral
account_servicetier_for_day(acct.accountid, '2014-08-12') asfd
Where acct.isdsl = 1
And acct.dslservicetypeid Is Not Null
And acct.accountid Not In (Select accountid From dailyaccounting_tab Where Day = '2014-08-12')
Order By acct.username

How to get a last record when using group by in PostgreSQL

This is my table "AuctionDetails"
The following select:
select string_agg("AuctionNO",',' ) as "AuctionNO"
,sum("QuntityInAuction" ) as "QuntityInAuction"
,"AmmanatPattiID"
,"EntryPassDetailsId"
,"BrokerID"
,"TraderID"
,"IsSold"
,"IsActive"
,"IsExit"
,"IsNew"
,"CreationDate"
from "AuctionDetails"
group by "AmmanatPattiID"
,"EntryPassDetailsId"
,"TraderID"
,"IsSold"
,"IsActive"
,"IsExit"
,"IsNew"
,"BrokerID"
,"CreationDate"
gives me this result:
but i need record like
AuctionNo QunatityInAuction AmmanatpattiID EntryPassDetailID BrokerID Trader ID IsSold ISActive ISExit IsNew CreationDate
AU8797897,AU8797886,AU596220196F37379 1050 -1 228,229 42 42 f t f t 2013-10-10
At the end i need a latest entry of trader and broker which is in our case "42", sum of quantity , and concatenation of auction number ...
The Postgres wiki describes how to define your own FIRST and LAST aggregate functions. For example:
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
The page is here: https://wiki.postgresql.org/wiki/First/last_(aggregate)
There are various ways to do this. Combinations of aggregate and window functions or a combination of window functions and DISTINCT ...
SELECT a.*, b.*
FROM (
SELECT string_agg("AuctionNO", ',') AS "AuctionNO"
,sum("QuntityInAuction") AS "QuntityInAuction"
FROM "AuctionDetails"
) a
CROSS JOIN (
SELECT "AmmanatPattiID"
,"EntryPassDetailsId"
,"BrokerID"
,"TraderID"
,"IsSold"
,"IsActive"
,"IsExit"
,"IsNew"
,"CreationDate"
FROM "AuctionDetails"
ORDER BY "AuctionID" DESC
LIMIT 1
) b
For the simple case of a single result row for a whole table, this may be simplest.

SQL Server Query invalid column in select list

I am trying to create a function that will return a table of all cows that produced on average more than 20 liters of milk per day.
This is the code I came up with:
CREATE FUNCTION SuperCows (#year int)
RETURNS #supercows TABLE (
Name nvarchar(50),
AvgMilk decimal(4,2)
)
BEGIN
INSERT #supercows
SELECT c.Name, AVG(CAST(p.MilkQuantity AS decimal(4,2))) FROM MilkProduction AS p
INNER JOIN Cows AS c ON c.IDCow = p.CowID
WHERE YEAR(p.Date) = #year
GROUP BY p.CowID
HAVING AVG(CAST(p.MilkQuantity AS decimal(4,2))) > 20
RETURN
END
GO
The error that I get when trying to create the function is this:
Column 'Cows.Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
My knowledge of SQL is fairly limited and I was hoping someone cold help me with solving this.
You need to add Cows.name to the group by list:
SELECT c.Name, AVG(CAST(p.MilkQuantity AS decimal(4,2))) FROM MilkProduction AS p
INNER JOIN Cows AS c ON c.IDCow = p.CowID
WHERE YEAR(p.Date) = #year
GROUP BY p.CowID, c.Name
HAVING AVG(CAST(p.MilkQuantity AS decimal(4,2))) > 20
If you are using group by every field you select needs to either be in the list being grouped by or have an aggregate function applied to the column (AVG, MIN, MAX, SUM, etc) as there can be multiple values returned for each of the non-grouped-by columns.
Change
GROUP BY p.CowID
to
GROUP BY c.Name
This won't work if you have multiple cows with the same name - in that case their total MilkQuantity will be combined into a single record.

Is there a better way to calculate the median (not average)

Suppose I have the following table definition:
CREATE TABLE x (i serial primary key, value integer not null);
I want to calculate the MEDIAN of value (not the AVG). The median is a value that divides the set in two subsets containing the same number of elements. If the number of elements is even, the median is the average of the biggest value in the lowest segment and the lowest value of the biggest segment. (See wikipedia for more details.)
Here is how I manage to calculate the MEDIAN but I guess there must be a better way:
SELECT AVG(values_around_median) AS median
FROM (
SELECT
DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
AS values_around_median
FROM (
SELECT LAST_VALUE(value) OVER w AS value,
SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
FROM x
GROUP BY value
WINDOW w AS (ORDER BY value)
ORDER BY value
) AS find_if_values_are_above_or_below_median
WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
w3 AS (PARTITION BY above ORDER BY value ASC)
) AS find_values_around_median
Any ideas?
Yes, with PostgreSQL 9.4, you can use the newly introduced inverse distribution function PERCENTILE_CONT(), an ordered-set aggregate function that is specified in the SQL standard as well.
WITH t(value) AS (
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 100
)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
t;
This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here.
Indeed there IS an easier way. In Postgres you can define your own aggregate functions. I posted functions to do median as well as mode and range to the PostgreSQL snippets library a while back.
http://wiki.postgresql.org/wiki/Aggregate_Median
A simpler query for that:
WITH y AS (
SELECT value, row_number() OVER (ORDER BY value) AS rn
FROM x
WHERE value IS NOT NULL
)
, c AS (SELECT count(*) AS ct FROM y)
SELECT CASE WHEN c.ct%2 = 0 THEN
round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
ELSE
(SELECT value FROM y WHERE y.rn = (c.ct+1)/2)
END AS median
FROM c;
Major points
Ignores NULL values.
Core feature is the row_number() window function, which has been there since version 8.4
The final SELECT gets one row for uneven numbers and avg() of two rows for even numbers. Result is numeric, rounded to 3 decimal places.
Test shows, that the new version is 4x faster than (and yields correct results, unlike) the query in the question:
CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);
For googlers: there is also http://pgxn.org/dist/quantile
Median can be calculated in one line after installation of this extension.
Simple sql with native postgres functions only:
select
case count(*)%2
when 1 then (array_agg(num order by num))[count(*)/2+1]
else ((array_agg(num order by num))[count(*)/2]::double precision + (array_agg(num order by num))[count(*)/2+1])/2
end as median
from unnest(array[5,17,83,27,28]) num;
Sure you can add coalesce() or something if you want to handle nulls.
CREATE TABLE array_table (id integer, values integer[]) ;
INSERT INTO array_table VALUES ( 1,'{1,2,3}');
INSERT INTO array_table VALUES ( 2,'{4,5,6,7}');
select id, values, cardinality(values) as array_length,
(case when cardinality(values)%2=0 and cardinality(values)>1 then (values[(cardinality(values)/2)]+ values[((cardinality(values)/2)+1)])/2::float
else values[(cardinality(values)+1)/2]::float end) as median
from array_table
Or you can create a function and use it any where in your further queries.
CREATE OR REPLACE FUNCTION median (a integer[])
RETURNS float AS $median$
Declare
abc float;
BEGIN
SELECT (case when cardinality(a)%2=0 and cardinality(a)>1 then
(a[(cardinality(a)/2)] + a[((cardinality(a)/2)+1)])/2::float
else a[(cardinality(a)+1)/2]::float end) into abc;
RETURN abc;
END;
$median$
LANGUAGE plpgsql;
select id,values,median(values) from array_table
Use the Below function for Finding nth percentile
CREATE or REPLACE FUNCTION nth_percentil(anyarray, int)
RETURNS
anyelement as
$$
SELECT $1[$2/100.0 * array_upper($1,1) + 1] ;
$$
LANGUAGE SQL IMMUTABLE STRICT;
In Your case it's 50th Percentile.
Use the Below Query to get the Median
SELECT nth_percentil(ARRAY (SELECT Field_name FROM table_name ORDER BY 1),50)
This will give you 50th percentile which is the median basically.
Hope this is helpful.