I have a function that returns a setof from a table:
CREATE OR REPLACE FUNCTION get_assoc_addrs_from_bbl(_bbl text)
RETURNS SETOF wow_bldgs AS $$
SELECT bldgs.* FROM wow_bldgs AS bldgs
...
$$ LANGUAGE SQL STABLE;
Here's a sample of what the table would return:
Now I'm writing an "aggregate" function that will return only one row that with various (aggregated) data points about the table that this function returns. Here is my current working (& naive) example:
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT unnest(corpnames) as corpname
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT unnest(businessaddrs) as businessaddr
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
As you can see, for the two "subqueries" that require a custom grouping/ordering method, I need to repeat the call to get_assoc_addrs_from_bbl(). Ideally, I'm looking for a structure that would avoid the repeated calls as the function requires a lot of processing and I want the capacity for an arbitrary number of subqueries. I've looked into CTEs and window expressions and the like but no luck.
Any tips? Thank you!
Create simple aggregate function:
create aggregate array_agg2(anyarray) (
sfunc=array_cat,
stype=anyarray);
It aggregates array values into one single-dim array. Example:
# with t(x) as (values(array[1,2]),(array[2,3,4])) select array_agg2(x) from t;
┌─────────────┐
│ array_agg2 │
╞═════════════╡
│ {1,2,2,3,4} │
└─────────────┘
After that your query could be rewritten as
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT * FROM unnest(array_agg2(corpnames)) as corpname
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT * FROM unnest(array_agg2(businessaddrs)) as businessaddr
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
(surely if I understand your goal correctly)
Related
I use Postgres sql function it in my sql query like:
SELECT
message.id,
note,
earned_media_direct(
SUM(message_stat.posts_delivered)::int,
CAST(SUM(message_stat.clicks) AS bigint),
team.earned_media_multi_clicks::int,
SUM(message_stat.likes)::int,
team.earned_media_multi_likes::int,
SUM(message_stat.comments)::int,
team.earned_media_multi_comments::int,
SUM(message_stat.shares)::int,
team.earned_media_multi_shares::int
) AS media_points,
count(*) OVER() AS total_count
FROM message
LEFT JOIN team ON team.id = 10
WHERE team_id = 10
GROUP BY message.id, team.id
{$orderBy}
LIMIT 20 OFFSET 1
When returning a list of messages I want to use ORDER BY rank (Sorting by "rank" really means sorting by Media Points)
The function earned_media_direct is defined within Postgres like:
CREATE OR REPLACE FUNCTION public.earned_media_direct(posts bigint, clicks bigint, clicks_multiplier numeric, likes bigint, likes_multiplier numeric, comments bigint, comments_multiplier numeric, reshares bigint, shares_multiplier numeric)
RETURNS numeric
LANGUAGE plpgsql
AS $function$
BEGIN
RETURN COALESCE(clicks, 0) * clicks_multiplier +
COALESCE(likes, 0) * likes_multiplier +
COALESCE(comments, 0) * comments_multiplier +
(COALESCE(posts, 0) + COALESCE(reshares, 0)) * shares_multiplier;
END;
$function$
I tried adding:
ROW_NUMBER() OVER (
ORDER BY earned_media_direct(
SUM(message_stat.posts_delivered),
CAST(SUM(message_stat.clicks) AS bigint),
team.earned_media_multi_clicks,
SUM(message_stat.likes),
team.earned_media_multi_likes,
SUM(message_stat.comments),
team.earned_media_multi_comments,
SUM(message_stat.shares),
team.earned_media_multi_shares) DESC
) AS rank
I am not sure I am using it right regarding my example. Is there another way to perform ORDER BY rank.
Thanks
We probably can use "subquery" approach for this
First: we wrap your main query into subquery and omit the order inside
it.
Second: we do order on outer query.
-- outer query
SELECT * FROM (
-- sub query (your main query)
SELECT
message.id,
note,
earned_media_direct(
SUM(message_stat.posts_delivered)::int,
CAST(SUM(message_stat.clicks) AS bigint),
team.earned_media_multi_clicks::int,
SUM(message_stat.likes)::int,
team.earned_media_multi_likes::int,
SUM(message_stat.comments)::int,
team.earned_media_multi_comments::int,
SUM(message_stat.shares)::int,
team.earned_media_multi_shares::int
) AS media_points,
count(*) OVER() AS total_count
FROM message
LEFT JOIN team ON team.id = 10
WHERE team_id = 10
GROUP BY message.id, team.id
LIMIT 20 OFFSET 1
) A
-- move the order to outer query
ORDER BY A.media_points;
Hope this help answer your question.
Reference:
https://www.postgresql.org/docs/current/functions-subquery.htm
It seems we can use a SQL statement as:
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
);
but we can't do
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(
select
(c_foos / c_bars) as the_ratio
);
or
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(c_foos / c_bars) as the_ratio;
Is there a way to do that showing all 3 numbers? Is there a more definite rule as to what can be done and what can't?
You can try this:
You define two CTEs in a WITH clause, so you can use your result in the main query built on two cte tables (cte_num and cte_den)
WITH recursive
cte_num AS (
SELECT count(*) as c_foos
FROM foos
),
cte_den AS (
SELECT count(*) as c_bars
FROM bars
)
SELECT
cte_num.foos,
cte_den.bars,
cte_num.foos / cte_den.bars as the_ratio
from cte_num, cte_den;
There is a small number of simple rules... but SQL seems so easy that most programmers prefer to cut to the chase, and later complain they didn't get the plot :)
You can think of a query as a description of a flow: columns in a select share inputs (defined in from), but are evaluated "in parallel", without seeing each other. Your complex example boils down to the fact, that you cannot do this:
select 1 as a, 2 as b, a + b;
fields a and b are defined as outputs from the query, but there are no inputs called a and b. All you have to do is modify the query so that a and b are inputs:
select a + b from (select 1 as a, 2 as b) as inputs
And this will work (this is, btw., the solution for your queries).
Addendum:
The confusion comes from the fact, that in most SQL 101 cases outputs are created directly from inputs (data just passes through).
This flow model is useful, because it makes things easier to reason about in more complex cases. Also, we avoid ambiguities and loops. You can think about it in the context of query like: select name as last_name, last_name as name, name || ' ' || last_name from person;
Move the conditions to the FROM clause:
select f.c_foos, b.c_bars, f.c_foos / f.c_bars
from (select count(*) as c_foos from foos
) f cross join
(select count(*) as c_bars from bars
) b;
Ironically, your first version will work in MySQL (see here). I don't actually think this is intentional. I think it is an artifact of their parser -- meaning that it happens to work but might stop working in future versions.
The simplest way is to use a CTE that returns the 2 columns:
with cte as (
select
(select count(*) from foos) as c_foos,
(select count(*) from bars) as c_bars
)
select c_foos, c_bars, (c_foos / c_bars) as the_ratio
from cte
Note that the aliases of the 2 columns must be set outside of each query and not inside (the parentheses).
When i run this query
SELECT
DT.CONTRACT_NUMBER,
DT.ROLE,
DT.TAX_ID,
DT.EFFECTIVE_DATE
FROM DATA_TABLE DT
I get this result.
Id like to remove results where the TAX ID appears more than once for each contract.
i.e This result would be gone. If they had 3 results they would be gone.
I think window functions might be the way to go:
SELECT DT.CONTRACT_NUMBER, DT.ROLE, DT.TAX_ID, DT.EFFECTIVE_DATE
FROM (SELECT DT.CONTRACT_NUMBER, DT.ROLE, DT.TAX_ID, DT.EFFECTIVE_DATE,
COUNT(*) OVER (PARTITION BY TAX_ID) as cnt
FROM DATA_TABLE DT
WHERE DT.CONTRACT_NUMBER = '551000280'
) DT
WHERE CNT = 1;
If you actually want to keep one row per tax id, then use row_number() instead of count(*).
I need to sum a subarray from an array using postgresql.
I need to create a postgresql query that will dynamically do this as the upper and lower indexes will be different for each array.
These indexes will come from two other columns within the same table.
I had the below query that will get the subarray:
SELECT
SUM(t) AS summed_index_values
FROM
(SELECT UNNEST(int_array_column[34:100]) AS t
FROM array_table
WHERE id = 1) AS t;
...but I then realised I couldn't use variables or SELECT statements when using array slices to make the query dynamic:
int_array_column[SELECT array_index_lower FROM array_table WHERE id = 1; : SELECT array_index_upper FROM array_table WHERE id = 1;]
...does anyone know how I can achieve this query dynamically?
No need for sub-selects, just use the column names:
SELECT SUM(t) AS summed_index_values
FROM (
SELECT UNNEST(int_array_column[tb.array_index_lower:tb.array_index_upper]) AS t
FROM array_table tb
WHERE id = 1
) AS t;
Note that it's not recommended to use set-returning functions (unnest) in the SELECT list. It's better to put that into the FROM clause:
SELECT sum(t.val)
FROM (
SELECT t.val
FROM array_table tb
cross join UNNEST(int_array_column[tb.array_idx_lower:array_idx_upper]) AS t(val)
WHERE id = 1
) AS t;
Suppose I have the following table definition:
CREATE TABLE x (i serial primary key, value integer not null);
I want to calculate the MEDIAN of value (not the AVG). The median is a value that divides the set in two subsets containing the same number of elements. If the number of elements is even, the median is the average of the biggest value in the lowest segment and the lowest value of the biggest segment. (See wikipedia for more details.)
Here is how I manage to calculate the MEDIAN but I guess there must be a better way:
SELECT AVG(values_around_median) AS median
FROM (
SELECT
DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
AS values_around_median
FROM (
SELECT LAST_VALUE(value) OVER w AS value,
SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
FROM x
GROUP BY value
WINDOW w AS (ORDER BY value)
ORDER BY value
) AS find_if_values_are_above_or_below_median
WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
w3 AS (PARTITION BY above ORDER BY value ASC)
) AS find_values_around_median
Any ideas?
Yes, with PostgreSQL 9.4, you can use the newly introduced inverse distribution function PERCENTILE_CONT(), an ordered-set aggregate function that is specified in the SQL standard as well.
WITH t(value) AS (
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 100
)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
t;
This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here.
Indeed there IS an easier way. In Postgres you can define your own aggregate functions. I posted functions to do median as well as mode and range to the PostgreSQL snippets library a while back.
http://wiki.postgresql.org/wiki/Aggregate_Median
A simpler query for that:
WITH y AS (
SELECT value, row_number() OVER (ORDER BY value) AS rn
FROM x
WHERE value IS NOT NULL
)
, c AS (SELECT count(*) AS ct FROM y)
SELECT CASE WHEN c.ct%2 = 0 THEN
round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
ELSE
(SELECT value FROM y WHERE y.rn = (c.ct+1)/2)
END AS median
FROM c;
Major points
Ignores NULL values.
Core feature is the row_number() window function, which has been there since version 8.4
The final SELECT gets one row for uneven numbers and avg() of two rows for even numbers. Result is numeric, rounded to 3 decimal places.
Test shows, that the new version is 4x faster than (and yields correct results, unlike) the query in the question:
CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);
For googlers: there is also http://pgxn.org/dist/quantile
Median can be calculated in one line after installation of this extension.
Simple sql with native postgres functions only:
select
case count(*)%2
when 1 then (array_agg(num order by num))[count(*)/2+1]
else ((array_agg(num order by num))[count(*)/2]::double precision + (array_agg(num order by num))[count(*)/2+1])/2
end as median
from unnest(array[5,17,83,27,28]) num;
Sure you can add coalesce() or something if you want to handle nulls.
CREATE TABLE array_table (id integer, values integer[]) ;
INSERT INTO array_table VALUES ( 1,'{1,2,3}');
INSERT INTO array_table VALUES ( 2,'{4,5,6,7}');
select id, values, cardinality(values) as array_length,
(case when cardinality(values)%2=0 and cardinality(values)>1 then (values[(cardinality(values)/2)]+ values[((cardinality(values)/2)+1)])/2::float
else values[(cardinality(values)+1)/2]::float end) as median
from array_table
Or you can create a function and use it any where in your further queries.
CREATE OR REPLACE FUNCTION median (a integer[])
RETURNS float AS $median$
Declare
abc float;
BEGIN
SELECT (case when cardinality(a)%2=0 and cardinality(a)>1 then
(a[(cardinality(a)/2)] + a[((cardinality(a)/2)+1)])/2::float
else a[(cardinality(a)+1)/2]::float end) into abc;
RETURN abc;
END;
$median$
LANGUAGE plpgsql;
select id,values,median(values) from array_table
Use the Below function for Finding nth percentile
CREATE or REPLACE FUNCTION nth_percentil(anyarray, int)
RETURNS
anyelement as
$$
SELECT $1[$2/100.0 * array_upper($1,1) + 1] ;
$$
LANGUAGE SQL IMMUTABLE STRICT;
In Your case it's 50th Percentile.
Use the Below Query to get the Median
SELECT nth_percentil(ARRAY (SELECT Field_name FROM table_name ORDER BY 1),50)
This will give you 50th percentile which is the median basically.
Hope this is helpful.