Is there a better way to calculate the median (not average) - sql

Suppose I have the following table definition:
CREATE TABLE x (i serial primary key, value integer not null);
I want to calculate the MEDIAN of value (not the AVG). The median is a value that divides the set in two subsets containing the same number of elements. If the number of elements is even, the median is the average of the biggest value in the lowest segment and the lowest value of the biggest segment. (See wikipedia for more details.)
Here is how I manage to calculate the MEDIAN but I guess there must be a better way:
SELECT AVG(values_around_median) AS median
FROM (
SELECT
DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
AS values_around_median
FROM (
SELECT LAST_VALUE(value) OVER w AS value,
SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
FROM x
GROUP BY value
WINDOW w AS (ORDER BY value)
ORDER BY value
) AS find_if_values_are_above_or_below_median
WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
w3 AS (PARTITION BY above ORDER BY value ASC)
) AS find_values_around_median
Any ideas?

Yes, with PostgreSQL 9.4, you can use the newly introduced inverse distribution function PERCENTILE_CONT(), an ordered-set aggregate function that is specified in the SQL standard as well.
WITH t(value) AS (
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 100
)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
t;
This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here.

Indeed there IS an easier way. In Postgres you can define your own aggregate functions. I posted functions to do median as well as mode and range to the PostgreSQL snippets library a while back.
http://wiki.postgresql.org/wiki/Aggregate_Median

A simpler query for that:
WITH y AS (
SELECT value, row_number() OVER (ORDER BY value) AS rn
FROM x
WHERE value IS NOT NULL
)
, c AS (SELECT count(*) AS ct FROM y)
SELECT CASE WHEN c.ct%2 = 0 THEN
round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
ELSE
(SELECT value FROM y WHERE y.rn = (c.ct+1)/2)
END AS median
FROM c;
Major points
Ignores NULL values.
Core feature is the row_number() window function, which has been there since version 8.4
The final SELECT gets one row for uneven numbers and avg() of two rows for even numbers. Result is numeric, rounded to 3 decimal places.
Test shows, that the new version is 4x faster than (and yields correct results, unlike) the query in the question:
CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);

For googlers: there is also http://pgxn.org/dist/quantile
Median can be calculated in one line after installation of this extension.

Simple sql with native postgres functions only:
select
case count(*)%2
when 1 then (array_agg(num order by num))[count(*)/2+1]
else ((array_agg(num order by num))[count(*)/2]::double precision + (array_agg(num order by num))[count(*)/2+1])/2
end as median
from unnest(array[5,17,83,27,28]) num;
Sure you can add coalesce() or something if you want to handle nulls.

CREATE TABLE array_table (id integer, values integer[]) ;
INSERT INTO array_table VALUES ( 1,'{1,2,3}');
INSERT INTO array_table VALUES ( 2,'{4,5,6,7}');
select id, values, cardinality(values) as array_length,
(case when cardinality(values)%2=0 and cardinality(values)>1 then (values[(cardinality(values)/2)]+ values[((cardinality(values)/2)+1)])/2::float
else values[(cardinality(values)+1)/2]::float end) as median
from array_table
Or you can create a function and use it any where in your further queries.
CREATE OR REPLACE FUNCTION median (a integer[])
RETURNS float AS $median$
Declare
abc float;
BEGIN
SELECT (case when cardinality(a)%2=0 and cardinality(a)>1 then
(a[(cardinality(a)/2)] + a[((cardinality(a)/2)+1)])/2::float
else a[(cardinality(a)+1)/2]::float end) into abc;
RETURN abc;
END;
$median$
LANGUAGE plpgsql;
select id,values,median(values) from array_table

Use the Below function for Finding nth percentile
CREATE or REPLACE FUNCTION nth_percentil(anyarray, int)
RETURNS
anyelement as
$$
SELECT $1[$2/100.0 * array_upper($1,1) + 1] ;
$$
LANGUAGE SQL IMMUTABLE STRICT;
In Your case it's 50th Percentile.
Use the Below Query to get the Median
SELECT nth_percentil(ARRAY (SELECT Field_name FROM table_name ORDER BY 1),50)
This will give you 50th percentile which is the median basically.
Hope this is helpful.

Related

Select all rows where the sum of column X is greather or equal than Y

I need to find a group of lots to satisfy X demand for items. I can't do it with aggregate functions, it seems to me that I need something more than a window function, do you know anything that can help me solve this problem?
For example, if I have a demand for 1 Item, the query should return any lot with a quantity greater than or equal to 1. But if I have a demand for 15, there are no lots with that availability, so it should return a lot of 10 and another with 5 or one of 10 and two of 3, etc.
With a programming language like Java this is simple, but with SQL is it possible? I am trying to achieve it with sales functions but I cannot find a way to add the available quantity of the current row until reaching the required quantity.
SELECT id,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,VI_CANTIDAD,NEXT, VI_CANTIDAD + NEXT AS TOT FROM (
SELECT row_number() over (ORDER BY SF_FECHA_CREACION desc) id ,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,
VI_CANTIDAD,LEAD(VI_CANTIDAD,1) OVER (ORDER BY SF_FECHA_CREACION desc) as NEXT FROM PUBLIC.M_LOTE WHERE SI_ID_M_ARTICULO = 44974
AND VI_CANTIDAD > 0 ) AS T
WHERE MOD(id, 2) != 0
I tried with lead to then sum only odd records but I saw that it is not the way, any suggestions?
You need a recursive query like this:
demo:db<>fiddle
WITH RECURSIVE lots_with_rowcount AS ( -- 1
SELECT
*,
row_number() OVER (ORDER BY avail_qty DESC) as rowcnt
FROM mytable
), lots AS ( -- 2
SELECT -- 3
lot_nr,
avail_qty,
rowcnt,
avail_qty as total_qty
FROM lots_with_rowcount
WHERE rowcnt = 1
UNION
SELECT
t.lot_nr,
t.avail_qty,
t.rowcnt,
l.total_qty + t.avail_qty -- 4
FROM lots_with_rowcount t
JOIN lots l ON t.rowcnt = l.rowcnt + 1
AND l.total_qty < --<your demand here>
)
SELECT * FROM lots -- 5
This CTE is only to provide a row count to each record which can be used within the recursion to join the next records.
This is the recursive CTE. A recursive CTE contains two parts: The initial SELECT statement and the recursion.
Initial part: Queries the lot record with the highest avail_qty value. Naturally, you can order them in any order you like. Most qty first yield the smallest output.
After the UNION the recursion part: Here the current row is joined the previous output AND as an additional condition: Join only if the previous output doesn't fit your demand value. In that case, the next total_qty value is calculated using the previous and the current qty value.
Recursion end, when there's no record left which fits the join condition. Then you can SELECT the entire recursion output.
Notice: If your demand was higher than your all your available quantities in total, this would return the entire table because the recursion runs as long as the demanded is not reached or your table ends. You should add a query before, which checks this:
SELECT SUM(avail_qty) > demand FROM mytable
I gratefully fiddled around with S-Man's fiddle and found a query, at least simpler to understand
select lot_nr, avail_qty, tot_amount from
(select lot_nr, avail_qty,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) as tot_amount,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) - avail_qty as last_amount
from mytable) amounts
where last_amount < 15 -- your amount here
so this lists all rows where with the predecesor (in descending order by avail_qty) the limit isn't yet reached
Here is a simple old-school PL/pgSQL version that uses a (slow) loop. It returns only the lot numbers as an illustration. Basically what it does is return lot numbers for a particular item_id in certain order (that reflects the required business rules) and allocates the available quantities until the allocated quantity is equal or exceeds the required quantity.
create function get_lots(required_item integer, required_qty integer) returns setof text as
$$
declare
r record;
allocated_qty integer := 0;
begin
for r in select * from lots where item_id = required_item order by <your biz-rule> loop
return next r.lot_number;
allocated_qty := allocated_qty + r.available_qty;
exit when allocated_qty >= required_qty;
end loop;
end;
$$ language plpgsql;
-- Use
select lot_id from get_lots(1, 17) lot_id;

How to check if a column data is an arithematic progression in PostgreSQL

Suppose i have a column C in a table T, which is as follow:
sr
c
1
34444444444440
2
34444444444442
3
34444444444444
4
34444444444446
5
34444444444448
6
34444444444450
How can i verify or check if the values in Column C are arithmetic progression?
An arithmetic progression means that the differences are all constants. Assuming that the values are not floating point, then you can directly compare them:
select (min(c - prev_c) = max(c - prev_c)) as is_arithmetic_progression
from (select t.*,
lag(c) over (order by sr) as prev_c
from t
) t
If these are floating point values, you probably want some sort of tolerance, such as:
select abs(min(c - prev_c), max(c - prev_c)) < 0.001 as is_arithmetic_progression
step-by-step demo:db<>fiddle
SELECT
COUNT(*) = 1 as is_arithmetic_progression -- 4
FROM (
SELECT
difference
FROM (
SELECT
*,
lead(c) OVER (ORDER BY sr) - c as difference -- 1
FROM
mytable
) s
WHERE difference IS NOT NULL -- 2
GROUP BY difference -- 3
) s
Arithmetical progression: The difference between each element is constant.
lead() window function shifts the next value into the current row. Generating the difference to the current value shows the difference
lead() creates a NULL value in the last column, because it has no "next" value. So, this will be filtered
Grouping the difference values.
If you only have one difference value, this would return in only one single group. Only one difference value means: You have a constant difference between the elements. That is exactly what arithmetical progression means. So if the number of groups is exactly 1, you have arithmetical progression.
You can use exists as follows:
Select case when count(*) > 0 then 'no progression' else 'progression' end as res_
From your_table t
Where exists
(select 1 from your_table tt
Where tt.str > t.str
And tt.c < t.c)

PostgreSQL - multiple aggregate queries from the same function call

I have a function that returns a setof from a table:
CREATE OR REPLACE FUNCTION get_assoc_addrs_from_bbl(_bbl text)
RETURNS SETOF wow_bldgs AS $$
SELECT bldgs.* FROM wow_bldgs AS bldgs
...
$$ LANGUAGE SQL STABLE;
Here's a sample of what the table would return:
Now I'm writing an "aggregate" function that will return only one row that with various (aggregated) data points about the table that this function returns. Here is my current working (& naive) example:
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT unnest(corpnames) as corpname
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT unnest(businessaddrs) as businessaddr
FROM get_assoc_addrs_from_bbl('3012380016')
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
As you can see, for the two "subqueries" that require a custom grouping/ordering method, I need to repeat the call to get_assoc_addrs_from_bbl(). Ideally, I'm looking for a structure that would avoid the repeated calls as the function requires a lot of processing and I want the capacity for an arbitrary number of subqueries. I've looked into CTEs and window expressions and the like but no luck.
Any tips? Thank you!
Create simple aggregate function:
create aggregate array_agg2(anyarray) (
sfunc=array_cat,
stype=anyarray);
It aggregates array values into one single-dim array. Example:
# with t(x) as (values(array[1,2]),(array[2,3,4])) select array_agg2(x) from t;
┌─────────────┐
│ array_agg2 │
╞═════════════╡
│ {1,2,2,3,4} │
└─────────────┘
After that your query could be rewritten as
SELECT
count(distinct registrationid) as bldgs,
sum(unitsres) as units,
round(avg(yearbuilt), 1) as age,
(SELECT first(corpname) FROM (
SELECT * FROM unnest(array_agg2(corpnames)) as corpname
GROUP BY corpname ORDER BY count(*) DESC LIMIT 1
) corps) as topcorp,
(SELECT first(businessaddr) FROM (
SELECT * FROM unnest(array_agg2(businessaddrs)) as businessaddr
GROUP BY businessaddr ORDER BY count(*) DESC LIMIT 1
) rbas) as topbusinessaddr
FROM get_assoc_addrs_from_bbl('3012380016') assocbldgs
(surely if I understand your goal correctly)

Finding the sum of a column that contains a window function

I have this select query:
SELECT
total,
COALESCE(total - Lag(total)OVER(ORDER BY total), 0) AS dif_total
FROM ( select count(*) as total
FROM
tbl_person
left join
tbl_census
on
tbl_census.person_id = tbl_person.person_id
group by extract(year from tbl_census.date)
) abc
Is there a way I could find the sum of the column dif_total?
I can't use the Sum() because it contains a window function.
I tried saving the column to an array because I figure maybe I could call the function and convert the array to a column then use Sum().
But I messed it up.
Here is my query for the function.
CREATE OR REPLACE function growth() Returns int[] as $$
declare total2 integer[];
BEGIN
SELECT
total,
COALESCE(total - Lag(total)OVER(ORDER BY total), 0) into total2
FROM
( select count(*) as total
from
tbl_person
group by extract(year from bdate)
) abc ;
RETURN total2;
END; $$ LANGUAGE plpgsql;
The function query runs successfully and does not show any warning or error but I think I was doing it wrong because when I try to SELECT it it will say
Array value must start with "{" or dimension information
I'm very new in using stored function in postgre.
What changes should I do to my function to work?
Or what are the other ways for me to sum the column dif_total above?
Why don't you just wrap it with another select?
SELECT total,sum(dif_total) as total_2
FROM ( YOUR QUERY HERE...)
GROUP BY total

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)
This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000
A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.
This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20
I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.
Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.