Do all groups have equal total power for given subgroup? - sql

I have a PostgreSQL table like this:
CREATE TABLE foo (man_id, subgroup, power, grp)
AS VALUES
( 1, 'Sub_A', 1, 'Group_A' ),
( 2, 'Sub_B', -1, 'Group_A' ),
( 3, 'Sub_A', -1, 'Group_B' ),
( 4, 'Sub_B', 1, 'Group_B' ),
( 5, 'Sub_A', -1, 'Group_A' ),
( 6, 'Sub_B', 1, 'Group_A' ),
( 7, 'Sub_A', -1, 'Group_B' ),
( 8, 'Sub_B', 1, 'Group_B' );
The power calculation works like this:
Total Power of Subgroup Sub_A in the grp Group_A is (1 + (-1) ) = 0
Total Power of Subgroup Sub_B in the grp Group_A is ((-1) + 1 ) = 0
Total Power of Subgroup Sub_A in the grp Group_B is ((-1) + (-1) ) = -2
Total Power of Subgroup Sub_B in the grp Group_B is (1 + 1 ) = 2
So the power of Sub_A in the Group_A is not equal to power of Sub_A in the Group_B
So the power of Sub_B in the Group_A is not equal to power of Sub_B in the Group_B
I want to query the database with a subgroup name. If for a same subgroup name power is equal across all the other grp names, then it will return True, else False.
As an example, sub_A and sub_B both will return False. What would be the recommended way to do this?
I want something like:
SELECT * FROM foo (solution query will be added)
WHERE subgroup = 'sub_A'
And it returns False.

Read the question carefully
I want to query the database with a subgroup name.
And:
I want something like
SELECT * FROM foo (solution query will be added)
WHERE subgroup = 'Sub_A'
The important point for performance is to exclude irrelevant rows early and only compute aggregates for the given subgroup. Then (assuming more than a few distinct subgroups), an index on (subgroup) can help:
CREATE INDEX ON foo (subgroup);
Each of the following queries returns FALSE if at least two groups have different total sums for the given subgroup, and TRUE in all other cases (with a minor exception for query 5, see below).
Query 1
SELECT count(DISTINCT total_power) = 1
FROM (
SELECT sum(power) AS total_power
FROM foo
WHERE subgroup = 'Sub_B' -- exclude irrelevant rows early!
GROUP BY grp
) sub;
Query 2
SELECT count(*) = 1
FROM (
SELECT true
FROM (
SELECT sum(power) AS total_power
FROM foo
WHERE subgroup = 'Sub_C'
GROUP BY grp
) sub2
GROUP BY total_power
) sub2;
Query 3
SELECT count(*) OVER () = 1
FROM (
SELECT sum(power) AS total_power
FROM foo
WHERE subgroup = 'Sub_A'
GROUP BY grp
) sub
GROUP BY total_power
LIMIT 1;
Query 4
(
SELECT FALSE
FROM (
SELECT sum(power) AS total_power
FROM foo
WHERE subgroup = 'Sub_A'
GROUP BY grp
) sub
GROUP BY total_power
OFFSET 1
LIMIT 1
)
UNION ALL
SELECT TRUE
LIMIT 1;
This one is special. Related answers with explanation:
Return a value if no record is found
Way to try multiple SELECTs till a result is available?
Query 5
SELECT min(total_power) = max(total_power) -- can fail for NULL values
FROM (
SELECT sum(power) AS total_power
FROM foo
WHERE subgroup = 'Sub_A'
GROUP BY grp
) sub;
The last can fail if NULL values in power are allowed. (But you would have to define expected results in this case anyway.)
I ran an extensive test and found all queries to perform about the same under ideal conditions:
db<>fiddle here
Query 5 tended to be a tad bit faster than the rest.

Assuming that in your CREATE TABLE statement, 'sub_A' is meant to be 'Sub_A' (because Postgres is case-sensitive), and that your power values are actually integers (if they aren't, just add casts; the code below is simpler without them), then you can calculate the power for each subgroup within a group as follows:
select
subgroup,
grp,
sum(power) as sum_power
from
foo
group by
subgroup,
grp
To determine whether all of the total power values for a subgroup are the same, just check that the minimum and maximum values are the same. Convert the previous query into a subquery, where the main query does that comparison, as follows:
select
subgroup
from (
select
subgroup,
grp,
sum(power) as sum_power
from
foo
group by
subgroup,
grp
) as subpwr
group by
subgroup
having
min(sum_power) = max(sum_power);

Here's 1 way not yet mentioned by the other answers
SELECT SUM(power) = FIRST_VALUE(SUM(power)) OVER () powpow
FROM foo
WHERE subgroup = 'Sub_A'
GROUP BY grp
ORDER BY powpow
LIMIT 1
-- returns:
-- false if some values differ
-- true if all values are the same
-- no rows if the where condition fails to match any rows.

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Stacking my conditions in a CASE statement it's not returning all cases for each member

SELECT DISTINCT
Member_ID,
CASE
WHEN a.ASTHMA_MBR = 1 THEN 'ASTHMA'
WHEN a.COPD_MBR = 1 THEN 'COPD'
WHEN a.HYPERTENSION_MBR = 1 THEN 'HYPERTENSION'
END AS DX_FLAG
So a member may have more than one, but my statement is only returning one of them.
I'm using Teradata and trying to convert multiple columns of boolean data into one column. The statement is only returning one condition when members may have 2 or more. I tried using Select instead of Select Distinct and it made no difference.
This is a kind of UNPIVOT:
with base_data as
( -- select the columns you want to unpivot
select
member_id
,date_col
-- the aliases will be the final column value
,ASTHMA_MBR AS ASTHMA
,COPD_MBR AS COPD
,HYPERTENSION_MBR AS HYPERTENSION
from your_table
)
,unpvt as
(
select member_id, date_col, x, DX_FLAG
from base_data
-- now unpivot those columns into rows
UNPIVOT(x FOR DX_FLAG IN (ASTHMA, COPD, HYPERTENSION)
) dt
)
select member_id, DX_FLAG, date_col
from unpvt
-- only show rows where the condition is true
where x = 1

Count Joins from Multiple Tables

For reference, I am using Postgres 9.2.23.
I have several tables where one table (user_group) is related to some other tables (eg: posts, group_invites, and some more other ones). There is, also a groups table, but it doesn't hold any data that I need for the purposes of these queries.
Table user_group:
fk_user_group_id, fk_user_id, fk_group_id, fk_invite_id user_status, ...
Table message:
pk_message_id, fk_user_id, fk_group_id, child_message_id, ...
Table group_prospective_user:
pk_prospective_user_id, fk_group_id, ...
I want to get some statistics for each of the related tables for a list of specified group ids if the user is a member of the group.
Right now I do this with one query for each related table, eg:
select
"public"."user_group"."fk_group_id" as "groupId",
count(case
when (
"public"."message"."child_message_id" is null
and "public"."message"."pk_message_id" is not null
) then "public"."message"."pk_message_id"
end) as "numDiscussions",
count("public"."message"."pk_message_id") as "numDiscussionPosts"
from "public"."user_group"
left outer join "public"."message"
on "public"."message"."fk_group_id" = "public"."user_group"."fk_group_id"
where (
"public"."user_group"."fk_group_id" in (
1, 11, 23, 530, 1070
)
and "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
)
and "public"."user_group"."fk_user_id" = 17517
)
group by "public"."user_group"."fk_group_id"
And for invites:
select
"public"."user_group"."fk_group_id" as "groupId",
count(case
when "public"."prospective_user"."status" = 1 then "public"."prospective_user"."pk_prospective_user_id"
end) as "numInviteesExternal"
from "public"."user_group"
left outer join "public"."prospective_user"
on "public"."prospective_user"."fk_group_id" = "public"."user_group"."fk_group_id"
where (
"public"."user_group"."fk_group_id" in (
1, 11, 23, 530, 6176
)
and "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
)
and "public"."user_group"."fk_user_id" = 17517
)
group by "public"."user_group"."fk_group_id"
The query to count the number of group invites is very similar to the above query. Just the count when and join on change.
Each of the queries to these tables has the same related logic for checking the groups to which the current user is an active member. Is there efficient way to merge multiple similar queries like this into a single query?
I tried using multiple LEFT JOINs with select count distinct, but that ran into performance issues on groups with both lots of messages, and lots of invites. Is there a way to easily/efficiently do this with, say, a subquery?
The answer from user #Parfait was the most scalable solution I could find. I based my queries on this tutorial: https://www.sqlteam.com/articles/using-derived-tables-to-calculate-aggregate-values.
While this isn't perfect, and results in a bunch of subqueries running, it does get me all the data at once, and with a single trip to the DB.
It ended up like this:
"groups"."groupId",
coalesce(
"members"."member_count",
0
) as "numActiveMembers",
coalesce(
"members"."invitee_count",
0
) as "numInviteesInternal",
coalesce(
"discussions"."discussions_count",
0
) as "numDiscussions",
coalesce(
"discussions"."posts_count",
0
) as "numDiscussionPosts"
from (
select "public"."user_group"."fk_group_id" as "groupId"
from "public"."user_group"
where (
"public"."user_group"."fk_group_id" in (
1, 2, 3, 4, 5
)
and "public"."user_group"."role" = 'ADMINISTRATOR'
and "public"."user_group"."fk_user_id" = 123
)
group by "public"."user_group"."fk_group_id"
) as "groups"
left outer join (
select
"public"."user_group"."fk_group_id" as "members_group_id",
count(distinct case
when "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
) then "public"."user_group"."pk_user_group_id"
end) as "member_count",
count(distinct case
when "public"."user_group"."role" = 'INVITEE' then "public"."user_group"."pk_user_group_id"
end) as "invitee_count"
from "public"."user_group"
group by "public"."user_group"."fk_group_id"
) as "members"
on "members_group_id" = "groupId"
left outer join (
select
"public"."message"."fk_group_id" as "discussions_group_id",
count(case
when (
"public"."message"."child_message_id" is null
and "public"."message"."pk_message_id" is not null
) then "public"."message"."pk_message_id"
end) as "discussions_count",
count("public"."message"."pk_message_id") as "posts_count"
from "public"."message"
group by "public"."message"."fk_group_id"
) as "discussions"
on "discussions_group_id" = "groupId"```

Get only best ranked rows from a subquery

I want to get the price of an article for a specific customer.
There are several levels of prices which i ranked in my query.
So Article A has a price on rank 1, 4, 6. The result should always be the lowest ranked price.
Article B rank 3 ,5
So article A price is ranked 1 and Article b is price ranked 3.
My query is below .
SELECT p2.* FROM(
SElect ART_ID, MIN(RANG) RANG FROM (
Select p.ART_ID, p.betrag ,
CASE p.PREIS_EBENE WHEN 'KA' THEN 1 WHEN 'KW' THEN 2 WHEN 'W' THEN 7 WHEN 'A' THEN 6 ELSE 99 END RANG
FROM MDART a
INNER JOIN MDPRSVK p ON (a.KLIENT_ID = p.KLIENT_ID AND a.ART_ID = p.ART_ID)
WHERE ICP_KZ.IS_SET(KENNUNG_USER, 'P') = 1
ORDER BY RANG)
GROUP BY ART_ID) T
INNER JOIN MDPRSVK p2 ON (p2.ART_ID = T.ART_ID AND p2.PREIS_EBENE = p.PREIS_EBENE)
i want to have every article appearing only once in the result
You have tagged your request PL/SQL, so I guess your DBMS may be Oracle.
If I understand correctly, the table MDPRSVK contains several prices per ART_ID. And you want to select each ART_ID's best price (best to worst: 'KA' -> 'KW' -> 'A' -> 'W' -> any other PREIS_EBENE).
You can use a window function (ROW_NUMBER, RANK or DENSE_RANK) for this:
select *
from mdprsvk
order by row_number()
over (partition by art_id
order by decode(preis_ebene, 'KA', 1, 'KW', 2, 'A', 3, 'W', 4, 5))
fetch first row with ties;
This is standard SQL. In Oracle, FETCH FIRST is available as of version 12c. In earlier versions you'd use a subquery instead:
select *
from
(
select
mdprsvk.*,
row_number() over (partition by art_id
order by decode(preis_ebene, 'KA', 1, 'KW', 2, 'A', 3, 'W', 4, 5))
as rn
from mdprsvk
)
where rn = 1;
Or use OraclesKEEP FIRST`:
select art_id, max(betrag)
keep (dense_rank first
order by decode(preis_ebene, 'KA', 1, 'KW', 2, 'A', 3, 'W', 4, 5))
from mdprsvk
group by art_id;
It is not clear, how MDART comes into play. It looks like you want to restrict your results to articles for certain clients and KENNUNG_USER is the column in MDART to check. If so, add a WHERE clause:
where exists
(
select *
from mdart
where mdart.klient_id = mdprsvk.klient_id
and mdart.art_id = mdprsvk.art_id
and icp_kz.is_set(mdart.kennung_user, 'p') = 1
)
Or with IN instead of EXISTS:
where (klient_id, art_id) in
(
select klient_id, art_id
from mdart
where icp_kz.is_set(kennung_user, 'p') = 1
)

Simple way to calculate median with MySQL

What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x) for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.
Example data:
id | val
--------
1 4
2 7
3 2
4 2
5 9
6 8
7 3
Sorting on val gives 2 2 3 4 7 8 9, so the median should be 4, versus SELECT AVG(val) which == 5.
In MariaDB / MySQL:
SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, #rownum:=#rownum+1 as `row_number`, #total_rows:=#rownum
FROM data d, (SELECT #rownum:=0) r
WHERE d.val is NOT NULL
-- put some where clause here
ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((#total_rows+1)/2), FLOOR((#total_rows+2)/2) );
Steve Cohen points out, that after the first pass, #rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.
Also AVG(dd.val) and dd.row_number IN(...) is used to correctly produce a median when there are an even number of records. Reasoning:
SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3
Finally, MariaDB 10.3.3+ contains a MEDIAN function
I just found another answer online in the comments:
For medians in almost any SQL:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2
Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.
select count(*) from table --find the number of rows
Calculate the "median" row number. Maybe use: median_row = floor(count / 2).
Then pick it out of the list:
select val from table order by val asc limit median_row,1
This should return you one row with just the value you want.
I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1
Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.
Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.
So, here's velcro's solution patched to handle both odd and even number sets:
SELECT AVG(middle_values) AS 'median' FROM (
SELECT t1.median_column AS 'middle_values' FROM
(
SELECT #row:=#row+1 as `row`, x.median_column
FROM median_table AS x, (SELECT #row:=0) AS r
WHERE 1
-- put some where clause here
ORDER BY x.median_column
) AS t1,
(
SELECT COUNT(*) as 'count'
FROM median_table x
WHERE 1
-- put same where clause here
) AS t2
-- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;
To use this, follow these 3 easy steps:
Replace "median_table" (2 occurrences) in the above code with the name of your table
Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition
I propose a faster way.
Get the row count:
SELECT CEIL(COUNT(*)/2) FROM data;
Then take the middle value in a sorted subquery:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit #middlevalue) x;
I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.
Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/
After that, calculate median is easy:
SELECT median(val) FROM data;
A comment on this page in the MySQL documentation has the following suggestion:
-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:
DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;
INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);
-- Calculating the MEDIAN
SELECT #a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
#a := (#a + 1) mod o.c AS shifted_n,
IF(#a mod o.c=0, o.c, #a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,
-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;
-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1
-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4
-- from there we can select the n-th element on the position: count div 2 + 1
If MySQL has ROW_NUMBER, then the MEDIAN is (be inspired by this SQL Server query):
WITH Numbered AS
(
SELECT *, COUNT(*) OVER () AS Cnt,
ROW_NUMBER() OVER (ORDER BY val) AS RowNum
FROM yourtable
)
SELECT id, val
FROM Numbered
WHERE RowNum IN ((Cnt+1)/2, (Cnt+2)/2)
;
The IN is used in case you have an even number of entries.
If you want to find the median per group, then just PARTITION BY group in your OVER clauses.
Rob
Most of the solutions above work only for one field of the table, you might need to get the median (50th percentile) for many fields on the query.
I use this:
SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`
FROM table_name;
You can replace the "50" in example above to any percentile, is very efficient.
Just make sure you have enough memory for the GROUP_CONCAT, you can change it with:
SET group_concat_max_len = 10485760; #10MB max length
More details: http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/
I have this below code which I found on HackerRank and it is pretty simple and works in each and every case.
SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE
(SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) =
(SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL );
You could use the user-defined function that's found here.
Building off of velcro's answer, for those of you having to do a median off of something that is grouped by another parameter:
SELECT grp_field, t1.val FROM (
SELECT grp_field, #rownum:=IF(#s = grp_field, #rownum + 1, 0) AS row_number,
#s:=IF(#s = grp_field, #s, grp_field) AS sec, d.val
FROM data d, (SELECT #rownum:=0, #s:=0) r
ORDER BY grp_field, d.val
) as t1 JOIN (
SELECT grp_field, count(*) as total_rows
FROM data d
GROUP BY grp_field
) as t2
ON t1.grp_field = t2.grp_field
WHERE t1.row_number=floor(total_rows/2)+1;
Takes care about an odd value count - gives the avg of the two values in the middle in that case.
SELECT AVG(val) FROM
( SELECT x.id, x.val from data x, data y
GROUP BY x.id, x.val
HAVING SUM(SIGN(1-SIGN(IF(y.val-x.val=0 AND x.id != y.id, SIGN(x.id-y.id), y.val-x.val)))) IN (ROUND((COUNT(*))/2), ROUND((COUNT(*)+1)/2))
) sq
My code, efficient without tables or additional variables:
SELECT
((SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', floor(1+((count(val)-1) / 2))), ',', -1))
+
(SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', ceiling(1+((count(val)-1) / 2))), ',', -1)))/2
as median
FROM table;
Single query to archive the perfect median:
SELECT
COUNT(*) as total_rows,
IF(count(*)%2 = 1, CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL), ROUND((CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL)) / 2)) as median,
AVG(val) as average
FROM
data
Optionally, you could also do this in a stored procedure:
DROP PROCEDURE IF EXISTS median;
DELIMITER //
CREATE PROCEDURE median (table_name VARCHAR(255), column_name VARCHAR(255), where_clause VARCHAR(255))
BEGIN
-- Set default parameters
IF where_clause IS NULL OR where_clause = '' THEN
SET where_clause = 1;
END IF;
-- Prepare statement
SET #sql = CONCAT(
"SELECT AVG(middle_values) AS 'median' FROM (
SELECT t1.", column_name, " AS 'middle_values' FROM
(
SELECT #row:=#row+1 as `row`, x.", column_name, "
FROM ", table_name," AS x, (SELECT #row:=0) AS r
WHERE ", where_clause, " ORDER BY x.", column_name, "
) AS t1,
(
SELECT COUNT(*) as 'count'
FROM ", table_name, " x
WHERE ", where_clause, "
) AS t2
-- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
WHERE t1.row >= t2.count/2
AND t1.row <= ((t2.count/2)+1)) AS t3
");
-- Execute statement
PREPARE stmt FROM #sql;
EXECUTE stmt;
END//
DELIMITER ;
-- Sample usage:
-- median(table_name, column_name, where_condition);
CALL median('products', 'price', NULL);
My solution presented below works in just one query without creation of table, variable or even sub-query.
Plus, it allows you to get median for each group in group-by queries (this is what i needed !):
SELECT `columnA`,
SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(`columnB` ORDER BY `columnB`), ',', CEILING((COUNT(`columnB`)/2))), ',', -1) medianOfColumnB
FROM `tableC`
-- some where clause if you want
GROUP BY `columnA`;
It works because of a smart use of group_concat and substring_index.
But, to allow big group_concat, you have to set group_concat_max_len to a higher value (1024 char by default).
You can set it like that (for current sql session) :
SET SESSION group_concat_max_len = 10000;
-- up to 4294967295 in 32-bits platform.
More infos for group_concat_max_len: https://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len
Another riff on Velcrow's answer, but uses a single intermediate table and takes advantage of the variable used for row numbering to get the count, rather than performing an extra query to calculate it. Also starts the count so that the first row is row 0 to allow simply using Floor and Ceil to select the median row(s).
SELECT Avg(tmp.val) as median_val
FROM (SELECT inTab.val, #rows := #rows + 1 as rowNum
FROM data as inTab, (SELECT #rows := -1) as init
-- Replace with better where clause or delete
WHERE 2 > 1
ORDER BY inTab.val) as tmp
WHERE tmp.rowNum in (Floor(#rows / 2), Ceil(#rows / 2));
Knowing exact row count you can use this query:
SELECT <value> AS VAL FROM <table> ORDER BY VAL LIMIT 1 OFFSET <half>
Where <half> = ceiling(<size> / 2.0) - 1
SELECT
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(field ORDER BY field),
',',
((
ROUND(
LENGTH(GROUP_CONCAT(field)) -
LENGTH(
REPLACE(
GROUP_CONCAT(field),
',',
''
)
)
) / 2) + 1
)),
',',
-1
)
FROM
table
The above seems to work for me.
I used a two query approach:
first one to get count, min, max and avg
second one (prepared statement) with a "LIMIT #count/2, 1" and "ORDER BY .." clauses to get the median value
These are wrapped in a function defn, so all values can be returned from one call.
If your ranges are static and your data does not change often, it might be more efficient to precompute/store these values and use the stored values instead of querying from scratch every time.
as i just needed a median AND percentile solution, I made a simple and quite flexible function based on the findings in this thread. I know that I am happy myself if I find "readymade" functions that are easy to include in my projects, so I decided to quickly share:
function mysql_percentile($table, $column, $where, $percentile = 0.5) {
$sql = "
SELECT `t1`.`".$column."` as `percentile` FROM (
SELECT #rownum:=#rownum+1 as `row_number`, `d`.`".$column."`
FROM `".$table."` `d`, (SELECT #rownum:=0) `r`
".$where."
ORDER BY `d`.`".$column."`
) as `t1`,
(
SELECT count(*) as `total_rows`
FROM `".$table."` `d`
".$where."
) as `t2`
WHERE 1
AND `t1`.`row_number`=floor(`total_rows` * ".$percentile.")+1;
";
$result = sql($sql, 1);
if (!empty($result)) {
return $result['percentile'];
} else {
return 0;
}
}
Usage is very easy, example from my current project:
...
$table = DBPRE."zip_".$slug;
$column = 'seconds';
$where = "WHERE `reached` = '1' AND `time` >= '".$start_time."'";
$reaching['median'] = mysql_percentile($table, $column, $where, 0.5);
$reaching['percentile25'] = mysql_percentile($table, $column, $where, 0.25);
$reaching['percentile75'] = mysql_percentile($table, $column, $where, 0.75);
...
Here is my way . Of course, you could put it into a procedure :-)
SET #median_counter = (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`);
SET #median = CONCAT('SELECT `val` FROM `data` ORDER BY `val` LIMIT ', #median_counter, ', 1');
PREPARE median FROM #median;
EXECUTE median;
You could avoid the variable #median_counter, if you substitude it:
SET #median = CONCAT( 'SELECT `val` FROM `data` ORDER BY `val` LIMIT ',
(SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`),
', 1'
);
PREPARE median FROM #median;
EXECUTE median;
After reading all previous ones they didn't match with my actual requirement so I implemented my own one which doesn't need any procedure or complicate statements, just I GROUP_CONCAT all values from the column I wanted to obtain the MEDIAN and applying a COUNT DIV BY 2 I extract the value in from the middle of the list like the following query does :
(POS is the name of the column I want to get its median)
(query) SELECT
SUBSTRING_INDEX (
SUBSTRING_INDEX (
GROUP_CONCAT(pos ORDER BY CAST(pos AS SIGNED INTEGER) desc SEPARATOR ';')
, ';', COUNT(*)/2 )
, ';', -1 ) AS `pos_med`
FROM table_name
GROUP BY any_criterial
I hope this could be useful for someone in the way many of other comments were for me from this website.
Based on #bob's answer, this generalizes the query to have the ability to return multiple medians, grouped by some criteria.
Think, e.g., median sale price for used cars in a car lot, grouped by year-month.
SELECT
period,
AVG(middle_values) AS 'median'
FROM (
SELECT t1.sale_price AS 'middle_values', t1.row_num, t1.period, t2.count
FROM (
SELECT
#last_period:=#period AS 'last_period',
#period:=DATE_FORMAT(sale_date, '%Y-%m') AS 'period',
IF (#period<>#last_period, #row:=1, #row:=#row+1) as `row_num`,
x.sale_price
FROM listings AS x, (SELECT #row:=0) AS r
WHERE 1
-- where criteria goes here
ORDER BY DATE_FORMAT(sale_date, '%Y%m'), x.sale_price
) AS t1
LEFT JOIN (
SELECT COUNT(*) as 'count', DATE_FORMAT(sale_date, '%Y-%m') AS 'period'
FROM listings x
WHERE 1
-- same where criteria goes here
GROUP BY DATE_FORMAT(sale_date, '%Y%m')
) AS t2
ON t1.period = t2.period
) AS t3
WHERE
row_num >= (count/2)
AND row_num <= ((count/2) + 1)
GROUP BY t3.period
ORDER BY t3.period;
create table med(id integer);
insert into med(id) values(1);
insert into med(id) values(2);
insert into med(id) values(3);
insert into med(id) values(4);
insert into med(id) values(5);
insert into med(id) values(6);
select (MIN(count)+MAX(count))/2 from
(select case when (select count(*) from
med A where A.id<B.id)=(select count(*)/2 from med) OR
(select count(*) from med A where A.id>B.id)=(select count(*)/2
from med) then cast(B.id as float)end as count from med B) C;
?column?
----------
3.5
(1 row)
OR
select cast(avg(id) as float) from
(select t1.id from med t1 JOIN med t2 on t1.id!= t2.id
group by t1.id having ABS(SUM(SIGN(t1.id-t2.id)))=1) A;
Often, we may need to calculate Median not just for the whole table, but for aggregates with respect to our ID. In other words, calculate median for each ID in our table, where each ID has many records. (good performance and works in many SQL + fixes problem of even and odds, more about performance of different Median-methods https://sqlperformance.com/2012/08/t-sql-queries/median )
SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val,
COUNT(*) OVER (PARTITION BY our_id) AS cnt,
ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rn
FROM our_table
) AS x
WHERE rn IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;
Hope it helps
MySQL has supported window functions since version 8.0, you can use ROW_NUMBER or DENSE_RANK (DO NOT use RANK as it assigns the same rank to same values, like in sports ranking):
SELECT AVG(t1.val) AS median_val
FROM (SELECT val,
ROW_NUMBER() OVER(ORDER BY val) AS rownum
FROM data) t1,
(SELECT COUNT(*) AS num_records FROM data) t2
WHERE t1.row_num IN
(FLOOR((t2.num_records + 1) / 2),
FLOOR((t2.num_records + 2) / 2));
A simple way to calculate Median in MySQL
set #ct := (select count(1) from station);
set #row := 0;
select avg(a.val) as median from
(select * from table order by val) a
where (select #row := #row + 1)
between #ct/2.0 and #ct/2.0 +1;
The most simple and fast way to calculate median in mysql.
select x.col
from (select lat_n,
count(1) over (partition by 'A') as total_rows,
row_number() over (order by col asc) as rank_Order
from station ft) x
where x.rank_Order = round(x.total_rows / 2.0, 0)