SQL BigQuery Aggregation depending on conditions - sql

I am not sure how to solve this problem through an SQL query.
Imagine you have a table like this one:
user_id
timestamp
quantity1
quantity2
A
2021/01/10
10
0
B
2021/01/17
10
0
A
2021/01/19
1
12
B
2021/01/25
10
8
A
2021/01/27
2
8
Now I want to aggregate by performing an ordered difference between quantity 1 and quantity 2.
So by a simple group by and sum I would have this result:
user_id
result
A
-7
B
12
However what I want is that whenever the next intermediary sum is smaller than 0, then that sum is set to 0 before aggregating the next quantity.
In this case I would like to keep B = 12, but A instead should be:
First aggregation: (10-0)
Second aggregation: 10 + (1-12) = -1
Now since this is a negative result, it should be set to 0.
Third aggregation: 0 + (2-8) = -6
And if there was another entry, a negative previous aggregation should always start from 0.
I hope it was clear.
Anybody knows how to do this in SQL?

Consider below approach
declare result, delta, count int64;
declare id, prev_id string;
create temp table results as (select * from (select '' as user_id, 0 as result) where false);
create temp table your_table as (
select 'A' user_id, '2021/01/10' timestamp, 10 quantity1, 0 quantity2 union all
select 'B', '2021/01/17', 10, 0 union all
select 'A', '2021/01/19', 1, 12 union all
select 'B', '2021/01/25', 10, 8 union all
select 'A', '2021/01/27', 2, 8
);
set (count, result, delta, prev_id) = (0, 0, 0, '');
for record in (select * from your_table order by user_id, timestamp) do
if record.user_id != prev_id and prev_id != '' then
insert into results values(prev_id , result + delta);
set result = 0;
end if;
set prev_id = record.user_id;
set result = result + record.quantity1 - record.quantity2;
if result < 0 then set (result, delta) = (0, result); else set delta = 0; end if;
end for;
insert into results values(prev_id , result + delta);
select * from results;
with output (in temp table results)
Obviously, instead of creating temp table your_table - you should just use you real table
Also, note use of just introduced (earlier today) FOR...IN Loop

Related

SQL - Extracting an ID range for a packet of records

I have a table where I have about 40000000 records. Min(id) = 2 and max(80000000).
I would like create a automated script which will be running in a loop.
But I don't want to create about 80 iteration because a part of then will be empty.
Who knows how I can find range min(id) and max(id) for first iteration, and next?
I used mod but it doesn't work correctly:
SELECT MIN(ID), MAX(ID)
FROM (
SELECT mod(id,45), id FROM table
WHERE mod(id,45) = 0
GROUP BY mod(id,45), id
ORDER BY id desc
)
Because I want to:
first iteration has range for 1mln records: min(id) = 2 max(id) = 1 500 000
second iteration has range for 1 mln records: min(id)=1 550 000, max(id) = 5 000 000
and so on
It should be easy for whatever DBMS supporting ordered numbering of rows.
Db2.
Every SELECT returns 2 rows except the last one, which may return less rows.
SELECT 'SELECT * FROM MYTAB WHERE I BETWEEN ' || MIN (I) || ' AND ' || MAX (I) AS STMT
FROM
(
SELECT I, (ROW_NUMBER () OVER (ORDER BY I) - 1) / 2 AS RN_
FROM (VALUES 1, 9, 2, 7, 4) MYTAB (I)
) G
GROUP BY RN_
The result is:
STMT
SELECT * FROM MYTAB WHERE I BETWEEN 1 AND 2
SELECT * FROM MYTAB WHERE I BETWEEN 4 AND 7
SELECT * FROM MYTAB WHERE I BETWEEN 9 AND 9

Computing number of filled columns in SQL table

I have a SQL table with 6 columns. 1 ID int column and 5 Datetime columns Round1, Round2, ..., Round5
The data looks something like this. Either there is a date or the cell is empty.
I would like the query to show the number of filled datetime columns. That is
Can you please give some hint on how to build this query? Would this involve aggregate function?
Thank you
Consider:
SELECT ID, IIf(Round1 Is Null, 0, 1) + IIf(Round2 Is Null, 0, 1) +
IIf(Round 3 Is Null, 0, 1) + IIf(Round4 Is Null, 0, 1) + IIf(Round5 Is Null, 0, 1) AS Cnt
FROM Table;
Aggregate function is not helpful unless you first normalize the data with UNION query.
SELECT ID, Round1 AS Dte, "R1" AS Src FROM table
UNION SELECT ID, Round2, "R2" FROM table
UNION SELECT ID, Round3, "R3" FROM table
UNION SELECT ID, Round4, "R4" FROM table
UNION SELECT ID, Round5, "R5" FROM table;
Then use that query in aggregate SQL.
SELECT ID, Count(Dte) AS CntD FROM Q1 GROUP BY ID;
You can use CASE expressions to return 1 when a value is not NULL or 0 otherwise. Then just add all that.
SELECT id,
CASE
WHEN round1 IS NOT NULL THEN
1
ELSE
0
END
+
...
CASE
WHEN round5 IS NOT NULL THEN
1
ELSE
0
END total
FROM elbat;
And next time do not post images of tables. Post their CREATE statements with sample data as INSERT statements. And tag the specific DBMS you're using.

How to compare range of integer values in PL/SQL?

I am trying to compare a range of integer values between a test table and a reference table. If any range of values from the test table overlaps with the available ranges in the reference table, it should be deleted.
Sorry if it's not clear but here is an example data:
TEST_TABLE:
MIN MAX
10 121
122 648
1200 1599
REFERENCE_TABLE:
MIN MAX
50 106
200 1400
1450 1500
MODIFIED TEST_TABLE: (expected result after running PL/SQL)
MIN MAX
10 49
107 121
122 199
1401 1449
1501 1599
In the first row from the example above, the 10-121 has been cut down into two rows: 10-49 and 107-121 because the values 50, 51, ..., 106 are included in the first row of the reference_table (50-106); and so on..
Here's what I've written so far with nested loops. I've created two additional temp tables that would store all the values that would be found in the reference table. Then it would create new sets of ranges to be inserted to test_table.
But this does not seem to work correctly and might cause performance issues especially if we're dealing with values of millions and above:
CREATE TABLE new_table (num_value NUMBER);
CREATE TABLE new_table_next (num_value NUMBER, next_value NUMBER);
-- PL/SQL start
DECLARE
l_count NUMBER;
l_now_min NUMBER;
l_now_max NUMBER;
l_final_min NUMBER;
l_final_max NUMBER;
BEGIN
FOR now IN (SELECT min_num, max_num FROM test_table) LOOP
l_now_min:=now.min_num;
l_now_max:=now.max_num;
WHILE (l_now_min < l_now_max) LOOP
SELECT COUNT(*) -- to check if number is found in reference table
INTO l_count
FROM reference_table refr
WHERE l_now_min >= refr.min_num
AND l_now_min <= refr.max_num;
IF l_count > 0 THEN
INSERT INTO new_table (num_value) VALUES (l_now_min);
COMMIT;
END IF;
l_now_min:=l_now_min+1;
END LOOP;
INSERT INTO new_table_next (num_value, next_value)
VALUES (SELECT num_value, (SELECT MIN (num_value) FROM new_table t2 WHERE t2.num_value > t.num_value) AS next_value FROM new_table t);
DELETE FROM test_table t
WHERE now.min_num = t.min_num
AND now.max_num = t.max_num;
COMMIT;
SELECT (num_value + 1) INTO l_final_min FROM new_table_next;
SELECT (next_value - num_value - 2) INTO l_final_max FROM new_table_next;
INSERT INTO test_table (min_num, max_num)
VALUES (l_final_min, l_final_max);
COMMIT;
DELETE FROM new_table;
DELETE FROM new_table_next;
COMMIT;
END LOOP;
END;
/
Please help, I'm stuck. :)
The idea behind this approach is to unwind both tables, keeping track of whether the numbers are in the reference table or the original table. This is really cumbersome, because adjacent values can cause problems.
The idea is then to do a "gaps-and-islands" type solution along both dimensions -- and then only keep the values that are in the original table and not in the second. Perhaps this could be called "exclusionary gaps-and-islands".
Here is a working version:
with vals as (
select min as x, 1 as inc, 0 as is_ref
from test_table
union all
select max + 1, -1 as inc, 0 as is_ref
from test_table
union all
select min as x, 0, 1 as is_ref
from reference_table
union all
select max + 1 as x, 0, -1 as is_ref
from reference_table
)
select min, max
from (select refgrp, incgrp, ref, inc2, min(x) as min, (lead(min(x), 1, max(x) + 1) over (order by min(x)) - 1) as max
from (select v.*,
row_number() over (order by x) - row_number() over (partition by ref order by x) as refgrp,
row_number() over (order by x) - row_number() over (partition by inc2 order by x) as incgrp
from (select v.*, sum(is_ref) over (order by x, inc) as ref,
sum(inc) over (order by x, inc) as inc2
from vals v
) v
) v
group by refgrp, incgrp, ref, inc2
) v
where ref = 0 and inc2 = 1 and min < max
order by min;
And here is a db<>fiddle.
The inverse problem of getting the overlaps is much easier. It might be feasible to "invert" the reference table to handle this.
select greatest(tt.min, rt.min), least(tt.max, rt.max)
from test_table tt join
reference_table rt
on tt.min < rt.max and tt.max > rt.min -- is there an overlap?
This is modified from a similar task (using dates instead of numbers) I did on Teradata, it's based on the same base data as Gordon's (all begin/end values combined in a single list), but uses a simpler logic:
WITH minmax AS
( -- create a list of all existing start/end values (possible to simplify using Unpivot or Cross Apply)
SELECT Min AS val, -1 AS prio, 1 AS flag -- main table, range start
FROM test_table
UNION ALL
SELECT Max+1, -1, -1 -- main table, range end
FROM test_table
UNION ALL
SELECT Min, 1, 1 -- reference table, adjusted range start
FROM reference_table
UNION ALL
SELECT Max+1, 1, -1 -- reference table, adjusted range end
FROM reference_table
)
, all_ranges AS
( -- create all ranges from current to next row
SELECT minmax.*,
Lead(val) Over (ORDER BY val, prio desc, flag) AS next_val, -- next value = end of range
Sum(flag) Over (ORDER BY val, prio desc, flag ROWS Unbounded Preceding) AS Cnt -- how many overlapping periods exist
FROM minmax
)
SELECT val, next_val-1
FROM all_ranges
WHERE Cnt = 1 -- 1st level only
AND prio + flag = 0 -- either (prio -1 and flag 1) = range start in base table
-- or (prio 1 and flag -1) = range end in ref table
ORDER BY 1
See db-fiddle
Here's one way to do this. I put the test data in a WITH clause rather than creating the tables (I find this is easier for testing purposes). I used your column names (MIN and MAX); these are very poor choices though, as MIN and MAX are Oracle keywords. They will generate confusion for sure, and they may cause queries to error out.
The strategy is simple - first take the COMPLEMENT of the ranges in REFERENCE_TABLE, which will also be a union of intervals (using NULL as marker for minus infinity and plus infinity); then take the intersection of each interval in TEST_TABLE with each interval in the complement of REFERENCE_TABLE. How that is done is shown in the final (outer) query in the solution below.
with
test_table (min, max) as (
select 10, 121 from dual union all
select 122, 648 from dual union all
select 1200, 1599 from dual
)
, reference_table (min, max) as (
select 50, 106 from dual union all
select 200, 1400 from dual union all
select 1450, 1500 from dual
)
,
prep (min, max) as (
select lag(max) over (order by max) + 1 as min
, min - 1 as max
from ( select min, max from reference_table
union all
select null, null from dual
)
)
select greatest(t.min, nvl(p.min, t.min)) as min
, least (t.max, nvl(p.max, t.max)) as max
from test_table t inner join prep p
on t.min <= nvl(p.max, t.max)
and t.max >= nvl(p.min, t.min)
order by min
;
MIN MAX
---------- ----------
10 49
107 121
122 199
1401 1449
1501 1599
Example to resolve the problem:
CREATE TABLE xrange_reception
(
vdeb NUMBER,
vfin NUMBER
);
CREATE TABLE xrange_transfert
(
vdeb NUMBER,
vfin NUMBER
);
CREATE TABLE xrange_resultat
(
vdeb NUMBER,
vfin NUMBER
);
insert into xrange_reception values (10,50);
insert into xrange_transfert values (15,25);
insert into xrange_transfert values (30,33);
insert into xrange_transfert values (40,45);
DECLARE
CURSOR cr_rec IS SELECT * FROM xrange_reception;
CURSOR cr_tra IS
SELECT *
FROM xrange_transfert
ORDER BY vdeb;
i NUMBER;
vdebSui NUMBER;
BEGIN
FOR rc IN cr_rec
LOOP
i := 1;
vdebSui := NULL;
FOR tr IN cr_tra
LOOP
IF tr.vdeb BETWEEN rc.vdeb AND rc.vfin
THEN
IF i = 1 AND tr.vdeb > rc.vdeb
THEN
INSERT INTO xrange_resultat (vdeb, vfin)
VALUES (rc.vdeb, tr.vdeb - 1);
ELSIF i = cr_rec%ROWCOUNT AND tr.vfin < rc.vfin
THEN
INSERT INTO xrange_resultat (vdeb, vfin)
VALUES (tr.vfin, rc.vfin);
ELSIF vdebSui < tr.vdeb
THEN
INSERT INTO xrange_resultat (vdeb, vfin)
VALUES (vdebSui + 1, tr.vdeb - 1);
END IF;
vdebSui := tr.vfin;
i := i + 1;
END IF;
END LOOP;
IF vdebSui IS NOT NULL THEN
IF vdebSui < rc.vfin
THEN
INSERT INTO xrange_resultat (vdeb, vfin)
VALUES (vdebSui + 1, rc.vfin);
END IF;
ELSE
INSERT INTO xrange_resultat (vdeb, vfin)
VALUES (rc.vdeb, rc.vfin);
END IF;
END LOOP;
END;
So:
Table xrange_reception:
vdeb vfin
10 50
Table xrange_transfert:
vdeb vfin
15 25
30 33
40 45
Table xrange_resultat:
vdeb vfin
10 14
26 29
34 39
46 50

Fill in missing dates in date range from a table

table A
no date count
1 20160401 1
1 20160403 4
2 20160407 3
result
no date count
1 20160401 1
1 20160402 0
1 20160403 4
1 20160404 0
.
.
.
2 20160405 0
2 20160406 0
2 20160407 3
.
.
.
I'm using Oracle and I want to write a query that returns rows for every date within a range based on table A.
Is there some function in Oracle that can help me?
you can use the SEQUENCES.
First create a sequence
Create Sequence seq_name start with 20160401 max n;
where n is the max value till u want to display.
Then use the sql
select seq_name.next,case when seq_name.next = date then count else 0 end from tableA;
Note:- Its better not to use date,count as the column names.
Try this:
with
A as (
select 1 no, to_date('20160401', 'yyyymmdd') dat, 1 cnt from dual union all
select 1 no, to_date('20160403', 'yyyymmdd') dat, 4 cnt from dual union all
select 2 no, to_date('20160407', 'yyyymmdd') dat, 3 cnt from dual),
B as (select min(dat) mindat, max(dat) maxdat from A t),
C as (select level + mindat - 1 dat from B connect by level + mindat - 1 <= maxdat),
D as (select distinct no from A),
E as (select * from D,C)
select E.no, E.dat, nvl(cnt, 0) cnt
from E
full outer join A on A.no = E.no and A.dat = E.dat
order by 1, 2, 3
This isn't an oracle specific answer, you'll need to translate it to oracle yourself.
Create an intervals table, containing all integers from 0 to 999. Something like this:
CREATE TABLE intervals (days int);
INSERT INTO intervals (days) VALUES (0), (1);
DECLARE #rc int;
SELECT #rc = 2;
WHILE (SELECT Count(*) FROM intervals) < 1000 BEGIN
INSERT INTO intervals (days) SELECT days + #rc FROM intervals WHERE days + #rc < 1000;
SELECT #rc = #rc * 2
END;
Then all the dates in the range can be identified by adding intervals.days to the first date you've got, where the first date + intervals.days is <= the end date, and the resultant date is new. Do this by cross joining intervals to your own table. Something like (it would be in SQL, but again you'll need to translate):
SELECT DateAdd(a.date, d, i.days)
FROM (select min(date) from table_A) a, intervals I
WHERE DateAdd(a.date, d, i.days) < (select max(date) from table_A)
AND NOT EXISTS (select 1 from table_A aa where aa.date = DateAdd(a.date, d, i.days))
Hope this gives you a starting point

Simple way to calculate median with MySQL

What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x) for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.
Example data:
id | val
--------
1 4
2 7
3 2
4 2
5 9
6 8
7 3
Sorting on val gives 2 2 3 4 7 8 9, so the median should be 4, versus SELECT AVG(val) which == 5.
In MariaDB / MySQL:
SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, #rownum:=#rownum+1 as `row_number`, #total_rows:=#rownum
FROM data d, (SELECT #rownum:=0) r
WHERE d.val is NOT NULL
-- put some where clause here
ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((#total_rows+1)/2), FLOOR((#total_rows+2)/2) );
Steve Cohen points out, that after the first pass, #rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.
Also AVG(dd.val) and dd.row_number IN(...) is used to correctly produce a median when there are an even number of records. Reasoning:
SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3
Finally, MariaDB 10.3.3+ contains a MEDIAN function
I just found another answer online in the comments:
For medians in almost any SQL:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2
Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.
select count(*) from table --find the number of rows
Calculate the "median" row number. Maybe use: median_row = floor(count / 2).
Then pick it out of the list:
select val from table order by val asc limit median_row,1
This should return you one row with just the value you want.
I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1
Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.
Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.
So, here's velcro's solution patched to handle both odd and even number sets:
SELECT AVG(middle_values) AS 'median' FROM (
SELECT t1.median_column AS 'middle_values' FROM
(
SELECT #row:=#row+1 as `row`, x.median_column
FROM median_table AS x, (SELECT #row:=0) AS r
WHERE 1
-- put some where clause here
ORDER BY x.median_column
) AS t1,
(
SELECT COUNT(*) as 'count'
FROM median_table x
WHERE 1
-- put same where clause here
) AS t2
-- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;
To use this, follow these 3 easy steps:
Replace "median_table" (2 occurrences) in the above code with the name of your table
Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition
I propose a faster way.
Get the row count:
SELECT CEIL(COUNT(*)/2) FROM data;
Then take the middle value in a sorted subquery:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit #middlevalue) x;
I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.
Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/
After that, calculate median is easy:
SELECT median(val) FROM data;
A comment on this page in the MySQL documentation has the following suggestion:
-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:
DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;
INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);
-- Calculating the MEDIAN
SELECT #a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
#a := (#a + 1) mod o.c AS shifted_n,
IF(#a mod o.c=0, o.c, #a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,
-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;
-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1
-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4
-- from there we can select the n-th element on the position: count div 2 + 1
If MySQL has ROW_NUMBER, then the MEDIAN is (be inspired by this SQL Server query):
WITH Numbered AS
(
SELECT *, COUNT(*) OVER () AS Cnt,
ROW_NUMBER() OVER (ORDER BY val) AS RowNum
FROM yourtable
)
SELECT id, val
FROM Numbered
WHERE RowNum IN ((Cnt+1)/2, (Cnt+2)/2)
;
The IN is used in case you have an even number of entries.
If you want to find the median per group, then just PARTITION BY group in your OVER clauses.
Rob
Most of the solutions above work only for one field of the table, you might need to get the median (50th percentile) for many fields on the query.
I use this:
SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`
FROM table_name;
You can replace the "50" in example above to any percentile, is very efficient.
Just make sure you have enough memory for the GROUP_CONCAT, you can change it with:
SET group_concat_max_len = 10485760; #10MB max length
More details: http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/
I have this below code which I found on HackerRank and it is pretty simple and works in each and every case.
SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE
(SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) =
(SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL );
You could use the user-defined function that's found here.
Building off of velcro's answer, for those of you having to do a median off of something that is grouped by another parameter:
SELECT grp_field, t1.val FROM (
SELECT grp_field, #rownum:=IF(#s = grp_field, #rownum + 1, 0) AS row_number,
#s:=IF(#s = grp_field, #s, grp_field) AS sec, d.val
FROM data d, (SELECT #rownum:=0, #s:=0) r
ORDER BY grp_field, d.val
) as t1 JOIN (
SELECT grp_field, count(*) as total_rows
FROM data d
GROUP BY grp_field
) as t2
ON t1.grp_field = t2.grp_field
WHERE t1.row_number=floor(total_rows/2)+1;
Takes care about an odd value count - gives the avg of the two values in the middle in that case.
SELECT AVG(val) FROM
( SELECT x.id, x.val from data x, data y
GROUP BY x.id, x.val
HAVING SUM(SIGN(1-SIGN(IF(y.val-x.val=0 AND x.id != y.id, SIGN(x.id-y.id), y.val-x.val)))) IN (ROUND((COUNT(*))/2), ROUND((COUNT(*)+1)/2))
) sq
My code, efficient without tables or additional variables:
SELECT
((SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', floor(1+((count(val)-1) / 2))), ',', -1))
+
(SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', ceiling(1+((count(val)-1) / 2))), ',', -1)))/2
as median
FROM table;
Single query to archive the perfect median:
SELECT
COUNT(*) as total_rows,
IF(count(*)%2 = 1, CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL), ROUND((CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val SEPARATOR ','), ',', 50/100 * COUNT(*)), ',', -1) AS DECIMAL)) / 2)) as median,
AVG(val) as average
FROM
data
Optionally, you could also do this in a stored procedure:
DROP PROCEDURE IF EXISTS median;
DELIMITER //
CREATE PROCEDURE median (table_name VARCHAR(255), column_name VARCHAR(255), where_clause VARCHAR(255))
BEGIN
-- Set default parameters
IF where_clause IS NULL OR where_clause = '' THEN
SET where_clause = 1;
END IF;
-- Prepare statement
SET #sql = CONCAT(
"SELECT AVG(middle_values) AS 'median' FROM (
SELECT t1.", column_name, " AS 'middle_values' FROM
(
SELECT #row:=#row+1 as `row`, x.", column_name, "
FROM ", table_name," AS x, (SELECT #row:=0) AS r
WHERE ", where_clause, " ORDER BY x.", column_name, "
) AS t1,
(
SELECT COUNT(*) as 'count'
FROM ", table_name, " x
WHERE ", where_clause, "
) AS t2
-- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
WHERE t1.row >= t2.count/2
AND t1.row <= ((t2.count/2)+1)) AS t3
");
-- Execute statement
PREPARE stmt FROM #sql;
EXECUTE stmt;
END//
DELIMITER ;
-- Sample usage:
-- median(table_name, column_name, where_condition);
CALL median('products', 'price', NULL);
My solution presented below works in just one query without creation of table, variable or even sub-query.
Plus, it allows you to get median for each group in group-by queries (this is what i needed !):
SELECT `columnA`,
SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(`columnB` ORDER BY `columnB`), ',', CEILING((COUNT(`columnB`)/2))), ',', -1) medianOfColumnB
FROM `tableC`
-- some where clause if you want
GROUP BY `columnA`;
It works because of a smart use of group_concat and substring_index.
But, to allow big group_concat, you have to set group_concat_max_len to a higher value (1024 char by default).
You can set it like that (for current sql session) :
SET SESSION group_concat_max_len = 10000;
-- up to 4294967295 in 32-bits platform.
More infos for group_concat_max_len: https://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len
Another riff on Velcrow's answer, but uses a single intermediate table and takes advantage of the variable used for row numbering to get the count, rather than performing an extra query to calculate it. Also starts the count so that the first row is row 0 to allow simply using Floor and Ceil to select the median row(s).
SELECT Avg(tmp.val) as median_val
FROM (SELECT inTab.val, #rows := #rows + 1 as rowNum
FROM data as inTab, (SELECT #rows := -1) as init
-- Replace with better where clause or delete
WHERE 2 > 1
ORDER BY inTab.val) as tmp
WHERE tmp.rowNum in (Floor(#rows / 2), Ceil(#rows / 2));
Knowing exact row count you can use this query:
SELECT <value> AS VAL FROM <table> ORDER BY VAL LIMIT 1 OFFSET <half>
Where <half> = ceiling(<size> / 2.0) - 1
SELECT
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(field ORDER BY field),
',',
((
ROUND(
LENGTH(GROUP_CONCAT(field)) -
LENGTH(
REPLACE(
GROUP_CONCAT(field),
',',
''
)
)
) / 2) + 1
)),
',',
-1
)
FROM
table
The above seems to work for me.
I used a two query approach:
first one to get count, min, max and avg
second one (prepared statement) with a "LIMIT #count/2, 1" and "ORDER BY .." clauses to get the median value
These are wrapped in a function defn, so all values can be returned from one call.
If your ranges are static and your data does not change often, it might be more efficient to precompute/store these values and use the stored values instead of querying from scratch every time.
as i just needed a median AND percentile solution, I made a simple and quite flexible function based on the findings in this thread. I know that I am happy myself if I find "readymade" functions that are easy to include in my projects, so I decided to quickly share:
function mysql_percentile($table, $column, $where, $percentile = 0.5) {
$sql = "
SELECT `t1`.`".$column."` as `percentile` FROM (
SELECT #rownum:=#rownum+1 as `row_number`, `d`.`".$column."`
FROM `".$table."` `d`, (SELECT #rownum:=0) `r`
".$where."
ORDER BY `d`.`".$column."`
) as `t1`,
(
SELECT count(*) as `total_rows`
FROM `".$table."` `d`
".$where."
) as `t2`
WHERE 1
AND `t1`.`row_number`=floor(`total_rows` * ".$percentile.")+1;
";
$result = sql($sql, 1);
if (!empty($result)) {
return $result['percentile'];
} else {
return 0;
}
}
Usage is very easy, example from my current project:
...
$table = DBPRE."zip_".$slug;
$column = 'seconds';
$where = "WHERE `reached` = '1' AND `time` >= '".$start_time."'";
$reaching['median'] = mysql_percentile($table, $column, $where, 0.5);
$reaching['percentile25'] = mysql_percentile($table, $column, $where, 0.25);
$reaching['percentile75'] = mysql_percentile($table, $column, $where, 0.75);
...
Here is my way . Of course, you could put it into a procedure :-)
SET #median_counter = (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`);
SET #median = CONCAT('SELECT `val` FROM `data` ORDER BY `val` LIMIT ', #median_counter, ', 1');
PREPARE median FROM #median;
EXECUTE median;
You could avoid the variable #median_counter, if you substitude it:
SET #median = CONCAT( 'SELECT `val` FROM `data` ORDER BY `val` LIMIT ',
(SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`),
', 1'
);
PREPARE median FROM #median;
EXECUTE median;
After reading all previous ones they didn't match with my actual requirement so I implemented my own one which doesn't need any procedure or complicate statements, just I GROUP_CONCAT all values from the column I wanted to obtain the MEDIAN and applying a COUNT DIV BY 2 I extract the value in from the middle of the list like the following query does :
(POS is the name of the column I want to get its median)
(query) SELECT
SUBSTRING_INDEX (
SUBSTRING_INDEX (
GROUP_CONCAT(pos ORDER BY CAST(pos AS SIGNED INTEGER) desc SEPARATOR ';')
, ';', COUNT(*)/2 )
, ';', -1 ) AS `pos_med`
FROM table_name
GROUP BY any_criterial
I hope this could be useful for someone in the way many of other comments were for me from this website.
Based on #bob's answer, this generalizes the query to have the ability to return multiple medians, grouped by some criteria.
Think, e.g., median sale price for used cars in a car lot, grouped by year-month.
SELECT
period,
AVG(middle_values) AS 'median'
FROM (
SELECT t1.sale_price AS 'middle_values', t1.row_num, t1.period, t2.count
FROM (
SELECT
#last_period:=#period AS 'last_period',
#period:=DATE_FORMAT(sale_date, '%Y-%m') AS 'period',
IF (#period<>#last_period, #row:=1, #row:=#row+1) as `row_num`,
x.sale_price
FROM listings AS x, (SELECT #row:=0) AS r
WHERE 1
-- where criteria goes here
ORDER BY DATE_FORMAT(sale_date, '%Y%m'), x.sale_price
) AS t1
LEFT JOIN (
SELECT COUNT(*) as 'count', DATE_FORMAT(sale_date, '%Y-%m') AS 'period'
FROM listings x
WHERE 1
-- same where criteria goes here
GROUP BY DATE_FORMAT(sale_date, '%Y%m')
) AS t2
ON t1.period = t2.period
) AS t3
WHERE
row_num >= (count/2)
AND row_num <= ((count/2) + 1)
GROUP BY t3.period
ORDER BY t3.period;
create table med(id integer);
insert into med(id) values(1);
insert into med(id) values(2);
insert into med(id) values(3);
insert into med(id) values(4);
insert into med(id) values(5);
insert into med(id) values(6);
select (MIN(count)+MAX(count))/2 from
(select case when (select count(*) from
med A where A.id<B.id)=(select count(*)/2 from med) OR
(select count(*) from med A where A.id>B.id)=(select count(*)/2
from med) then cast(B.id as float)end as count from med B) C;
?column?
----------
3.5
(1 row)
OR
select cast(avg(id) as float) from
(select t1.id from med t1 JOIN med t2 on t1.id!= t2.id
group by t1.id having ABS(SUM(SIGN(t1.id-t2.id)))=1) A;
Often, we may need to calculate Median not just for the whole table, but for aggregates with respect to our ID. In other words, calculate median for each ID in our table, where each ID has many records. (good performance and works in many SQL + fixes problem of even and odds, more about performance of different Median-methods https://sqlperformance.com/2012/08/t-sql-queries/median )
SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val,
COUNT(*) OVER (PARTITION BY our_id) AS cnt,
ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rn
FROM our_table
) AS x
WHERE rn IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;
Hope it helps
MySQL has supported window functions since version 8.0, you can use ROW_NUMBER or DENSE_RANK (DO NOT use RANK as it assigns the same rank to same values, like in sports ranking):
SELECT AVG(t1.val) AS median_val
FROM (SELECT val,
ROW_NUMBER() OVER(ORDER BY val) AS rownum
FROM data) t1,
(SELECT COUNT(*) AS num_records FROM data) t2
WHERE t1.row_num IN
(FLOOR((t2.num_records + 1) / 2),
FLOOR((t2.num_records + 2) / 2));
A simple way to calculate Median in MySQL
set #ct := (select count(1) from station);
set #row := 0;
select avg(a.val) as median from
(select * from table order by val) a
where (select #row := #row + 1)
between #ct/2.0 and #ct/2.0 +1;
The most simple and fast way to calculate median in mysql.
select x.col
from (select lat_n,
count(1) over (partition by 'A') as total_rows,
row_number() over (order by col asc) as rank_Order
from station ft) x
where x.rank_Order = round(x.total_rows / 2.0, 0)