SQL ranking query to compute ranks and median in sub groups - sql

I want to compute the Median of y in sub groups of this simple xy_table:
x | y --groups--> gid | x | y --medians--> gid | x | y
------- ------------- -------------
0.1 | 4 0.0 | 0.1 | 4 0.0 | 0.1 | 4
0.2 | 3 0.0 | 0.2 | 3 | |
0.7 | 5 1.0 | 0.7 | 5 1.0 | 0.7 | 5
1.5 | 1 2.0 | 1.5 | 1 | |
1.9 | 6 2.0 | 1.9 | 6 | |
2.1 | 5 2.0 | 2.1 | 5 2.0 | 2.1 | 5
2.7 | 1 3.0 | 2.7 | 1 3.0 | 2.7 | 1
In this example every x is unique and the table is already sorted by x.
I now want to GROUP BY round(x) and get the tuple that holds the median of y in each group.
I can already compute the median for the whole table with this ranking query:
SELECT a.x, a.y FROM xy_table a,xy_table b
WHERE a.y >= b.y
GROUP BY a.x, a.y
HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table)
Output: 0.1, 4.0
But I did not yet succeed writing a query to compute the median for sub groups.
Attention: I do not have a median() aggregation function available. Please also do not propose solutions with special PARTITION, RANK, or QUANTILE statements (as found in similar but too vendor specific SO questions). I need plain SQL (i.e., compatible to SQLite without median() function)
Edit: I was actually looking for the Medoid and not the Median.

I suggest doing the computing in your programming language:
for each group:
for each record_in_group:
append y to array
median of array
But if you are stuck with SQLite, you can order each group by y and select the records in the middle like this http://sqlfiddle.com/#!5/d4c68/55/0:
UPDATE: only bigger "median" value is importand for even nr. of rows, so no avg() is needed:
select groups.gid,
ids.y median
from (
-- get middle row number in each group (bigger number if even nr. of rows)
-- note the integer divisions and modulo operator
select round(x) gid,
count(*) / 2 + 1 mid_row_right
from xy_table
group by round(x)
) groups
join (
-- for each record get equivalent of
-- row_number() over(partition by gid order by y)
select round(a.x) gid,
a.x,
a.y,
count(*) rownr_by_y
from xy_table a
left join xy_table b
on round(a.x) = round (b.x)
and a.y >= b.y
group by a.x
) ids on ids.gid = groups.gid
where ids.rownr_by_y = groups.mid_row_right

OK, this relies on a temporary table:
create temporary table tmp (x float, y float);
insert into tmp
select * from xy_table order by round(x), y
But you could potentially create this for a range of data you were interested in. Another way would be to ensure the xy_table had this sort order, instead of just ordering on x. The reason for this is SQLite's lack of row numbering capability.
Then:
select tmp4.x as gid, t.* from (
select tmp1.x,
round((tmp2.y + coalesce(tmp3.y, tmp2.y)) / 2) as y -- <- for larger of the two, change to: (case when tmp2.y > coalesce(tmp3.y, 0) then tmp2.y else tmp3.y end)
from (
select round(x) as x, min(rowid) + (count(*) / 2) as id1,
(case when count(*) % 2 = 0 then min(rowid) + (count(*) / 2) - 1
else 0 end) as id2
from (
select *, rowid from tmp
) t
group by round(x)
) tmp1
join tmp tmp2 on tmp1.id1 = tmp2.rowid
left join tmp tmp3 on tmp1.id2 = tmp3.rowid
) tmp4
join xy_table t on tmp4.x = round(t.x) and tmp4.y = t.y
If you wanted to treat the median as the larger of the two middle values, which doesn't fit the definition as #Aprillion already pointed out, then you would simply take the larger of the two y values, instead of their average, on the third line of the query.

Related

How to get values that are >= 25%, 50%, 75% of a list from a column using SQL

My table has a single column called Speed (integer), and I need to select values that are greater than 25%, 50%,... values in that list.
Sample data:
+-------+
| Speed |
+-------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+-------+
Desired output:
+--------+
| OUTPUT |
+--------+
| 3 |
| 5 |
| 8 |
+--------+
Explain:
3 >= 25% numbers in the list
5 >= 50% numbers in the list
8 >= 75% numbers in the list
I think that I should sort the data, and do something like:
SELECT speed
FROM my_table
WHERE speed IN (ROUND(0.25 * <total_row>), ROUND(0.50 * <total_row>),..)
but I don't know how to get that <total_row> reference. If I could just SELECT COUNT(speed) AS total_row, and use that later, that would be great.
Thank you so much.
create table Speed Engine=Memory
as select number+1 X from numbers(10);
SELECT quantilesExact(0.25, 0.5, 0.75)(X)
FROM Speed
┌─quantilesExact(0.25, 0.5, 0.75)(X)─┐
│ [3,6,8] │
└────────────────────────────────────┘
SELECT arrayJoin(quantilesExact(0.25, 0.5, 0.75)(X)) AS q
FROM Speed
┌─q─┐
│ 3 │
│ 6 │
│ 8 │
└───┘
SELECT arrayJoin(quantilesExact(0.25, 0.499999999999, 0.75)(X)) AS q
FROM Speed
┌─q─┐
│ 3 │
│ 5 │
│ 8 │
└───┘
In CH realm Join is not applicable because it's usually billions of rows.
create table Speed Engine=MergeTree order by X as select number X from numbers(1000000000);
SELECT quantilesExact(0.25, 0.5, 0.75)(X)
FROM Speed
┌─quantilesExact(0.25, 0.5, 0.75)(X)─┐
│ [250000000,500000000,750000000] │
└────────────────────────────────────┘
1 rows in set. Elapsed: 7.974 sec. Processed 1.00 billion rows,
SELECT quantiles(0.25, 0.5, 0.75)(X)
FROM Speed
┌─quantiles(0.25, 0.5, 0.75)(X)────────┐
│ [244782599,500713390.5,751014086.75] │
└──────────────────────────────────────┘
1 rows in set. Elapsed: 1.274 sec. Processed 1.00 billion rows
This is a bit long for a comment.
Basically, to answer this question in SQL there are three approaches:
Window functions.
Correlated subqueries to calculate cumulative counts.
Self-join with non-equal conditions and aggregation to calculative cumulative counts.
The first is BY FAR the best approach. But the other two can be used in databases that don't support window functions.
Alas, Clickhouse does not support:
Window functions.
Correlated subqueries.
Non-equijoins.
It might have undocumented features or extensions that support one or more of this functionality. However, the base product does not seem to support enough SQL to do this as a single query.
EDIT:
There would seem to be a way, assuming that rowNumberInAllBlocks() obeys the ordering specified in order by:
select t.*
from (select t.*,
rowNumberInAllBlocks() as seqnum,
tt.cnt
from t cross join
(select count(*) as cnt from t) tt
order by speed
) t
where (t.seqnum <= tt.cnt * 0.25 and t.seqnum + 1 > tt.cnt * 0.25) or
(t.seqnum <= tt.cnt * 0.50 and t.seqnum + 1 > tt.cnt * 0.50) or
(t.seqnum <= tt.cnt * 0.75 and t.seqnum + 1 > tt.cnt * 0.75) ;
Sorry for a not effective but working solution, try this:
declare a var for max value:
declare #maxspeed int = (select max(speed) from my_table)
select fro my_table relevant values:
select speed
from my_table
where speed in ((select top 1 speed
from #my_table
where speed > 0.25 * #maxspeed),
(select top 1 speed
from #my_table
where speed > 0.5 * #maxspeed),
(select top 1 speed
from #my_table
where speed > 0.75 * #maxspeed))
First you do a self join so each row is joined to all the rows with Speed less than or equal to the rows Speed.
Then cross join to the query that returns the total number of rows of the table.
Finally group by the percentage of the rows that each Speed is greater, rounded to the integer values of 25, 50 and 75 and get the minimum Speed for each group:
select min(t.speed) Output
from (select count(*) total from tablename) c
cross join (
select t.speed, count(*) counter
from tablename t inner join tablename tt
on tt.speed <= t.speed
group by t.speed
) t
where 25 * floor(floor(100.0 * t.counter / c.total) / 25) in (25, 50, 75)
group by 25 * floor(floor(100.0 * t.counter / c.total) / 25)
This code is tested and working for MySql, Postgresql and SQL Server.
See the demo.
Results:
| output |
| ------ |
| 3 |
| 5 |
| 8 |

Order rows based on aggregated data

I have a sample table like shown below :
select * from sampleTable;
label | data
-------+------
a | 1
b | 2
c | 3
d | 4
a | 5
b | 6
(6 rows)
I require rows to be sorted with the summed up values of 'data' column (i.e) c with data of 3 should come first and b with combined data of 2 and 6 should come last and others in-between like shown below
label | data
-------+------
c | 3
d | 4
a | 1
a | 5
b | 2
b | 6
I have tried to achieve this with a self join as shown below. But it seems a bit verbose. Am I doing it right or is there a better way to achieve the same without joins?
select l, data from sampleTable join (select label as l, sum(data) as x from sampleTable group by l) m on label = m.l order by x;
l | data
---+------
c | 3
d | 4
a | 1
a | 5
b | 2
b | 6
(6 rows)
You can avoid the self-join by using a SUM with a windowed function, something like this:
SELECT label
, data
FROM (
SELECT *
, SUM(data) OVER (PARTITION BY label) pts
FROM sampleTable
) AS rez
ORDER BY pts
You don't need a self-join or a subquery. You can use window functions in the order by:
select t.*
from t
order by sum(data over (partition by label),
label;
Note the inclusion of label as the second key. This is important for distinguishing ties in the data. It ensures that the all rows for a given label all appear together.
Simply use the sum window function in ORDER BY
SELECT l, d
FROM tab
ORDER BY SUM(d) OVER (PARTITION BY l)
dbfiddle demo

How to return multiple rows with repeating non-unique query values in SQL?

I have a table 'mat' with columns x,y,data, where (x,y) is the multi-column primary key, so the table contains data in matrix form. The problem is how to select multiple rows when I have a "vector" of key pairs and there can be repeating pairs:
SELECT x,y,data FROM mat WHERE (x,y) IN ((0,0),(0,0),(1,1));
quite obviously returns
x | y | data
--+---+-----
0 | 0 | 5
1 | 1 | 7
whereas I would need:
x | y | data
--+---+-----
0 | 0 | 5
0 | 0 | 5
1 | 1 | 7
I could loop the key pairs from outside (in c++/whatever code) to get the correct data but there's a major performance degradation and that's quite critical. Any suggestions? Is it possible? Help appreciated!
I think you need a JOIN for this
SELECT mat.x,mat.y,data
FROM mat
JOIN
(
SELECT 0 x, 0 y
UNION ALL
SELECT 0 x, 0 y
UNION ALL
SELECT 1 x, 1 y
) t ON t.x = mat.x and t.y = mat.y
demo
The IN is just evaluated to true/false/unknown for each row, it can not multiply your data.
Radim has the right idea. I prefer this syntax:
SELECT m.*
FROM mat m JOIN
(VALUES (0, 0), (0, 0), (1, 1)) v(x, y)
ON m.x = v.x and m.y = v.y;

SQL: select top fewest rows with sum more than 0.7

The raw data table is
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.1 |
| 1 | 0.2 |
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.4 |
| 2 | 0.5 |
| 2 | 0.1 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
For each id, its value sum is 1. I want to select the top fewest rows of each id with value sum is more than or equal with 0.7, like
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.5 |
| 2 | 0.4 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
How to solve this problem?
It's neither pretty nor efficient but it's the best I can come up with.
Disclaimer: I'm sure this will perform horribly on any real-world dataset.
with recursive calc (id, row_list, value_list, total_value) as (
select id, array[ctid], array[value]::numeric(6,2)[], value::numeric(6,2) as total_value
from data
union all
select c.id, p.row_list||c.ctid, (p.value_list||c.value)::numeric(6,2)[], (p.total_value + c.value)::numeric(6,2)
from data as c
join calc as p on p.id = c.id and c.ctid <> all(p.row_list)
)
select id, unnest(min(value_list)) as value
from (
select id,
value_list,
array_length(row_list,1) num_values,
min(array_length(row_list,1)) over (partition by id) as min_num_values
from calc
where total_value >= 0.7
) as result
where num_values = min_num_values
group by id
SQLFiddle example: http://sqlfiddle.com/#!15/8966b/1
How does this work?
The recursive CTE (thew with recursive) part creates all possible combinations of values from the table. To make sure that the same value is not counted twice I'm collecting the CTIDs (an Postgres internal unique identifier for each row) for each row already processed into an array. The recursive join condition (p.id = c.id and c.ctid <> all(p.row_list)) then makes sure only values for the same id are added and only those that have not yet processed.
The result of the CTE is then reduced to all rows where the total sum (the column total_value) is >= 0.7.
The final outer select (the alias result) is then filtered down to those where the number of values making up the total sum is the smallest. The distinct and unnest then transforms the arrays back into a proper "table". The distinct is necessary because the CTE collects all combinations so that for e.g. id=3 the value_list array will contain {0.40,0.50} and {0.50,0.40}. Without the distinct, the unnest would return both combinations making it a total of four rows for id=3.
This also isn't that pretty but I think it'd be more efficient (and more transferable between RDBMS')
with unique_data as (
select id
, value
, row_number() over ( partition by id order by value desc ) as rn
from my_table
)
, cumulative_sum as (
select id
, value
, sum(value) over ( partition by id order by rn ) as csum
from unique_data
)
, first_over_the_mark as (
select id
, value
, csum
, lag(csum) over ( partition by id order by csum ) as prev_value
from cumulative_sum
)
select *
from first_over_the_mark
where coalesce(prev_value, 0) < 0.7
SQL Fiddle
I've done it with CTEs to make it easier to see what's happening but there's no need to use them.
It uses a cumulative sum, the first CTE makes the data unique as without it 0.2 is the same value and so all rows that have 0.2 get summed together. The second works out the running sum. The third then works out the previous value. If the previous is strictly less than 0.7 pick up everything. The idea being that if the previous cumulative sum is less than 0.7 then the current value is more (or equal) to that number.
It's worth noting that this will break down if you have any rows in your table where the value is 0.
This is a variant on Ben's method, but it is simpler to implement. You just need a cumulative sum, ordered by value in reverse, and then to take everything where the cumulative sum is less then 0.7 plus the first one that exceeds that value.
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc) as csum
from t
) t
where csum - value < 0.7;
The expression csum - value is the cumulative sum minus the current value (you can also get this using something like rows between unbounded preceding and 1 preceding). Your condition is that this value is less than some threshold.
EDIT:
Ben's comment is right about duplicate values. His solution is fine. Here is another solution:
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc, random()) as csum
from t
) t
where csum - value < 0.7;

SQL - min() gets the lowest value, max() the highest, what if I want the 2nd (or 5th or nth) lowest value?

The problem I'm trying to solve is that I have a table like this:
a and b refer to point on a different table. distance is the distance between the points.
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 2 | 0.2345 | 0 |
| 3 | 1 | 3 | 100 | 0 |
| 4 | 2 | 1 | 1343.2 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
| 6 | 2 | 3 | 110 | 0 |
....
The important column I'm looking is a_id. If I wanted to keep the closet b for each a, I could do something like this:
update mytable set delete = 1 from (select a_id, min(distance) as dist from table group by a_id) as x where a_gid = a_gid and distance > dist;
delete from mytable where delete = 1;
Which would give me a result table like this:
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
....
i.e. I need one row for each value of a_id, and that row should have the lowest value of distance for each a_id.
However I want to keep the 10 closest points for each a_gid. I could do this with a plpgsql function but I'm curious if there is a more SQL-y way.
min() and max() return the smallest and largest, if there was an aggregate function like nth(), which'd return the nth largest/smallest value then I could do this in similar manner to the above.
I'm using PostgeSQL.
Try this:
SELECT *
FROM (
SELECT a_id, (
SELECT b_id
FROM mytable mib
WHERE mib.a_id = ma.a_id
ORDER BY
dist DESC
LIMIT 1 OFFSET s
) AS b_id
FROM (
SELECT DISTINCT a_id
FROM mytable mia
) ma, generate_series (1, 10) s
) ab
WHERE b_id IS NOT NULL
Checked on PostgreSQL 8.3
I love postgres, so it took it as a challenge the second I saw this question.
So, for the table:
Table "pg_temp_29.foo"
Column | Type | Modifiers
--------+---------+-----------
value | integer |
With the values:
SELECT value FROM foo ORDER BY value;
value
-------
0
1
2
3
4
5
6
7
8
9
14
20
32
(13 rows)
You can do a:
SELECT value FROM foo ORDER BY value DESC LIMIT 1 OFFSET X
Where X = 0 for the highest value, 1 for the second highest, 2... And so forth.
This can be further embedded in a subquery to retrieve the value needed. So, to use the dataset provided in the original question we can get the a_ids with the top ten lowest distances by doing:
SELECT a_id, distance FROM mytable
WHERE id IN
(SELECT id FROM mytable WHERE t1.a_id = t2.a_id
ORDER BY distance LIMIT 10);
ORDER BY a_id, distance;
a_id | distance
------+----------
1 | 0.2345
1 | 1
1 | 100
2 | 0.45
2 | 110
2 | 1342.2
Does PostgreSQL have the analytic function rank()? If so try:
select a_id, b_id, distance
from
( select a_id, b_id, distance, rank() over (partition by a_id order by distance) rnk
from mytable
) where rnk <= 10;
This SQL should find you the Nth lowest salary should work in SQL Server, MySQL, DB2, Oracle, Teradata, and almost any other RDBMS: (note: low performance because of subquery)
SELECT * /*This is the outer query part */
FROM mytable tbl1
WHERE (N-1) = ( /* Subquery starts here */
SELECT COUNT(DISTINCT(tbl2.distance))
FROM mytable tbl2
WHERE tbl2.distance < tbl1.distance)
The most important thing to understand in the query above is that the subquery is evaluated each and every time a row is processed by the outer query. In other words, the inner query can not be processed independently of the outer query since the inner query uses the tbl1 value as well.
In order to find the Nth lowest value, we just find the value that has exactly N-1 values lower than itself.