Compute median of column in SQL common table expression - sql

In MSSQL2008, I am trying to compute the median of a column of numbers from a common table expression using the classic median query as follows:
WITH cte AS
(
SELECT number
FROM table
)
SELECT cte.*,
(SELECT
(SELECT (
(SELECT TOP 1 cte.number
FROM
(SELECT TOP 50 PERCENT cte.number
FROM cte
ORDER BY cte.number) AS medianSubquery1
ORDER BY cte.number DESC)
+
(SELECT TOP 1 cte.number
FROM
(SELECT TOP 50 PERCENT cte.number
FROM cte
ORDER BY cte.number DESC) AS medianSubquery2
ORDER BY cte.number ASC) ) / 2)) AS median
FROM cte
ORDER BY cte.number
The result set that I get is the following:
NUMBER MEDIAN
x1 x1
x1 x1
x1 x1
x2 x2
x3 x3
In other words, the "median" column is the same as the "number" column when I would expect the median column to be "x1" all the way down. I use a similar expression to compute the mode and it works fine over the same common table expression.

Here's a slightly different way to do it:
WITH cte AS
(
SELECT number
FROM table1
)
SELECT T1.number, T3.median
FROM cte T1,
(
SELECT AVG(number) AS median
FROM
(
SELECT number, ROW_NUMBER() OVER(ORDER BY number) AS rn
FROM cte
) T2
WHERE T2.rn = ((SELECT COUNT(*) FROM table1) + 1) / 2
OR T2.rn = ((SELECT COUNT(*) FROM table1) + 2) / 2
) T3

The problem with your query is that you are doing
SELECT TOP 1 cte.number FROM...
but it isn't correlated with the sub query it is correlated with the Outer query so the subquery is irrelevant. Which explains why you simply end up with the same value all the way down. Removing the cte. (as below) gives the median of the CTE. Which is a constant value. What are you trying to do?
WITH cte AS
( SELECT NUMBER
FROM master.dbo.spt_values
WHERE TYPE='p'
)
SELECT cte.*,
(SELECT
(SELECT (
(SELECT TOP 1 number
FROM
(SELECT TOP 50 PERCENT cte.number
FROM cte
ORDER BY cte.number) AS medianSubquery1
ORDER BY number DESC)
+
(SELECT TOP 1 number
FROM
(SELECT TOP 50 PERCENT cte.number
FROM cte
ORDER BY cte.number DESC) AS medianSubquery2
ORDER BY number ASC) ) / 2)) AS median
FROM cte
ORDER BY cte.number
Returns
NUMBER median
----------- -----------
0 1023
1 1023
2 1023
3 1023
4 1023
5 1023
6 1023
7 1023

This is not an entirely new answer as it mostly expands on Mark Byer's answer, but there are a couple of options for simplifying the query even further.
The first thing is to really make use of CTE's. Not only can you have multiple CTE's, but they can refer to each other. With this in mind, we can create an additional CTE to compute the median based on the results of the first. This encapsulates the median computation and leaves the actual SELECT to do only what it needs to do. Note that the ROW_NUMBER() had to be moved into the first CTE.
;WITH cte AS
(
SELECT number, ROW_NUMBER() OVER(ORDER BY number) AS rn
FROM table1
),
med AS
(
SELECT AVG(number) AS median
FROM cte
WHERE cte.rn = ((SELECT COUNT(*) FROM cte) + 1) / 2
OR cte.rn = ((SELECT COUNT(*) FROM cte) + 2) / 2
)
SELECT cte.number, med.median
FROM cte
CROSS JOIN med
And to further reduce complexity, you "could" use a custom CLR Aggregate to handle the Median (such as the one provided in the free SQL# library at http://www.SQLsharp.com/ [which I am the author of]).
;WITH cte AS
(
SELECT number
FROM table1
),
med AS
(
SELECT SQL#.Agg_Median(cte.number) AS median
FROM cte
)
SELECT cte.number, med.median
FROM cte
CROSS JOIN med

Related

Random records in Oracle table based on conditions

I have a Oracle table with the following columns
Table Structure
In a query I need to return all the records with CPER>=40 which is trivial. However, apart from CPER>=40 I need to list 5 random records for each CPID.
I have attached a sample list of records. However, in my table I have around 50,000 records.
Appreciate if you can help.
Oracle solution:
with CTE as
(
select t1.*,
row_number() over(order by DBMS_RANDOM.VALUE) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
SQL Server:
with CTE as
(
select t1.*,
row_number() over(order by newid()) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
How about something like this...
SELECT *
FROM (SELECT CID,
CVAL,
CPID,
CPER,
Row_number() OVER (partition BY CPID ORDER BY CPID ASC ) AS RN
FROM Table) tmp
WHERE CPER>=40 OR pids <= 5
However, this is not random.
Assuming that you want five additional random records, you can do:
select t.*
from (select t.*,
row_number() over (partition by cpid,
(case when cper >= 40 then 1 else 2 end)
order by dbms_random.value
) as seqnum
from t
) t
where seqnum <= 5 or cper >= 40;
The row_number() is enumerating the rows for each cpid in two groups -- based on the cper value. The outer where is taking all cper values in the range you want as well as five from the other group.

Applying LIMIT and OFFSET to MS SQL server 2008 queries

I need to apply LIMIT and OFFSET to original query (without modifying it) in MSSQL server 2008.
Let's say the original query is:
SELECT * FROM energy_usage
(But it can be any arbitrary SELECT query)
That's what I came up with so far:
1. It does what I need, but the query generates extra column row_number which I don't need.
WITH OrderedTable AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS row_number, * FROM energy_usage
)
SELECT * FROM OrderedTable WHERE row_number BETWEEN 1 AND 10
2. This one doesn't work for some reason and returns the following error.
SELECT real_sql.* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS row_number, * FROM (SELECT * FROM energy_usage) as real_sql) as subquery
WHERE row_number BETWEEN 1 AND 10
More common case is:
SELECT real_sql.* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS row_number, * FROM (real sql query) as real_sql) as subquery
WHERE row_number BETWEEN {offset} + 1 AND {limit} + {offset}
Error:
The column prefix 'real_sql' does not match with a table name or alias
name used in the query.
Simply do not put it on SELECT list:
WITH OrderedTable AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS row_number, *
FROM energy_usage
)
SELECT col1, col2, col3 FROM OrderedTable WHERE row_number BETWEEN 1 AND 10;
SELECT * is common anti-pattern and should be avoided anyway. Plus ORDER BY (SELECT 1) will not give you guarantee of stable sort between executions.
Second if you need only ten rows use:
SELECT TOP 10 *
FROM energy_usage
ORDER BY ...
Unfortunately you won't get something nice as Selecting all Columns Except One
WITH OrderedTable AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS row_number, *
FROM energy_usage
)
SELECT * EXCEPT row_number FROM OrderedTable WHERE row_number BETWEEN 1 AND 10;
This would solve the problem.
DECLARE #offset INT = 1;
DECLARE #limit INT = 10;
WITH Filtered AS (
SELECT TOP (#offset + #limit) *
FROM energy_usage
ORDER BY 1 ASC
), Results AS (
SELECT TOP (#limit) *
FROM Filtered
ORDER BY 1 DESC
)
SELECT *
FROM Results
ORDER BY 1 ASC;

Oracle: I need to select n rows from every k rows of a table

For example:
My table has 10000 rows. First I will divide it in 5 sets of 2000(k) rows. Then from each set of 2000 rows I will select only top 100(n) rows.
With this approach I am trying to scan some rows of table with a specific pattern.
Assuming you are ordering them 1 - 10000 using some logic and want to output only rows 1-100,2001-2100,4001-4100,etc then you can use the ROWNUM pseudocolumn:
SELECT *
FROM (
SELECT t.*,
ROWNUM AS rn -- Secondly, assign a row number to the ordered rows
FROM (
SELECT *
FROM your_table
ORDER BY your_condition -- First, order the data
) t
)
WHERE MOD( rn - 1, 2000 ) < 100; -- Finally, filter the top 100 per 2000.
Or you could use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( ORDER BY your_condition ) AS rn
FROM your_table
)
WHERE MOD( rn - 1, 2000 ) < 100;
Is it possible to increase the set of sample data exponentially. Like 1k, 2k, 4k,8k....and then fetch some rows from these.
Replace the WHERE clause with:
WHERE rn - POWER(
2,
TRUNC( CAST( LOG( 2, CEIL( rn / 1000 ) ) AS NUMBER(20,4) ) )
) * 1000 + 1000 <= 100
This solution uses the analytic ntile() to split the raw data into five buckets. That result set is labelled using the analytic row_number() which provides a filter to produce the final set:
with sq1 as ( select id, col1, ntile(5) over (order by id asc) as quintile
from t23
)
, sq2 as ( select id, col1, quintile
, row_number() over ( partition by quintile order by id asc) as rn
from sq1 )
select *
from sq2
where rn <= 200
order by quintile, rn
/
use partition by and order by with row_number. it will look like following:
row_number()over(partition by partition_column order by order_column)<=100
partition_column will be your condition to divide set.
order_column will be your condition to select top 100.

Calculating cumulative sum in ms-sql

I have a table tblsumDemo with the following structure
billingid qty Percent_of_qty cumulative
1 10 5 5
2 5 8 13(5+8)
3 12 6 19(13+6)
4 1 10 29(19+10)
5 2 11 40(11+10)
this is what I have tried
declare #s int
SELECT billingid, qty, Percent_of_qty,
#s = #s + Percent_of_qty AS cumulative
FROM tblsumDemo
CROSS JOIN (SELECT #s = 0) AS var
ORDER BY billingid
but I'm not able to get the desired output,any help would be much appreciated , Thanks
You can use CROSS APPLY:
SELECT
t1.*,
x.cumulative
FROM tblSumDemo t1
CROSS APPLY(
SELECT
cumulative = SUM(t2.Percent_of_Qty)
FROM tblSumDemo t2
WHERE t2.billingid <= t1.billingid
)x
For SQL Server 2012+, you can use SUM OVER():
SELECT *,
cummulative = SUM(Percent_of_Qty) OVER(ORDER BY billingId)
FROM tblSumDemo
You can use subquery which works in all versions:
select billingid,qty,percentofqty,
(select sum(qty) from tblsumdemo t2 where t1.id<=t2.id) as csum
from
tblsumdemo t1
you can use windows functions as well from sql 2012:
select *,
sum(qty) over (order by qty rows between unbounded PRECEDING and current row) as csum
from tblsumdemo
Here i am saying get me sum of all rows starting from first row for every row(unbounded preceeding and current row).you can ignore unbounded preceeding and current row which is default
Use ROW_NUMBER just to order the billingID in ascending order, then Use join.
Query
;with cte as(
select rn = row_number() over(
order by billingid
), *
from tblSumDemo
)
select t1.billingid, t1.qty, t1.Percent_of_qty,
sum(t2.Percent_of_qty) as cummulative
from cte t1
join cte t2
on t1.rn >= t2.rn
group by t1.billingid, t1.qty, t1.Percent_of_qty;

SQL stored procedure to add up values and stop once the maximum has been reached

I would like to write a SQL query (SQL Server) that will return rows (in a given order) but only up to a given total. My client has paid me a given amount, and I want to return only those rows that are <= to that amount.
For example, if the client paid me $370, and the data in the table is
id amount
1 100
2 122
3 134
4 23
5 200
then I would like to return only rows 1, 2 and 3
This needs to be efficient, since there will be thousands of rows, so a for loop would not be ideal, I guess. Or is SQL Server efficient enough to optimise a stored proc with for loops?
Thanks in advance. Jim.
A couple of options are.
1) Triangular Join
SELECT *
FROM YourTable Y1
WHERE (SELECT SUM(amount)
FROM YourTable Y2
WHERE Y1.id >= Y2.id ) <= 370
2) Recursive CTE
WITH RecursiveCTE
AS (
SELECT TOP 1 id, amount, CAST(amount AS BIGINT) AS Total
FROM YourTable
ORDER BY id
UNION ALL
SELECT R.id, R.amount, R.Total
FROM (
SELECT T.*,
T.amount + Total AS Total,
rn = ROW_NUMBER() OVER (ORDER BY T.id)
FROM YourTable T
JOIN RecursiveCTE R
ON R.id < T.id
) R
WHERE R.rn = 1 AND Total <= 370
)
SELECT id, amount, Total
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
The 2nd one will likely perform better.
In SQL Server 2012 you will be able to so something like
;WITH CTE AS
(
SELECT id,
amount,
SUM(amount) OVER(ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS RunningTotal
FROM YourTable
)
SELECT *
FROM CTE
WHERE RunningTotal <=370
Though there will probably be a more efficient way (to stop the scan as soon as the total is reached)
Straight-forward approach :
SELECT a.id, a.amount
FROM table1 a
INNER JOIN table1 b ON (b.id <=a.id)
GROUP BY a.id, a.amount
HAVING SUM(b.amount) <= 370
Unfortunately, it has N^2 performance issue.
something like this:
select id from
(
select t1.id, t1.amount, sum( t2.amount ) s
from tst t1, tst t2
where t2.id <= t1.id
group by t1.id, t1.amount
)
where s < 370