Finding Median Using MySQL - sql

I know that there are many ways to find the median, but I am trying to use this method to find the median. Can someone explain to me why this does not work? The error here says "Invalid use of group function," but when I use HAVING instead of WHERE, the system doesn't recognize what RowNumber is. I'm very confused.
SELECT
ROUND(AVG(LS.LAT_N))
FROM(
SELECT
LAT_N,
ROW_NUMBER() OVER (ORDER BY LAT_N) AS RowNumber
FROM
STATION
) AS LS
WHERE
RowNumber IN (
IF(
FLOOR(COUNT(LS.LAT_N)/2+0.5) = CEIL(COUNT(LS.LAT_N)/2+0.5),
FLOOR(COUNT(LS.LAT_N)/2+0.5),
FLOOR(COUNT(LS.LAT_N)/2+0.5) AND CEIL(COUNT(LS.LAT_N)/2+0.5)
)

I typically write this as:
SELECT AVG(LAT_N)
FROM (SELECT LAT_N,
ROW_NUMBER() OVER (ORDER BY LAT_N) AS RowNumber,
COUNT(*) OVER () as cnt
FROM STATION
) s
WHERE 2 * RowNumber IN (CNT, CNT + 1, CNT + 2);
Here is a db<>fiddle.

The median is the middle element in an ordered series - or the average of the two middle elements if there is an even number.
SELECT
AVG(LAT_N)
FROM(
SELECT
LAT_N,
ROW_NUMBER() OVER (ORDER BY LAT_N) AS RowNumber
FROM
STATION
) AS q
WHERE
RowNumber >= FLOOR ( (SELECT COUNT(*) FROM STATION)/2 + 0.5)
AND
RowNumber <= CEIL ( (SELECT COUNT(*) FROM STATION)/2 + 0.5)
Here is dbfiddle https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=b31e08f4ece61ecb95d9dde76c389fb1

Related

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

I have tried the following approaches which none of them worked:
Using SELECT TOP 50 PERCENT: BigQuery does not have top function
Using LIMIT (SELECT COUNT(*) FROM tabl)/2: the reason is BigQuery does not accept any non integer value.
Using SET to set the median value and then use WHERE
In BigQuery I would use window function percent_rank().
select t.* except (prnk)
from (select t.*, percent_rank() over(order by id) prnk from mytable t) t
where prnk <= 0.5
Note: any answer to your question will require that you provide a column to order your data. I assumed that this column is called id.
One method uses window functions:
select t.* except (seqnum, cnt)
from (select t.*, row_number() over (order by ?) as seqnum,
count(*) over () as cnt
from t
) t
where seqnum <= cnt / 2;
Another possibility would be to limit the data with a WHERE clause instead of LIMIT. This is an example if you want yo filter by an ID:
SELECT * FROM table_name as t
WHERE t.id <= (SELECT COUNT(*) FROM table_name)/2;
And if you want to filter by the row number:
SELECT t.* except (rn)
FROM (
SELECT t.*, ROW_NUMBER() OVER () AS rn
FROM table_name as t
) AS t
WHERE t.rn <= (SELECT COUNT(*) FROM table_name)/2;
To scale up, you can use an approx algorithm to find the 50% point:
DECLARE mid_date TIMESTAMP DEFAULT (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers` )
;
SELECT mid_date
, COUNTIF(creation_date > mid_date) first_half
, COUNTIF(creation_date < mid_date) second_half
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
Looks like it works well:
Now let's get these records out:
CREATE TABLE `temp.fifty_percent`
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
WHERE creation_date < (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
)
This method will happily scale, while solutions using OVER(ORDER BY) won't.

Replacement for row_number() in clickhouse

Row_number () is not supported by clickhouse database, looking for a alternate function.
SELECT company_name AS company,
DOMAIN,
city_name AS city,
state_province_code AS state,
country_code AS country,
location_revenue AS revenueRange,
location_TI_industry AS industry,
location_employeecount_range AS employeeSize,
topic,
location_duns AS duns,
rank AS intensityRank,
dnb_status_code AS locationStatus,
rank_delta AS intensityRankDelta,
company_id,
ROW_NUMBER() OVER (PARTITION BY DOMAIN) AS rowNumberFROM company_intent c
WHERE c.rank > 0
AND c.rank <= 10
AND c.signal_count > 0
AND c.topic IN ('Cloud Computing')
AND c.country_code = 'US'
AND c.rank IN (7, 8, 9, 10)
GROUP BY c.location_duns,
company_name,
DOMAIN,
city_name,
state_province_code,
country_code,
location_revenue,
location_TI_industry,
location_employeecount_range,
topic,
rank,
dnb_status_code,
rank_delta,
company_id
ORDER BY intensityRank DESC
LIMIT 15 SELECT COUNT (DISTINCT c.company_id) AS COUNT
FROM company_intent c
WHERE c.rank > 0
AND c.rank <= 10
AND c.signal_count > 0
AND c.topic IN ('Cloud Computing')
AND c.country_code = 'US'
AND c.rank IN (7, 8, 9, 10)
When executed the above query got the below error.
Expected one of: SETTINGS, FORMAT, WITH, HAVING, LIMIT, FROM, PREWHERE, token, UNION ALL, Comma, WHERE, ORDER BY, INTO OUTFILE, GROUP BY
any suggestions is appreciated
Solution #1
SELECT
*,
rowNumberInAllBlocks()
FROM
(
-- YOUR SELECT HERE
)
https://clickhouse.com/docs/en/sql-reference/functions/other-functions/#rownumberinallblocks says:
rowNumberInAllBlocks() Returns the ordinal number of the row in the data block. This function only considers the affected data blocks.
Solution #2
SELECT
row_number() OVER (),
...
FROM
...
https://clickhouse.com/docs/en/sql-reference/window-functions/
In my tests, both solutions show identical results. However, you need to remember that at the beginning of 2022, window functions work in single-threaded mode.
ClickHouse doesn't support Window Functions for now. There is a rowNumberInAllBlocks function that might be interesting to you.
SELECT *, rowNumberInAllBlocks() as row_count FROM (SELECT .....)
smth like this (terrible lokks but works good)
SELECT *, rn +1 -min_rn current, max_rn - min_rn + 1 last FROM (
SELECT *, rowNumberInAllBlocks() rn FROM (
SELECT i_device, i_time
FROM tbl
ORDER BY i_device, i_time
) t
) t1 LEFT JOIN (
SELECT i_device, min(rn) min_rn, max(rn) max_rn FROM (
SELECT *, rowNumberInAllBlocks() rn FROM (
SELECT i_device, i_time
FROM tbl
ORDER BY i_device, i_time
) t
) t GROUP BY i_device
) t2 USING (i_device)

I want to generate continuously number by 2 column and batch wise

I want to generate continuously number with the combination of 2 columns and in batch size of 5. Anybody can help to solve this?
An adoption of #GordonLinoff's answer...
SELECT
name,
rank,
DENSE_RANK() OVER (ORDER BY name DESC, Rank, ((seqnum - 1) / 5)) AS rno
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY name, rank ORDER BY (SELECT null)) AS seqnum
FROM
yourTable
)
sequenced
ORDER BY
3
You can use row_number() and arithmetic:
select name, rank,
((seqnum - 1) / 5) + 1 as rno
from (select t.*,
row_number() as (partition by name, rank order by (select null)) as seqnum
from t
) t
order by seqnum;

amazon redshift sql: remove top 1% of data as outliers

i'm trying to remove 1% of data as we consider those outliers that greatly skew data. i've tried using SELECT TOP 99 PERC, but Amazon Redshift does not support percentages with TOP.
I've tried something like:
WITH
elapsed_times AS (
SELECT
COALESCE(anonymous_id, distinct_id) as id,
elapsed_time
FROM studio_production.interaction
WHERE project_id = '55062b464a9bc578006987ff'
),
max_elapsed_time AS (
SELECT elapsed_time
FROM elapsed_times
ORDER BY elapsed_time ASC
OFFSET ROUND(0.99 * (SELECT COUNT(*) FROM elapsed_times))
LIMIT 1
),
user_times AS (
SELECT
id,
LEAST(elapsed_time, max_elapsed_time) as elapsed_time
FROM elapsed_times
GROUP BY 1
)
SELECT AVG(elapsed_time) FROM user_times
But I get: argument of OFFSET must not contain subqueries
thus, my query is now:
WITH
elapsed_times AS (
SELECT
COALESCE(anonymous_id, distinct_id) as id,
elapsed_time,
RANK() OVER (ORDER BY elapsed_time ASC) as rnk
FROM studio_production.interaction
WHERE project_id = '55062b464a9bc578006987ff'
),
user_times AS (
SELECT
id,
LEAST(MAX(elapsed_time), (
SELECT MIN(elapsed_time)
FROM elapsed_times
WHERE rnk > ROUND(0.99 * (SELECT COUNT(*) FROM elapsed_times))
)) as elapsed_time
FROM elapsed_times
GROUP BY 1
)
SELECT AVG(elapsed_time) FROM user_times
which is actually quite slow. what is the correct approach to this problem?
You could use ntile() (see here):
select avg(elapsed_time)
from (select et.*,
ntile(100) over (order by elapsed_time) as thetile
from elapsed_times et
) et
where thetile not in (1, 100);
EDIT:
I admit that I often do this using row_number() and count():
select avg(elapsed_time)
from (select et.*,
row_number() over (order by elapsed_time) as seqnum,
count(*) over () as cnt
from elapsed_times et
) et
where (seqnum <= 0.01 * cnt) or (seqnum >= 0.99 * cnt);

how to query the occuring time of the max wind from a database?

I want to find the occuring time of the max wind, the max wind, and total rain from a database. The database have three columns: observerTime, wind and rain, how to generate the SQL statement to get the result ?
select observerTime from t where wind = (select max(wind) from t)
or if you need the last date when it occures
select max(observerTime) from t where wind = (select max(wind) from t)
You don't mention the database, but one of the following is likely to work:
select top 1 *, (select sum(rain) from t) as TotalRain
order by wind desc
or:
select *, (select sum(rain) from t) as TotalRain
from t
order by wind desc
limit 1
or
select *, (select sum(rain) from t) as TotalRain
from (select *
from t
order by wind desc
) t
where rownum = 1
You should be able to use something like this:
select t1.observerTime,
t1.wind,
(select sum(rain) from yourtable) TotalRain
from yourtable t1
inner join
(
select max(wind) MaxWind
from yourtable
) t2
on t1.wind = t2.maxwind
See SQL Fiddle with Demo
Since you are using SQL Server, you can also use row_number():
select observertime,
wind,
(select sum(rain) from yourtable) TotalRain
from
(
select observertime,
wind,
rain,
row_number() over(order by wind desc) rn
from yourtable
) src
where rn = 1
See SQL Fiddle with Demo