What is a SQL frequency-distribution query to count ranges with group-by, and include 0 counts? - sql

Given:
table 'thing':
age
---
3.4
3.4
10.1
40
45
49
I want to count the number of things for each 10-year range, e.g.,
age_range | count
----------+-------
0 | 2
10| 1
20| 0
30| 0
40| 3
This query comes close:
SELECT FLOOR(age / 10) as age_range, COUNT(*)
FROM thing
GROUP BY FLOOR(age / 10) ORDER BY FLOOR(age / 10);
Output:
age_range | count
-----------+-------
0 | 1
1 | 2
4 | 3
However, it doesn't show me the ranges which have 0 counts. How can I modify the query so that it also shows the ranges in between with 0 counts?
I found similar stackoverflow questions for counting ranges, some for 0 counts, but they involve having to specify each range (either hard-coding the ranges into the query, or putting the ranges in a table). I would prefer to use a generic query like that above where I do not have to explicitly specify each range (e.g., 0-10, 10-20, 20-30, ...). I'm using PostgreSQL 9.1.3.
Is there a way to modify the simple query above to include 0 counts?
Similar:
Oracle: how to "group by" over a range?
Get frequency distribution of a decimal range in MySQL

generate_series to the rescue:
select 10 * s.d, count(t.age)
from generate_series(0, 10) s(d)
left outer join thing t on s.d = floor(t.age / 10)
group by s.d
order by s.d
Figuring out the upper bound for generate_series should be trivial with a separate query, I just used 10 as a placeholder.
This:
generate_series(0, 10) s(d)
essentially generates an inline table called s with a single column d which contains the values from 0 to 10 (inclusive).
You could wrap the two queries (one to figure out the range, one to compute the counts) into a function if necessary.

You need some way to invent the table of age ranges. Row number usually works nicely. Do a cartesian product against a big table to get lots of numbers.
WITH RANGES AS (
SELECT (rownum - 1) * 10 AS age_range
FROM ( SELECT row_number() OVER() as rownum
FROM pg_tables
) n
,( SELECT ceil( max(age) / 10 ) range_end
FROM thing
) m
WHERE n. rownum <= range_end
)
SELECT r.age_range, COUNT(t.age) AS count
FROM ranges r
LEFT JOIN thing t ON r.age_range = FLOOR(t.age / 10) * 10
GROUP BY r.age_range
ORDER BY r.age_range;
EDIT: mu is too short has a much more elegant answer, but if you didn't have a generate_series function on the db, ... :)

Related

Redshift: Generate a sequential range of numbers

I'm currently migrating PostgreSQL code from our existing DWH to new Redshift DWH and few queries are not compatible.
I have a table which has id, start_week, end_week and orders_each_week in a single row. I'm trying to generate a sequential series between the start_week and end_week so that I separate rows for each week between the give timeline.
Eg.,
This how it is present in the table
+----+------------+----------+------------------+
| ID | start_week | end_week | orders_each_week |
+----+------------+----------+------------------+
| 1 | 3 | 5 | 10 |
+----+------------+----------+------------------+
This is how I want to have it
+----+------+--------+
| ID | week | orders |
+----+------+--------+
| 1 | 3 | 10 |
+----+------+--------+
| 1 | 4 | 10 |
+----+------+--------+
| 1 | 5 | 10 |
+----+------+--------+
The code below is throwing error.
SELECT
id,
generate_series(start_week::BIGINT, end_week::BIGINT) AS demand_weeks
FROM client_demand
WHERE createddate::DATE >= '2021-01-01'
[0A000][500310] Amazon Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
[01000] Function "generate_series(bigint,bigint)" not supported.
So basically I am trying to find a sequential series between two numbers and I couldn't find any solution and any help here is really appreciated. Thank you
Gordon Linoff has shown a very common method for doing this and this approach has the advantage that the process isn't generating "rows" that don't already exist. This can make this faster than approaches that generate data on the fly. However, you need to have a table with about the right number of rows laying around and this isn't always the case. He also shows that this number series needs to be cross joined with your data to perform the function you need.
If you need to generate a large number of numbers in a series not using an existing table there are a number of ways to do this. Here's my go to approach:
WITH twofivesix AS (
SELECT
p0.n
+ p1.n * 2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as n
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
),
fourbillion AS (
SELECT (a.n * POWER(256, 3) + b.n * POWER(256, 2) + c.n * 256 + d.n) as n
FROM twofivesix a,
twofivesix b,
twofivesix c,
twofivesix d
)
SELECT ...
This example makes a whole bunch of numbers (4B) but you can extend or reduce the number in the series by changing the number of times the tables are cross joined and by adding where clauses (as Gordon Linoff did). I don't expect you need a list anywhere close to this long but wanted to show how this can be used to make series that are very long. (You can also write with in base 10 if that makes more sense to you.)
So if you have a table with a more rows that you need number then this can be the fastest method but if you don't have such a table or table lengths vary over time you may want this pure SQL approach.
Among the many Postgres features that Redshift does not support is generate_series() (except on the master node). You can generate one yourself.
If you have a table with enough rows in Redshift, then I find that this approach works:
with n as (
select row_number() over () - 1 as n
from client_demand cd
)
select cd.id, cd.start_week + n.n as week, cd.orders_each_week
from client_demand cd join
n
on n.n <= (end_week - start_week);
This assumes that you have a table with enough rows to generate enough numbers for the on clause. If the table is really big, then add something like limit 100 in the n CTE to limit the size.
If there are only a handful of values, you can use:
select 0 as n union all
select 1 as n union all
select 2 as n

Finding average of data within a certain range

How do I find the average of data set within a certain range? Specifically I am looking to find the average for a data set for all data points that are within one standard deviations of the original average. Here is an example:
Student_ID Test_Scores
1 3
1 20
1 30
1 40
1 50
1 60
1 95
Average = 42.571
Standard Deviation = 29.854
I want to find all data points that are within one standard deviation of this original average, so within the range (42.571-29.854)<=Data<=(42.571+29.854). And from here I want to recalculate a new average.
So my desired data set is:
Student_ID Test_Scores
1 20
1 30
1 40
1 50
1 60
My desired new average is: 40
Here is my following SQL code and it didn't yield my desired result:
SELECT
Student_ID,
AVG(Test_Scores)
FROM
Student_Data
WHERE
Test_Scores BETWEEN (AVG(Test_Scores)-STDEV(Test_Scores)) AND (AVG(Test_Scores)+STDEV(Test_Scores))
ORDER BY
Student_ID
Anyone know how I could fix this?
Use either window functions or do the calculation in a subquery:
SELECT sd.Student_ID, sd.Test_Scores
FROM Student_Data sd CROSS JOIN
(SELECT AVG(Test_Scores) as avgts, STDEV(Test_Scores) as stdts
FROM Student_Data
) x
WHERE sd.Test_Scores BETWEEN avgts - stdts AND avgts + stdts
ORDER BY sd.Student_ID;
select avg(
select test_scores from table where
test_scores between
(
(select avg(test_scores) from table)-(select stddev(test_scores) from table))
and
(
(select avg(test_scores) from table)+(select stddev(test_scores) from table))
);

Compare every field in table to every other field in same table

Imagine a table with only one column.
+------+
| v |
+------+
|0.1234|
|0.8923|
|0.5221|
+------+
I want to do the following for row K:
Take row K=1 value: 0.1234
Count how many values in the rest of the table are less than or equal to value in row 1.
Iterate through all rows
Output should be:
+------+-------+
| v |output |
+------+-------+
|0.1234| 0 |
|0.8923| 2 |
|0.5221| 1 |
+------+-------+
Quick Update I was using this approach to compute a statistic at every value of v in the above table. The cross join approach was way too slow for the size of data I was dealing with. So, instead I computed my stat for a grid of v values and then matched them to the vs in the original data. v_table is the data table from before and stat_comp is the statistics table.
AS SELECT t1.*
,CASE WHEN v<=1.000000 THEN pr_1
WHEN v<=2.000000 AND v>1.000000 THEN pr_2
FROM v_table AS t1
LEFT OUTER JOIN stat_comp AS t2
Windows functions were added to ANSI/ISO SQL in 1999 and to to Hive in version 0.11, which was released on 15 May, 2013.
What you are looking for is a variation on rank with ties high which in ANSI/ISO SQL:2011 would look like this-
rank () over (order by v with ties high) - 1
Hive currently does not support with ties ... but the logic can be implemented using count(*) over (...)
select v
,count(*) over (order by v) - 1 as rank_with_ties_high_implicit
from mytable
;
or
select v
,count(*) over
(
order by v
range between unbounded preceding and current row
) - 1 as rank_with_ties_high_explicit
from mytable
;
Generate sample data
select 0.1234 as v into #t
union all
select 0.8923
union all
select 0.5221
This is the query
;with ct as (
select ROW_NUMBER() over (order by v) rn
, v
from #t ot
)
select distinct v, a.cnt
from ct ot
outer apply (select count(*) cnt from ct where ct.rn <> ot.rn and v <= ot.v) a
After seeing your edits, it really does look look like you could use a Cartesian product, i.e. CROSS JOIN here. I called your table foo, and crossed joined it to itself as bar:
SELECT foo.v, COUNT(foo.v) - 1 AS output
FROM foo
CROSS JOIN foo bar
WHERE foo.v >= bar.v
GROUP BY foo.v;
Here's a fiddle.
This query cross joins the column such that every permutation of the column's elements is returned (you can see this yourself by removing the SUM and GROUP BY clauses, and adding bar.v to the SELECT). It then adds one count when foo.v >= bar.v, yielding the final result.
You can take the full Cartesian product of the table with itself and sum a case statement:
select a.x
, sum(case when b.x < a.x then 1 else 0 end) as count_less_than_x
from (select distinct x from T) a
, T b
group by a.x
This will give you one row per unique value in the table with the count of non-unique rows whose value is less than this value.
Notice that there is neither a join nor a where clause. In this case, we actually want that. For each row of a we get a full copy aliased as b. We can then check each one to see whether or not it's less than a.x. If it is, we add 1 to the count. If not, we just add 0.

PERCENTILE_DISC() in PostgreSQL as a window function

We are porting our system from SQL Server to PostgreSQL. In that we calculate Median Daily Turnovers for all companies on all dates for past 3 months. Below is the simplified query for the same
SELECT B.Company, B.Dt, B.Turnover, (Select distinct
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Turnover) OVER (PARTITION
BY B.Company, B.Dt) from Example_Tbl AS G where G.Company = B.Company
and G.Dt <= B.Dt and G.Dt > DateAdd(dd, -92, B.Dt)) as
Med_3m_Turnover FROM Example_Tbl B;
The problem is that PostgreSQL doesn't support the use of percentile_disc() as a window function. The error message is:
ERROR: OVER is not supported for ordered-set aggregate percentile_disc
Is there any way I can implement the same functionality using something else in PostgreSQL.
edit: Here is example input data in Example_Tbl
Company Dt Turnover
x 1 10
x 2 45
x 3 20
y 1 300
y 2 100
y 3 200
And the ouptut should be as below. Please note, we are ignoring 3 Months right now and just have 3 rows per company
Company Dt Turnover Med_3m_Turnover
x 1 10 10
x 2 45 10 or 45 depending on percentile_desc
x 3 20 20
y 1 300 300
y 2 100 300 or 100 depending on percentile_desc
y 3 50 100
Your partition by clause (PARTITION
BY B.Company, B.Dt) is using values from the outer query (alias B), not the subquery (alias G), which wasn't immediately obvious to me at first. Because the values of B.company and B.Dt are constant for each execution of the subquery, then your partition clause is really no different than simply writing it like this:
... over (partition by 1)
You can test it in SQL Server if you want, but you'll find that the results are the same. Now, I don't know if using B.Company, B.Dt was intentional or not, but in effect, it means that the partition by clause is not actually partitioning anything.
So, as a result, the good news for you is that to write the equivalent query in PostgreSQL, you simply need to omit the OVER (PARTITION
BY B.Company, B.Dt) clause entirely, and the behavior will be the same as in SQL Server.

SQL: create sequential list of numbers from various starting points

I'm stuck on this SQL problem.
I have a column that is a list of starting points (prevdoc), and anther column that lists how many sequential numbers I need after the starting point (exdiff).
For example, here are the first several rows:
prevdoc | exdiff
----------------
1 | 3
21 | 2
126 | 2
So I need an output to look something like:
2
3
4
22
23
127
128
I'm lost as to where even to start. Can anyone advise me on the SQL code for this solution?
Thanks!
;with a as
(
select prevdoc + 1 col, exdiff
from <table> where exdiff > 0
union all
select col + 1, exdiff - 1
from a
where exdiff > 1
)
select col
If your exdiff is going to be a small number, you can make up a virtual table of numbers using SELECT..UNION ALL as shown here and join to it:
select prevdoc+number
from doc
join (select 1 number union all
select 2 union all
select 3 union all
select 4 union all
select 5) x on x.number <= doc.exdiff
order by 1;
I have provided for 5 but you can expand as required. You haven't specified your DBMS, but in each one there will be a source of sequential numbers, for example in SQL Server, you could use:
select prevdoc+number
from doc
join master..spt_values v on
v.number <= doc.exdiff and
v.number >= 1 and
v.type = 'p'
order by 1;
The master..spt_values table contains numbers between 0-2047 (when filtered by type='p').
If the numbers are not too large, then you can use the following trick in most databases:
select t.exdiff + seqnum
from t join
(select row_number() over (order by column_name) as seqnum
from INFORMATION_SCHEMA.columns
) nums
on t.exdiff <= seqnum
The use of INFORMATION_SCHEMA columns in the subquery is arbitrary. The only purpose is to generate a sequence of numbers at least as long as the maximum exdiff number.
This approach will work in any database that supports the ranking functions. Most databases have a database-specific way of generating a sequence (such as recursie CTEs in SQL Server and CONNECT BY in Oracle).