Creating bins in presto sql - programmatically - sql

I am new to Presto SQL syntax and and wondering if a function exists that will bin rows into n bins in a certain range.
For example, I have a a table with 1m different integers that range from 1 - 100. What can I do to create 20 bins between 1 and 100 (a bin for 1-5, 6-10, 11-15 ... etc. ) without using 20 separate CASE WHEN statements ? Are there any standard SQL functions that do will perform the binning function?
Any advice would be appreciated!

You can use the standard SQL function width_bucket. For example:
WITH data(value) AS (
SELECT rand(100)+1 FROM UNNEST(sequence(1,10000))
)
SELECT value, width_bucket(value, 1, 101, 20) bucket
FROM data
produces:
value | bucket
-------+--------
100 | 20
98 | 20
38 | 8
42 | 9
67 | 14
74 | 15
6 | 2
...

You can just use integer division:
select (intcol - 1) / 5 as bin
Presto does integer division, so you shouldn't have to worry about the remainder.

Related

Find correct bucket for the number

I have a bucket that has different 3-3-4, I need to find the right bucket for the number
1-3 4-6 7-10 11-13 14-16 17-20 21 - 23 24 - 26 27 - 30
What could be the efficient formula to find the right bucket, like number 5 lies in buckets 4-6, 18 lies in the bucket 17-20.
Need to write sql query for that
I would maintain a separate bona fide table containing the range values. Then, join to it to get the output you want.
Table: ranges
start | end
1 | 3
4 | 6
7 | 10
11 | 13
14 | 16
17 | 20
21 | 23
24 | 26
27 | 30
WITH buckets AS (
SELECT 5 AS val UNION ALL
SELECT 18
)
SELECT b.val, CAST(r.start AS VARCHAR(10)) + '-' + CAST(r.end AS VARCHAR(10))
FROM buckets b
INNER JOIN ranges r
ON b.val BETWEEN r.start AND r.end
ORDER BY b.val;
For non-overlapping ranges, you only need the starting value of each range (or, equivalently, the ending). Then INDEX that column. Then this is very efficient:
SELECT ...
FROM ...
WHERE n >= ?
ORDER BY n
LIMIT 1
Because of the INDEX, this runs in O(1). (Or O(logN) if include the cost of the BTree lookup.)
More details (couched in IP-addresses): http://mysql.rjweb.org/doc.php/ipranges

redshift SQL - Histogram With Equal Width Bins

I currently have table like so:
ID
Count Value
1
45
2
24
3
13
4
67
5
3
6
21
...
...
Anyone know how to create a table that I can use to create a histogram with equal width bins going from min to max?
End result would look something like this:
Bin of Values
Count(of IDs)
min-5
3
6-10
20
11-15
5
16-20
2
21-25
35
...
...
(max-5)-max
1
I have used width_bucket in the past, but Redshift does not support the function. Any help would be greatly appreciated. Thank you!
You can use case in Redshift. It's a bit more laborious than width_bucket but the results can be the same:
select
case
when val between 0 and 5
then '0-5'
when val between 5 and 10
then '5-10'
when val between 10 and 15
then '10-15'
when val between 15 and 20
then '15-20'
...
end,
count(1)
from my_table
group by 1;

width_bucket not returning buckets of equal width

I'm using Postgres version 9.6.9 and attempting to use width_bucket() to generate a histogram with buckets consisting of equal widths. However, the query I'm using is not returning buckets of equal widths.
As you can see in the example below, the values in the bucket have varying widths. e.g. bucket 1 has a min of 7 and a max of 18 - a width of 11. bucket 3 has a min of 52 and a max of 55 - a width of 3.
How can I adjust my query to ensure that each bucket has the same width?
Here's what the data looks like:
value
-------
7
7
15
17
18
22
23
25
29
42
52
52
55
60
74
85
90
90
92
95
(20 rows)
Here's the query and resulting histogram:
WITH min_max AS (
SELECT
min(value) AS min_val,
max(value) AS max_val
FROM table
)
SELECT
min(value),
max(value),
count(*),
width_bucket(value, min_val, max_val, 5) AS bucket
FROM table, min_max
GROUP BY bucket
ORDER BY bucket;
min | max | count | bucket
-----+-----+-------+--------
7 | 23 | 7 | 1
25 | 42 | 3 | 2
52 | 55 | 3 | 3
60 | 74 | 2 | 4
85 | 92 | 4 | 5
95 | 95 | 1 | 6
( 6 rows )
From https://prestodb.io/docs/current/functions/window.html
Have a look at ntile():
ntile(n) → bigint
Divides the rows for each window partition into n buckets ranging from 1 to at most n. Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket.
For example, with 6 rows and 4 buckets, the bucket values would be as follows: 1 1 2 2 3 4
Or say to rank each runner's 100m race times to find their personal best out of their 10 races:
SELECT
NTILE(10) over (PARTITION BY runners ORDER BY racetimes)
FROM
table
Your buckets are the same size. You just don't have data that accurately represents the end-points.
For instance, would 24 be in the first or second bucket? This is more notable for the ranges without any data, such as 75-83.
From https://www.oreilly.com/library/view/sql-in-a/9780596155322/re91.html
WIDTH_BUCKET( expression, min, max, buckets)
The buckets argument specifies the number of buckets to create over the range defined by min through max. min is inclusive, whereas max is not.
Maximum is not included. so set
WIDTH_BUCKET( expression, min, max + 1, buckets)

How do I compare rows of a table against all other rows of the table?

I would like to create a script that takes the rows of a table which have a specific mathematical difference in their ASCII sum and to add the rows to a separate table, or even to flag a different field when they have that difference.
For instance, I am looking to find when the ASCII sum of word A and the ASCII sum of word B, both stored in rows of a table, have a difference of 63 or 31.
I could probably use a loop to select these rows, but SQL is not my greatest virtue.
ItemID | asciiSum |ProperDiff
-------|----------|----------
1 | 100 |
2 | 37 |
3 | 69 |
4 | 23 |
5 | 6 |
6 | 38 |
After running the code, the field ProperDiff will be updated to contain 'yes' for ItemID 1,2,3,5,6, since the AsciiSum for 1 and 2 (100-37) = 63 etc.
This will not be fast, but I think it does what you want:
update t
set ProperDiff = 'yes'
where exists (select 1
from t t2
where abs(t2.AsciiSum - t.AsciiSum) in (63, 31)
);
It should work okay on small tables.

Select every ten steps SQL

I have the following table:
----------------------------------------------
oNumber oValue1
----------------------------------------------
1 54
2 44
3 89
4 65
ff.
10 33
11 22
ff.
20 43
21 76
ff.
100 45
I want to select every 10 value in oNumber. So the result should be:
----------------------------------------------
oNumber oValue1
----------------------------------------------
10 33
20 43
ff.
100 45
Also, oNumber is not a sequence number. It's just a value. Even it isn't a sequence number, 10, 20, 30 and so on will always appear under oNumber field.
Does anyone know how is the tsql for this case?
Thank you.
select * from table where oNumber % 10 = 0
https://msdn.microsoft.com/en-us/library/ms190279.aspx
Use the "Modulo" operator - %. So in this case, the answer would be something like:
SELECT * FROM table WHERE oNumber % 10 = 0
This will only load if oNumber is a number divisible by ten (and therefore has a remainder zero).
In the case you simply want multiples of 10, then just use the modulo operator as stated by Daniel and Ian.
select *
from table
where oNumber % 10 = 0;
However, I felt that you could be alluding to the fact that you want to get every 10th item in your list. If that's the case, which it may be not, you would simply just sequence your set based on oNumber and use the modulo operator.
select *
from (
select *,
RowNum = row_number() over (order by oNumber)
from table) a
where RowNum % 10 = 0;