Get certain percentile values over SQL table - sql

Let's say I have a table storing users, the number of red balls they have, the total number of balls (blue, yellow, other colors etc.), and the ratio of red to total balls.
Schema looks like this:
**user_id** | **ratio** | **red_balls** | **total_balls**
1 .2 2 10
2 .3 6 20
I want to select the 0, 25, 50, 75, and 100 percentile values based on ordering the red_balls column, so this doesn't mean I want the 0, 0.25, etc. values for the ratio column. I want the 25th percentile of the red_balls column. Any suggestions?

I think this can do what you want:
select *
from your_table
where ratio in (0, 0.25, 0.5, 0.75, 1)
order by red_balls
Query finds all rows with ratios that exactly one of 0, 25, 50, 75, 100 and sort rows in ascending order by count of red_balls

Related

I want to count the values in one column based on a condition of another

OP_CARRIER
WHY_DELAY
WN
WEATHER
DL
0
AA
CARRIER
Each row is a flight into Miami International Airport. WHY DELAY is a column that states why the flight was delayed, if the value is 0, that means that the flight was on time. I am trying to count how many flights were delayed by airline.
df.loc[(df['WHY_DELAY']!= 0)].groupby('OP_CARRIER').value_counts()
If all you want is the number that satisfy a criterion:
df[df['WHY_DELAY'] == 0].shape[0]
Produces a count of the number of rows matching the specified criterion, in this case 1
df[df.WHY_DELAY!='0'].groupby('OP_CARRIER').size()
Out[19]:
OP_CARRIER
AA 1
WN 1
if you want with column_names then
df[df.WHY_DELAY!='0'].groupby('OP_CARRIER').count().reset_index()
Out[21]:
OP_CARRIER WHY_DELAY
0 AA 1
1 WN 1
Changing value_counts() to count() will give you the count of none zero grouped by OP_CARRIER.
df = pd.DataFrame({'OP_CARRIER': ['WN', 'DL', 'AA', 'DL', 'AA'],
'WHY_DELAY': ['WEATHER', 0, 'CARRIER', 'ALIENS', 'ASTROID']})
df.loc[(df['WHY_DELAY']!= 0)].groupby('OP_CARRIER').count()

Count rows inside each element of an array of buckets

I've made a query to calculate some ranges (or buckets) and now I want to count of many elements are inside in each one of them.
For example, this could be a set of rows:
id
tx_value
date
1
30
2022-03-04
2
0.30
2022-03-04
1
300
2022-03-03
4
3000
2022-03-05
5
30
2022-03-04
I've calculated the range with the following clause:
ARRAY(SELECT tx_value_range * avg_tx_value
FROM UNNEST([0.001, 0.01, 0.1, 1, 10, 100]) AS tx_value_range) AS tx_size_buckets
This is a possible range:
[0.003, 0.03, 0.30, 3.0, 30.0, 300.0]
Then what I'm struggling to get is to count the amount of rows placed into each bucket, something like:
0.003, 3
0.03, 1
0.30, 4
I simply can't come up with even a test query to calculate this, I think that I'll need to iterate the bucket array for each transaction row in order to determine where to place the row but I can't seem to articulate it into a query.
Consider below
select any_value(ranges[offset(range_pos)]) `range` , count(*) rows_count
from your_table,
unnest([struct([0.003, 0.03, 0.30, 3.0, 30.0, 300.0] as ranges)]),
unnest([struct(range_bucket(tx_value, ranges) - 1 as range_pos)])
group by range_pos
if applied to sample data in your question - output is

Redshift - Breaking number into 10 parts and finding which part does a number fall into

I am trying to break down a given number into 10 equal parts and then compare a row of numbers and see in which of the 10 parts they fall under.
ref_number, number_to_check
70, 34
70, 44
70, 14
70, 24
In the above data set, I would like to break 70 into 10 equal parts (in this case it would be 7,14,21, and so on till 70). Next I would like to see in which "part" does the value in column "number_to_check" fall into.
Output expected:
ref_number, number_to_check, part
70, 34, 5
70, 44, 7
70, 14, 2
70, 24, 4
You want arithmetic. If I understand correctly:
select ceiling(number_to_check * 10.0 / ref_number)
Here is a db<>fiddle (the fiddle happens to use Postgres).

Postgis: How do I select every second point from LINESTRING?

In DBeaver I have a table containing some GPS coordinates stored as Postgis LINESTRING format.
My questions is: If I have, say, this info:
LINESTRING(20 20, 30 30, 40 40, 50 50, 60 60, 70 70)
which built-in ST function can I use to get every N-th element in that LINESTRING? For example, if I choose 2, I would get:
LINESTRING(20 20, 40 40, 60 60)
, if 3:
LINESTRING(20 20, 50 50)
and so on.
I've tried with ST_SIMPLIFY and ST_POINTN, but that's now exactly what I need because I still want it to stay a LINESTRING but just with less points (lower resolution).
Any ideas?
Thanks :-)
Welcome to SO. Have you tried using ST_DumpPoints and applying a module % over the vertices path? e.g. every second record:
WITH j AS (
SELECT
ST_DumpPoints('LINESTRING(20 20, 30 30, 40 40, 50 50, 60 60, 70 70)') AS point
)
SELECT ST_AsText(ST_MakeLine((point).geom)) FROM j
WHERE (point).path[1] % 2 = 0;
st_astext
-------------------------------
LINESTRING(30 30,50 50,70 70)
(1 Zeile)
Further reading:
ST_MakeLine
CTE
ST_Simplify should return a linestring unless the simplification results in an invalid geometry for a lingstring, e.i., less than 2 vertex. If you always want to return a linestring consider ST_SimplifyPreserveTopology . It ensures that at least two vertices are returned in a linestring.
https://postgis.net/docs/ST_SimplifyPreserveTopology.html

How to floor a number in sql based on a range

I would like to know if there is a function or some sort of way to round a number to lowest whole value. Something like floor, but to a specified increment like 10. So for example:
0.766,5.0883, 9, 9.9999 would all be floored to 0
11.84848, 15.84763, 19.999 would all be floored to 10
etc...
I'm basically looking to fit numbers in the ranges of 0, 10, 20, 30, etc
Can I also do it with different ranges? For example 0, 100, 200, 300, etc
Thank you.
You can do this with arithmetic and floor():
select 10*floor(val / 10)
You can replace the 10s with whatever value you want.