Count rows inside each element of an array of buckets - sql

I've made a query to calculate some ranges (or buckets) and now I want to count of many elements are inside in each one of them.
For example, this could be a set of rows:
id
tx_value
date
1
30
2022-03-04
2
0.30
2022-03-04
1
300
2022-03-03
4
3000
2022-03-05
5
30
2022-03-04
I've calculated the range with the following clause:
ARRAY(SELECT tx_value_range * avg_tx_value
FROM UNNEST([0.001, 0.01, 0.1, 1, 10, 100]) AS tx_value_range) AS tx_size_buckets
This is a possible range:
[0.003, 0.03, 0.30, 3.0, 30.0, 300.0]
Then what I'm struggling to get is to count the amount of rows placed into each bucket, something like:
0.003, 3
0.03, 1
0.30, 4
I simply can't come up with even a test query to calculate this, I think that I'll need to iterate the bucket array for each transaction row in order to determine where to place the row but I can't seem to articulate it into a query.

Consider below
select any_value(ranges[offset(range_pos)]) `range` , count(*) rows_count
from your_table,
unnest([struct([0.003, 0.03, 0.30, 3.0, 30.0, 300.0] as ranges)]),
unnest([struct(range_bucket(tx_value, ranges) - 1 as range_pos)])
group by range_pos
if applied to sample data in your question - output is

Related

Get certain percentile values over SQL table

Let's say I have a table storing users, the number of red balls they have, the total number of balls (blue, yellow, other colors etc.), and the ratio of red to total balls.
Schema looks like this:
**user_id** | **ratio** | **red_balls** | **total_balls**
1 .2 2 10
2 .3 6 20
I want to select the 0, 25, 50, 75, and 100 percentile values based on ordering the red_balls column, so this doesn't mean I want the 0, 0.25, etc. values for the ratio column. I want the 25th percentile of the red_balls column. Any suggestions?
I think this can do what you want:
select *
from your_table
where ratio in (0, 0.25, 0.5, 0.75, 1)
order by red_balls
Query finds all rows with ratios that exactly one of 0, 25, 50, 75, 100 and sort rows in ascending order by count of red_balls

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Mean of consecutive days without selling

I am trying to calculate the mean of Interval without selling of a product.
I thought that a good way to get this is:
Count (Days without selling) / Count (Intervals of consecutive days without selling)
Units Sold
0 1
1 4
2 0
3 0
4 0
5 7
6 0
7 0
8 0
9 0
10 1
11 0
In this example I had:
8 days without selling
3 Intervals of consecutive days without selling
So, 8/3 = 2.7 should be my result.
Counting days with No units sold I am using this:
x['Units Sold'] == 0).sum()
However, I don't figured out a good approach to calculate 'Intervals of consecutive days without selling' in a efficient way (considering I will run on multiple products)
Another approach using nunique
s = df["Units Sold"].eq(0)
d = s.sum()
i = s[s].index.to_series().diff().ne(1).cumsum().nunique()
final = d/i # 2.6666666666666665
Using eq, cumsum and diff
First we use eq(0) and sum, to count the amount of days where nothing was sold.
Then we get the cumsum of these days and check wether or not there's a difference between the rows. If this difference is 0, that means there was an interval.
days = x['Units Sold'].eq(0).sum()
intervals = x['Units Sold'].eq(0).cumsum().diff().eq(0)
mask = x['Units Sold'].shift(-1).eq(0)
days / (intervals & mask).sum()
Output
2.6666666666666665
You already knew how to get sum of count of 0, so try this to find number of consective group of 0
s = df['Units Sold'].eq(0)
(s & ~s.shift(fill_value=False)).sum()
Out[567]: 3
You can use:
df.eq(0).sum()/((df.eq(0)&df.shift().ne(0)).sum())
Output:
Units Solds 2.666667
dtype: float64

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')

min over one dimension followed by max over another dimension

I have an SQL table that looks like this:
i j x
0 0 0.5
0 1 1.0
0 2 1.5
1 0 1.4
1 1 1.3
1 2 1.2
and so on. I would like to take the average over the j dimension followed by the minimum over the i dimension. In this case, taking the average over the j dimension produces the following:
i x
0 1.0
1 1.3
Taking the minimum over the i dimension then produces the value 1.0, which is the final result. Is there an efficient way to perform a query like the one in this example, i.e., a query in which a sequence of dimension reduction operations is performed in a specified order?
Note that if we reverse the order of operations, the intermediate result is
j x
0 0.5
1 1.0
2 1.2
Taking the average over the j dimension produces a final result of 0.9. Thus, the order of operations is important.
Phillip
http://phillipmfeldman.org
You can do it with a subquery, of course:
SELECT MIN(avg_over_j) FROM (
SELECT i, AVG(x) AS avg_over_j
FROM TheTable
GROUP BY i
)
But this isn't APL or the J language; there's no "dimension reduction operations".