Partition rows based on percent value of the data range - sql

What will be the select query if I want to rank and partition rows based on percent range of the partitioning column?
For example, Let say I have the below table structure (the column 'Rank' needs to be populated).
And I want to rank the rows based on the order of score for the row, but only within the +/- 10% of the amount range from the current row's amount. That is, for the first row, the amount is 2.3, and +/-10% of 2.3 is: 2.07 - 2.53. So while ranking first row, I should rank based on the score and consider only those rows which has the amount in the range 2.07 - 2.53 (in this case its id's 1,5,11). Based on this logic the percentile rank is populated in the last column, and the rank for first row will be 0.5. Similarly, perform the same steps for each row.
Question is how can I do this with PERCENT_RANK() or RANK() or NTILE() with partition clause as part of a select query? The original table does not have the last column, that is the column that needs to be populated. I need the percentile ranking of the row based on the score within the 10% range of amount.
PRODUCT_ID
Amount
Score
Percent_Rank
1
2.3
45
0.5
2
2.7
30
0
3
2.0
40
0.5
4
2.6
50
1
5
2.2
35
0
6
5.1
25
0
7
4.8
40
1
8
6.1
60
0
9
22.1
70
0.33
10
8.2
20
0
11
2.1
50
1
12
22.2
60
0
13
22.3
80
1
14
22.4
75
0.66
I tried using the PERCENT_RANK() over partition() but its not considering the range. I cannot use range unbounded preceding and following in the partition clause because I need the range to be within 10% of the amount in the current row.

You may try PERCENT_RANK() with a self join as the following:
SELECT PRODUCT_ID, Amount, Score, Percent_Rank
FROM
(
SELECT A.PRODUCT_ID, A.Amount ,A.Score, B.Amount AS B_Amount,
PERCENT_RANK() OVER (PARTITION BY A.PRODUCT_ID ORDER BY B.SCORE) Percent_Rank
FROM table_name A JOIN table_name B
ON B.Amount BETWEEN A.Amount-A.Amount *0.1 AND A.Amount+A.Amount*0.1
) T
WHERE Amount=B_Amount
See a demo.

I think that you can just nest your percent_rank in a subquery once you have calculated the bucket number based on equally spaced scores.
The trickiest part of this example is actually getting the fixed width buckets. It might be simpler if we could use width_bucket() but some databases don't support that, so I had to compute manually (in 'bucketed' inline table).
Here is the example. I used postgres to create the mockup test table, because it has a very nice generate_series(), but the actual example SQL should run on any database.
create table product_scores as (
select
product_id,
score
from
generate_series(1,2) product_id,
generate_series(1,50) score);
This created a table with two product ids and 50 scores for each one.
with ranges as (
select
product_id,
(max(score)-min(score))*(1+1e-10) range,
min(score) minscore from product_scores
group by product_id),
bucketed as (
select
ranges.product_id,
score,
floor((score-minscore)*10.0/range) bucket
from
ranges
inner
join product_scores
on
ranges.product_id=product_scores.product_id)
select
product_id,
score,
bucket,
percent_rank()
over (partition by product_id,bucket order by score) from bucketed;
No the 1e-10 is not a joke. Unfortunately roundoff error would assign the highest value to a bucket all by itself unless we expand the range by a tiny amount. But once we have a workable range we can then calculate the partition easily enough by checking range.
Then having the partition number you can to the percent_rank() as usual, as shown.

Related

How to implement a reset when (Teradata) using ANSI SQL only?

enter image description here
I need to write a query that count the number of times customers transactions exceed 250 Pounds. Adding cumulatively until the sum exceeds 250 then reset and start from the following row until it exceeds 250 and so on. This functionality can be carried out using Teradata keywords 'RESET WHEN' yet I am supposed to create a query that's only composed of ANSI SQL SYNTAX.
Can anyone help with that?
SUM(sales) OVER (
PARTITION BY region
ORDER BY day_of_calendar
RESET WHEN sales < /* preceding row */ SUM(sales) OVER (
PARTITION BY region
ORDER BY day_of_calendar
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
ROWS UNBOUNDED PRECEDING
)
1: https://i.stack.imgur.com/lu4Jp.png This is a sample of the input of customer
enter image description here
And that's the output.
Every time the customer's total spent exceeds 250, I should be summing from 0 once again and find the day at which the customer exceeded 250 USD.
Without your table definitions and just a screenshot of a very limited dataset it is hard to test my answer on your data - so I'm showing it first on the dataset supplied in the match_recognize tutorial on live SQL and then with your columns:
SELECT
*
FROM
ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
nvl(SUM(up.price),0) AS tot
ALL ROWS PER MATCH
PATTERN ( up* ) DEFINE
up AS SUM(up.price) - up.price <= 100
);
So on your table this would be something like
SELECT
*
FROM
your_table MATCH_RECOGNIZE (
PARTITION BY region
ORDER BY day_of_calendar
MEASURES
nvl(SUM(up.sales),0) AS tot
ALL ROWS PER MATCH
PATTERN ( up* ) DEFINE
up AS SUM(up.sales) - up.sales <= 250
);

Count rows with equal values in a window function

I have a time series in a SQLite Database and want to analyze it.
The important part of the time series consists of a column with different but not unique string values.
I want to do something like this:
Value concat countValue
A A 1
A A,A 1
B A,A,B 1
B A,B,B 2
B B,B,B 3
C B,B,C 1
B B,C,B 2
I don't know how to get the countValue column. It should count all Values of the partition equal to the current rows Value.
I tried this but it just counts all Values in the partition and not the Values equal to this rows Value.
SELECT
Value,
group_concat(Value) OVER wind AS concat,
Sum(Case When Value Like Value Then 1 Else 0 End) OVER wind AS countValue
FROM TimeSeries
WINDOW
wind AS (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
ORDER BY
date
;
The query is also limited by these factors:
The query should work with any amount of unique Values
The query should work with any Partition Size (ROWS BETWEEN n PRECEDING AND CURRENT ROW)
Is this even possible using only SQL?
Here is an approach using string functions:
select
value,
group_concat(value) over wind as concat,
(
length(group_concat(value) over wind) - length(replace(group_concat(value) over wind, value, ''))
) / length(value) cnt_value
from timeseries
window wind as (order by date rows between 2 preceding and current row)
order by date;

snowflake ntile of engagement rate over month

I'm trying to write a query that ranks a customer's engagement rate by decile within a month, against all other customers.
I tried:
ntile(10) over (partition by rec_month order by engagement_rate) as decile
but I don't think that is getting me what I need. It appears to just be splitting the vhosts up into 10 equally sized groups. I want percentiles.
I also tried:
ntile(10) over (partition by rec_month, vhost order by engagement_rate) as decile
But that is only calculating it within customer (vhost) within the month.
How do I calculate the engagement_rate decile against all other customers (Vhosts) within the month?
At first I suspect you a rank function, like DENSE_RANK or RANK (which is sparse, the on I'd use), and then turn that to a percentage by dividing by the count(*) and then truncating into deciles by (trunc(V/10)*10), but then I suspect you might want the output of the PERCENT_RANK function, but the snowflake documentation example is not as clarifying as I would hope, to know it solves your problem.
select column1,
column2,
round(percent_rank() over (order by column2),3) as p_rank,
trunc(p_rank*10)*10 as decile
from values ('a',1),('b',2),('c',3),('d',4),('e',5),('f',6),('g',7);
gives
COLUMN1 COLUMN2 P_RANK DECILE
a 1 0 0
b 2 0.167 10
c 3 0.333 30
d 4 0.5 50
e 5 0.667 60
f 6 0.833 80
g 7 1 100
but maybe you want to use ntile instead of the truncating of the percentage. The round is just there to make the answers above less verbose.

sql DB calculation moving summary‏‏‏‏‏

I would like to calculate moving summary‏‏‏‏‏:
Total amount:100
first receipt: 20
second receipt: 10
the first row in calculation column is a difference between total amount and the first receipt: 100-20=80
the second row in calculation column is a difference between the first calculated_row and the first receip: 80-10=70
The presentation is supposed to present receipt_amount, balance:
receipt_amount | balance
20 | 80
10 | 70
I'll be glad to use your help
Thanks :-)
You didn't really give us much information about your tables and how they are structured.
I'm assuming that there is an orders table that contains the total_amount and a receipt_table that contains each receipt (as a positive value):
As you also didn't specify your DBMS, this is ANSI SQL:
select sum(amount) over (order by receipt_nr) as running_sum
from (
select total_amount as amount
from orders
where order_no = 1
union all
select -1 * receipt_amount
from the_receipt_table
where order_no =
) t
First of all- thanks for your response.
I work with Cache DB which can be used both SQL and ORACLE syntax.
Basically, the data is locaed in two different tables, but I have them in one join query.
Couple of rows with different receipt amounts and each row (receipt) has the same total amount.
Foe example:
Receipt_no Receipt_amount Total_amount Balance
1 20 100 80
1 10 100 70
1 30 100 40
2 20 50 30
2 10 50 20
So, the calculation is supposed to be in a way that in the first receipt the difference calculation is made from the total_amount and all other receipts (in the same receipt_no) are being reduced from the balance
Thanks!

Optimizing a Vertica SQL query to do running totals

I have a table S with time series data like this:
key day delta
For a given key, it's possible but unlikely that days will be missing.
I'd like to construct a cumulative column from the delta values (positive INTs), for the purposes of inserting this cumulative data into another table. This is what I've got so far:
SELECT key, day,
SUM(delta) OVER (PARTITION BY key ORDER BY day asc RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
delta
FROM S
In my SQL flavor, default window clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, but I left that in there to be explicit.
This query is really slow, like order of magnitude slower than the old broken query, which filled in 0s for the cumulative count. Any suggestions for other methods to generate the cumulative numbers?
I did look at the solutions here:
Running total by grouped records in table
The RDBMs I'm using is Vertica. Vertica SQL precludes the first subselect solution there, and its query planner predicts that the 2nd left outer join solution is about 100 times more costly than the analytic form I show above.
I think you're essentially there. You may just need to update the syntax a bit:
SELECT s_qty,
Sum(s_price)
OVER(
partition BY NULL
ORDER BY s_qty ASC rows UNBOUNDED PRECEDING ) "Cumulative Sum"
FROM sample_sales;
Output:
S_QTY | Cumulative Sum
------+----------------
1 | 1000
100 | 11000
150 | 26000
200 | 28000
250 | 53000
300 | 83000
2000 | 103000
(7 rows)
reference link:
https://dwgeek.com/vertica-cumulative-sum-average-and-example.html/
Sometimes it's faster to just use a correlated subquery:
SELECT
[key]
, [day]
, delta
, (SELECT SUM(delta) FROM S WHERE [key] < t1.[key]) AS DeltaSum
FROM S t1