Calculate a specific moving average using sql query - sql

Consider that I have a table with one column "A" and I would like to create another column called "B" such that
B[i] = 0.2*A[i] + 0.8*B[i-1]
where B[0]=0.
My problem is that I cannot use the OVER() function because I want to use the values in B while I am trying to construct B. Any idea would be appreciated. Thanks

This is a rather complex mathematical exercise. You want to accumulate exponentially decreasing amounts from previous rows.
It is a little confusing because the amount going in on each row is 20%, but that is just a factor in the formula.
In any case, this seems to do what you want:
select t.*,
sum(power(0.8, -n) * a * 0.2) over (order by id) / power(0.8, -n)
from (select t.8,
row_number() over (order by id) - 1 as n
from t
) x;
Here is a db<>fiddle using Postgres.

Related

Count half of rest of a partition by from position

I'm trying to achieve the following results:
now, the group comes from
SUM(CASE WHEN seqnum <= (0.5 * seqnum_rev) THEN i.[P&L] END) OVER(PARTITION BY i.bracket_label ORDER BY i.event_id) AS [P&L 50%],
I need that in each iteration it counts the total of rows from the end till position (seq_inv) and sum the amounts in P&L only for the half of it from that position.
for example, when
seq = 2
seq_inv will be = 13, half of it is 6 so I need to sum the following 6 positions from seq = 2.
when seq = 4 there are 11 positions till the end (seq_inv = 11), so half is 5, so I want to count 5 positions from seq = 4.
I hope this makes sense, I'm trying to come up with a rule that will be able to adapt to the case I have, since the partition by is what gives me the numbers that need to be summed.
I was also thinking if there was something to do with a partition by top 50% or something like that, but I guess that doesn't exist.
I have the advantage that I've helped him before and have a little extra context.
That context is that this is just the later stage of a very long chain of common table expressions. That means self-joins and/or correlated sub-queries are unfortunately expensive.
Preferably, this should be answerable using window functions, as the data set is already available in the appropriate ordering and partitioning.
My reading is this...
The SUM(5:9) (meaning the sum of rows 5 to row 9, inclusive) is equal to SUM(5:end) - SUM(10:end)
That leads me to this...
WITH
cumulative AS
(
SELECT
*,
SUM([P&L]) OVER (PARTITION BY bracket_label ORDER BY event_id DESC) AS cumulative_p_and_l
FROM
data
)
SELECT
*,
cum_val - LEAD(cumulative_p_and_l, seq_inv/2, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_50_perc,
cum_val - LEAD(cumulative_p_and_l, seq_inv/4, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_25_perc,
FROM
cumulative
NOTE: Using , &, % in column names is horrendous, don't do it ;)
EDIT: Corrected the ORDER BY in the cumulative sum.
I don't think that window functions can do what you want. You could use a correlated subquery instead, with the following logic:
select
t.*,
(
select sum(t1.P&L]
from mytable t1
where t1.seq - t.seq between 0 and t.seq_inv/2
) [P&L 50%]
from mytable t

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?
I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

SQL - Group Values by Percentile/Merge Rankings

I have multiple tables that contain the name of a company/attribute and a ranking.
I would like to write a piece of code which allows a range of Scores to be placed into specific Groups based on the percentile of the score in relationship to tables Score total. I provided a very easy use case to demonstrate what I am looking for, splitting a group of 10 companies into 5 groups, but I would like to scales this in order to apply the 5 groups to data sets with many rows WITHOUT having to specify values in a CASE statement.
You can use NTILE to divide the data into 5 buckets based on score. However, if the data can't be divided into equal number of bins or if there are ties, one of the groups will have more members.
SELECT t.*, NTILE(5) OVER(ORDER BY score) as grp
FROM tablename t
Read more about NTILE here
NTILE(5) OVER(ORDER BY score) might actually put rows with the same value into different quantiles (This is probably not what you want, at least I never liked that).
It's quite similar to
5 * (row_number() over (order by score) - 1) / count(*) over ()
but if the number of rows can't be evenly divided the remainder rows are added to the first quantiles when using NTILE and randomly for ROW_NUMBER.
To assign all the rows with the same value to the same quantile you need to do your own calculation:
5 * (rank() over (order by score) - 1) / count(*) over ()
You can try using ROW_NUMBER() and CEILING() :
SELECT t.name,t.score,
CEILING(ROW_NUMBER() OVER(ORDER BY t.score)/2) as group
FROM YourTable t
This will divide each group of two into a single group, using the ROW_NUMBER() result .

Computing a moving maximum in BigQuery

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)
Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.
(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)
One approach is to do the following:
SELECT word,
(CASE
WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
ELSE word_count_2 END
) AS max_count
FROM (
SELECT word, word_count,
LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'
)
This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:
SELECT word,
IF(word_count >= word_count_1,
IF(word_count >= word_count_2, word_count, word_count_2),
IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...
This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:
SELECT word,
IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
SELECT word,
IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
FROM ...
)
It seems like doing 20 nested queries might not be a great idea performance-wise, though.
Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?
A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.
SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM
(
SELECT mean_temp, year-i ending_at_year, year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;
I have another way to do the thing you are trying to achieve. See query below
SELECT word, max(words)
FROM
(SELECT word,
word_count AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 1) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 2) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth')
group by word order by word
You can try it and compare performance with your approach (I didn't try that)
There's an example creating a moving using window function in the docs here.
Quoting:
The following example calculates a moving average of the values in the current row and the row preceding it. The window frame comprises two rows that move with the current row.
#legacySQL
SELECT
name,
value,
AVG(value)
OVER (ORDER BY value
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
AS MovingAverage
FROM
(SELECT "a" AS name, 0 AS value),
(SELECT "b" AS name, 1 AS value),
(SELECT "c" AS name, 2 AS value),
(SELECT "d" AS name, 3 AS value),
(SELECT "e" AS name, 4 AS value);

how to select lines in Mysql while a condition lasts

I have something like this:
Name.....Value
A...........10
B............9
C............8
Meaning, the values are in descending order. I need to create a new table that will contain the values that make up 60% of the total values. So, this could be a pseudocode:
set Total = sum(value)
set counter = 0
foreach line from table OriginalTable do:
counter = counter + value
if counter > 0.6*Total then break
else insert line into FinalTable
end
As you can see, I'm parsing the sql lines here. I know this can be done using handlers, but I can't get it to work. So, any solution using handlers or something else creative will be great.
It should also be in a reasonable time complexity - the solution how to select values that sum up to 60% of the total
works, but it's slow as hell :(
Thanks!!!!
You'll likely need to use the lead() or lag() window function, possibly with a recursive query to merge the rows together. See this related question:
merge DATE-rows if episodes are in direct succession or overlapping
And in case you're using MySQL, you can work around the lack of window functions by using something like this:
Mysql query problem
I don't know which analytical functions SQL Server (which I assume you are using) supports; for Oracle, you could use something like:
select v.*,
cumulative/overall percent_current,
previous_cumulative/overall percent_previous from (
select
id,
name,
value,
cumulative,
lag(cumulative) over (order by id) as previous_cumulative,
overall
from (
select
id,
name,
value,
sum(value) over (order by id) as cumulative,
(select sum(value) from mytab) overall
from mytab
order by id)
) v
Explanation:
- sum(value) over ... computes a running total for the sum
- lag() gives you the value for the previous row
- you can then combine these to find the first row where percent_current > 0.6 and percent_previous < 0.6