Multiple/Dependent Sub-Queries - sql

I am currently working within SQL Workbench/J and Redshift. I am still learning a bit, and have a question around creating a sub-query that is dependent upon another sub-query result. In the below example, a sub-query has been implemented in order to produce the mean of multiple records grouped upon a unique symbol. I am then using the mean in the primary query to calculate additional values (USD/UCL/LCL). However, I need to add a where clause on these aggregate values, which I cannot do. How would I implement another layer of sub-query to pre-calculate the UCL/LCL due to it being dependent on the first subquery to generate? I have tried adding it to the first sub-query, but have been unsuccessful. I appreciate the help in advance, as I am just learning.
select
symbol,
mean,
avg(volume) as volume,
(mean * avg(volume) * 0.001) as USD,
STDV,
(MEAN + STDV * 3) as UCL,
(MEAN - STDV * 3) as LCL,
sum((high > ucl)::int) as ucltest,
sum((low < lcl)::int) as lcltest
from
(select
h.*,
avg(close) over (partition by symbol) as mean,
cast(stddev_samp(close) over (partition by symbol) as dec(14,2)) as STDV
from
historical h)
group by
symbol, mean, STDV;

You don't need another layer of query but there are issues. The first thing is you need to remember that you have the mean (and stdv) for each value of symbol repeated for each row coming out of the sub query. This is why you need the group by on these to get down to a single value per symbol. This is one way to do it but since these are all correlated it is better to use an aggregate function like MIN(MEAN) as MEAN. This also leads to confusion as there are two values with the name MEAN.
Now the real issue is that you are trying to use ucl and lcl in an aggregate function. These are local to this level of the query so you are seeing an error. You just need to repeat the calculation for these values. Like this (untested):
select
symbol,
min(lmean) as mean,
avg(volume) as volume,
(min(lmean) * avg(volume) * 0.001) as USD,
min(LSTDV) as STDV,
min(LMEAN + LSTDV * 3) as UCL,
min(LMEAN - LSTDV * 3) as LCL,
sum((high > (LMEAN + LSTDV * 3))::int) as ucltest,
sum((low < (LMEAN - LSTDV * 3))::int) as lcltest
from
(select
h.*,
avg(close) over (partition by symbol) as lmean,
cast(stddev_samp(close) over (partition by symbol) as dec(14,2)) as LSTDV
from
historical h)
group by
symbol
having
lcltest = 0 and ucltest = 0; --Having clause excludes any groups with either lcltest or ucltest not equal to 0

Related

Count half of rest of a partition by from position

I'm trying to achieve the following results:
now, the group comes from
SUM(CASE WHEN seqnum <= (0.5 * seqnum_rev) THEN i.[P&L] END) OVER(PARTITION BY i.bracket_label ORDER BY i.event_id) AS [P&L 50%],
I need that in each iteration it counts the total of rows from the end till position (seq_inv) and sum the amounts in P&L only for the half of it from that position.
for example, when
seq = 2
seq_inv will be = 13, half of it is 6 so I need to sum the following 6 positions from seq = 2.
when seq = 4 there are 11 positions till the end (seq_inv = 11), so half is 5, so I want to count 5 positions from seq = 4.
I hope this makes sense, I'm trying to come up with a rule that will be able to adapt to the case I have, since the partition by is what gives me the numbers that need to be summed.
I was also thinking if there was something to do with a partition by top 50% or something like that, but I guess that doesn't exist.
I have the advantage that I've helped him before and have a little extra context.
That context is that this is just the later stage of a very long chain of common table expressions. That means self-joins and/or correlated sub-queries are unfortunately expensive.
Preferably, this should be answerable using window functions, as the data set is already available in the appropriate ordering and partitioning.
My reading is this...
The SUM(5:9) (meaning the sum of rows 5 to row 9, inclusive) is equal to SUM(5:end) - SUM(10:end)
That leads me to this...
WITH
cumulative AS
(
SELECT
*,
SUM([P&L]) OVER (PARTITION BY bracket_label ORDER BY event_id DESC) AS cumulative_p_and_l
FROM
data
)
SELECT
*,
cum_val - LEAD(cumulative_p_and_l, seq_inv/2, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_50_perc,
cum_val - LEAD(cumulative_p_and_l, seq_inv/4, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_25_perc,
FROM
cumulative
NOTE: Using , &, % in column names is horrendous, don't do it ;)
EDIT: Corrected the ORDER BY in the cumulative sum.
I don't think that window functions can do what you want. You could use a correlated subquery instead, with the following logic:
select
t.*,
(
select sum(t1.P&L]
from mytable t1
where t1.seq - t.seq between 0 and t.seq_inv/2
) [P&L 50%]
from mytable t

Calculate a specific moving average using sql query

Consider that I have a table with one column "A" and I would like to create another column called "B" such that
B[i] = 0.2*A[i] + 0.8*B[i-1]
where B[0]=0.
My problem is that I cannot use the OVER() function because I want to use the values in B while I am trying to construct B. Any idea would be appreciated. Thanks
This is a rather complex mathematical exercise. You want to accumulate exponentially decreasing amounts from previous rows.
It is a little confusing because the amount going in on each row is 20%, but that is just a factor in the formula.
In any case, this seems to do what you want:
select t.*,
sum(power(0.8, -n) * a * 0.2) over (order by id) / power(0.8, -n)
from (select t.8,
row_number() over (order by id) - 1 as n
from t
) x;
Here is a db<>fiddle using Postgres.

How to find neighboring records in the SQL table in terms of month and year?

Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.

Computing a moving maximum in BigQuery

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)
Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.
(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)
One approach is to do the following:
SELECT word,
(CASE
WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
ELSE word_count_2 END
) AS max_count
FROM (
SELECT word, word_count,
LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'
)
This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:
SELECT word,
IF(word_count >= word_count_1,
IF(word_count >= word_count_2, word_count, word_count_2),
IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...
This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:
SELECT word,
IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
SELECT word,
IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
FROM ...
)
It seems like doing 20 nested queries might not be a great idea performance-wise, though.
Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?
A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.
SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM
(
SELECT mean_temp, year-i ending_at_year, year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;
I have another way to do the thing you are trying to achieve. See query below
SELECT word, max(words)
FROM
(SELECT word,
word_count AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 1) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 2) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth')
group by word order by word
You can try it and compare performance with your approach (I didn't try that)
There's an example creating a moving using window function in the docs here.
Quoting:
The following example calculates a moving average of the values in the current row and the row preceding it. The window frame comprises two rows that move with the current row.
#legacySQL
SELECT
name,
value,
AVG(value)
OVER (ORDER BY value
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
AS MovingAverage
FROM
(SELECT "a" AS name, 0 AS value),
(SELECT "b" AS name, 1 AS value),
(SELECT "c" AS name, 2 AS value),
(SELECT "d" AS name, 3 AS value),
(SELECT "e" AS name, 4 AS value);

Simplify query in H2 database - alternative to TOP X PERCENT

I'm having performance issues with a query and was wondering how to simplify it.
I have a table "Evaluations" (Sample, Category, Jury, Value)
And created some custom functions to get some average values for each sample, so I have this view:
CREATE VIEW Results AS
SELECT Sample,
Category,
IFNULL(COUNT_VALID(Value),0) || ' / ' || COUNT(Value) AS Valid,
CUSTOM_MEAN(Value) AS Mean,
CUSTOM_MEDIAN(Value) AS Median
FROM Evaluations GROUP BY Sample, Category;
Then I want to have another field telling me if each sample is within the 30% of best valued samples of its category. It would be perfect to use TOP(X) PERCENT but it seems H2 doesn't support it so I made a second view that calculates the position in category multiplied by 100, divided by the total count in category and compared to 30:
CREATE VIEW Res AS
SELECT R1.*,
CASE
WHEN (
((SELECT COUNT(*) FROM Results R2
WHERE R2.Category = R1.Category
AND (R2.Mean > R1.Mean OR (R2.Mean = R1.Mean AND R2.Median > R1.Median))) + 1) * 100
/
(SELECT COUNT(*) FROM Results R2 WHERE R2.Category = R1.Category) )
> 30
THEN 'over 30%'
ELSE 'within 30%'
END as 30PERCENT
FROM Results R1 ORDER BY Mean DESC, Median DESC;
This works properly but with just 500 records it takes some time to retrieve the results.
Could someone tell me a more efficient way of constructing this query?
Thanks and regards!