Running cumulative return in sql

Running cumulative return in sql - sql

Looking to have a running cumulative return for a series of daily returns? I know this can be solved using exp and sum, but my return series is not calculated using LN.
Hoping to solve this without using loops, as they are very inefficient in sql.
Its important to make this run fast.
Dataset:
desired result

Is this what you want?
select t.*,
(select exp(sum(log(1 + return))) - 1
from table t2
where t2.date <= t.date
) as cumereturn
from table t;
The functions for exp() and log() may be different in the database you are using. In many databases, you can also use:
select t.*, exp(sum(log(1 + return) over (order by date)) - 1
from table t;
I don't think any database has a built in product() aggregation function. Alas.

converting (1+r)(1+r)(1+r) to
exp(log(1+r) + log(1+r) + log(1+r))
does not work because 1+r could be negative.
Negative log is undefined. i.e. the return is less than -100%

Related

NTILE function performance in hive

Is there any way we can optimize the NTILE function run time. Currently, we have around 51M records with 17 variables.
We are performing below query to divide the datasets in to 100 buckets.
create table secondary_table
stored as orc
as
select a.*,NTILE(100) OVER(ORDER BY score) AS score_rank
from main_table a;
Here score variable represents 12 digit decimal values.
As of now all the load is getting dumped to one reducer which is taking a lot of time after completing 99%. Is there any approach we can optimize it as the current query is taking around 35 min. to execute.
Appreciate any response.
Thanks In advance.

This is not quite an answer, but it might provide some guidance.
The issue is the lack of partition by in the window function. Replacing it with equivalent constructs using, say, row_number() and count(*) won't help.
When I have encountered this, I have been able to work around it in one of two ways.
If there are lots of duplicates, then aggregate and use cumulative sums to define the tiles.
Otherwise, break the values into groups.
As an example of the second. Assuming the scores range from 0 to 1000, with pretty even distributions. Then:
select t.*,
1 + floor((t.seqnum_within + tt.running_cnt - tt.cnt - 1) * 100 / cnt)
from (select t.*,
row_number() over (partition by trunc(score) order by score) as seqnum_within
from t
) t join
(select trunc(score) as score_trunc, count(*) as cnt,
sum(count(*)) over (order by min(score)) as running_cnt,
sum(count(*)) over () as total_cnt
from t
group by trunc(score)
) tt
on trunc(t.score) = score_trunc;
The GROUP BY and JOIN should make better use of the parallel hardware.

REGR_SLOPE in Teradata SQL Query Returning 0 Slope

I am a relative newbie with Teradata SQL and have run into this strange (I think strange) situation. I am trying to run a regression (REGR_SLOPE) on sensor data. I am gathering sensor readings for a single day, each day is 80 observations which is confirmed by the COUNT in the outer SELECT. My query is:
SELECT
d.meter_id,
REGR_SLOPE(d.reading_measure, d.x_axis) AS slope,
COUNT(d.x_axis) AS xcount,
COUNT(d.reading_measure) AS read_count
FROM
(
SELECT
meter_id,
reading_measure,
row_number() OVER (ORDER BY Reading_Dttm) AS x_axis
FROM data_mart.v_meter_reading
WHERE Reading_Start_Dt = '2017-12-12'
AND Meter_Id IN (11932101, 11419827, 11385229, 11643466)
AND Channel_Num = 5
) d
GROUP BY 1
When I use the "IN" clause in the subquery to specify Meter_Id, I get slope values, but when I take it out (to run over all meters) all the slopes are 0 (zero). I would simply like to run a line through a day's worth of observations (80).
I'm using Teradata v15.0.
What am I missing / doing wrong?

I would bet a Pepperoni Pizza that it's the x_axis value.
Instead try ROW_NUMBER() OVER (PARTITION BY meter_id ORDER BY reading_dttm)
This will ensure that the x_axis starts again from 1 for each meter, and each reading will always be 1 away from the previous reading on the x_axis.
This makes me thing you should probably just use reading_dttm as the x_axis value, rather than fabricating one with ROW_NUMBER(). That way readings with a 5 hour gap between them have a different slope to readings with a 10 day gap between them. You may need to convert the reading_dttm's data-type, with a function like TO_UNIXTIME(reading_dttm), or something similar.
I'll message you my address for the Pizza Delivery. (Joking.)

Additional to #MatBailie's answer.
You probably know that should you order by the timestamp instead of the ROW_NUMBER, but you couldn't do it because Teradata doesn't allow timestamps in this place (strange).
There's no built-in TO_UNIXTIME function in Teradata, but you can use this instead:
REPLACE FUNCTION TimeStamp_to_UnixTime (ts TIMESTAMP(6))
RETURNS decimal(18,6)
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
(Cast(ts AS DATE) - DATE '1970-01-01') * 86400
+ (Extract(HOUR From ts) * 3600)
+ (Extract(MINUTE From ts) * 60)
+ (Extract(SECOND From ts));
If you're not allowed to create UDFs simply cut&paste the calculation.

Doing lag queries using hive on irregular timeseries

I am trying to get a lag of one my column on an irregular time series. The data would be as follow
time stamp (seconds), temperature
1, 20
4,12
6,13
7,18
the new dataset should be as follow
time stamp (seconds), temperature, lagged_1_temperature
1, 20,0
4,12,0
6,13,0
7,18,13
As seen just the lag for last row is a non zero.
For a typical lag I use bellow hive query inside my spark application.
"select timestamp, value ,lag(value,1) OVER (ORDER BY timestamp) as lagged_1_value"
Can I change above hive query to give me the result I want

You can do this with a case expression.
select t.*,
case when timestmp-coalesce(lag(timestmp,1) over(order by timestmp),0)=1
then coalesce(lag(temperature,1) over(order by timestmp),0)
else 0 end as lagged_1_termperature
from t

A simple left join might be more efficient:
select t.*,
coalesce(tprev.value, 0) as prev_value
from t left join
t tprev
on tprev.timestmp = t.timestmp - 1;

counting date and time for historical reporting

I am currently working on a query that will be used in junction with share-point to run reports. I have a query that I know will work with Oracle, but the company I am working for is running SQL Server 2005.
What the report will do is give the person the ability to select any date and time, and give the count for that specific operation. The problem is that there are large gaps in the time stamps (because it takes a little while for the product to get to the next operation). The date type is varchar, so i used substrings to parse out the year, month, day, and time. I have sample data available.
The people looking at the reports want the ability to say at this time and day how many units went through this operation.
I know this is is confusing, let me know if you need any clarification.
Here is the oracle syntax
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'MM')) <= T3."Hr"
AND TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'DD')) <= T3."Min") AS "NUM"
FROM(SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM dual CONNECT BY LEVEL <= 961) T2, INVOICE_ARCHIVE T1
ORDER BY T1.PAYMENT_DATE, T2."Hr", T2."Min") T3

The answer to your question is the datepart() function in SQL Server. This will allow you to extract minutes and hours from dates.
The harder part is the "connect by level" portion. How is this being used? You might need to use recursive CTEs to handle this.
With the little hint from spencer, the following may suffice for your query:
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE datepart(month, T4.PAYMENT_DATE) <= T3."Hr" AND
datepart(day, T4.PAYMENT_DATE, 'DD') <= T3."Min"
) AS "NUM"
FROM (SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT top 961 (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM (select top 961 row_number() over (order by (select NULL)) as level
from invoice_archive
) t
) T2 cross join
INVOICE_ARCHIVE T1
) T3
ORDER BY T3.PAYMENT_DATE, T3."Hr", T3."Min"
I made the following changes:
Changed the date arithmetic to use datepart() instead of to_char() .
Replaced the method for getting a list of numbers, by using row_number() instead of connect by level
Made the cross join explicit
Moved the order by to the outer query, since neither SQL Server nor Oracle guarantee the results of an order by in a subquery (and SQL Server does not allow it unless you have a "TOP" query)

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps

Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.

i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table

If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Running cumulative return in sql - sql

converting (1+r)(1+r)(1+r) to exp(log(1+r) + log(1+r) + log(1+r)) does not work because 1+r could be negative. Negative log is undefined. i.e. the return is less than -100%

Related

NTILE function performance in hive

REGR_SLOPE in Teradata SQL Query Returning 0 Slope

Doing lag queries using hive on irregular timeseries

counting date and time for historical reporting

SQL Average Inter-arrival Time, Time Between Dates

Categories

Resources