How to use sql LAG() properly

How to use sql LAG() properly - sql

I have the following SQL line that has a syntax error. I'm trying to reference prior day close in my SQL query how do i fix my query to not error out?
Thanks!
SELECT *
FROM "daily_data"
WHERE date >'2018-01-01' and (open-LAG(close))/LAG(close)>=1.4 and volume > 1000000 and open > 1
Error:
Query execution failed
Reason: SQL Error [42809]: ERROR: window function lag requires an OVER
clause Position: 63

You need to use a subquery. You cannot use window functions in the where clause. You also need an ORDER BY and potentially a PARTITION BY clause:
SELECT *
FROM (SELECT dd.*,
LAG(close) OVER (ORDER BY date) as prev_close
FROM "daily_data" dd
) dd
WHERE date > '2018-01-01' AND
(open - prev_close) / prev_close >= 1.4 AND
volume > 1000000 AND
open > 1;

lag(close) means "the value of close from the prior record." So the phrase by itself is missing something fundamental, specifically how do you define "prior record" since there is never any implied order in a RDBMS.
As with functions such as rank and row_number, to propery form the lead and lag commands you need to establish the prior (or next) record by defining the order of output. In other words, "if you were to sort the output by x, the prior record's close" would look like this:
lag (close) over (order by x)
To order by something descending:
lag (close) over (order by x desc)
You can optionally chunk the data by a field using partition by which may or may not be useful in your problem. For example, "for each item, if you were to sort the output by x, the prior record's close:"
lag (close) over (partition by item order by x)
To the question here is prior record (lag)... how? By which fields, in which order?
As a final thought, analytic/windowing functions cannot be used in the where clause in PostgreSQL. To accomplish this, wrap them in a subquery:
with daily as (
SELECT
d.*,
LAG (d.close) over (order by d.<something>) as prior_close
FROM "daily_data" d
WHERE
d.date >'2018-01-01' and
d.volume > 1000000 and
d.open > 1
)
select *
from daily
where
(open - prior_close) / prior_close >= 1.4

Related

Average time between two consecutive events in SQL

I have a table as shown below.
time
Event
2021-03-19T17:15:05
A
2021-03-19T17:15:11
B
2021-03-19T17:15:11
C
2021-03-19T17:15:12
A
2021-03-19T17:15:14
C
I want to find the average time between event A and the event following it.
How do I find it using an SQL query?
here desired output is: 4 seconds.
I really appreciate any help you can provide.

The basic idea is lead() to get the time from the next row. Then you need to calculate the difference. So for all rows:
select t.*,
(to_unix_timestamp(lead(time) over (order by time) -
to_unix_timestamp(time)
) as diff_seconds
from t;
Use a subquery and filtering for just A and the average:
select avg(diff_seconds)
from (select t.*,
(to_unix_timestamp(lead(time) over (order by time) -
to_unix_timestamp(time)
) as diff_seconds
from t
) t
where event = 'A';

SQL Server Lag function adding range

Hi I am a newbie when it comes to SQL and was hoping someone can help me in this matter. I've been using the lag function here and there but was wondering if there is a way to rewrite it to make it into a sum range. So instead of prior one month, i want to take the prior 12 months and sum them together for each period. I don't want to write 12 lines of lag but was wondering if there is a way to get it with less lines of code. Note there will be nulls and if one of the 12 records is null then it should be null.
I know you can write write subquery to do this, but was wondering if this is possible. Any help would be much appreciated.

You want the "window frame" part of the window function. A moving 12-month average would look like:
select t.*,
sum(balance) over (order by period rows between 11 preceding and current row) as moving_sum_12
from t;
You can review window frames in the documentation.
If you want a cumulative sum, you can leave out the window frame entirely.
I should note that you can also do this using lag(), but it is much more complicated:
select t.*,
(balance +
lag(balance, 1, 0) over (order by period) +
lag(balance, 2, 0) over (order by period) +
. . .
lag(balance, 11, 0) over (order by period) +
) as sum_1112
from t;
This uses the little known third argument to lag(), which is the default value to use if the record is not available. It replaces a coalesce().
EDIT:
If you want NULL if 12 values are not available, then use case and count() as well:
select t.*,
(case when count(*) over (order by period rows between 11 preceding and current row) = 12
then sum(balance) over (order by period rows between 11 preceding and current row)
end) as moving_sum_12
from t;

Limit in Group - ActiveRecord Postgres

I have an events table I'm querying by month and trying to limit the number of events returned per day to 3.
[39] pry(#<EventsController>)> #events.group("DATE_TRUNC('day', start)").count
CACHE (0.0ms) SELECT COUNT(*) AS count_all, DATE_TRUNC('day', start) AS
date_trunc_day_start FROM "events" WHERE ((start >= '2014-08-31 00:00:00' and
start <= '2014- 10-12 00:00:00')) GROUP BY DATE_TRUNC('day', start)
=> {2014-09-24 00:00:00 UTC=>5,
2014-09-18 00:00:00 UTC=>6,
2014-09-25 00:00:00 UTC=>3}
Here we have 5 events on the 24th, 6 on the 18th, and 3 on the 25th.
http://stackoverflow.com/a/12529783/3317093>
When I try the query without the .count, I get the error message
PG::GroupingError: ERROR: column "events.id" must appear in the GROUP BY clause or be used in an aggregate function
I looked at using select() to get the grouping to work, but would need to list all the columns in the table. How should I structure the query/scope to return 3 records from each group of events?
Edit - I'm close!
I've found many similar questions, most of them in MySQL using select. I think using select could be the way to go, either as events.* or as below
#events.where("exists (select 1 from events GROUP BY DATE_TRUNC('day', start) limit 3)")
yields the SQL
SELECT "events".* FROM "events" WHERE ((start >= '2014-08-31 00:00:00'
and start <= '2014-10-12 00:00:00')) AND (exists (select 1 from events
GROUP BY DATE_TRUNC('day', start) limit 3))
The query returns all #events sorted by id (seems :id is implicitly a part of the grouping). I've tried switching things up but most often get the same grouping error as earlier.

For anyone experiencing a similar issue, I would recommend checking out window functions and this blog post covering different ways to solve a similar question. The three approaches covered in the post include using 1) group_by, 2) SQL subselects, 3) window functions.
My solution, using window functions:
#events.where("(events.id)
IN (
SELECT id FROM
( SELECT DISTINCT id,
row_number() OVER (PARTITION BY DATE_TRUNC('day', start) ORDER BY id) AS rank
FROM events) AS result
WHERE (
start >= '#{startt}' and
start <= '#{endt}' and
rank <= 3
)
)
")

If you don't want to use count, you can use group_by from rails for list events as the following:
hash = #events.group_by{ |p| p.start.to_date}
And use this code for limit(3) for each date:
hash.inject({}){ |hash, (k, v)| hash.merge( k => v.take(3) ) }
Helper link for map on hash, and return hash instead of array.

how to use median as a analytic function (oracle SQL)

Can you explain why the following works:
select recdate,avg(logtime)
over
(ORDER BY recdate rows between 10 preceding and 0 following) as logtime
from v_download_times;
and the following doesn’t
select recdate,median(logtime)
over
(ORDER BY recdate rows between 10 preceding and 0 following) as logtime
from v_download_times;
(median instead of avg)
I get an ORA-30487 error.
and I would be grateful for a workaround.

The error message is ORA-30487: ORDER BY not allowed here. And sure enough, if we consult the documentation for the MEDIAN function it says:
"You can use MEDIAN as an analytic function. You can specify only the
query_partition_clause in its OVER clause."

But it is not redundant if you only want to take it from a certain number of rows preceding the current one.
A way around may be limiting your data set just for the median purpose, like
select
median(field) over (partition by field2)
from ( select * from dataset
where period_back between 0 and 2 )

MEDIAN doesn't allow an ORDER BY clause. As APC points out in his answer, the documentation tells us we can only specify the query_partition_clause.
ORDER BY is redundant as we're looking for the central value -- it's the same regardless of order.

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.

I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use sql LAG() properly - sql

Related

Average time between two consecutive events in SQL

SQL Server Lag function adding range

Limit in Group - ActiveRecord Postgres

how to use median as a analytic function (oracle SQL)

How can I make this query run efficiently?

Categories

Resources