Exclude nulls from postgres window function - sql

For each row, I want to take average of last 20 non-null values using.
The window function is taking 20 rows below including the null ones and calculating average, while I want average of last 20 non null rows.
What I have tried?
WITH adv_calculated AS (
SELECT security_id AS sec_id,
date AS pd,
AVG(volume)
OVER (PARTITION BY (security_id) ORDER BY date ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) AS adv
FROM tbl_financial_index_data
WHERE volume IS NOT NULL
GROUP BY security_id, date
)
UPDATE tbl_financial_index_data
SET adv = amc.adv
FROM adv_calculated amc
WHERE amc.sec_id = security_id
AND amc.pd = date
This solution works good for all rows where volume is NOT null but it does not calculate the adv/average for the rows where volume is NULL.
Then for the null adv and volume rows, I have to run this query which is really slow
UPDATE tbl_financial_index_data
SET average_daily_volume =
(SELECT avg(t.volume)
FROM (
SELECT a.volume
FROM tbl_financial_index_data a
WHERE a.security_id = tbl_financial_index_data.security_id
AND a.date::date <= tbl_financial_index_data.date::date
AND a.volume IS NOT NULL
ORDER BY a.date DESC
LIMIT 21
) t)
WHERE volume IS NULL;
I want to avoid using the second query and calculate ADV for all rows using first query (because it is much faster).

Simply omit the WHERE condition WHERE volume IS NOT NULL, then you should get what you want.
You can nest the query in an outer query to remove the undesired values later:
WITH adv_calculated AS (
SELECT ...
FROM (SELECT AVG(volume) OVER (... ROWS BETWEEN ... AND ...),
...
FROM tbl_financial_index_data
GROUP BY security_id, date
) AS subq
WHERE volume IS NOT NULL
)
SELECT ...

This is just a work around
Okay, so I found the solution.
If the volume is null for some row then the adv for that row is going to be equal to the previous non-null volume row's adv, so I had to find some way to carry forward previous non-null adv for rows where it is null.
I was able to find a way to do that from this answer.
Here's the code to carry forward the non-null value:
WITH temp_adv_filled_values AS (
SELECT security_id,
date,
FIRST_VALUE(adv) OVER W AS adv
FROM (
SELECT security_id,
date,
adv,
SUM(CASE WHEN volume IS NULL THEN 0 ELSE 1 END)
OVER (PARTITION BY security_id ORDER BY date ) AS value_partition
FROM tbl_financial_index_data
) AS q
WINDOW W AS (PARTITION BY security_id, value_partition ORDER BY date)
)
UPDATE tbl_financial_index_data tfid
SET adv = tmcfv.adv
FROM temp_adv_filled_values tmcfv
WHERE tmcfv.security_id = tfid.security_id
AND tmcfv.date = tfid.date;

Related

Sum over a given time period

The following codes gives the total duration that a light has been switched on.
CREATE TABLE switch_times (
id SERIAL PRIMARY KEY,
is1 BOOLEAN,
id_dec INTEGER,
label TEXT,
ts TIMESTAMP WITH TIME ZONE default current_timestamp
);
CREATE VIEW makecount AS
SELECT *, row_number() OVER (PARTITION BY id_dec ORDER BY id) AS count
FROM switch_times;
select c1.label, SUM(c2.ts-c1.ts) AS sum
from
(makecount AS c1
inner join
makecount AS c2 ON c2.count = c1.count + 1)
where c2.is1=FALSE AND c1.id_dec = c2.id_dec AND c2.is1 != c1.is1
GROUP BY c1.label;
Link to working demo https://dbfiddle.uk/ZR8pLEBk
Any suggestions on how to alter the code so that it would give the sum over a given specific time period, say the 25th, during which all three lights were switched on for 12 hours? Problem 1: current code gives total sum, as follows. Problem 2: all durations that have not ended are disregarded, because there is no switch off time.
label sum
0x29 MH3 1 day 03:00:00
0x2B MH1 1 day 01:00:00
0x2C MH2 1 day 02:00:00
The expected results is just over a a given date, i.e.
label sum
0x29 MH3 12:00:00
0x2B MH1 12:00:00
0x2C MH2 12:00:00
Assuming the following (which should be defined in the question):
Postgres 15.
The table is big, many rows per label, performance matters, we can add indexes.
All columns are actually NOT NULL, you just forgot to declare columns as such.
Evey "light" has a distinct id_dec and a distinct label. Having both in switch_times is redundant. (Normalization!)
A light is "switched on" if the most recent earlier entry has is1 IS TRUE. Else it's considered "off".
The order of rows is established by ts, not by id as used in your query (typically incorrect).
Consecutive entries do not have to change the state.
No duplicate entries for (id_dec, ts). (There is a unique index enforcing that.)
There is no minimum or maximum time interval between entries.
"The 25th" is supposed to mean tstzrange '[2022-11-25 0:0+02, 2022-11-26 0:0+02)' (Note the time zone offsets.)
You want results for all labels that were switched on at all during the given time interval.
There is a table "labels" with one distinct entry per relevant light. If you don't have one, create it.
Indexes
Have at least these indexes to make everything fast:
CREATE INDEX ON switch_times (id_dec, ts DESC);
CREATE INDEX ON switch_times (ts);
Optional step to create table labels
CREATE TABLE labels AS
WITH RECURSIVE cte AS (
(
SELECT id_dec, label
FROM switch_times
ORDER BY 1
LIMIT 1
)
UNION ALL
(
SELECT s.id_dec, s.label
FROM cte c
JOIN switch_times s ON s.id_dec > c.id_dec
ORDER BY 1
LIMIT 1
)
)
TABLE cte;
ALTER TABLE labels
ADD PRIMARY KEY (id_dec)
, ALTER COLUMN label SET NOT NULL
, ADD CONSTRAINT label_uni UNIQUE (label)
;
Why this way? See:
Optimize GROUP BY query to retrieve latest row per user
Main query
WITH bounds(lo, hi) AS (
SELECT timestamptz '2022-11-25 0:0+02' -- enter time interval here *once*
, timestamptz '2022-11-26 0:0+02'
)
, snapshot AS (
SELECT id_dec, label, is1, ts
FROM switch_times s, bounds b
WHERE s.ts >= b.lo
AND s.ts < b.hi
UNION ALL -- must be separate
SELECT s.*
FROM labels l
JOIN LATERAL ( -- latest earlier entry
SELECT s.id_dec, s.label, s.is1, b.lo AS ts -- cut off at lower bound
FROM switch_times s, bounds b
WHERE s.id_dec = l.id_dec
AND s.ts < b.lo
ORDER BY s.ts DESC
LIMIT 1
) s ON s.is1 -- ... if it's "on"
)
SELECT label, sum(z - a) AS duration
FROM (
SELECT label
, lag(is1, 1, false) OVER w AS last_is1
, lag(ts) OVER w AS a
, ts AS z
FROM snapshot
WINDOW w AS (PARTITION BY label ORDER BY ts ROWS UNBOUNDED PRECEDING)
) sub
WHERE last_is1
GROUP BY 1;
fiddle
CTE bounds is an optional convenience feature to enter lower and upper bound of your time interval once.
CTE snapshot collects all rows of interest, which consists of
all rows inside the time interval (1st leg of UNION ALL query)
the latest earlier row if it was "on" (2nd leg of UNION ALL query)
We need to gather 2. separately to cover corner cases where the light was switched on earlier and there is no entry for the given time interval! But we can replace the timestamp to the lower bound immediately.
The final query gets the previous (is1, ts) for every row in a subquery, defaulting to "off" if there was no previous row.
Finally sum up intervals in the outer SELECT. Only sum what's switched on at the begin (no matter the final state).
Related:
Jump SQL gap over specific condition & proper lead() usage
My assumption
actual on time is time difference between is1 is true to next is1 false order by ts
Below query will calculate total sum of on time between two dates
select
id_dec ,
label,
sum(to_timestamp(nexttime)-ts) as time_def
from
(
select
id_dec,
"label",
ts,
is1,
case
when is1 = true then lead(extract(epoch from ts))over(partition by id_dec
order by
id_dec ,
ts asc)
else 0
end nexttime
from
switch_times
where
ts between '2022-11-24' and '2022-11-28'
) as a
where
nexttime <> 0
group by
id_dec,
label

Get value of same column from next row if current column value is null

I have a table and I want to select one column such as if it's record not found(cause I have joins with other tables) or exists but is null than select value of same column from next row. I tried to use isnull and coalesce functions but I am unable to get value of next row.
Any help or link would be appreciated.
Here is my query so far
Select
(select top 1 OpenPrice from #tbltempData where Dated=D.Dated) [Open],
ISNULL((select top 1 ClosePrice from #tbltempData where Dated= DATEADD(hour,#Interval-1, D.Dated)),
(select top 1 ClosePrice from #tbltempData where Dated= DATEADD(hour,0, D.Dated))) [Close],
[Min],[Max],Dated
from #tbltempData2 D
Order BY Dated Asc
Open column is having null values.
here is Screenshot of my sample data
and here is output am getting
Details: as I have records in my sample data for date '28/06/2019' and time for first record is 9 am and I am grouping my data in 2 hours so after grouping my first group record of same date is for 8am and as I have no value for that time in sample data so am getting null values. to avoid this scenario I want to get OpenPrice value where time is 9am(in sample data) of same date cause that time is in same group.
If you want "next row" always greater than current time
[Open] = (
select top 1 OpenPrice
from #tbltempData t
where DATEDIFF(day,t.Dated,D.Dated) = 0 -- make sure the price for same day
AND t.Dated>=D.Dated
ORDER BY t.Dated ASC
)
In case you want "next row" be closest available time slot
[Open] = (
select top 1 OpenPrice
from #tbltempData t
where DATEDIFF(day,t.Dated,D.Dated) = 0 -- make sure the price for same day
ORDER BY ABS(DATEDIFF(minute,t.Dated,D.Dated)) ASC
)
I think a correlated subquery does what you want:
select d.*,
(select top (1) ClosePrice
from #tbltempData td
where td.Dated <= D.Dated
order by td.Dated desc
) as ClosePrice
from #tbltempData2 d
order by dated Asc

SQL Server - How to fill in missing column values

I have set of records at day level with 2 columns:
Invoice_date
Invoice_amount
For few records, value of invoice_amount is missing.
I need to fill invoice_amount values where it is NULL using this logic:
Look for next available invoice_amount (in dates later than the blank value record date)
For records with invoice_amount still blank (invoice_amount not present for future dates), look for most previous invoice_amount (in dates before the blank value date)
Note: We have consecutive multiple days where invoice_amount is blank in the dataset:
use CROSS APPLY to find next and previous not null Invoice Amount
update p
set Invoice_Amount = coalesce(nx.Invoice_Amount, pr.Invoice_Amount)
from Problem p
outer apply -- Next non null value
(
select top 1 *
from Problem x
where x.Invoice_Amount is not null
and x.Invoice_Date > p.Invoice_Date
order by Invoice_Date
) nx
outer apply -- Prev non null value
(
select top 1 *
from Problem x
where x.Invoice_Amount is not null
and x.Invoice_Date < p.Invoice_Date
order by Invoice_Date desc
) pr
where p.Invoice_Amount is null
this updates back your table. If you need a select query, it can be modify to it easily
Not efficient but seems to work. Try:
update test set invoice_amount =
coalesce ((select top 1 next.invoice_amount from test next
where next.invoiceDate > test.invoiceDate and next.invoice_amount is not null
order by next.invoiceDate),
(select top 1 prev.invoice_amount from test prev
where prev.invoiceDate < test.invoiceDate and prev.invoice_amount is not null
order by prev.invoiceDate desc))
where invoice_amount is null;
As per given example you could use window function with self join
update t set t.amount = tt.NewAmount
from table t
inner join (
select Dates, coalesce(min(amount) over (order by dates desc ROWS BETWEEN 1 PRECEDING AND CURRENT ROW),
min(amount) over (order by dates asc ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)) NewAmount
from table t
) tt on tt.dates = t.dates
where t.amount is null

SQL query group by nearby timestamp

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.
From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT
Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.
Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip

SQL: Updating a column based on subquery results

I have a T-SQL table that contains the following columns: Date, StationCode, HDepth, and MaxDepth. Each row in the MaxDepth column is set to 0 by default. What I am trying to do is find the maximum HDepth by Date and StationCode, and update the MaxDepth to a column on these rows. I have written a SELECT statement to find where the maximums occur and it is:
SELECT StationCode, [Date], MAX(HDepth) AS Maximum FROM dbo.[DepthTable] GROUP BY [Date], StationCode
How could I put this query into an Update statement to set the MaxDepth to 1 on the rows that are returned by this query?
You might try something like this:
UPDATE a
SET MaxDepth = 1
FROM dbo.[DepthTable] AS a
JOIN (
-- Your original query
SELECT StationCode, [Date], MAX(HDepth) AS Maximum
FROM dbo.[DepthTable]
GROUP BY [Date], StationCode
) AS b ON a.StationCode = b.StationCode
AND a.[DATE] = b.[DATE]
AND a.HDepth = b.Maximum -- Here we get only the max rows
However, if a column is simply based upon other columns, then you might think about putting this logic into a view (to avoid update anomalies). The select for such a view might look like:
SELECT a.[Date], a.StationCode, a.HDepth,
CASE WHEN b.Maximum IS NULL THEN 0 ELSE 1 END AS MaxDepth
FROM dbo.[DepthTable] AS a
LEFT JOIN (
-- Your original query
SELECT StationCode, [Date], MAX(HDepth) AS Maximum
FROM dbo.[DepthTable]
GROUP BY [Date], StationCode
) AS b ON a.StationCode = b.StationCode
AND a.[DATE] = b.[DATE]
AND a.HDepth = b.Maximum -- Here we get only the max rows