Replace first and last row having null values or missing values with previous/next available value in Postgresql12

Replace first and last row having null values or missing values with previous/next available value in Postgresql12 - sql

I am a newbiew to postgresql.
I want to replace my first and last row of table,T which has null or missing values, with next/previous available values. Also, if there are missing values in the middle, it should be replaced with previous available value. For example:
id value EXPECTED
1 1
2 1 1
3 2 2
4 2
5 3 3
6 3
I am aware that there are many similar threads, but none seems to address this problem where the start and end also have missing values (including some missing in the middle rows). Also some of the concepts such as first_row ,partition by, top 1(which does not work for postgres) are very hard to grasp as a newbie.
So far i have referred to the following threads: value from previous row and Previous available value
Could someone kindly direct me in the right direction to address this problem?
Thank you

Unfortunately, Postgres doesn't have the ignore nulls option on lead() and lag(). In your example, you only need to borrow from the next row. So:
select t.*,
coalesce(value, lag(value) over (order by id), lead(value) over (order by id)) as expected
from t;
If you had multiple NULLs in a row, then this is trickier. One solution is to define "groups" based on when a value starts or stops. You can do this with a cumulative count of the values -- ascending and descending:
select t.*,
coalesce(value,
max(value) over (partition by grp_before),
max(value) over (partition by grp_after)
) as expected
from (select t.*,
count(value) over (order by id asc) as grp_before,
count(value) over (order by id desc) as grp_after
from t
) t;
Here is a db<>fiddle.

Related

Function to REPLACE* last previous known value for NULL

I want to fill the NULL values with the last given value for that column. A small sample of the data:
2021-08-15 Bulgaria 1081636
2021-08-16 Bulgaria 1084693
2021-08-17 Bulgaria 1089066
2021-08-18 Bulgaria NULL
2021-08-19 Bulgaria NULL
In this example, the NULL values should be 1089066 until I reach the next non-NULL value.
I tried the answer given in this response, but to no avail. Any help would be appreciated, thank you!
EDIT: Sorry, I got sidetracked with trying to return the last value that I forgot my ultimate goal, which is to replace the NULL values with the previous known value.
Therefore the query should be
UPDATE covid_data
SET people_vaccinated = ISNULL(?)

Assuming the number you have is always increasing, you can use MAX aggregate over a window:
SELECT dt
, country
, cnt
, MAX(cnt) OVER (PARTITION BY country ORDER BY dt)
FROM #data
If the number may decrease, the query becomes a little bit more complex as we need to mark the rows that have nulls as belonging to the same group as the last one without a null first:
SELECT dt
, country
, cnt
, SUM(cnt) OVER (PARTITION BY country, partition)
FROM (
SELECT country
, dt
, cnt
, SUM(CASE WHEN cnt IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY country ORDER BY dt) AS partition
FROM #data
) AS d
ORDER BY dt
Here's a working demo on dbfiddle, it returns the same data with ever increasing amount, but if you change the number for 08-17 to be lower than that of 08-16, you'll see MAX(...) method producing wrong results.

In many datasets it is incorrect to make assumptions about the behaviour of the data in the underlying dataset, if your goal is simply to fill the blanks that might appear mid-way in a dataset then the answer to the post you referenced A:sql server nulls duplicate last known value in table is still one of the best solutions, here is an adaptation:
SELECT dt
, country
, cnt
, ISNULL(source.cnt, excludeNulls.LastCnt)
FROM #data source
OUTER APPLY ( SELECT TOP 1 cnt as LastCnt
FROM #data
WHERE dt < source.dt
AND cnt IS NOT NULL
ORDER BY dt desc) ExcludeNulls
ORDER BY dt
MAX and LAST_VALUE will give you the a value with respect to the entire record set, which would not work with the existing solutions if you had a value for 2021-08-19. In that case the last value would be used to fill the gaps, not the previous non-null value.
When we need to fill in gaps that occur part-way through the results we need to apply a filter to the window query, TOP 1 ... ORDER BY gives us the ability to filter and sort on entirely different fields to the one that we want to capture, but also means that we can display the last value for fields that are not numeric, see this fiddle a few other examples: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=372285d29f97dbb9663e8552af6fb7a2

Comparing time difference for every other row

I'm trying to determine the length of time in days between using the AR_Event_Creation_Date_Time for every other row. For example, the number of days between the 1 and 2 row, 3rd and 4th, 5th and 6th etc. In other words, there will be a number of days value for every even row and NULL for every odd row. My code below works if there are only two rows per borrower number but falls down when there are more than two. In the results, notice the change in 1002092539
SELECT Borrower_Number,
Workgroup_Name,
FORMAT(AR_Event_Creation_Date_Time,'d','en-us') AS Tag_Date,
Usr_Usrnm,
DATEDIFF(day, LAG(AR_Event_Creation_Date_Time,1) OVER(PARTITION BY
Borrower_Number Order By Borrower_Number), AR_Event_Creation_Date_Time) Diff
FROM Control_Mail

You need to add in a row number. Also your partition by is non-deterministic:
SELECT Borrower_Number,
Workgroup_Name,
FORMAT(AR_Event_Creation_Date_Time,'d','en-us') AS Tag_Date,
Usr_Usrnm,
DATEDIFF(day, LAG(AR_Event_Creation_Date_Time,1) OVER(PARTITION BY Borrower_Number, (rn - 1) / 2 ORDER BY AR_Event_Creation_Date_Time),
AR_Event_Creation_Date_Time) Diff
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Borrower_Number ORDER BY AR_Event_Creation_Date_Time) AS rn
FROM Control_Mail
) C
```

How do I group a set of entities (people) into 10 equal portions but based on their usage for EX using Oracle's Toad 11g (SQL)

Hi I have a list of 2+ mil people and their usage put in order from largest to smallest.
I tried ranking using row_number () over (partition by user column order by usage desc) as rnk
but that didnt work ..the results were crazy.
Simply put, I just want 10 equal groups of 10 with the first group consisting of the highest usage in the order of which i had first listed them.
HELP!

You can use ntile():
select t.*, ntile(10) over (order by usage desc) as usage_decile
from t;
The only caveat: This will divide the data into exactly 10 equal sized groups. If usage values have duplicates, then users with the same usage will be in different deciles.
If you don't want that behavior, use a more manual calculation:
select t.*,
ceil(rank() over (order by usage desc) * 10 /
count(*) over ()
) as usage_decile
from t;

SQL Server Lag function adding range

Hi I am a newbie when it comes to SQL and was hoping someone can help me in this matter. I've been using the lag function here and there but was wondering if there is a way to rewrite it to make it into a sum range. So instead of prior one month, i want to take the prior 12 months and sum them together for each period. I don't want to write 12 lines of lag but was wondering if there is a way to get it with less lines of code. Note there will be nulls and if one of the 12 records is null then it should be null.
I know you can write write subquery to do this, but was wondering if this is possible. Any help would be much appreciated.

You want the "window frame" part of the window function. A moving 12-month average would look like:
select t.*,
sum(balance) over (order by period rows between 11 preceding and current row) as moving_sum_12
from t;
You can review window frames in the documentation.
If you want a cumulative sum, you can leave out the window frame entirely.
I should note that you can also do this using lag(), but it is much more complicated:
select t.*,
(balance +
lag(balance, 1, 0) over (order by period) +
lag(balance, 2, 0) over (order by period) +
. . .
lag(balance, 11, 0) over (order by period) +
) as sum_1112
from t;
This uses the little known third argument to lag(), which is the default value to use if the record is not available. It replaces a coalesce().
EDIT:
If you want NULL if 12 values are not available, then use case and count() as well:
select t.*,
(case when count(*) over (order by period rows between 11 preceding and current row) = 12
then sum(balance) over (order by period rows between 11 preceding and current row)
end) as moving_sum_12
from t;

Calculating deltas in time series with duplicate & missing values

I have an Oracle table that consist of tuples of logtime/value1, value2..., plus additional columns such as a metering point id. The values are sampled values of different counters that are each monotonically increasing, i.e. a newer value cannot be less than an older value. However, values can remain equal for several samplings, and values can miss sometimes, so the corresponding table entry is NULL while other values of the same logtime are valid. Also, the intervals between logtimes are not constant.
In the following, for simplicity I will regard only the logtime and one counter value.
I have to calculate the deltas from each logtime to the previous one. Using the method described in another question here gives two NULL deltas for each NULL value because two subtractions are invalid. A second solution fails when consecutive values are identical since the difference to the previous value is calculated twice.
Another solution is to construct a derived table/view with those NULL values replaced by the latest older valid value. My approach looks like this:
SELECT A.logtime, A.val,
(A.val - (SELECT MAX(C.val)
FROM tab C
WHERE logtime =
(SELECT MAX(B.logtime)
FROM tab B
WHERE B.logtime < A.logtime AND B.val IS NOT NULL))) AS delta
FROM tab A;
I suspect that this will result in a quite inefficient query, especially when doing this for all N counters in the table which will result in (1 + 2*N) SELECTs. It also does not take advantage from the fact that the counter is monotonically increasing.
Are there any alternative approaches? I'd think others have similar problems, too.
An obvious solution would of course be filling in those NULL values constructing a new table or modifying the existing table, but unfortunately that is not possible in this case. Avoiding/eliminating them on entry isn't possible either.
Any help would be greatly appreciated.

select
logtime,
val,
last_value(val ignore nulls) over (order by logtime)
as not_null_val,
last_value(val ignore nulls) over (order by logtime) -
last_value(val ignore nulls) over (order by logtime rows between unbounded preceding and 1 preceding)
as delta
from your_tab order by logtime;

I found a way to avoid the nested SELECT statements using Oracle SQL's build-in LAG function:
SELECT logtime, val,
NVL(val-LAG(val IGNORE NULLS) OVER (ORDER BY logtime), 0) AS delta
FROM tab;
seems to work as I intended.
(Repeated here as a separate answer)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Replace first and last row having null values or missing values with previous/next available value in Postgresql12 - sql

Related

Function to REPLACE* last previous known value for NULL

Comparing time difference for every other row

How do I group a set of entities (people) into 10 equal portions but based on their usage for EX using Oracle's Toad 11g (SQL)

SQL Server Lag function adding range

Calculating deltas in time series with duplicate & missing values

Categories

Resources