first_value over non null values not working in Spark SQL - sql

I am trying to run a query on Spark SQL, where I want to fill the missing average_price (NULL) values with the next non null average price
Problem:
Desired Result:
Result I am getting from my query below
Here is the query I am using
spark.sql("""
select *,
CASE
WHEN average_price IS NULL AND store_id = 0 THEN
first_value(average_price, yes)
OVER
(
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between current row and 3 following
)
ELSE 0
END AS new_av_price
from table
""")
what am I doing wrong here?

I understand that your spark-sql version does not supportIGNORE NULLS syntax; see https://issues.apache.org/jira/browse/SPARK-30789.
You can go with this:
select
t1.*,
(select min(t2.average_price)
from Tbl t2
where t1.product_id=t2.product_id
and t2.purchase_dt=(select min(t3.purchase_dt)
from Tbl t3
where t3.product_id = t1.product_id
and t3.purchase_dt >= t1.purchase_Dt
and t3.average_price is not null
)
) as new_average_price
from Tbl t1
or this:
select
t1.*,
t2.average_price
from
Tbl t1
left join
Tbl t2
on t2.product_id = t1.product_id
and t2.average_price is not null
and t2.purchase_dt = (select min(t3.purchase_dt)
from Tbl t3
where t3.product_id=t1.product_id
and t3.purchase_dt>=t1.purchase_dt
and t3.average_price is not null)
These assume that you have only one row per product_id, purchase_dt. If you can have more than one row, you need to add additional logic to get rid of all but one row.
UPDATE 20220405:
If you can't use a JOIN, but you know that the non-NULL value is only up to 3 rows away, could you use:
COALESCE(
average_price
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 1 following and 1 following
)
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 2 following and 2 following
)
/* , ... */
) as new_average_price

Related

How do I exclude null duplicates without excluding valid nulls - SQL Server

I'm using a SQL Server and am trying to pull certain data. However, some rows pop up with duplicate order #s, where one line # is null while the other has a correct value.
The tricky thing is that there also are other line #s that are null, and if I were to do a filter that line_id is not null then I would exclude all the valid order #s with null values. Would I use a case statement? A subquery? I'm at a loss.
Here's an abridged version of my code and what I mean:
select
order_number
line_number
from table_1
With NOT EXISTS:
select t.*
from tablename t
where t.line_number is not null
or not exists (
select 1 from tablename
where order_number = t.order_number and line_number is not null
)
or with ROW_NUMBER() window function:
select t.order_number, t.line_number
from (
select *,
row_number() over (partition by order_number order by case when order_number is not null then 1 else 2 end) rn
from tablename
) t
where t.rn = 1

Change string split select SQL query into update query

I am trying to split a column in a SQL table into two columns where the data in the column is separated with a “-”, I have managed to edit a query I found online to do that.
The issue is, it only returns the data for viewing as it is a select query.
How can I change this query to update the table?
select
case when CHARINDEX('-',ProjectNumber)>0
then SUBSTRING(ProjectNumber,1,CHARINDEX('-',ProjectNumber)-1)
else ProjectNumber end ProjectNumber,
CASE WHEN CHARINDEX('-',ProjectNumber)>0
THEN SUBSTRING(ProjectNumber,CHARINDEX('-',ProjectNumber)+1,len(ProjectNumber))
ELSE NULL END as Vessel
from dbo.Stock
EDIT:
I have tried this:
update dbo.stock set vessel =
case when CHARINDEX('-',ProjectNumber)>0
then SUBSTRING(ProjectNumber,1,CHARINDEX('-',ProjectNumber)-1)
else ProjectNumber end **ProjectNumber,**
CASE WHEN CHARINDEX('-',ProjectNumber)>0
THEN SUBSTRING(ProjectNumber,CHARINDEX('-',ProjectNumber)+1,len(ProjectNumber))
ELSE NULL END as Vessel
But it is telling me i have a syntax error near ProjectNumber, it's the one i put stars around.
Why dont use the STRING_SPLIT function? (https://learn.microsoft.com/es-es/sql/t-sql/functions/string-split-transact-sql?view=sql-server-2017#examples):
UPDATE dbo.Stock
SET vessel = id1,
vessel2 = id2
FROM (
SELECT id,
(SELECT TOP 1 value as val
FROM STRING_SPLIT(id, '-')
ORDER BY (ROW_NUMBER() OVER(ORDER BY 1 ASC)) ASC
) AS id1,
(SELECT TOP 1 value as val
FROM STRING_SPLIT(id, '-')
ORDER BY (ROW_NUMBER() OVER(ORDER BY 1 ASC)) DESC
) AS id2
FROM dbo.Stock
) A inner join dbo.Stock B
ON A.id = B.id;

SQL Get rows based on conditions

I'm currently having trouble writing the business logic to get rows from a table with id's and a flag which I have appended to it.
For example,
id: id seq num: flag: Date:
A 1 N ..
A 2 N ..
A 3 N
A 4 Y
B 1 N
B 2 Y
B 3 N
C 1 N
C 2 N
The end result I'm trying to achieve is that:
For each unique ID I just want to retrieve one row with the condition for that row being that
If the flag was a "Y" then return that row.
Else return the last "N" row.
Another thing to note is that the 'Y' flag is not always necessarily the last
I've been trying to get a case condition using a partition like
OVER (PARTITION BY A."ID" ORDER BY A."Seq num") but so far no luck.
-- EDIT:
From the table, the sample result would be:
id: id seq num: flag: date:
A 4 Y ..
B 2 Y ..
C 2 N ..
Using a window clause is the right idea. You should partition the results by the ID (as you've done), and order them so the Y flag rows come first, then all the N flag rows in descending date order, and pick the first for each id:
SELECT id, id_seq_num, flag, date
FROM (SELECT id, id_seq_num, flag, date,
ROW_NUMBER() OVER (PARTITION BY id
ORDER BY CASE flag WHEN 'Y' THEN 0
ELSE 1
END ASC,
date ASC) AS rk
FROM mytable) t
WHERE rk = 1
My approach is to take a UNION of two queries. The first query simply selects all Yes records, assuming that Yes only appears once per ID group. The second query targets only those ID having no Yes anywhere. For those records, we use the row number to select the most recent No record.
WITH cte1 AS (
SELECT id
FROM yourTable
GROUP BY id
HAVING SUM(CASE WHEN flag = 'Y' THEN 1 ELSE 0 END) = 0
),
cte2 AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY t1.id ORDER BY t1."id seq" DESC) rn
FROM yourTable t1
INNER JOIN cte1 t2
ON t1.id = t2.id
)
SELECT *
FROM yourTable
WHERE flag = 'Y'
UNION ALL
SELECT *
FROM cte2 t2
WHERE t2.rn = 1
Here's one way (with quite generic SQL):
select t1.*
from Table1 as t1
where t1.id_seq_num = COALESCE(
(select max(id_seq_num) from Table1 as T2 where t1.id = t2.id and t2.flag = 'Y') ,
(select max(id_seq_num) from Table1 as T3 where t1.id = t3.id and t3.flag = 'N') )
Available in a fiddle here: http://sqlfiddle.com/#!9/5f7f9/6
SELECT DISTINCT id, flag
FROM yourTable

SQL Case depending on previous status of record

I have a table containing status of a records. Something like this:
ID STATUS TIMESTAMP
1 I 01-01-2016
1 A 01-03-2016
1 P 01-04-2016
2 I 01-01-2016
2 P 01-02-2016
3 P 01-01-2016
I want to make a case where I take the newest version of each row, and for all P that has at some point been an I, they should be cased as a 'G' instead of P.
When I try to do something like
Select case when ID in (select ID from TABLE where ID = 'I') else ID END as status)
From TABLE
where ID in (select max(ID) from TABLE)
I get an error that this isn't possible using IN when casing.
So my question is, how do I do it then?
Want to end up with:
ID STATUS TIMESTAMP
1 G 01-04-2016
2 G 01-02-2016
3 P 01-01-2016
DBMS is IBM DB2
Have a derived table which returns each id with its newest timestamp. Join with that result:
select t1.ID, t1.STATUS, t1.TIMESTAMP
from tablename t1
join (select id, max(timestamp) as max_timestamp
from tablename
group by id) t2
ON t1.id = t2.id and t1.TIMESTAMP = t2.max_timestamp
Will return both rows in case of a tie (two rows with same newest timestamp.)
Note that ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP".
You can do this by using a common table expression find all IDs that have had a status of 'I', and then using an outer join with your table to determine which IDs have had a status of 'I' at some point.
To get the final result (with only the newest record) you can use the row_number() OLAP function and select only the "newest" record (this is shown in the ranked common table expression below:
with irecs (ID) as (
select distinct
ID
from
TABLE
where
status = 'I'
),
ranked as (
select
rownumber() over (partition by t.ID order by t.timestamp desc) as rn,
t.id,
case when i.id is null then t.status else 'G' end as status,
t.timestamp
from
TABLE t
left outer join irecs i
on t.id = i.id
)
select
id,
status,
timestamp
from
ranked
where
rn = 1;
other solution
with youtableranked as (
select f1.id,
case (select count(*) from yourtable f2 where f2.ID=f1.ID and f2."TIMESTAMP"<f1."TIMESTAMP" and f2.STATUS='I')>0 then 'G' else f1.STATUS end as STATUS,
rownumber() over(partition by f1.id order by f1.TIMESTAMP desc, rrn(f1) desc) rang,
f1."TIMESTAMP"
from yourtable f1
)
select * from youtableranked f0
where f0.rang=1
ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP"
try this
select distinct f1.id, f4.*
from yourtable f1
inner join lateral
(
select
case (select count(*) from yourtable f3 where f3.ID=f2.ID and f3."TIMESTAMP"<f2."TIMESTAMP" and f3.STATUS='I')>0 then 'G' else f2.STATUS end as STATUS,
f2."TIMESTAMP"
from yourtable f2 where f2.ID=f3.ID
order by f2."TIMESTAMP" desc, rrn(f2) desc
fetch first rows only
) f4 on 1=1
rrn(f2) order is for same last date
ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP"

LAG functions and NULLS

How can I tell the LAG function to get the last "not null" value?
For example, see my table bellow where I have a few NULL values on column B and C.
I'd like to fill the nulls with the last non-null value. I tried to do that by using the LAG function, like so:
case when B is null then lag (B) over (order by idx) else B end as B,
but that doesn't quite work when I have two or more nulls in a row (see the NULL value on column C row 3 - I'd like it to be 0.50 as the original).
Any idea how can I achieve that?
(it doesn't have to be using the LAG function, any other ideas are welcome)
A few assumptions:
The number of rows is dynamic;
The first value will always be non-null;
Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.
Thanks
You can do it with outer apply operator:
select t.id,
t1.colA,
t2.colB,
t3.colC
from table t
outer apply(select top 1 colA from table where id <= t.id and colA is not null order by id desc) t1
outer apply(select top 1 colB from table where id <= t.id and colB is not null order by id desc) t2
outer apply(select top 1 colC from table where id <= t.id and colC is not null order by id desc) t3;
This will work, regardless of the number of nulls or null "islands". You may have values, then nulls, then again values, again nulls. It will still work.
If, however the assumption (in your question) holds:
Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.
there is a more efficient solution. We only need to find the latest (when ordered by idx) values. Modifying the above query, removing the where id <= t.id from the subqueries:
select t.id,
colA = coalesce(t.colA, t1.colA),
colB = coalesce(t.colB, t2.colB),
colC = coalesce(t.colC, t3.colC)
from table t
outer apply (select top 1 colA from table
where colA is not null order by id desc) t1
outer apply (select top 1 colB from table
where colB is not null order by id desc) t2
outer apply (select top 1 colC from table
where colC is not null order by id desc) t3;
You could make a change to your ORDER BY, to force the NULLs to be first in your ordering, but that may be expensive...
lag(B) over (order by CASE WHEN B IS NULL THEN -1 ELSE idx END)
Or, use a sub-query to calculate the replacement value once. Possibly less expensive on larger sets, but very clunky.
- Relies on all the NULLs coming at the end
- The LAG doesn't rely on that
COALESCE(
B,
(
SELECT
sorted_not_null.B
FROM
(
SELECT
table.B,
ROW_NUMBER() OVER (ORDER BY table.idx DESC) AS row_id
FROM
table
WHERE
table.B IS NOT NULL
)
sorted_not_null
WHERE
sorted_not_null.row_id = 1
)
)
(This should be faster on larger data-sets, than LAG or using OUTER APPLY with correlated sub-queries, simply because the value is calculated once. For tidiness, you could calculate and store the [last_known_value] for each column in variables, then just use COALESCE(A, #last_known_A), COALESCE(B, #last_known_B), etc)
if it is null all the way up to the end then can take a short cut
declare #b varchar(20) = (select top 1 b from table where b is not null order by id desc);
declare #c varchar(20) = (select top 1 c from table where c is not null order by id desc);
select is, isnull(b,#b) as b, insull(c,#c) as c
from table;
Select max(diff) from(
Select
Case when lag(a) over (order by b) is not null
Then (a -lag(a) over (order by b)) end as diff
From <tbl_name> where
<relevant conditions>
Order by b) k
Works fine in db visualizer.
UPDATE table
SET B = (#n := COALESCE(B , #n))
WHERE B is null;