LAG functions and NULLS - sql

How can I tell the LAG function to get the last "not null" value?
For example, see my table bellow where I have a few NULL values on column B and C.
I'd like to fill the nulls with the last non-null value. I tried to do that by using the LAG function, like so:
case when B is null then lag (B) over (order by idx) else B end as B,
but that doesn't quite work when I have two or more nulls in a row (see the NULL value on column C row 3 - I'd like it to be 0.50 as the original).
Any idea how can I achieve that?
(it doesn't have to be using the LAG function, any other ideas are welcome)
A few assumptions:
The number of rows is dynamic;
The first value will always be non-null;
Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.
Thanks

You can do it with outer apply operator:
select t.id,
t1.colA,
t2.colB,
t3.colC
from table t
outer apply(select top 1 colA from table where id <= t.id and colA is not null order by id desc) t1
outer apply(select top 1 colB from table where id <= t.id and colB is not null order by id desc) t2
outer apply(select top 1 colC from table where id <= t.id and colC is not null order by id desc) t3;
This will work, regardless of the number of nulls or null "islands". You may have values, then nulls, then again values, again nulls. It will still work.
If, however the assumption (in your question) holds:
Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.
there is a more efficient solution. We only need to find the latest (when ordered by idx) values. Modifying the above query, removing the where id <= t.id from the subqueries:
select t.id,
colA = coalesce(t.colA, t1.colA),
colB = coalesce(t.colB, t2.colB),
colC = coalesce(t.colC, t3.colC)
from table t
outer apply (select top 1 colA from table
where colA is not null order by id desc) t1
outer apply (select top 1 colB from table
where colB is not null order by id desc) t2
outer apply (select top 1 colC from table
where colC is not null order by id desc) t3;

You could make a change to your ORDER BY, to force the NULLs to be first in your ordering, but that may be expensive...
lag(B) over (order by CASE WHEN B IS NULL THEN -1 ELSE idx END)
Or, use a sub-query to calculate the replacement value once. Possibly less expensive on larger sets, but very clunky.
- Relies on all the NULLs coming at the end
- The LAG doesn't rely on that
COALESCE(
B,
(
SELECT
sorted_not_null.B
FROM
(
SELECT
table.B,
ROW_NUMBER() OVER (ORDER BY table.idx DESC) AS row_id
FROM
table
WHERE
table.B IS NOT NULL
)
sorted_not_null
WHERE
sorted_not_null.row_id = 1
)
)
(This should be faster on larger data-sets, than LAG or using OUTER APPLY with correlated sub-queries, simply because the value is calculated once. For tidiness, you could calculate and store the [last_known_value] for each column in variables, then just use COALESCE(A, #last_known_A), COALESCE(B, #last_known_B), etc)

if it is null all the way up to the end then can take a short cut
declare #b varchar(20) = (select top 1 b from table where b is not null order by id desc);
declare #c varchar(20) = (select top 1 c from table where c is not null order by id desc);
select is, isnull(b,#b) as b, insull(c,#c) as c
from table;

Select max(diff) from(
Select
Case when lag(a) over (order by b) is not null
Then (a -lag(a) over (order by b)) end as diff
From <tbl_name> where
<relevant conditions>
Order by b) k
Works fine in db visualizer.

UPDATE table
SET B = (#n := COALESCE(B , #n))
WHERE B is null;

Related

first_value over non null values not working in Spark SQL

I am trying to run a query on Spark SQL, where I want to fill the missing average_price (NULL) values with the next non null average price
Problem:
Desired Result:
Result I am getting from my query below
Here is the query I am using
spark.sql("""
select *,
CASE
WHEN average_price IS NULL AND store_id = 0 THEN
first_value(average_price, yes)
OVER
(
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between current row and 3 following
)
ELSE 0
END AS new_av_price
from table
""")
what am I doing wrong here?
I understand that your spark-sql version does not supportIGNORE NULLS syntax; see https://issues.apache.org/jira/browse/SPARK-30789.
You can go with this:
select
t1.*,
(select min(t2.average_price)
from Tbl t2
where t1.product_id=t2.product_id
and t2.purchase_dt=(select min(t3.purchase_dt)
from Tbl t3
where t3.product_id = t1.product_id
and t3.purchase_dt >= t1.purchase_Dt
and t3.average_price is not null
)
) as new_average_price
from Tbl t1
or this:
select
t1.*,
t2.average_price
from
Tbl t1
left join
Tbl t2
on t2.product_id = t1.product_id
and t2.average_price is not null
and t2.purchase_dt = (select min(t3.purchase_dt)
from Tbl t3
where t3.product_id=t1.product_id
and t3.purchase_dt>=t1.purchase_dt
and t3.average_price is not null)
These assume that you have only one row per product_id, purchase_dt. If you can have more than one row, you need to add additional logic to get rid of all but one row.
UPDATE 20220405:
If you can't use a JOIN, but you know that the non-NULL value is only up to 3 rows away, could you use:
COALESCE(
average_price
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 1 following and 1 following
)
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 2 following and 2 following
)
/* , ... */
) as new_average_price

How to select all columns for rows where I check if just 1 or 2 columns contain duplicate values

I'm having difficulty with what I figure should be an easy problem. I want to select all the columns in a table for which one particular column has duplicate values.
I've been trying to use aggregate functions, but that's constraining me as I want to just match on one column and display all values. Using aggregates seems to require that I 'group by' all columns I'm going to want to display.
If I understood you correctly, this should do:
SELECT *
FROM YourTable A
WHERE EXISTS(SELECT 1
FROM YourTable
WHERE Col1 = A.Col1
GROUP BY Col1
HAVING COUNT(*) > 1)
You can join on a derived table where you aggregate and determine "col" values which are duplicated:
SELECT a.*
FROM Table1 a
INNER JOIN
(
SELECT col
FROM Table1
GROUP BY col
HAVING COUNT(1) > 1
) b ON a.col = b.col
This query gives you a chance to ORDER BY cola in ascending or descending order and change Cola output.
Here's a Demo on SqlFiddle.
with cl
as
(
select *, ROW_NUMBER() OVER(partition by colb order by cola ) as rn
from tbl)
select *
from cl
where rn > 1

how to find the most appears one in a table using sql?

I have a table A with two columns named B and C as following:
('W1','F2')
('W1','F7')
('W2','F1')
('W2','F6')
('W2','F8')
('W4','F7')
('W6','F2')
('W6','F15')
('W7','F1')
('W7','F4')
('W7','F17')
('W8','F13')
How can I find which one in the B column appears with the most time using sql in oracle? (In this case, it's W2 and W7). Thank you!
Use a subquery to calculate the number of items in columC for each value in columnB and rank() the results of the subquery based on that count. Then in your main select return just the values of columnB where the rank of the rows returned by the subquery is 1:
SELECT ColB
FROM (
SELECT ColB,
Count(ColC),
rank() over (ORDER BY Count(ColC) DESC) AS rnk
FROM yourTable
GROUP BY ColB)
WHERE rnk = 1
Here's a sql fiddle: http://sqlfiddle.com/#!4/fa6bd/2
/*
C2 REFERS TO THE COLUMN B
T1 Refers to an alias
*/
WITH T1 AS
(
SELECT C2,COUNT(*) AS COUNT
FROM YOURTABLE
GROUP BY C2
)
SELECT C2,COUNT FROM T1 WHERE COUNT=(SELECT MAX(COUNT) FROM T1 )
;
Select ColB, Count(*)
FROM yourTable
GROUP BY ColB
ORDER BY count(*) desc

Wrap-around in SQL results ordering

If I have a table with items beginning with C, D, and J, is it in any way possible to arrange a query that orders these ascending, but starts with D, and ends with C wrapped around?
E.g. raw table = C,C,C,D,D,J
Desired result order = D,D,J,C,C,C
In ordinary SQL this is, not any functional language? I can't see how the desired order can be achieved without hard-coding, i.e. selecting each individual record in the desired order all union'd together.
You could introduce a custom sort key with a CASE statement.
Select Col1, Col2, Col3,
case left(Col3,1) when 'D' then 1
when 'J' then 2
when 'C' then 3
else 4
end as SortKey
from YourTable
order by SortKey, Col3
That does not seem like a regular order, so you won't be able to do it in a simple way. But you can do this:
(SELECT * FROM yourtable WHERE ID >=C ORDER BY ID) UNION (SELECT SELECT * FROM yourtable WHERE ID <C ORDER BY ID)
You can implement custom sort orders without hardcoding by using a table:
CREATE TABLE SortRules (
Prefix char(1) NOT NULL
,SortOrder int NOT NULL
)
Then join to the SortRules table:
SELECT *
FROM YourTable
LEFT JOIN SortRules
ON YourTable.YourColumn LIKE SortRules.Prefix + '%'
ORDER BY SortRules.SortOrder, YourTable.YourColumn
You can make SortOrder UNIQUE (although that's not required). You can also decide if you want missing ones (unmatched in the join) to be at the top or bottom:
SELECT *
FROM YourTable
LEFT JOIN SortRules
ON YourTable.YourColumn LIKE SortRules.Prefix + '%'
ORDER BY COALESCE(SortRules.SortOrder, 2147483647), YourTable.YourColumn
SELECT *
FROM YourTable
LEFT JOIN SortRules
ON YourTable.YourColumn LIKE SortRules.Prefix + '%'
ORDER BY COALESCE(SortRules.SortOrder, -2147483648), YourTable.YourColumn

Select DISTINCT, return entire row

I have a table with 10 columns.
I want to return all rows for which Col006 is distinct, but return all columns...
How can I do this?
if column 6 appears like this:
| Column 6 |
| item1 |
| item1 |
| item2 |
| item1 |
I want to return two rows, one of the records with item1 and the other with item2, along with all other columns.
In SQL Server 2005 and above:
;WITH q AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY col6 ORDER BY id) rn
FROM mytable
)
SELECT *
FROM q
WHERE rn = 1
In SQL Server 2000, provided that you have a primary key column:
SELECT mt.*
FROM (
SELECT DISTINCT col6
FROM mytable
) mto
JOIN mytable mt
ON mt.id =
(
SELECT TOP 1 id
FROM mytable mti
WHERE mti.col6 = mto.col6
-- ORDER BY
-- id
-- Uncomment the lines above if the order matters
)
Update:
Check your database version and compatibility level:
SELECT ##VERSION
SELECT COMPATIBILITY_LEVEL
FROM sys.databases
WHERE name = DB_NAME()
The key word "DISTINCT" in SQL has the meaning of "unique value". When applied to a column in a query it will return as many rows from the result set as there are unique, different values for that column. As a consequence it creates a grouped result set, and values of other columns are random unless defined by other functions (such as max, min, average, etc.)
If you meant to say you want to return all rows for which Col006 has a specific value, then use the "where Col006 = value" clause.
If you meant to say you want to return all rows for which Col006 is different from all other values of Col006, then you still need to specify what that value is => see above.
If you want to say that the value of Col006 can only be evaluated once all rows have been retrieved, then use the "having Col006 = value" clause. This has the same effect as the "where" clause, but "where" gets applied when rows are retrieved from the raw tables, whereas "having" is applied once all other calculations have been made (i.e. aggregation functions have been run etc.) and just before the result set is returned to the user.
UPDATE:
After having seen your edit, I have to point out that if you use any of the other suggestions, you will end up with random values in all other 9 columns for the row that contains the value "item1" in Col006, due to the constraint further up in my post.
You can group on Col006 to get the distinct values, but then you have to decide what to do with the multiple records in each group.
You can use aggregates to pick a value from the records. Example:
select Col006, min(Col001), max(Col002)
from TheTable
group by Col006
order by Col006
If you want the values to come from a specific record in each group, you have to identify it somehow. Example of using Col002 to identify the record in each group:
select Col006, Col001, Col002
from TheTable t
inner join (
select Col006, min(Col002)
from TheTable
group by Col006
) x on t.Col006 = x.Col006 and t.Col002 = x.Col002
order by Col006
SELECT *
FROM (SELECT DISTINCT YourDistinctField FROM YourTable) AS A
CROSS APPLY
( SELECT TOP 1 * FROM YourTable B
WHERE B.YourDistinctField = A.YourDistinctField ) AS NewTableName
I tried the answers posted above with no luck... but this does the trick!
select * from yourTable where column6 in (select distinct column6 from yourTable);
SELECT *
FROM harvest
GROUP BY estimated_total;
You can use GROUP BY and MIN() to get more specific result.
Lets say that you have id as the primary_key.
And we want to get all the DISTINCT values for a column lets say estimated_total, And you also need one sample of complete row with each distinct value in SQL. Following query should do the trick.
SELECT *, min(id)
FROM harvest
GROUP BY estimated_total;
create table #temp
(C1 TINYINT,
C2 TINYINT,
C3 TINYINT,
C4 TINYINT,
C5 TINYINT,
C6 TINYINT)
INSERT INTO #temp
SELECT 1,1,1,1,1,6
UNION ALL SELECT 1,1,1,1,1,6
UNION ALL SELECT 3,1,1,1,1,3
UNION ALL SELECT 4,2,1,1,1,6
SELECT * FROM #temp
SELECT *
FROM(
SELECT ROW_NUMBER() OVER (PARTITION BY C6 Order by C1) ID,* FROM #temp
)T
WHERE ID = 1