How do I exclude null duplicates without excluding valid nulls - SQL Server - sql

I'm using a SQL Server and am trying to pull certain data. However, some rows pop up with duplicate order #s, where one line # is null while the other has a correct value.
The tricky thing is that there also are other line #s that are null, and if I were to do a filter that line_id is not null then I would exclude all the valid order #s with null values. Would I use a case statement? A subquery? I'm at a loss.
Here's an abridged version of my code and what I mean:
select
order_number
line_number
from table_1

With NOT EXISTS:
select t.*
from tablename t
where t.line_number is not null
or not exists (
select 1 from tablename
where order_number = t.order_number and line_number is not null
)
or with ROW_NUMBER() window function:
select t.order_number, t.line_number
from (
select *,
row_number() over (partition by order_number order by case when order_number is not null then 1 else 2 end) rn
from tablename
) t
where t.rn = 1

Related

first_value over non null values not working in Spark SQL

I am trying to run a query on Spark SQL, where I want to fill the missing average_price (NULL) values with the next non null average price
Problem:
Desired Result:
Result I am getting from my query below
Here is the query I am using
spark.sql("""
select *,
CASE
WHEN average_price IS NULL AND store_id = 0 THEN
first_value(average_price, yes)
OVER
(
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between current row and 3 following
)
ELSE 0
END AS new_av_price
from table
""")
what am I doing wrong here?
I understand that your spark-sql version does not supportIGNORE NULLS syntax; see https://issues.apache.org/jira/browse/SPARK-30789.
You can go with this:
select
t1.*,
(select min(t2.average_price)
from Tbl t2
where t1.product_id=t2.product_id
and t2.purchase_dt=(select min(t3.purchase_dt)
from Tbl t3
where t3.product_id = t1.product_id
and t3.purchase_dt >= t1.purchase_Dt
and t3.average_price is not null
)
) as new_average_price
from Tbl t1
or this:
select
t1.*,
t2.average_price
from
Tbl t1
left join
Tbl t2
on t2.product_id = t1.product_id
and t2.average_price is not null
and t2.purchase_dt = (select min(t3.purchase_dt)
from Tbl t3
where t3.product_id=t1.product_id
and t3.purchase_dt>=t1.purchase_dt
and t3.average_price is not null)
These assume that you have only one row per product_id, purchase_dt. If you can have more than one row, you need to add additional logic to get rid of all but one row.
UPDATE 20220405:
If you can't use a JOIN, but you know that the non-NULL value is only up to 3 rows away, could you use:
COALESCE(
average_price
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 1 following and 1 following
)
, first_value(average_price) OVER (
PARTITION BY product_id
ORDER BY cast(purchase_dt as int) asc
range between 2 following and 2 following
)
/* , ... */
) as new_average_price

Finding duplicate values in a table where all the columns are not the same

I am working with a set of data in a table.
For simplicity i have the table like below with some sample data:
Some of the data in this table came from a different source, such data are the ones that have cqmRecordID != null
I need to find duplicate values in this table and delete the duplicate ones that came over from the other source (ones with a cqmRecordID)
A record is considered duplicate if they have the same values for these cols:
[Name]
Cast([CreatedDate] as Date)
[CreatedBy]
So in the sample data i have above, record #5 and record #6 would be considered duplicates.
As solutions I came up with these two queries:
Query #1:
select * from (
select recordid, cqmrecordid, ROW_NUMBER() over (partition by name, cast(createddate as date), createdby
order by cqmrecordid, recordid) as rownum
from vmsNCR ) A
where cqmrecordid is not null
order by recordid
Query #2:
select A.recordID, A.cqmRecordID, B.RecordID, B.cqmRecordID
from vmsNCR A
join vmsNCR B
on A.Name = B.Name
and cast(A.CreatedDate as date) = cast(B.CreatedDate as date)
and A.CreatedBy = B.CreatedBy
and A.RecordID != B.RecordID
and A.cqmRecordID is not null
order by A.RecordID
Is there a better approach to this? Is one better than the other performance wise?
If you want to fetch all the rows without duplicates, then:
select t.* -- or all columns except seqnum
from (select t.*,
row_number() over (partition by name, cast(createddate as date), createdby
order by (case when cqmRecordId is not null then 1 else 2 end)
) as seqnum
from t
) t
where seqnum = 1;
If you want performance, create a columns and then an index:
alter table t add cqmRecordId_flag as (case when cqmRecordId is null then 0 else 1 end) persisted;
alter table t add createddate_date as (cast(createddate as date)) persisted;
And then an index:
create index idx_t_4 on t(name, createddate_date, createdby, cqmRecordId_flag desc);
EDIT:
If you actually just want to delete the NULL values from the table, you can use:
delete t from t
where t.cqmRecordId is null and
exists (select 1
from t t2
where t2.name = t.name and
convert(date, t2.createddate_date) =convert(date, t.createddate_date) and
t2.createdby = t.createdby and
t2.cqmRecordId is not null
);
You can use the same logic with select to just select the duplicates.
Try below Query it might work for You
;WITH TestCTE
AS
(
SELECT *,ROW_NUMBER() OVER(
PARTITION BY [Name],Cast([CreatedDate] as Date),[CreatedBy]
ORDER BY RecordId
) AS RowNumber
)
DELETE FROM TestCTE
WHERE RowNumber > 1
Use the below code to eliminate duplicates
;WITH CTE
AS
(
SELECT ROW_NUMBER() OVER(
PARTITION BY [Name],Cast([CreatedDate] as Date),[CreatedBy]
ORDER BY cqmRecordId
) AS Rnk
,*
)
DELETE FROM CTE
WHERE Rnk <> 1

get only row that meet condition if such row exist and if not get the row that meet another condition

this sounds like a simple question but I just cant find the right way.
given the simplified table
with t as (
select ordernumber, orderdate, case when ordertype in (5,21) then 1 else 0 end is_restore , ordertype, row_number() over(order by orderdate) rn from
(
select to_date('29.08.08','DD.MM.YY') orderdate,'313' ordernumber, 1 as ordertype from dual union all
select to_date('13.03.15','DD.MM.YY') orderdate, '90/4/2' ordernumber, 5 as ordertype from dual
)
)
select * from t -- where clause should be here
for every row is_restore guaranteed to be 1 or 0.
if table has a row where is_restore=1 then select ordernumber,orderdate of that row and nothing else.
If a table does not have a row where is_restore=1 then select ordernumber,orderdate of the row where rn=1(row where rn=1 is guaranteed to exist in a table)
Given the requirements above what do I need to put in where clause to get the following?
You could use ROW_NUMBER:
CREATE TABLE t
AS
select ordernumber, orderdate,
case when ordertype in (5,21) then 1 else 0 end is_restore, ordertype,
row_number() over(order by orderdate) rn
from (
select to_date('29.08.08','DD.MM.YY') orderdate,'313' ordernumber,
1 as ordertype
from dual union all
select to_date('13.03.15','DD.MM.YY') orderdate, '90/4/2' ordernumber,
5 as ordertype
from dual);
-------------------
with cte as (
select t.*,
ROW_NUMBER() OVER(/*PARTITION BY ...*/ ORDER BY is_restore DESC, rn) AS rnk
from t
)
SELECT *
FROM cte
WHERE rnk = 1;
db<>fiddle demo
Here is sql, that doesn't use window functions, maybe it will be useful for those, whose databases don't support OVER ( ... ) or when there are indexed fields, on which query is based.
SELECT
*
FROM t
WHERE t.is_restore = 1
OR (
NOT EXISTS (SELECT 1 FROM t WHERE t.is_restore = 1)
AND t.rn = 1
)

SQL Server RowNumbering

SrNo TextCol
--------------
NULL ABC
NULL ABC
NULL ASC
NULL qwe
I want to update the SrNo column with numbers 1,2,3,4 without changing sequence of other columns.
It only makes sense to speak of using row number if there exist a column which can provide ordering. Assuming the ordering is specified by the TextCol column, then we can try the following:
WITH cte AS (
SELECT SrNo, TextCol, ROW_NUMBER() OVER (ORDER BY TextCol) rn
FROM yourTable
)
UPDATE cte
SET SrNo = rn;
Tables are unordered, so you cannot rely on "existing sequence". However a "trick" is to use select null which in effect does nothing to the row order. While it works you should not rely on it as a permanent solution.
WITH cte AS (
SELECT SrNo, TextCol
, ROW_NUMBER() OVER (ORDER BY (select NULL)) rn
FROM yourTable
)
UPDATE cte
SET SrNo = rn;

How to get the first not null value from a column of values in Big Query?

I am trying to extract the first not null value from a column of values based on timestamp. Can somebody share your thoughts on this. Thank you.
What have i tried so far?
FIRST_VALUE( column ) OVER ( PARTITION BY id ORDER BY timestamp)
Input :-
id,column,timestamp
1,NULL,10:30 am
1,NULL,10:31 am
1,'xyz',10:32 am
1,'def',10:33 am
2,NULL,11:30 am
2,'abc',11:31 am
Output(expected) :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am
You can modify your sql like this to get the data you want.
FIRST_VALUE( column )
OVER (
PARTITION BY id
ORDER BY
CASE WHEN column IS NULL then 0 ELSE 1 END DESC,
timestamp
)
Try this old trick of string manipulation:
Select
ID,
Column,
ttimestamp,
LTRIM(Right(CColumn,20)) as CColumn,
FROM
(SELECT
ID,
Column,
ttimestamp,
MIN(Concat(RPAD(IF(Column is null, '9999999999999999',STRING(ttimestamp)),20,'0'),LPAD(Column,20,' '))) OVER (Partition by ID) CColumn
FROM (
SELECT
*
FROM (Select 1 as ID, STRING(NULL) as Column, 0.4375 as ttimestamp),
(Select 1 as ID, STRING(NULL) as Column, 0.438194444444444 as ttimestamp),
(Select 1 as ID, 'xyz' as Column, 0.438888888888889 as ttimestamp),
(Select 1 as ID, 'def' as Column, 0.439583333333333 as ttimestamp),
(Select 2 as ID, STRING(NULL) as Column, 0.479166666666667 as ttimestamp),
(Select 2 as ID, 'abc' as Column, 0.479861111111111 as ttimestamp)
))
As far as I know, Big Query has no options like 'IGNORE NULLS' or 'NULLS LAST'. Given that, this is the simplest solution I could come up with. I would like to see even simpler solutions.
Assuming the input data is in table "original_data",
select w2.id, w1.column, w2.timestamp
from
(select id,column,timestamp
from
(select id,column,timestamp, row_number()
over (partition BY id ORDER BY timestamp) position
FROM original_data
where column is not null
)
where position=1
) w1
right outer join
original_data as w2
on w1.id = w2.id
SELECT id,
(SELECT top(1) column FROM test1 where id=1 and column is not null order by autoID desc) as name
,timestamp
FROM yourTable
Output :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am