SQL: exclude out certain rows in a messy dataset

SQL: exclude out certain rows in a messy dataset - sql

I am cleaning a dataset using duckdb package in Python. The code is as follows:
dt = db.query(
"""
select *
from dt_orig
where (A != '') and
(B != '' or B != 'Unknown' or B != 'Undefined') and
(C != '') and
(D != '' or D != '[UNKNOWN]')
""").df()
However, when I check the unique levels of each variable (A, B, C, and D), some to-be-excluded-out rows are still there. The python code used to print the unique levels are as follows:
print("The unique A are", np.unique(dt['A']), ".\n")
print("The unique B are", np.unique(dt['B']), ".\n")
print("The unique C are", np.unique(dt['C']), ".\n")
print("The unique D are", np.unique(dt['D']), ".\n")
The output are as follows:
The unique A are ['A1' 'A2' 'A3' 'A4'] .
The unique B are ['' 'B1' 'Undefined' 'Unknown' 'B2'] .
The unique C are ['C1' 'C2' 'C3' 'C4'] .
The unique D are ['[UNKNOWN]' 'D1' 'D2' 'D3' 'D4'] .
Who can help? Thank you.

You have a logical error in your where condition:
(B != '' or B != 'Unknown' or B != 'Undefined')
Instead of this condition you should write:
(B is not null and B != 'Unknown' and B != 'Undefined')
or
(B is not null and B not in ('Unknown', 'Undefined'))

Related

SQL Create A New Table off Of Query Results , in one Query

Below is my attempt to create a table from a query, having a little trouble getting the insert into statement to work, the rest of the query works great, just cannot get my return results into a new table any idea how to ?
CREATE TABLE test (
a varchar(255),
b varchar(255),
)
--I assume the above is wrong , not sure why--
Insert into test (a, b)
select (B , Cor)
from
(
--below works great---
Select B,
CASE
WHEN B = 't' THEN 'test'
WHEN B = '-' THEN 'NULL'
WHEN B = 'Choos' THEN 'NULL'
WHEN B = 'Se Co' THEN 'S'
--
WHEN B LIKE 'Y%' THEN 'di'
WHEN B LIKE 'T%' THEN 'Ten'
--
ELSE B
END AS Cor
FROM
(SELECT WON AS B
FROM ma
UNION SELECT C
FROM ma )
Order By B )

The parentheses are suspicious:
Insert into test (a, b)
select (B, Cor)
-----------^
Some databases might interpret this as a single tuple (struct or record) with two fields. Some will generate an error. Basically, you want to drop the parens:
Insert into test (a, b)
select B, Cor
. . .

Finding rows where two values not equal, with nulls

I'm interested to know what the common practices are for this situation.
You need to find all rows where two columns do not match, both columns are nullable (Exclude where both columns are NULL). None of these methods will work:
WHERE A <> B --does not include any NULLs
WHERE NOT (A = B) --does not include any NULLs
WHERE (A <> B OR A IS NULL OR B IS NULL) --includes NULL, NULL
Except this...it does work, but I don't know if there is a performance hit...
WHERE COALESCE(A, '') <> COALESCE(B, '')
Lately I've started using this logic...it's clean, simple and works...would this be considered the common way to handle it?:
WHERE IIF(A = B, 1, 0) = 0
--OR
WHERE CASE WHEN A = B THEN 1 ELSE 0 END = 0

This is a bit painful, but I would advise direct boolean logic:
where (A <> B) or (A is null and B is not null) or (A is not null and B is null)
or:
where not (A = B or A is null and B is null)
It would be much simpler if SQL Server implemented is distinct from, the ANSI standard, NULL-safe operator.
If you use coalesce(), a typical method is:
where coalesce(A, '') <> coalesce(B, '')
This is used because '' will convert to most types.

How about using except ?
for example if i want to get all a and b that is not a=b and exclude all null values of a and b
select a, b from tableX where a is not null and b is not null
except
select a,b from tableX where a=b

How can I improve this conditional UPDATE query?

I have a table t with several columns, let's name them a, b and c. I also have a state column which indicates the current state. There is also an id column.
I want to write the following query: update column a always, but b and c only if the application state is still equal to the database state. Here, the state column is used for optimistic locking.
I wrote this query as following:
UPDATE t
SET a = $a$,
b = (CASE WHEN state = $state$ THEN $b$ ELSE b END),
c = (CASE WHEN state = $state$ THEN $c$ ELSE c END)
WHERE id = $id$ AND
(
a != $a$ OR
b != (CASE WHEN state = $state$ THEN $b$ ELSE b END) OR
c != (CASE WHEN state = $state$ THEN $c$ ELSE c END)
)
Here, $id$, $a$, ... are input variables from the application. The second part of the WHERE clause is to avoid updates which do not effectively update anything.
This query works as expected, but is very clumsy. I am repeating the same condition several times. I am looking for a way to rewrite this query in a more elegant fashion. If this was a simple SELECT query, I could do something with a LATERAL JOIN, but I cannot see how to apply this here.
How can I improve this query?

Split the query in two:
UPDATE t
SET a = $a$
WHERE id = $id$
UPDATE t
SET b = $b$,
c = $c$
WHERE id = $id$ AND
state = $state$
If you need atomicity, wrap in a transaction.

This seems a bit cleaner(untested):
WITH src AS (
SELECT $a$ AS a
, (CASE WHEN state = $state$ THEN $b$ ELSE b END) AS b
, (CASE WHEN state = $state$ THEN $c$ ELSE c END) AS c
FROM t
WHERE id = $id$
)
UPDATE t dst
SET a=src.a, b=src.b, c=src.c
FROM src
WHERE dst.id = src.id
AND (src.a, src.b, src.c) IS DISTINCT FROM (dst.a, dst.b, dst.c)
;

EDIT: It Took me a while to realize my fault here: The question obviously targets at a single update, while my answer tried to update many rows. However, if you need to execute this Update for a set of rows you could:
Insert the needed parameters in a temporary table
Join that table within the "t2" subquery
Select it's columns (e.g. tempTable.b As tempB)
Replace the Parameters (e.g. $b$ -> t2.tempB)
.
UPDATE t
SET a=source.a,
b=source.b,
c=source.c
FROM
(
SELECT
id,
a,
(CASE WHEN UpdateCondition THEN $b$ ELSE b END) AS b,
(CASE WHEN UpdateCondition THEN $c$ ELSE c END) AS c
FROM
(
SELECT state = $state$ As UpdateCondition, * FROM t
) As t2
WHERE
id = $id$ AND
(
a != $a$ OR
b != (CASE WHEN UpdateCondition THEN $b$ ELSE b END) OR
c != (CASE WHEN UpdateCondition THEN $c$ ELSE c END)
) AS source
WHERE t.id=source.id;
The Sub query for t2 gives you your state Condition and executes the calculation for it only once per row.
The subquery for "source" gives you the mapped values and filters those without changes.

The only filter you need is on ID = $id
The case statement says don't change it in the update if the state doesn't match, so you don't need to filter it.
EDIT
where Id = $id and a !=$a
Or (state = $state and (b !=b or c!= $c))
If you do any more than that then"always update a" will not necessary be true.
3rd attempt checks for the possibility of a remaining the same, but b or c updating.

How to set nulls the values of column B and column C if they already exist on column A sql?

Select BillName as A, ConsigneeName as B, ShipperName as C
from Sum_Orders
where (OrderStatus in ('Complete','Invoiced')
)
and
OrderPeriodYear IN (
(
YEAR(GETDATE())-1
)
)
Group by billname,ConsigneeName,ShipperName
I'm having duplicates in A, B, C (which is expected)
I'm trying to make a condition to
keep the value in A and set to nulls the values that repeat in B OR C
IF A = B or C then keep A and SET B or C to NULLS
Thank you, guys, :D

Is this what you want?
update t
set B = (case when B <> A then B end),
C = (case when C <> A then C end)
where B <> A or C <> A;

If you have to do this inline the perhaps a case will help.
Select Billname AS A,
CASE WHEN ConsigneeName = Billname THEN NULL ELSE ConsigneeName END,
CASE WHEN ShipperName = Billname THEN NULL ELSE ShipperName
from Sum_Orders etc ...
If the table is big, this maybe expensive on the query and pushing this logic into the query itself might be better.

If a,b and c has same value, b and c should set null, so:
Update tablename
Set B = if(A=B, null, B) , C=if(A=C, null, C)
-- where A=B or A=C
You can use 'where' if optimization is interesting!
If you're going to 'Select' value:
Select A, if(A=B, null, B) as B , if(A=C, null, C) as C from tablename

T-SQL AND logic

I have table TABLE1 with columns A, B and C. I need to get all rows from the table where columns A, B and C are not all equal to 1, e.g.,
WHERE NOT (A = 1 AND B = 1 AND C = 1)
This works. However, I need to do this in a fashion that only uses AND and OR statements. I had expected this to work:
WHERE A != 1
AND B != 1
AND C != 1
However, this only returns rows where no row = 1, i.e, too few rows.
Using MS SQL 2008.

WHERE (A <> 1 OR B <> 1 OR C <> 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL: exclude out certain rows in a messy dataset - sql

You have a logical error in your where condition: (B != '' or B != 'Unknown' or B != 'Undefined') Instead of this condition you should write: (B is not null and B != 'Unknown' and B != 'Undefined') or (B is not null and B not in ('Unknown', 'Undefined'))

Related

SQL Create A New Table off Of Query Results , in one Query

Finding rows where two values not equal, with nulls

How can I improve this conditional UPDATE query?

How to set nulls the values of column B and column C if they already exist on column A sql?

T-SQL AND logic

Categories

Resources