COUNT() Function in conjunction with NOT IN clause not working properly with varchar field (T-SQL) - sql

I came across a weird situation when trying to count the number of rows that DO NOT have varchar values specified by a select statement. Ok, that sounds confusing even to me, so let me give you an example:
Let's say I have a field "MyField" in "SomeTable" and I want to count in how many rows MyField values do not belong to a domain defined by the values of "MyOtherField" in "SomeOtherTable".
In other words, suppose that I have MyOtherField = {1, 2, 3}, I wanna count in how many rows MyField value are not 1, 2 or 3. For that, I'd use the following query:
SELECT COUNT(*) FROM SomeTable
WHERE ([MyField] NOT IN (SELECT MyOtherField FROM SomeOtherTable))
And it works like a charm. Notice however that MyField and MyOtherField are int typed. If I try to run the exact same query, except for varchar typed fields, its returning value is 0 even though I know that there are wrong values, I put them there! And if I, however, try to count the opposite (how many rows ARE in the domain opposed to what I want that is how many rows are not) simply by supressing the "NOT" clause in the query above... Well, THAT works! ¬¬
Yeah, there must be tons of workarounds to this but I'd like to know why it doesn't work the way it should. Furthermore, I can't simply alter the whole query as most of it is built inside a C# code and basically the only part I have freedom to change that won't have an impact in any other part of the software is the select statement that corresponds to the domain (whatever comes in the NOT IN clause). I hope I made myself clear and someone out there could help me out.
Thanks in advance.

For NOT IN, it is always false if the subquery returns a NULL value. The accepted answer to this question elegantly describes why.
The NULLability of a column value is independent of the datatype used too: most likely your varchar columns has NULL values
Do deal with this, use NOT EXISTS. For non-null values, it works the same as NOT IN so is compatible
SELECT COUNT(*) FROM SomeTable S1
WHERE NOT EXISTS (SELECT * FROm SomeOtherTable S2 WHERE S1.[MyField] = S2.MyOtherField)

gbn has a more complete answer, but I can't be bothered to remember all that. Instead I have the religious habit of filtering nulls out of my IN clauses:
SELECT COUNT(*)
FROM SomeTable
WHERE [MyField] NOT IN (
SELECT MyOtherField FROM SomeOtherTable
WHERE MyOtherField is not null
)

Related

How to know which column has changed on UPDATE?

In a statement like this:
update tab1 set (col1,col2)=(val1,val2) returning "?"
I send whole row for update on new values, RETURNING * gives back the whole row, but is there a way to check which exactly column has changed when others remained the same?
I understand that UPDATE rewrites the values, but maybe there is some built-in function for such comparison?
Basically, you need the pre-UPDATE values of updated rows to compare. That's kind of hard as RETURNING only returns post-UPDATE state. But can be worked around. See:
Return pre-UPDATE column values using SQL only
So this does the basic trick:
WITH input(col1, col2) AS (
SELECT 1, text 'post_up' -- "whole row"
)
, pre_upd AS (
UPDATE tab1 x
SET (col1, col2) = (i.col1, i.col2)
FROM input i
JOIN tab1 y USING (col1)
WHERE x.col1 = y.col1
RETURNING y.*
)
TABLE pre_upd
UNION ALL
TABLE input;
db<>fiddle here
This is assuming that col1 in your example is the PRIMARY KEY. We need some way to identify rows unambiguously.
Note that this is not safe against race conditions between concurrent writes. You need to do more to be safe. See related answer above.
The explicit cast to text I added in the CTE above is redundant as text is the default type for string literals anyway. (Like integer is the default for simple numeric literals.) For other data types, explicit casting may be necessary. See:
Casting NULL type when updating multiple rows
Also be aware that all updates write a new row version, even if nothing changes at all. Typically, you'd want to suppress such costly empty updates with appropriate WHERE clauses. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
While "passing whole rows", you'll have to check on all columns that might change, to achieve that.

How to backreference a calculated column value in another column during an INSERT query on Postgres? (query-runtime temporary variable assignment)

In MySQL there's some helpful syntax for doing things like SELECT #calc:=3,#calc, but I can't find the way to solve this on PostgreSQL
The idea would be something like:
SELECT (SET) autogen := UUID_GENERATE_v4() AS id, :autogen AS duplicated_id;
returning a row with 2 columns with same value
EDIT: Not interested in conventional \set, I need to do this for hundreds of rows
You can use a subquery:
select id, id as duplicated_id
from (select UUID_GENERATE_v4() AS id
) x
Postgres does not confuse the select statement by allowing variable assignment. Even if it did, nothing guarantees the order of evaluation of expressions in a select, so you still would not be sure that it worked.

Add column with substring of other column in SQL (Snowflake)

I feel like this should be simple but I'm relatively unskilled in SQL and I can't seem to figure it out. I'm used to wrangling data in python (pandas) or Spark (usually pyspark) and this would be a one-liner in either of those. Specifically, I'm using Snowflake SQL, but I think this is probably relevant to a lot of flavors of SQL.
Essentially I just want to trim the first character off of a specific column. More generally, what I'm trying to do is replace a column with a substring of the same column. I would even settle for creating a new column that's a substring of an existing column. I can't figure out how to do any of these things.
On obvious solution would be to create a temporary table with something like
CREATE TEMPORARY TABLE tmp_sub AS
SELECT id_col, substr(id_col, 2, 10) AS id_col_sub FROM table1
and then join it back and write a new table
CREATE TABLE table2 AS
SELECT
b.id_col_sub as id_col,
a.some_col1, a.some_col2, ...
FROM table1 a
JOIN tmp_sub b
ON a.id_col = b.id_col
My tables have roughly a billion rows though and this feels extremely inefficient. Maybe I'm wrong? Maybe this is just the right way to do it? I guess I could replace the CREATE TABLE table2 AS... to INSERT OVERWRITE INTO table1 ... and at least that wouldn't store an extra copy of the whole thing.
Any thoughts and ideas are most welcome. I come at this humbly from the perspective of someone who is baffled by a language that so many people seem to have mastery over.
I'm not sure the exact syntax/functions in Snowflake but generally speaking there's a few different ways of achieving this.
I guess the general approach that would work universally is using the SUBSTRING function that's available in any database.
Assuming you have a table called Table1 with the following data:
+-------+-----------------------------------------+
Code | Desc
+-------+-----------------------------------------+
0001 | 1First Character Will be Removed
0002 | xCharacter to be Removed
+-------+-----------------------------------------+
The SQL code to remove the first character would be:
select SUBSTRING(Desc,2,len(desc)) from Table1
Please note that the "SUBSTRING" function may vary according to different databases. In Oracle for example the function is "SUBSTR". You just need to find the Snowflake correspondent.
Another approach that would work at least in SQLServer and MySQL would be using the "RIGHT" function
select RIGHT(Desc,len(Desc) - 1) from Table1
Based on your question I assume you actually want to update the actual data within the table. In that case you can use the same function above in an update statement.
update Table1 set Desc = SUBSTRING(Desc,2,len(desc))
You didn't try this?
UPDATE tableX
SET columnY = substr(columnY, 2, 10 ) ;
-Paul-
There is no need to specify the length, as is evidenced from the following simple test harness:
SELECT $1
,SUBSTR($1, 2)
,RIGHT($1, -2)
FROM VALUES
('abcde')
,('bcd')
,('cdef')
,('defghi')
,('e')
,('fg')
,('')
;
Both expressions here - SUBSTR(<col>, 2) and RIGHT(<col>, -2) - effectively remove the first character of the <col> column value.
As for the strategy of using UPDATE versus INSERT OVERWRITE, I do not believe that there will be any difference in performance or outcome, so I might opt for the UPDATE since it is simpler. So, in conclusion, I would use:
UPDATE tableX
SET columnY = SUBSTR(columnY, 2)
;

SQL0802 - invalid numeric data

I'm on a db2 database over as400 system.
I have a select query that is throwing the error in the title: SQL0802 code 6 which is "invalid numeric data" (translated).
I have tried separating the query in different parts and testing each part one by one to see if it works, I am 99% convinced that the problem comes because of a "CAST" clause I am using in a subquery(to cast CHAR to INT), I just don't understand why the subquery works by itself but it doesn't work as a part of the main query.
So if I run the subquery with the "CAST" clause it works fine, but when I run the main query that uses the subquery it doesn't work and the error arises.
Main query can be divided in 2 queries, see the code below.
query1 looks something like this:
select SUM(Price) from TABLE1
where X = 1
group by Country
having SUM(Price) = (query2);
query2 looks something like this:
SELECT SUM(UnitPrice * AmountStocked)
FROM TABLE2
WHERE J = X and ItemNumber in (
SELECT CAST(ItmNumbr AS INT) from TABLE3
where Id in (select Id from TABLE4 where Z=Y)
)
Notes:
*query2 will return a single number.
*Running query2 by itself works fine.
*Running query1 without the "having" clause works fine too.
*If I substitute the "SELECT CAST..." subquery in query2 with something like "(2002, 9912, 1234)" and then run the main query it works fine, so this pretty much confirms that the problem is the "CAST" clause.
*I have to CAST ItmNumbr to INT because ItemNumber is of Numeric type and
ItmNumbr is of Char type.
You said:
*I have to CAST ItmNumbr to INT because ItemNumber is of Numeric type and ItmNumbr is of Char type.
But this is not true. You could cast the other way around:
SELECT SUM(UnitPrice * AmountStocked)
FROM TABLE2
WHERE J = X and CHAR(ItemNumber) in (
SELECT TRIM(ItmNumbr) from TABLE3
where Id in (select Id from TABLE4 where Z=Y)
)
The advantage here is that non-numeric characters in ItmNumber will not blow you up, and CHAR(ItemNumber) should also not fail.
One thing to know about DB2 for i is that there are two ways to create database tables, and the two differ slightly in the characteristics of the resulting table. If the table is created using DDL (CREATE TABLE ...), then that table cannot contain bad data. The data types are verified on write, no matter how you write the data, it is validated before being written to the table. If the table is created by DDS (CRTPF ...), the table can indeed contain bad data because the data is not validated until it is read and loaded into a variable. Old style programs that write data to DDS tables by writing a record from a program described data structure are able to put whatever they want into a DDS defined table, including numeric data in character fields or worse, character data in numeric fields. This usually is only found in very old databases that have been migrated from the System/36 (circa 1980's) which used flat files rather than database files (it had no notion of a database). I only posit this because it is possible. Check the data in your file using hex() to see if there is anything funky in the ItmNumbr or ItemNumber fields.
I am not sure but I am thinking the issue has to do with your join of "WHERE J = X" since we don't know what "J" is and it may not join to "X" (not the correct data type).
Based on your analysis:
"*If I substitute the "SELECT CAST..." subquery in query2 with something like "(2002, 9912, 1234)" and then run the main query it works fine, so this pretty much confirms that the problem is the "CAST" clause."
Check the content of TABLE3.ItmNumbr. If it is defined as NUMERIC (unpacked decimal) it may contain non-numeric values (typically spaces). That may be causing the error you are observing.

one column two times in WHERE condition

I have an important question. I need to use a column two times in WHERE condition.
Example is here:
SELECT COL1 as salary, COL1 as money
FROM employees
WHERE salary = '3000' OR money = '5000'
How can I restrict the same column twice? I need a simple solution. Better with alias. Thank you
When a SELECT statement is processed, the WHERE clause is processed before the SELECT clause. This means that, when the WHERE clause is processed, the aliases (which are defined in SELECT) don't exist yet. The query the way you wrote it will fail with a syntax error message, something like "unknown identifier."
Since what you are really filtering on is the value in col1, why do you care if it is using the column name col1 or an alias? Somehow I get the impression that your problem is different and you over-simplified it to the point that it no longer makes sense.
In any case: with what you have shown (which, again, may not be your real problem), you can write the WHERE condition either as
where col1 = 3000 or col1 = 5000
(assuming col1 is of number data type - there's no reason to compare to strings like '3000' and '5000'), or as
where col1 in (3000, 5000)