SQL query to find columns having at least one non null value - sql

I am developing a data validation framework where I have this requirement of checking that the table fields should have at least one non-null value i.e they shouldn't be completely empty having all values as null.
For a particular column, I can easily check using
select count(distinct column_name) from table_name;
If it's greater than 0 I can tell that the column is not empty. I already have a list of columns. So, I can execute this query in the loop for every column but this would mean a lot of requests and it is not the ideal way.
What is the better way of doing this? I am using Microsoft SQL Server.

I would not recommend using count(distinct) because it incurs overhead for removing duplicate values. You can just use count().
You can construct the query for counts using a query like this:
select count(col1) as col1_cnt, count(col2) as col2_cnt, . . .
from t;
If you have a list of columns you can do this as dynamic SQL. Something like this:
declare #sql nvarchar(max);
select #sql = concat('select ',
string_agg(concat('count(', quotename(s.value), ') as cnt_', s.value),
' from t'
)
from string_split(#list) s;
exec sp_executesql(#sql);
This might not quite work if your columns have special characters in them, but it illustrates the idea.

You should probably use exists since you aren't really needing a count of anything.
You don't indicate how you want to consume the results of multiple counts, however one thing you could do is use concat to return a list of the columns meeting your criteria:
The following sample table has 5 columns, 3 of which have a value on at least 1 row.
create table t (col1 int, col2 int, col3 int, col4 int, col5 int)
insert into t select null,null,null,null,null
insert into t select null,2,null,null,null
insert into t select null,null,null,null,5
insert into t select null,null,null,null,6
insert into t select null,4,null,null,null
insert into t select null,6,7,null,null
You can name the result of each case expression and concatenate, only the columns that have a non-null value are included as concat ignores nulls returned by the case expressions.
select Concat_ws(', ',
case when exists (select * from t where col1 is not null) then 'col1' end,
case when exists (select * from t where col2 is not null) then 'col2' end,
case when exists (select * from t where col3 is not null) then 'col3' end,
case when exists (select * from t where col4 is not null) then 'col4' end,
case when exists (select * from t where col5 is not null) then 'col5' end)
Result:
col2, col3, col5

I asked a similar question about a decade ago. The best way of doing this in my opinion would meet the following criteria.
Combine the requests for multiple columns together so they can all be calculated in a single scan.
If the scan encounters a not null value in every column under consideration allow it to exit early without reading the rest of the table/index as reading subsequent rows won't change the result.
This is quite a difficult combination to get in practice.
The following might give you the desired behaviour
SELECT DISTINCT TOP 2 ColumnWithoutNull
FROM YourTable
CROSS APPLY (VALUES(CASE WHEN b IS NOT NULL THEN 'b' END),
(CASE WHEN c IS NOT NULL THEN 'c' END)) V(ColumnWithoutNull)
WHERE ColumnWithoutNull IS NOT NULL
OPTION ( HASH GROUP, MAXDOP 1, FAST 1)
If it gives you a plan like this
Hash match usually reads all its build input first meaning that no shortcircuiting of the scan will happen. If the optimiser gives you an operator in "flow distinct" mode it won't do this however and the query execution can potentially stop as soon as TOP receives its first two rows signalling that a NOT NULL value has been found in both columns and query execution can stop.
But there is no hint to request the mode for hash aggregate so you are dependent on the whims of the optimiser as to whether you will get this in practice. The various hints I have added to the query above are an attempt to point it in that direction however.

Related

How to duplicate records, modify and add them to same table

I got some question and hopefully you can help me out. :)
What I have is a table like this:
ID Col1 Col2 ReverseID
1 Number 1 Number A
2 Number 2 Number B
3 Number 3 Number C
What I want to achieve is:
Create duplicate of every record with switched columns and add them to original table
Add the ID of the duplicate to ReverseID column of original record and vice-versa
So the new table should look like:
ID Col1 Col2 ReverseID
1 Number 1 Number A 4
2 Number 2 Number B 5
3 Number 3 Number C 6
4 Number A Number 1 1
5 Number B Number 2 2
6 Number C Number 3 3
What I've done so far was working with temporary table:
SELECT * INTO #tbl
FROM myTable
UPDATE #tbl
SET Col1 = Col2,
Col2 = Col1,
ReverseID = ID
INSERT INTO DUPLICATEtable(
Col1,
Col2,
ReverseID
)
SELECT Col1,
Col2,
ReverseID
FROM #tbl
In this example code I used a secondary table just for making sure I do not compromise the original data records.
I think I could skip the SET-part and change the columns in the last SELECT statement to achieve the same, but I am not sure.
Anyway - with this I am ending up at:
ID Col1 Col2 ReverseID
1 Number 1 Number A
2 Number 2 Number B
3 Number 3 Number C
4 Number A Number 1 1
5 Number B Number 2 2
6 Number C Number 3 3
So the question remains: How do I get the ReverseIDs correctly added to original records?
As my SQL knowledge is pretty low I am almost sure, this is not the simplest way of doing things, so I hope you guys & girls can enlighten me and lead me to a more elegant solution.
Thank you in advance!
br
mrt
Edit:
I try to illustrate my initial problem, so this posting gets long. ;)
.
First of all: My frontend does not allow any SQL statements, I have to focus on classes, attributes, relations.
First root cause:
Instances of a class B (B1, B2, B3, ...) are linked together in class Relation, these are many-to-many relations of same class. My frontend does not allow join tables, so that's a workaround.
Stating a user adds a relation with B1 as first side (I just called it 'left') and B2 as second side (right):
Navigating from B1, there will be two relations showing up (FK_Left, FK_Right), but only one of them will contain a value (let's say FK_Left).
Navigating from B2, the value will be only listed in the other relation (FK_Right).
So from the users side, there are always two relations displayed, but it depends on how the record was entered, if one can find the data behind relation_left or relation_right.
That's no practicable usability.
If I had all records with vice-versa partners, I can just hide one of the relations and the user sees all information behind one relation, regardless how it was entered.
Second root cause:
The frontend provides some matrix view, which gets the relation class as input and displays left partners in columns and right partners in rows.
Let's say I want to see all instances of A in columns and their partners in rows, this is only possible, if all relations regarding the instances of A are entered the same way, e.g. all A-instances as left partner.
The matrix view shall be freely filterable regarding rows and columns, so if I had duplicate relations, I can filter on any of the partners in rows and columns.
sorry for the long text, I hope that made my situation a bit clearer.
I would suggest just using a view instead of trying to create and maintain two copies of the same data. Then you just select from the view instead of the base table.
create view MyReversedDataView as
select ID
, Col1
, Col2
from MyTable
UNION ALL
select ID
, Col2
, Col1
from MyTable
The trick to this kind of thing is to start with a SELECT that gets the data you need. In this case you need a resultset with Col1, Col2, reverseid.
SELECT Col2 Col1, Col1 Col1, ID reverseid
INTO #tmp FROM myTable;
Convince yourself it's correct -- swapped column values etc.
Then do this:
INSERT INTO myTable (Col1, col2, reverseid)
SELECT Col1, Col2, reverseid FROM #tmp;
If you're doing this from a GUI like ssms, don't forget to DROP TABLE #tmp;
BUT, you can get the same result with a pure query, without duplicating rows. Why do it this way?
You save the wasted space for the reversed rows.
You always get the reversed rows up to the last second, even if you forget to run the process for reversing and inserting them into the table.
There's no consistency problem if you insert or delete rows from the table.
Here's how you might do this.
SELECT Col1, Col2, null reverseid FROM myTable
UNION ALL
SELECT Col2 Col1, Col1 Col2, ID reverseid FROM myTable;
You can even make it into a view and use it as if it were a table going forward.
CREATE VIEW myTableWithReversals AS
SELECT Col1, Col2, null reverseid FROM myTable
UNION ALL
SELECT Col2 Col1, Col1 Col2, ID reverseid FROM myTable;
Then you can say SELECT * FROM myTableWithReversals WHERE Col1 = 'value' etc.
Let me assume that the id column is auto-incremented. Then, you can do this in two steps:
insert into myTable (Col1, Col2, reverseid)
select col2, col1, id
from myTable t
order by id; -- ensures that they go in in the right order
This inserts the new ids with the right reverseid. Now we have to update the previous values:
update t
set reverseid = tr.id
from myTable t join
myTable tr
on tr.reverseid = t.id;
Note that no temporary tables are needed.

Using EXCEPT where 1=0

I saw the following posted on a basic way to de-dup entries, without explanation of how it works. I see that it works, but I want to know the workings of how it works and the process in which it evaluates. Below I will post the code, and my thoughts. I am hoping that somebody can tell me if my thought process on how this is evaluated step by step is correct, or if I am off, can somebody please break it down for me.
CREATE TABLE #DuplicateRcordTable (Col1 INT, Col2 INT)
INSERT INTO #DuplicateRcordTable
SELECT 1, 1
UNION ALL
SELECT 1, 1
UNION ALL
SELECT 1, 1
UNION ALL
SELECT 1, 2
UNION ALL
SELECT 1, 2
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 1, 4
GO
This returns a basic table:
Then this code is used to exclude duplicates:
SELECT col1,col2
FROM #DuplicateRcordTable
EXCEPT
SELECT col1,col2
FROM #DuplicateRcordTable WHERE 1=0
My understanding is that where 1=0 creates a "temp" table structured the same but has no data.
Does this code then start adding data to the new empty table?
For example does it look at the first Col1, Col2 pair of 1,1 and say "I don't see it in the table" so it adds it to the "temp" table and end result, then checks the next row which is also 1,1 and then sees it already in the "temp" table so its not added to the end result....and so on through the data.
EXCEPT is a set operation that removes duplicates. That is, it takes everything in the first table that is not in the second and then does duplicate removal.
With an empty second set, all that is left is the duplicate removal.
Hence,
SELECT col1, col2
FROM #DuplicateRcordTable
EXCEPT
SELECT col1, col2
FROM #DuplicateRcordTable
WHERE 1 = 0;
is equivalent to:
SELECT DISTINCT col1, col2
FROM #DuplicateRcordTable
This would be the more typical way to write the query.
This would also be equivalent to:
SELECT col1,col2
FROM #DuplicateRcordTable
UNION
SELECT col1,col2
FROM #DuplicateRcordTable
WHERE 1 = 0;
The reason that this works is due to the definition of EXCEPT which according to the MS docs is
EXCEPT returns distinct rows from the left input query that aren't
output by the right input query.
The key word here being distinct. Putting where 1 = 0 makes the second query return no results, but the EXCEPT operator itself then reduces the rows from the left query down to those which are distinct.
As #Gordon Linoff says in his answer, there is a simpler, more straightforward way to accomplish this.
The fact that the example uses the same table in the left and right queries could be misleading, the following query will accomplish the same thing, so long as the values in the right query don't exist in the left:
SELECT col1, col2
FROM #DuplicateRecordTable
EXCEPT
SELECT -1, -1
REF: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017

SQL Server inconsistent results over 2 columns using = and <>

I am trying to replace a manual process with an SQL-SERVER (2012) based automated one. Prior to doing this, I need to analyse the data in question over time to produce some data quality measures/statistics.
Part of this entails comparing the values in two columns. I need to count where they match and where they do not so I can prove my varied stats tally. This should be simple but seems not to be.
Basically, I have a table containing two columns both of which are defined identically as type INT with null values permitted.
SELECT * FROM TABLE
WHERE COLUMN1 is NULL
returns zero rows
SELECT * FROM TABLE
WHERE COLUMN2 is NULL
also returns zero rows.
SELECT COUNT(*) FROM TABLE
returns 3780
and
SELECT * FROM TABLE
returns 3780 rows.
So I have established that there are 3780 rows in my table and that there are no NULL values in the columns I am interested in.
SELECT * FROM TABLE
WHERE COLUMN1=COLUMN2
returns zero rows as expected.
Conversely therefore in a table of 3780 rows, with no NULL values in the columns being compared, I expect the following SQL
SELECT * FROM TABLE
WHERE COLUMN1<>COLUMN2
or in desperation
SELECT * FROM TABLE
WHERE NOT (COLUMN1=COLUMN2)
to return 3780 rows but it doesn't. It returns 3709!
I have tried SELECT * instead of SELECT COUNT(*) in case NULL values in some other columns were impacting but this made no difference, I still got 3709 rows.
Also, there are some negative values in 73 rows for COLUMN1 - is this what causes the issue (but 73+3709=3782 not 3780 my number of rows)?
What is a better way of proving the values in these numeric columns never match?
Update 09/09/2016: At Lamaks suggestion below I isolated the 71 missing rows and found that in each one, COLUMN1 = NULL and COLUMN2 = -99. So the issue is NULL values but why doesn't
SELECT * FROM TABLE WHERE COLUMN1 is NULL
pick them up? Here is the information in Information Schema Views and System Views:
ORDINAL_POSITION COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH IS_NULLABLE
1 ID int NULL NO
.. .. .. .. ..
7 COLUMN1 int NULL YES
8 COLUMN2 int NULL YES
CONSTRAINT_NAME
PK__TABLE___...
name type_desc is_unique is_primary_key
PK__TABLE___... CLUSTERED 1 1
Suspect the CHARACTER_MAXIMUM_LENGTH of NULL must be the issue?
You can find the count based on the below query using left join.
--To find COLUMN1=COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is not null
--To find COLUMN1<>COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is null
Through the exhaustive comment chain above with all help gratefully received, I suspect this to be a problem with the table creation script data types for the columns in question. I have no explanation from an SQL code point of view, as to why the "is NULL" intermittently picked up NULL values.
I was able to identify the 71 rows that were not being picked up as expected by using an "except".
i.e. I flipped the SQL that was missing 71 rows, namely:
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
through an except:
SELECT * FROM TABLE
EXCEPT
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
Through that I could see that COLUMN1 was always NULL in the missing 71 rows - even though the "is NULL" was not picking them up for me when I ran
SELECT * FROM TABLE WHERE COLUMN1 IS NULL
which returned zero rows.
Regarding the comparison of values stored in the columns, as my data volumes are low (3780 recs), I am just forcing the issue by using ISNULL and setting to 9999 (a numeric value I know my data will never contain) to make it work.
SELECT * FROM TABLE
WHERE ISNULL(COLUMN1, 9999) <> COLUMN2
I then get the 3780 rows as expected. It's not ideal but it'll have to do and is more or less appropriate as there are null values in there so they have to be handled.
Also, using Bertrands tip above I could view the table creation script and the columns were definitely set up as INT.

Check a lot of colums for at least one 'true'

I have a table with a lot of columns (say 200) they are all boolean. I want to know which of those has at least one record set to true. I have come up with the following query which works fine:
SELECT sum(Case When [column1] = 1 Then 1 Else 0 End) as column1,
sum(Case When [column2] = 1 Then 1 Else 0 End) as column2, sum(Case
When [column3] = 1 Then 1 Else 0 End) as column3, FROM [tablename];
It will return the number of rows that are 'true' for a column. However, this is more information than I need and thereby maybe a more expensive query then needed. The query keeps scanning all fields for all records even though that would not be necessary.
I just learned something about CHECKSUM(*) that might be useful. Try the following code:
DECLARE #T TABLE (
b1 bit
,b2 bit
,b3 bit
);
DECLARE #T2 TABLE (
b1 bit
,b2 bit
,b3 bit
,b4 bit
,b5 bit
);
INSERT INTO #T VALUES (0,0,0),(1,1,1);
INSERT INTO #T2 VALUES (0,0,0,0,0),(1,1,1,1,1);
SELECT CHECKSUM(*) FROM #T;
SELECT CHECKSUM(*) FROM #T2;
You will see from the results that no matter how many columns are in a row, if they are all bit columns with a value of 0, the result of CHECKSUM(*) is always 0.
This means that you could use WHERE CHECKSUM(*)<>0 in your query to save the engine the trouble of summing rows where all the values are 0. Might improve performance.
And even if it doesn't, it's a neat thing to know.
EDIT:
You could do an EXISTS() function on each column. I understand that the EXISTS() function stops scanning when it finds a value that exists. If you have more rows than columns, it might be more performant. If you have more columns than rows, then your current query using SUM() on every column is probably the fastest thing you can do.
If you just want to know the rows that have at last one boolean field, you will need to test every of them.
Something like this (maybe):
SELECT ROW.*
FROM TABLE ROW
WHERE ROW.COLUMN_1 = 1
OR ROW.COLUMN_2 = 1
OR ROW.COLUMN_3 = 1
OR ...
OR ROW.COLUMN_N = 1;
If you actually have 200 columns/fields on one table with boolean then something like the following should work.
SELECT CASE WHEN column1 + column2 + column3 + ... + column200 >= 1 THEN 'Something was true for this record' ELSE NULL END AS My_Big_Field_Test
FROM [TableName];
I'm not in front of my machine, but you could also try the bitwise or operator:
SELECT * FROM [table name] WHERE column1 | column2 | column3 = 1
The OR answer from Arthur is the other suggestion I would offer. Try a few different suggestions and look at the query plans. Also take a look at disk reads and CPU usage. (SET STATISTICS IO ON and SET STATISTICS TIME ON).
See whatever method gives the desires results and the best performance...and then let us know :-)
You can use a query of the form
SELECT
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column1] = 1) THEN 0 ELSE 1 END AS 'Column1',
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column2] = 1) THEN 0 ELSE 1 END AS 'Column2',
...
The efficiency of this critically depends on how sparse your table is. If there are columns where every single row has a 0 value, then any query that searches for a 1 value will require a full table scan, unless an index is in place. A really good choice for this scenario (millions of rows and hundreds of columns) is a columnstore index. These are supported from SQL Server 2012 onwards; from SQL Server 2014 onwards they don't cause the table to be read-only (which is a major barrier to their adoption).
With a columnstore index in place, each subquery should require constant time, and so should the query as a whole (in fact, with hundreds of columns, this query gets so big that you might run into trouble with the input buffer and need to split it up into smaller queries). Without indexes, this query can still be effective as long as the table isn't sparse -- if it "quickly" runs into a row with a 1 value, it stops.

Catching null warnings in aggregate functions in sql

How does one use the debugger in sql 2008 / 2012 to catch null values in records?
See:
drop table abc
create table abc(
a int
)
go
insert into abc values(1)
insert into abc values(null)
insert into abc values(2)
select max(a) from abc
(1 row(s) affected)
Warning: Null value is eliminated by an aggregate or other SET operation.
Now this can be rectifed by doing:
SELECT max(isNull(a,0)) FROM abc
which is fine, until I come to to 200 line queries with several levels of nesting,and a result set of 2000 odd records. -- And then have no clue which column is throwing the warning.
How do I add conditional breakpoints ( or break on warning ) in the SQL debugger? ( if it is even possible )
Part 1: About aggregate warnings...
Considering your several levels nesting I am afraid there is no straightforward way of seeing which records trigger those warnings.
I think your best shot would be to remove each aggregate function, one at a time, from the SELECT part of the top-level statement and run query so you can see which aggregate is causing warnings at the top level (if any)
After that you should move on to nested queries and move each sub-query that feeds the top-level aggregates to a separate window and run it there, check for warnings. You should repeat this for additional levels of nesting to find out what actually causes the warnings.
You can employ the following method also.
Part 2:About conditional breakpoints...
For the sake of debugging, you move each of you nested tables out and put its data to a temp table. After that you check for null values in that temp table. You set a breakpoint in an IF statement. I believe this is the best thing close to a conditional breakpoint. (IF clause can be altered to build other conditions)
Here is a solid example,
Instead of this:
SELECT A.col1, A.col2, SUM(A.col3) as col3
FROM (SELECT X as col1, Y as col2, MAX(Z) as col3
FROM (SELECT A as X, B as Y, MIN(C) as Z
FROM myTableC
) as myTableB
) as myTableA
do this:
SELECT A as X, B as Y, MIN(C) as Z
INTO #tempTableC
FROM myTableC
IF EXISTS (SELECT *
FROM #tempTableC
WHERE A IS NULL ) BEGIN
SELECT 'A' --- Breakpoint here
END
SELECT X as col1, Y as col2, MAX(Z) as col3
INTO #tempTableB
FROM #tempTableC
IF EXISTS (SELECT *
FROM #tempTableB
WHERE X IS NULL ) BEGIN
SELECT 'B' --- Breakpoint here
END
SELECT col1, col2, SUM(col3) as col3
FROM #tempTableB as myTableA
aggregate functions exclude null values by definition, so you can just write
select max (a) from abc
instead of
SELECT max(isNull(a,0)) FROM abc
unless all values of a in abc are null, in which the second query would return zero instead of null.
If you want to prevent null values being entered, use a not null constraint on the table column.
You can turn off the warning by executing:
set ansi_warnings off
This is explained here. This works, at least on the systems I've tested it on, to remove the warning when aggregating on NULL values.
This supposedly has another effect on converting numeric overflows and divide by 0s to NULLs rather than an error. However, I still get errors for divide by 0 and arithmetic overflows.
As an aside, when using SQL Server Management Studio, one rarely sees this message. When the query is successful, the message is on the "Messages" tab. However, SSMS defaults to the "Results" tab and usually there is no reason to look at the messages (although the warning is there). You only see the warning automatically when there is an error in the query, and SSMS defaults to the messages tab.
You'll have to write a second query to pull out the data that you're looking for.
SELECT * FROM abc WHERE a IS NULL
You can put that into an IF statement to write an error message, or log to a table. Other than that, you're out of luck. Sorry. : /
Rather You can ignore to have rows with null values
SELECT MAX(a) FROM abc WHERE a IS NOT NULL