Check a lot of colums for at least one 'true' - sql

I have a table with a lot of columns (say 200) they are all boolean. I want to know which of those has at least one record set to true. I have come up with the following query which works fine:
SELECT sum(Case When [column1] = 1 Then 1 Else 0 End) as column1,
sum(Case When [column2] = 1 Then 1 Else 0 End) as column2, sum(Case
When [column3] = 1 Then 1 Else 0 End) as column3, FROM [tablename];
It will return the number of rows that are 'true' for a column. However, this is more information than I need and thereby maybe a more expensive query then needed. The query keeps scanning all fields for all records even though that would not be necessary.

I just learned something about CHECKSUM(*) that might be useful. Try the following code:
DECLARE #T TABLE (
b1 bit
,b2 bit
,b3 bit
);
DECLARE #T2 TABLE (
b1 bit
,b2 bit
,b3 bit
,b4 bit
,b5 bit
);
INSERT INTO #T VALUES (0,0,0),(1,1,1);
INSERT INTO #T2 VALUES (0,0,0,0,0),(1,1,1,1,1);
SELECT CHECKSUM(*) FROM #T;
SELECT CHECKSUM(*) FROM #T2;
You will see from the results that no matter how many columns are in a row, if they are all bit columns with a value of 0, the result of CHECKSUM(*) is always 0.
This means that you could use WHERE CHECKSUM(*)<>0 in your query to save the engine the trouble of summing rows where all the values are 0. Might improve performance.
And even if it doesn't, it's a neat thing to know.
EDIT:
You could do an EXISTS() function on each column. I understand that the EXISTS() function stops scanning when it finds a value that exists. If you have more rows than columns, it might be more performant. If you have more columns than rows, then your current query using SUM() on every column is probably the fastest thing you can do.

If you just want to know the rows that have at last one boolean field, you will need to test every of them.
Something like this (maybe):
SELECT ROW.*
FROM TABLE ROW
WHERE ROW.COLUMN_1 = 1
OR ROW.COLUMN_2 = 1
OR ROW.COLUMN_3 = 1
OR ...
OR ROW.COLUMN_N = 1;

If you actually have 200 columns/fields on one table with boolean then something like the following should work.
SELECT CASE WHEN column1 + column2 + column3 + ... + column200 >= 1 THEN 'Something was true for this record' ELSE NULL END AS My_Big_Field_Test
FROM [TableName];

I'm not in front of my machine, but you could also try the bitwise or operator:
SELECT * FROM [table name] WHERE column1 | column2 | column3 = 1
The OR answer from Arthur is the other suggestion I would offer. Try a few different suggestions and look at the query plans. Also take a look at disk reads and CPU usage. (SET STATISTICS IO ON and SET STATISTICS TIME ON).
See whatever method gives the desires results and the best performance...and then let us know :-)

You can use a query of the form
SELECT
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column1] = 1) THEN 0 ELSE 1 END AS 'Column1',
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column2] = 1) THEN 0 ELSE 1 END AS 'Column2',
...
The efficiency of this critically depends on how sparse your table is. If there are columns where every single row has a 0 value, then any query that searches for a 1 value will require a full table scan, unless an index is in place. A really good choice for this scenario (millions of rows and hundreds of columns) is a columnstore index. These are supported from SQL Server 2012 onwards; from SQL Server 2014 onwards they don't cause the table to be read-only (which is a major barrier to their adoption).
With a columnstore index in place, each subquery should require constant time, and so should the query as a whole (in fact, with hundreds of columns, this query gets so big that you might run into trouble with the input buffer and need to split it up into smaller queries). Without indexes, this query can still be effective as long as the table isn't sparse -- if it "quickly" runs into a row with a 1 value, it stops.

Related

SQL query to find columns having at least one non null value

I am developing a data validation framework where I have this requirement of checking that the table fields should have at least one non-null value i.e they shouldn't be completely empty having all values as null.
For a particular column, I can easily check using
select count(distinct column_name) from table_name;
If it's greater than 0 I can tell that the column is not empty. I already have a list of columns. So, I can execute this query in the loop for every column but this would mean a lot of requests and it is not the ideal way.
What is the better way of doing this? I am using Microsoft SQL Server.
I would not recommend using count(distinct) because it incurs overhead for removing duplicate values. You can just use count().
You can construct the query for counts using a query like this:
select count(col1) as col1_cnt, count(col2) as col2_cnt, . . .
from t;
If you have a list of columns you can do this as dynamic SQL. Something like this:
declare #sql nvarchar(max);
select #sql = concat('select ',
string_agg(concat('count(', quotename(s.value), ') as cnt_', s.value),
' from t'
)
from string_split(#list) s;
exec sp_executesql(#sql);
This might not quite work if your columns have special characters in them, but it illustrates the idea.
You should probably use exists since you aren't really needing a count of anything.
You don't indicate how you want to consume the results of multiple counts, however one thing you could do is use concat to return a list of the columns meeting your criteria:
The following sample table has 5 columns, 3 of which have a value on at least 1 row.
create table t (col1 int, col2 int, col3 int, col4 int, col5 int)
insert into t select null,null,null,null,null
insert into t select null,2,null,null,null
insert into t select null,null,null,null,5
insert into t select null,null,null,null,6
insert into t select null,4,null,null,null
insert into t select null,6,7,null,null
You can name the result of each case expression and concatenate, only the columns that have a non-null value are included as concat ignores nulls returned by the case expressions.
select Concat_ws(', ',
case when exists (select * from t where col1 is not null) then 'col1' end,
case when exists (select * from t where col2 is not null) then 'col2' end,
case when exists (select * from t where col3 is not null) then 'col3' end,
case when exists (select * from t where col4 is not null) then 'col4' end,
case when exists (select * from t where col5 is not null) then 'col5' end)
Result:
col2, col3, col5
I asked a similar question about a decade ago. The best way of doing this in my opinion would meet the following criteria.
Combine the requests for multiple columns together so they can all be calculated in a single scan.
If the scan encounters a not null value in every column under consideration allow it to exit early without reading the rest of the table/index as reading subsequent rows won't change the result.
This is quite a difficult combination to get in practice.
The following might give you the desired behaviour
SELECT DISTINCT TOP 2 ColumnWithoutNull
FROM YourTable
CROSS APPLY (VALUES(CASE WHEN b IS NOT NULL THEN 'b' END),
(CASE WHEN c IS NOT NULL THEN 'c' END)) V(ColumnWithoutNull)
WHERE ColumnWithoutNull IS NOT NULL
OPTION ( HASH GROUP, MAXDOP 1, FAST 1)
If it gives you a plan like this
Hash match usually reads all its build input first meaning that no shortcircuiting of the scan will happen. If the optimiser gives you an operator in "flow distinct" mode it won't do this however and the query execution can potentially stop as soon as TOP receives its first two rows signalling that a NOT NULL value has been found in both columns and query execution can stop.
But there is no hint to request the mode for hash aggregate so you are dependent on the whims of the optimiser as to whether you will get this in practice. The various hints I have added to the query above are an attempt to point it in that direction however.

SQL Server inconsistent results over 2 columns using = and <>

I am trying to replace a manual process with an SQL-SERVER (2012) based automated one. Prior to doing this, I need to analyse the data in question over time to produce some data quality measures/statistics.
Part of this entails comparing the values in two columns. I need to count where they match and where they do not so I can prove my varied stats tally. This should be simple but seems not to be.
Basically, I have a table containing two columns both of which are defined identically as type INT with null values permitted.
SELECT * FROM TABLE
WHERE COLUMN1 is NULL
returns zero rows
SELECT * FROM TABLE
WHERE COLUMN2 is NULL
also returns zero rows.
SELECT COUNT(*) FROM TABLE
returns 3780
and
SELECT * FROM TABLE
returns 3780 rows.
So I have established that there are 3780 rows in my table and that there are no NULL values in the columns I am interested in.
SELECT * FROM TABLE
WHERE COLUMN1=COLUMN2
returns zero rows as expected.
Conversely therefore in a table of 3780 rows, with no NULL values in the columns being compared, I expect the following SQL
SELECT * FROM TABLE
WHERE COLUMN1<>COLUMN2
or in desperation
SELECT * FROM TABLE
WHERE NOT (COLUMN1=COLUMN2)
to return 3780 rows but it doesn't. It returns 3709!
I have tried SELECT * instead of SELECT COUNT(*) in case NULL values in some other columns were impacting but this made no difference, I still got 3709 rows.
Also, there are some negative values in 73 rows for COLUMN1 - is this what causes the issue (but 73+3709=3782 not 3780 my number of rows)?
What is a better way of proving the values in these numeric columns never match?
Update 09/09/2016: At Lamaks suggestion below I isolated the 71 missing rows and found that in each one, COLUMN1 = NULL and COLUMN2 = -99. So the issue is NULL values but why doesn't
SELECT * FROM TABLE WHERE COLUMN1 is NULL
pick them up? Here is the information in Information Schema Views and System Views:
ORDINAL_POSITION COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH IS_NULLABLE
1 ID int NULL NO
.. .. .. .. ..
7 COLUMN1 int NULL YES
8 COLUMN2 int NULL YES
CONSTRAINT_NAME
PK__TABLE___...
name type_desc is_unique is_primary_key
PK__TABLE___... CLUSTERED 1 1
Suspect the CHARACTER_MAXIMUM_LENGTH of NULL must be the issue?
You can find the count based on the below query using left join.
--To find COLUMN1=COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is not null
--To find COLUMN1<>COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is null
Through the exhaustive comment chain above with all help gratefully received, I suspect this to be a problem with the table creation script data types for the columns in question. I have no explanation from an SQL code point of view, as to why the "is NULL" intermittently picked up NULL values.
I was able to identify the 71 rows that were not being picked up as expected by using an "except".
i.e. I flipped the SQL that was missing 71 rows, namely:
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
through an except:
SELECT * FROM TABLE
EXCEPT
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
Through that I could see that COLUMN1 was always NULL in the missing 71 rows - even though the "is NULL" was not picking them up for me when I ran
SELECT * FROM TABLE WHERE COLUMN1 IS NULL
which returned zero rows.
Regarding the comparison of values stored in the columns, as my data volumes are low (3780 recs), I am just forcing the issue by using ISNULL and setting to 9999 (a numeric value I know my data will never contain) to make it work.
SELECT * FROM TABLE
WHERE ISNULL(COLUMN1, 9999) <> COLUMN2
I then get the 3780 rows as expected. It's not ideal but it'll have to do and is more or less appropriate as there are null values in there so they have to be handled.
Also, using Bertrands tip above I could view the table creation script and the columns were definitely set up as INT.

SQL Server "An invalid floating point operation occurred"

This is not really a problem because I solved it. But, I wanted to share something that ate my brain today. So, run this query and check it out:
The query:
select 2 as ID1,0 as ID2 into #TEMP
insert into #TEMP (ID1,ID2)
values (2,1),(2,2),(2,0)
select
case when min(case when ID1=2 and ID2=0 then 0
else 1
end)=0 then 0
else sum(log(ID2))
end
from #TEMP
The fix:
select
case when min(case when ID1=2 and ID2=0 then 0
else 1
end)=0 then 0
else sum(log(case when ID1=2 and ID2<>0 then ID2
else 1
end))
end
from #TEMP
My query was larger and more difficult to debug, but what do you say about the plan that MSSQL is making and the fact that it gets it wrong with this query? How can it be modified to work except my little fix that I showed before? I am guessing that computing scalars before the query would make things slow if the scalars are not easy to compute and we compute for all values.
SQL Server does not perform short circuit evaluation (i.e. should not be relied upon). It's a fairly well known problem.
Ref:
Don’t depend on expression short circuiting in T-SQL (not even with CASE)
Short Circuit
Short-Circuit Evaluation
Is the SQL WHERE clause short-circuit evaluated?
EDIT: I misunderstood the question with my original answer.
I supposed you could add a where clause, as shown below. Side note: this query might benefit from an index on (ID1, ID2).
select
sum(log(convert(float, ID2)))
from #TEMP
where
-- exclude all records with ID1 = 2 if there are any records with ID1 = 2 and ID2 = 0
not exists (
select 1
from #TEMP NoZeros
where
NoZeros.ID1 = #TEMP.ID1
and NoZeros.ID2 = 0
)
Update: Just in case performance is a concern, I got fairly comparable performance from this query after adding the following indexes:
create index ix_TEMP_ID1_ID2 on #TEMP (ID1, ID2)
create index ix_TEMP_ID2 on #TEMP (ID2) include (ID1)
Original Answer
How about modifying your sum function as shown below?
sum(case ID2 when 0 then null else log(convert(float, ID2)) end)

T-SQL 2005: Counting All Rows and Rows Meeting Criteria

Here's the scenario:
I have a table with 3 columns: 'KeyColumn', 'SubKeyColumn' and 'BooleanColumn', where the first two are the primary keys of the table.
For my query, I'd like to count the number of rows there are for any given value in 'KeyColumn', and I'd also like to know which ones have a value of true for 'BooleanColumn'. My initial thought was to create a query like this:
SELECT
COUNT(*)
,COUNT(CASE WHEN BooleanColumn = 1 THEN 1 ELSE 0 END)
FROM
MyTable
GROUP BY
KeyColumn
However, the 2nd part does not work (I'm not entirely sure why I thought it would to begin with). Is it possible to do something like this in one query? Or am I going to need to do multiple queries to make this happen?
Change COUNT to SUM in the 2nd part. ;)
... CASE WHEN BooleanColumn = 1 THEN 1 ELSE NULL END ...
COUNT counts the NON-NULL rows.
You could also do SUM(CAST(BooleanColumn AS TINYINT))

Check whether a table contains rows or not sql server 2005

How to Check whether a table contains rows or not sql server 2005?
For what purpose?
Quickest for an IF would be IF EXISTS (SELECT * FROM Table)...
For a result set, SELECT TOP 1 1 FROM Table returns either zero or one rows
For exactly one row with a count (0 or non-zero), SELECT COUNT(*) FROM Table
Also, you can use exists
select case when exists (select 1 from table)
then 'contains rows'
else 'doesnt contain rows'
end
or to check if there are child rows for a particular record :
select * from Table t1
where exists(
select 1 from ChildTable t2
where t1.id = t2.parentid)
or in a procedure
if exists(select 1 from table)
begin
-- do stuff
end
Like Other said you can use something like that:
IF NOT EXISTS (SELECT 1 FROM Table)
BEGIN
--Do Something
END
ELSE
BEGIN
--Do Another Thing
END
FOR the best performance, use specific column name instead of * - for example:
SELECT TOP 1 <columnName>
FROM <tableName>
This is optimal because, instead of returning the whole list of columns, it is returning just one. That can save some time.
Also, returning just first row if there are any values, makes it even faster. Actually you got just one value as the result - if there are any rows, or no value if there is no rows.
If you use the table in distributed manner, which is most probably the case, than transporting just one value from the server to the client is much faster.
You also should choose wisely among all the columns to get data from a column which can take as less resource as possible.
Can't you just count the rows using select count(*) from table (or an indexed column instead of * if speed is important)?
If not then maybe this article can point you in the right direction.
Fast:
SELECT TOP (1) CASE
WHEN **NOT_NULL_COLUMN** IS NULL
THEN 'empty table'
ELSE 'not empty table'
END AS info
FROM **TABLE_NAME**