Count the number of nulls in each column - sql

I've run into a DB that has tables that are excessively wide. (600+ columns) Even asking for the top 100 rows with no parameters takes 4 seconds. I'd like to slim these tables down a bit.
To figure out which columns can be most easily moved to new tables, or removed entirely, I would like to know how many nulls are in each column. This should tell me what information is likely to be least important.
How would I write a query that can find all columns and count the nulls inside those columns?
Edit The DB is SQL server 2008. I'm really hoping not to type each of the columns individually. It looks like sys.columns could help with this?
Edit2 The columns are all different types.

Try this
declare #Table_Name nvarchar(max), #Columns nvarchar(max), #stmt nvarchar(max)
declare table_cursor cursor local fast_forward for
select
s.name,
stuff(
(
select
', count(case when ' + name +
' is null then 1 else null end) as count_' + name
from sys.columns as c
where c.object_id = s.object_id
for xml path(''), type
).value('data(.)', 'nvarchar(max)')
, 1, 2, '')
from sys.tables as s
open table_cursor
fetch table_cursor into #Table_Name, #Columns
while ##FETCH_STATUS = 0
begin
select #stmt = 'select ''' + #Table_Name + ''' as Table_Name, ' + #Columns + ' from ' + #Table_Name
exec sp_executesql
#stmt = #stmt
fetch table_cursor into #Table_Name, #Columns
end
close table_cursor
deallocate table_cursor

select count(case when Column1 is null then 1 end) as Column1NullCount,
count(case when Column2 is null then 1 end) as Column2NullCount,
count(case when Column3 is null then 1 end) as Column3NullCount,
...
from MyTable

Related

How to get metadata from columns that have specific number of distinct values?

I need to find all columns that have 5 or more distinct values. Now my query is like:
SELECT TABLE_NAME,COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'MY_SCHEMA'
AND TABLE_NAME IN ('TABLE_1', 'TABLE_2', 'TABLE_3')
I thought it could be done like simple subquery. Something like:
*code above*
AND (select count(distinct COLUMN_NAME) FROM TABLE_SCHEMA + TABLE_NAME) > 5
I just recently started to learn SQL and thought this kind of thing is easy, but still I can't figure out right query.
With help of Stu's answer and this answer I was able to make workable solution.
declare #RowsToProcess int
declare #CurrentRow int
declare #SelectCol nvarchar(max)
declare #SelectTable nvarchar(max)
declare #tablesAndColumns table(RowID int not null primary key identity(1,1), table_name nvarchar(max), column_name nvarchar(max)
insert into #tablesAndColumns
select TABLE_NAME,COLUMN_NAME,DATA_TYPE
from INFORMATION_SCHEMA.COLUMNS
where TABLE_SCHEMA = 'my schema'
and TABLE_NAME in ('myTable', 'myTable2' ,'myTable3')
set #RowsToProcess=##ROWCOUNT
set #CurrentRow=0
while #CurrentRow<#RowsToProcess
begin
set #CurrentRow=#CurrentRow+1
select
#SelectCol=column_name,
#SelectTable=table_name
from #tablesAndColumns
where RowID=#CurrentRow
declare #QRY NVARCHAR(MAX)
set #QRY = ' insert into [my_schema].[result_table] (table_name,column_name,distinct_values)
SELECT ' + '''' +#SelectTable+ '''' + ', ' + '''' +#SelectCol+ '''' + ', count(*) as cnt
FROM (SELECT DISTINCT ' +#SelectCol+ ' FROM my_schema.'+ #SelectTable+') as a'
exec SP_EXECUTESQL #QRY
end
I'd like to propose another way. You can run through all the column and table names by using a CURSOR. That way you don't need to store them beforehand and can directly access them in your loop while also having a while condition.
Also I went with sys.tables and sys.columns since I noticed that INFORMATION_SCHEMA also contains views and sys.tables can be filtered for the table's type.
I added a "HAVING COUNT(*) >= 5" into the dynamic SQL so I don't save those informations in the first place rather than filtering them later.
Finally I went with "(NOLOCK)" because you only try to acces the tables for reading and that way you don't lock them for other users / interactions.
(The #i and #max are just for tracking the progress since I ran the query on ~10k columns and just wanted to see how far it is.)
Hopefully might be helpful aswell although you seem to have solved your problem.
DECLARE #columnName nvarchar(100),
#tableName nvarchar(100),
#sql nvarchar(MAX),
#i int = 0,
#max int = (SELECT COUNT(*)
FROM sys.tables T
INNER JOIN sys.columns C ON T.object_id = C.object_id
WHERE T.[type] = 'U')
DROP TABLE IF EXISTS #resultTable
CREATE TABLE #resultTable (ColumnName nvarchar(100), TableName nvarchar(100), ResultCount int)
DECLARE db_cursor CURSOR FOR
SELECT C.[name], T.[name]
FROM sys.tables T
INNER JOIN sys.columns C ON T.object_id = C.object_id
WHERE T.[type] = 'U'
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #columnName, #tableName
WHILE ##FETCH_STATUS = 0
BEGIN
SET #sql = CONCAT(' INSERT INTO #resultTable (ColumnName, TableName, ResultCount)
SELECT ''', #columnName, ''', ''', #tableName, ''', COUNT(*)
FROM (
SELECT DISTINCT [', #columnName, ']
FROM [', #tableName, '] (NOLOCK)
WHERE [', #columnName, '] IS NOT NULL
) t
HAVING COUNT(*) >= 5')
EXEC sp_executesql #sql
SET #i = #i + 1
PRINT CONCAT(#i, ' / ', #max)
FETCH NEXT FROM db_cursor INTO #columnName, #tableName
END
CLOSE db_cursor
DEALLOCATE db_cursor
SELECT *
FROM #resultTable

ERROR IN DYNAMIC SQL: Conversion failed when converting the varchar value ', ' to data type int

My purpose is to produce a table containing the table_name, column_name, number of row each column and number of null value in each column. But I get an error:
Conversion failed when converting the varchar value ', ' to data type int
These are my queries:
DECLARE #BANG TABLE
(
TABLE_NAME NVARCHAR(MAX),
COLUMN_NAME NVARCHAR(MAX),
ID INT IDENTITY(1, 1)
)
INSERT INTO #BANG (TABLE_NAME, COLUMN_NAME)
SELECT A.NAME AS TABLE_NAME, B.NAME AS COLUMN_NAME
FROM SYS.TABLES AS A
LEFT JOIN SYS.COLUMNS AS B ON A.OBJECT_ID = B.OBJECT_ID
WHERE 1=1
AND A.NAME IN ('CTHD', 'HOADON', 'SANPHAM', 'KHACHHANG', 'NHANVIEN')
DECLARE #RESULT TABLE
(
TABLE_NAME NVARCHAR(MAX),
COLUMN_NAME NVARCHAR(MAX),
TOTAL_ROW INT,
TOTAL_NULL INT
)
DECLARE #ID INT = 0
WHILE #ID <= (SELECT COUNT(*) FROM #BANG)
BEGIN
DECLARE #TABLE_NAME NVARCHAR(MAX)
SET #TABLE_NAME = (SELECT TABLE_NAME FROM #BANG WHERE #ID = ID)
DECLARE #COLUMN_NAME NVARCHAR(MAX)
SET #COLUMN_NAME = (SELECT COLUMN_NAME FROM #BANG WHERE ID = #ID)
DECLARE #TOTAL_ROW INT
DECLARE #TOTAL_NULL INT
DECLARE #SQL NVARCHAR(MAX)
SET #SQL = 'SET #TOTAL_ROW = (SELECT COUNT(*) FROM '+#TABLE_NAME+')
SET #TOTAL_NULL = (SELECT COUNT(*) FROM '+#TABLE_NAME+' WHERE '+#COLUMN_NAME+' IS NULL)
INSERT INTO #RESULT
VALUES ('+#TABLE_NAME+', '+#COLUMN_NAME+', '+#TOTAL_ROW+', '+#TOTAL_NULL+')
'
SET #ID += 1
EXEC (#SQL)
END
I need your help. Thanks in advance
You should be using parameterized SQL. But honestly, the code is such a mess that I'm not going to attempt that fix.
The problem is that parameters such as #TOTAL_ROW are integers, not strings. So, the + is treated as addition rather than string concatenation.
The simplest immediate fix is to use CONCAT():
SET #SQL = CONCAT('
INSERT INTO #RESULT
VALUES (''', #TABLE_NAME, ''', ''', #COLUMN_NAME, ''', ''', #TOTAL_ROW, ', ', #TOTAL_NULL, ')';
You may have the same error elsewhere in the code. You need to fix all places where you have a number and string combined with + and you intend string concatenation rather than addition.
However, the real fix is to not munge query strings with such values. Instead use sp_executesql passing the values in as parameters.
The conversion error is during the generation of the dynamic SQL query, not during execution of the statement.
There are a number of issues with the script in your question. Below is a script that uses QUOTENAME to more security build the SQL statement and uses a parameterized query to execute it. The WHILE pseudo cursor doesn't provide any value in this case so this version uses a real cursor.
DECLARE #RESULT TABLE (SCHEMA_NAME sysname, TABLE_NAME sysname, COLUMN_NAME sysname, TOTAL_ROW int, TOTAL_NULL int);
DECLARE #SQL nvarchar(MAX), #SchemaName sysname, #TableName sysname, #ColumnName sysname;
DECLARE BANG CURSOR LOCAL FAST_FORWARD FOR
SELECT s.name AS SCHEMA_NAME, t.name AS TABLE_NAME, c.name AS COLUMN_NAME
FROM sys.tables AS t
JOIN sys.schemas AS s ON s.schema_id = t.schema_id
JOIN sys.columns AS c ON c.object_id = t.object_id
WHERE t.name IN (N'CTHD', N'HOADON', N'SANPHAM', N'KHACHHANG', N'NHANVIEN');
OPEN BANG;
WHILE 1 = 1
BEGIN
FETCH NEXT FROM BANG INTO #SchemaName, #TableName, #ColumnName;
IF ##FETCH_STATUS = -1 BREAK;
SET #SQL = N'SELECT #SchemaName, #TableName, #ColumnName, COUNT(*), COALESCE(SUM(CASE WHEN ' + QUOTENAME(#ColumnName) + N' IS NULL THEN 1 ELSE 0 END),0)
FROM ' + QUOTENAME(#SchemaName) + N'.' + QUOTENAME(#TableName) + N';'
PRINT #SQL
INSERT INTO #RESULT(SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, TOTAL_ROW, TOTAL_NULL)
EXEC sp_executesql #SQL
, N'#SchemaName sysname, #TableName sysname, #ColumnName sysname'
, #SchemaName = #SchemaName
, #TableName = #TableName
, #ColumnName = #ColumnName;
END;
CLOSE BANG;
DEALLOCATE BANG;
SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, TOTAL_ROW, TOTAL_NULL
FROM #RESULT
ORDER BY SCHEMA_NAME, TABLE_NAME, COLUMN_NAME;
GO
If you don't have many tables/columns, you could use a single UNION ALL query and ditch the (pseudo)cursor entirely:
DECLARE #SQL nvarchar(MAX) = (SELECT STRING_AGG(
N'SELECT ' + QUOTENAME(s.name,'''') + N' AS SCHEMA_NAME,'
+ QUOTENAME(t.name, '''') + N' AS TABLE_NAME,'
+ QUOTENAME(c.name,'''') + N' AS COLUMN_NAME,'
+ 'COUNT(*) AS TOTAL_ROW,'
+ 'COALESCE(SUM(CASE WHEN ' + QUOTENAME(c.name) + ' IS NULL THEN 1 ELSE 0 END),0) AS TOTAL_NULL '
+ 'FROM ' + QUOTENAME(s.name) + N'.' + QUOTENAME(t.name)
, ' UNION ALL ') + N';'
FROM sys.tables AS t
JOIN sys.schemas AS s ON s.schema_id = t.schema_id
JOIN sys.columns AS c ON c.object_id = t.object_id
WHERE t.name IN (N'CTHD', N'HOADON', N'SANPHAM', N'KHACHHANG', N'NHANVIEN');
EXEC sp_executesql #SQL;
As Gordon said, use sp_execute properly to execute dynamics sql. And there's some others issue with this code, not related to the question.
"the more important" when investigating is to use the print statement before the exec statement, to know what's being execute. Then, you'll realize where the errors is and why it won't work....
the first execution of the loop is useless (or maybe this is the one that produce the error....). You initialize #Id with the value of 0 and then compare it with the value of the identity in the table #BANG, starting at 1. This result to #TABLE_NAME and #COLUMN_NAME set to NULL, thus, concatening string without using CONCAT will end up in a NULL value. Nothing is execute on the first loop.
However, using concat with null value will not result in a null value, but an incorrect value (query in your case). As an exemple, this code
SET #SQL = CONCAT('
INSERT INTO #RESULT
VALUES (''', #TABLE_NAME, ''', ''', #COLUMN_NAME, ''', ''', #TOTAL_ROW, ', ', #TOTAL_NULL, ')';
will result in something like "INSERT INTO #RESULT VALUES (''CTHD'',''COL1'',,)" since both #TOTAL_ROW AND #TOTAL_NULL are null values. you need to use parametrized query using sp_executesql.
There's no need to execute two count on the same table, one for total rows, the second for null values. select count(1) return the total number of rows, and select count(#column_name) return the number of non-null value. So, Count(1) - count(#column_name) will gives you the number of null value. Then, use something like this to insert into #result :
INSERT INTO #RESULT (...) SELECT TABLE_NAME, COLUMN_NAME, COUNT(1), COUNT(1) - COUNT(#COLUMN_NAME) FROM ...
when dealing with SQL Server object, you the quotename function. You'll never know when someone will put a space, quote, bracket or whatever in a schema/table/column name that might break you query.
Do not use "WHILE #ID <= (SELECT COUNT(*) FROM #BANG)" to test if there are more rows to process. Use a "WHILE EXISTS (SELECT 1 FROM #BANG)"

How can I search multiple fields and count nulls for all?

Is there an easy way to count nulls in all fields in a table without writing 40+ very similar, but slightly different, queries? I would think there is some kind of statistics maintained for all tables, and this may be the easiest way to go with it, but I don't know for sure. Thoughts, anyone? Thanks!!
BTW, I am using SQL Server 2008.
Not sure if you consider this simple or not, but this will total the NULLs by column in a table.
DECLARE #table sysname;
SET #table = 'MyTable'; --replace this with your table name
DECLARE #colname sysname;
DECLARE #sql NVARCHAR(MAX);
DECLARE COLS CURSOR FOR
SELECT c.name
FROM sys.tables t
INNER JOIN sys.columns c ON t.object_id = c.object_id
WHERE t.name = #table;
SET #sql = 'SELECT ';
OPEN COLS;
FETCH NEXT FROM COLS INTO #colname;
WHILE ##FETCH_STATUS = 0
BEGIN
SET #sql = #sql + 'COUNT(CASE WHEN ' + #colname + ' IS NULL THEN 1 END) AS ' + #colname + '_NULLS,'
FETCH NEXT FROM COLS INTO #colname;
END;
CLOSE COLS;
DEALLOCATE COLS;
SET #sql = LEFT(#sql,LEN(#sql) - 1) --trim tailing ,
SET #sql = #sql + ' FROM ' + #table;
EXEC sp_executesql #sql;
SELECT COUNT( CASE WHEN field01 IS NULL THEN 1 END) +
COUNT( CASE WHEN field02 IS NULL THEN 1 END) +
...
COUNT( CASE WHEN field40 IS NULL THEN 1 END) as total_nulls
This answer will return a table containing the name of each column of a specified table. (#tab is the name of the table you're trying to count NULLs in.)
You can loop through the column names, count NULLs in each column, and add the result to a total running count.

SQL Server Select data from table without knowing column names

This is sample data table.
I want to select values in any rows or any column (equals) = 200 with column names.But we don't know column names.
If you know the table name it's possible to interrogate the INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS, with SQLServer 2005 or better, or sysobjects and syscolumns with SQLServer 2000 to retrieve the table columns, after that you can create a fully referenced query for your needs
I think the below T-SQL will get you what you want. It was written against AdventureWorks2012LT. In the future, you can get more specific help by including the SQL create statements with your question (so the responder doesn't have to recreate the tables)
(BTW, My example is looking for any field that contains the letter 'S')
DECLARE #column_name nvarchar(200);
DECLARE #statement nvarchar(max);
DECLARE #results TABLE(
id int,
colname nvarchar(200),
value nvarchar(max)
)
DECLARE col_cursor CURSOR FOR
SELECT C.COLUMN_NAME AS col
FROM INFORMATION_SCHEMA.COLUMNS C WHERE C.TABLE_NAME LIKE 'Address'
OPEN col_cursor
FETCH NEXT FROM col_cursor INTO #column_name
WHILE ##FETCH_STATUS = 0
BEGIN
PRINT #column_name
SELECT #statement = N'SELECT AddressID, ''' + #column_name + ''' AS ColName, ' + #column_name + ' AS value FROM SalesLT.[Address] WHERE ' + #column_name + ' LIKE ''%S%''';
INSERT INTO #results
EXEC(#statement);
FETCH NEXT FROM col_cursor INTO #column_name
END
CLOSE col_cursor
DEALLOCATE col_cursor
SELECT * FROM #results

Search table columns for values over a certain length

So I have a database with many tables that have a column that contains a GL Account value (for financial purposes). The column name varies by table (i.e. in one table the column is called "gldebitaccount" and in another table it's called "glcreditaccount"). I was able to find all combinations of table / column pairs using the following query:
SELECT c.name AS ColName, t.name AS TableName
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%gl%acc%'
This query returns close to 100 pairs of tables/columns. I am trying to find any value in any of those table/column pairs that exceeds 25 chars in length. For an individual table/column, I'd typically use:
SELECT *
FROM tableName
WHERE LEN(columnName)>25
I want to avoid having to run that query 100 times with each pair. Is there any way I can do a "for each" (which I know is frowned upon in SQL since everything should be set-based). I've done sub-SELECT statements before, but not any that involved change the table in the FROM clause. Any ideas or help would be greatly appreciated!
Thanks in advance!
As the previous answer, the solution will need dynamic SQL. Here is a way that uses both dynamic SQL and cursors, and you can expect slow performance, so use at your own risk:
DECLARE #TableName NVARCHAR(128), #ColumnName NVARCHAR(128)
DECLARE #Query NVARCHAR(4000)
DECLARE CC CURSOR LOCAL FAST_FORWARD FOR
SELECT QUOTENAME(t.name), QUOTENAME(c.name)
FROM sys.columns c
INNER JOIN sys.tables t
ON c.object_id = t.object_id
WHERE c.collation_name IS NOT NULL
AND c.max_length > 25 AND c.name LIKE '%gl%acc%';
CREATE TABLE #Results(TableName NVARCHAR(128), ColumnName NVARCHAR(128));
OPEN CC
FETCH NEXT FROM CC INTO #TableName, #ColumnName
WHILE ##FETCH_STATUS = 0
BEGIN
SET #Query = 'IF EXISTS(SELECT 1 FROM '+#TableName+'
WHERE LEN('+#ColumnName+') > 25)
INSERT INTO #Results
VALUES(#TableName,#ColumnName)'
EXEC sp_executesql #Query,
N'#TableName NVARCHAR(128),#ColumnName NVARCHAR(128)',
#TableName,
#ColumnName;
FETCH NEXT FROM CC INTO #TableName, #ColumnName
END
CLOSE CC
DEALLOCATE CC
SELECT *
FROM #Results
Here's an option without cursors that also doesn't add XML overhead. Note that it also protects you from potential type conflicts (e.g. try the others in a database with hierarchyid columns, like AdventureWorks), from table or column names with apostrophes, and from table names that exist in more than one schema.
DECLARE #sql NVARCHAR(MAX) = N'';
CREATE TABLE #Results
(
SchemaName NVARCHAR(128), TableName NVARCHAR(128), ColumnName NVARCHAR(128)
);
SELECT #sql += N'INSERT #Results SELECT '''
+ REPLACE(s.name,'''','''''') + ''','''
+ REPLACE(t.name,'''','''''') + ''','''
+ REPLACE(c.name,'''','''''') + '''
WHERE EXISTS (SELECT 1 FROM ' + QUOTENAME(s.name)
+ '.' + QUOTENAME(t.name) + ' WHERE
LEN(' + QUOTENAME(c.name) + ') > 25);
'
FROM sys.columns AS c
INNER JOIN sys.tables AS t
ON c.[object_id] = t.[object_id]
INNER JOIN sys.schemas AS s
ON t.[schema_id] = s.[schema_id]
WHERE
(
c.system_type_id IN (35,99) -- text,ntext
OR (c.system_type_id IN (167,231) -- varchar,nvarchar, could be max
AND c.max_length > 25 OR c.max_length = -1)
OR (c.system_type_id IN (175,239) -- char, nchar
AND c.max_length > 25)
)
AND c.name LIKE N'%gl%acc%';
EXEC sp_executesql #sql;
SELECT SchemaName, TableName, ColumnName FROM #Results;
Yet another solution with dynamic SQL.
But now without cursors. It uses FOR XML statement and should be much faster.
DECLARE #sqlstatement VARCHAR(MAX);
SET #sqlstatement =
REPLACE (
STUFF ( (
SELECT 'UNION ALL SELECT ''' + t.name + ''' as TableName, '''
+ c.name + ''' AS ColumnName, '
+ c.name + ' AS Value FROM '
+ t.name + ' WHERE LEN (' + c.name + ') ' + CHAR(62) + ' 25'
FROM sys.columns c
INNER JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%gl%acc%'
FOR XML PATH('')
), 1, 10, '')
, '>', '>')
EXEC (#sqlstatement)
You may want to add extra filter for columns by their type and max_length:
INNER JOIN sys.types ty ON c.system_type_id = ty.system_type_id
AND (
ty.name IN ('text', 'ntext')
OR (
ty.name IN ('varchar', 'char', 'nvarchar', 'nchar')
AND (c.max_length > 25 OR c.max_length = -1)
)
You will need to create dynamic SQL because you cannot dynamically specify the source table. You could do this using a cursor, or write a select statement that makes a row for each statement you need to run. This shows how to do it with a cursor. You problem looks like an acceptable usage for a cursor:
DECLARE #ColName VARCHAR(MAX);
DECLARE #TableName VARCHAR(MAX);
DECLARE #SomeSQL VARCHAR(MAX);
DECLARE db_cursor CURSOR FOR
SELECT c.name AS ColName, t.name AS TableName
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%gl%acc%'
OPEN db_cursor;
FETCH NEXT FROM db_cursor INTO #ColName, #TableName;
WHILE ##FETCH_STATUS = 0
BEGIN
-- you need to make dynamic SQL
SELECT #SomeSQL = 'SELECT * FROM ' + #TableName + ' WHERE LEN(' + #ColName + ') > 25;'
PRINT(#SomeSQL + CHAR(10));
-- you could execute it directly if you wish.
--EXEC (#SomeSQL);
FETCH NEXT FROM db_cursor INTO #ColName, #TableName;
END
CLOSE db_cursor;
DEALLOCATE db_cursor;
I wasn't sure if you needed to do anything with the results, but this will return the records that meet the criteria you posted in your question
Declare #TableName sysname
Declare #ColName sysname
Declare #dynamic_SQL varchar(MAX)
Declare some_cursor CURSOR FOR
SELECT c.name AS ColName, t.name AS TableName
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%gl%acc%'
OPEN some_cursor
FETCH NEXT FROM some_cursor INTO #ColName, #TableName
WHILE ##FETCH_STATUS = 0
Begin
select #dynamic_SQL = '
Select *
From ' + #TableName + '
Where LEN('+ #ColName +') > 25
'
exec (#dynamic_SQL)
FETCH NEXT FROM some_cursor INTO #ColName, #TableName
End
CLOSE some_cursor
DEALLOCATE some_cursor