How to find all columns in a database with all NULL values - sql

I have a large Snowflake database with 70+ tables and 3000+ fields. Is there a query I can use across the entire database to find all columns with all NULLs? I have a command I can use to find all the columns
select * from prod_db.information_schema.columns
Is there a way to modify that command to identify which columns are all NULLs? If there is not a way to do it across the entire database. Is there a way to do it across a table? I do not want to type:
select column_name from prod_db.information_schema.table_name
3000+ times. Thanks!

This uses a SQL generator to generate a SQL statement that will locate columns matching two criteria:
The column is in a table with one or more rows
The column has all nulls.
To be highly efficient, rather than checking the each table entirely, it uses a UNION ALL block that looks for a single non-null row in each table. It uses TOP 1 to find a not null row. That way as soon as it finds a not null row, it returns that row and stops scanning that table so it can move to another table scan.
This means that the large UNION ALL section will list tables where it finds a not null row, which is the opposite of what we want. To use this information, a CTE wrapped around the UNION ALL will do an anti-join against the column view in the information schema.
with COLS as
(
select 'select top 1 ''' || C.TABLE_CATALOG || ''' as TABLE_CATALOG, ''' || C.TABLE_SCHEMA ||
''' as TABLE_SCHEMA, ''' || C.TABLE_NAME || ''' as TABLE_NAME, ''' || C.COLUMN_NAME ||
''' as COLUMN_NAME from "' ||
C.TABLE_CATALOG || '"."' || C.TABLE_SCHEMA || '"."' || C.TABLE_NAME || '"' ||
' where "' || C.COLUMN_NAME || '" is not null'
as NULL_CHECK
from INFORMATION_SCHEMA.COLUMNS C
left join INFORMATION_SCHEMA.TABLES T on
C.TABLE_CATALOG = T.TABLE_CATALOG and
C.TABLE_SCHEMA = T.TABLE_SCHEMA and
C.TABLE_NAME = T.TABLE_NAME
where C.IS_NULLABLE = 'YES' and T.TABLE_TYPE = 'BASE TABLE'
and T.ROW_COUNT > 0
), UNIONED as
(
select listagg(NULL_CHECK, '\nunion all\n') as UNIONED from COLS
)
select replace($$
with NON_NULL_COLUMNS as (
!~UNIONED~!
)
select C.TABLE_CATALOG, C.TABLE_SCHEMA, C.TABLE_NAME, C.COLUMN_NAME from INFORMATION_SCHEMA.COLUMNS C
left join NON_NULL_COLUMNS NN
on C.TABLE_CATALOG = NN.TABLE_CATALOG
and C.TABLE_SCHEMA = NN.TABLE_SCHEMA
and C.TABLE_NAME = NN.TABLE_NAME
and C.COLUMN_NAME = NN.COLUMN_NAME
left join INFORMATION_SCHEMA.TABLES T
on C.TABLE_CATALOG = T.TABLE_CATALOG
and C.TABLE_SCHEMA = T.TABLE_SCHEMA
and C.TABLE_NAME = T.TABLE_NAME
where NN.COLUMN_NAME is null and T.ROW_COUNT > 0
;$$, '!~UNIONED~!', UNIONED) as SQL_TO_RUN from UNIONED
;

You can produce a list of SELECT queries for each column as follows
SELECT CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT('SELECT ''', TABLE_NAME), ''', '''), COLUMN_NAME), ''', '), 'COUNT(*) FROM '), TABLE_NAME), ' WHERE '), COLUMN_NAME), ' IS NULL OR '), LEN(COLUMN_NAME)), ' = 0'), ' UNION ')
from information_schema.columns
The result of the above query can then be taken and executed to get the result you need (PS: Remove the UNION on the last row produced from Step 1 before executing)
Hope this helps.

Related

Retrieve tables based on its contents in SQL server

I'd like to retrieve all tables and the associated column values where two of their specific columns (the column names will be passed into) that don't have the exact same content in them.
Here's a more definite break-down of the problem. Suppose, the columns that I need to look into is 'Column_1' and 'Column_2'
First identify from in INFORMATION_SCHEMA which of the tables have both of these columns present in them(possible one sub-query),
And then identify which of these tables don't have exact same content on these 2 columns meaning Column_1 != Column_2.
The following section would retrieve all the tables that has both 'Column_1' and 'Column_2'.
SELECT
TABLE_NAME
FROM
INFORMATION_SCHEMA.TABLES T
WHERE
T.TABLE_CATALOG = 'myDB' AND
T.TABLE_TYPE = 'BASE TABLE'
AND EXISTS (
SELECT T.TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS C
WHERE
C.TABLE_CATALOG = T.TABLE_CATALOG AND
C.TABLE_SCHEMA = T.TABLE_SCHEMA AND
C.TABLE_NAME = T.TABLE_NAME AND
C.COLUMN_NAME = 'Column_1')
AND EXISTS
(
SELECT T.TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS C
WHERE
C.TABLE_CATALOG = T.TABLE_CATALOG AND
C.TABLE_SCHEMA = T.TABLE_SCHEMA AND
C.TABLE_NAME = T.TABLE_NAME AND
C.COLUMN_NAME = 'Column_2')
As the next step, I tried to use this as a sub-query and have the following at the end but that doesn't work and sql-server returns 'Cannot call methods on sysname'. What would the next step on this? This problem assumes all columns has the exact same Data-type.
WHERE SUBQUERY.TABLE_NAME.Column_1 != SUBQUERY.TABLE_NAME.Column_2
This is what's expected :
Table_Name
Column_Name1
Column_Value_1
Column_Name2
Column_Value_2
Table_A
Column_1
abcd
Column_2
abcde
Table_A
Column_1
qwerty
Column_2
qwert
Table_A
Column_1
abcde
Column_2
eabcde
Table_B
Column_1
zxcv
Column_2
zxcde
Table_C
Column_1
asdfgh
Column_2
asdfghy
Table_C
Column_1
aaaa
Column_2
bbbb
If in fact you want to actually compare values (not length) between two columns in tables that contain those two columns, you will need to generate dynamic SQL and then execute it. This could be done semi-automatically with the following:
DECLARE #SqlTemplate VARCHAR(MAX) =
'UNION ALL'
+ ' SELECT Table_Name = <TNAME>'
+ ', Column_Name1 = <C1NAME>, Column_Value_1 = <C1>'
+ ', Column_Name2 = <C2NAME>, Column_Value_2 = <C2>'
+ ' FROM <T>'
+ ' WHERE ISNULL(<C1>, '(null)') <> ISNULL(<C2>, '(null)')'
SELECT T.TABLE_NAME
, REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
#SqlTemplate
, '<TNAME>', QUOTENAME(T.TABLE_SCHEMA + '.' + T.TABLE_NAME, ''''))
, '<C1NAME>', QUOTENAME(C1.COLUMN_NAME, ''''))
, '<C2NAME>', QUOTENAME(C2.COLUMN_NAME, ''''))
, '<T>', QUOTENAME(T.TABLE_SCHEMA) + '.' + QUOTENAME(T.TABLE_NAME))
, '<C1>', QUOTENAME(C1.COLUMN_NAME))
, '<C2>', QUOTENAME(C2.COLUMN_NAME))
FROM INFORMATION_SCHEMA.TABLES T
JOIN INFORMATION_SCHEMA.COLUMNS C1
ON C1.TABLE_CATALOG = T.TABLE_CATALOG
AND C1.TABLE_SCHEMA = T.TABLE_SCHEMA
AND C1.TABLE_NAME = T.TABLE_NAME
AND C1.COLUMN_NAME = 'Column_1'
JOIN INFORMATION_SCHEMA.COLUMNS C2
ON C2.TABLE_CATALOG = T.TABLE_CATALOG
AND C2.TABLE_SCHEMA = T.TABLE_SCHEMA
AND C2.TABLE_NAME = T.TABLE_NAME
AND C2.COLUMN_NAME = 'Column_2'
WHERE T.TABLE_CATALOG = 'myDB'
AND T.TABLE_TYPE = 'BASE TABLE'
This would generate sql for each qualifying table of the form:
UNION ALL SELECT Table_Name = 'dbo.Z', Column_Name1 = 'X', Column_Value_1 = [X], Column_Name2 = 'Y', Column_Value_2 = [Y] FROM [dbo].[Z] WHERE ISNULL([X], '(null)') <> ISNULL([Y], '(null)')
After running the above, you would then cut & paste the generated SQL into another query window, remove the initial 'UNION ALL', and then execute the remaining SQL to get the final results.
There are ways of combining all the SQL into a single string and executing it automatically, but your problem sounds like a one-off process that doesn't warrant the extra complexity.
I believe you need to compare the CHARACTER_MAXIMUM_LENGTH or CHARACTER_OCTET_LENGTH metadata values in the INFORMATION_SCHEMA.COLUMNS table instead of using LEN(). This can be done using something like:
SELECT T.TABLE_NAME
, C1.COLUMN_NAME, C1.DATA_TYPE, C1.CHARACTER_MAXIMUM_LENGTH
, C2.COLUMN_NAME, C2.DATA_TYPE, C2.CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.TABLES T
JOIN INFORMATION_SCHEMA.COLUMNS C1
ON C1.TABLE_CATALOG = T.TABLE_CATALOG
AND C1.TABLE_SCHEMA = T.TABLE_SCHEMA
AND C1.TABLE_NAME = T.TABLE_NAME
AND C1.COLUMN_NAME = 'Column_1'
JOIN INFORMATION_SCHEMA.COLUMNS C2
ON C2.TABLE_CATALOG = T.TABLE_CATALOG
AND C2.TABLE_SCHEMA = T.TABLE_SCHEMA
AND C2.TABLE_NAME = T.TABLE_NAME
AND C2.COLUMN_NAME = 'Column_2'
WHERE T.TABLE_CATALOG = 'myDB'
AND T.TABLE_TYPE = 'BASE TABLE'
AND C1.CHARACTER_MAXIMUM_LENGTH <> C2.CHARACTER_MAXIMUM_LENGTH
The inner joins both limit results to tables having both columns and retrieve the column metadata. The length compare at the end checks for a mismatch.
This assumes character types. You might also want to check DATA_TYPE consistency ("char" vs "varchar" vs "nvarchar") or some of the other precision and scale values for other non-character data types.
To query the data within the columns you need dynamic SQL. I would advise you not to use INFORMATION_SCHEMA (which is for compatibility only) and instead use sys.tables etc. You don't need to check sys.columns twice, you can use aggregation in the EXISTS subquery to check for multiple columns.
To compare the columns, you can do Column_1 <> Column_2, but that will not deal with nulls correctly. If the columns can be nullable then you should instead use the syntax shown in the code below: NOT EXISTS (SELECT Column_1 INTERSECT SELECT Column_2)
DECLARE #sql nvarchar(max);
SELECT
STRING_AGG(CAST('
SELECT
Table_Name = ' + QUOTENAME(t.name, '''') + ',
Column_1,
Column_2
FROM ' + QUOTENAME(s.name) + '.' + QUOTENAME(t.name) + '
WHERE NOT EXISTS (SELECT Column_1 INTERSECT SELECT Column_2)
' AS nvarchar(max)), '
UNION ALL
' )
FROM sys.tables t
JOIN sys.schemas s ON s.schema_id = t.schema_id
AND s.name = 'myDB'
WHERE EXISTS (SELECT 1
FROM sys.columns c
WHERE c.object_id = t.object_id
AND c.name IN ('Column_1', 'Column_2')
HAVING COUNT(*) = 2
AND COUNT(DISTINCT c.system_type_id) = 1 -- all same type
);
PRINT #sql; -- your friend
EXEC sp_executesql #sql;

A script to find cells with certain values in Redshift

I have this query:
select t.table_schema,
t.table_name,
c.column_name
from information_schema.tables t
inner join information_schema.columns c
on c.table_name = t.table_name
and c.table_schema = t.table_schema
where c.column_name like '%column_name%'
and t.table_schema not in ('information_schema', 'pg_catalog')
and t.table_type = 'BASE TABLE'
order by t.table_schema;
Is there somewhere I can actually search for a specific value and see under which column & table & schema it falls under?
For example,
I would like to search for a value 'WINNER' and find out which columns contain this value (and obviously the table and schema as well)
and the column might be STATUS with value WINNER and under table CUSTOMER and schema ALL_DATA
Can anyone help please?
There is no straight forward way to do this, there is no build-in functionality in any DBMS as far as I know. One way how to do this would be to create SQL which selects all text-like columns and generates another SQL. There is an example:
select 'select '''||t.table_schema||''' as table_schema, '''||
t.table_name||''' as table_name, '''||
c.column_name||''' as column_name,'||
' count(*) as occurrences'
' from '||t.table_schema||'.'||t.table_name||
' where '||c.column_name||' like ''WINNER'''
||' union all '
from information_schema.tables t
inner join information_schema.columns c
on c.table_name = t.table_name
and c.table_schema = t.table_schema
where c.column_name like '%column_name%'
and t.table_schema not in ('information_schema', 'pg_catalog')
and t.table_type = 'BASE TABLE'
and c.data_type = 'character varying'
order by t.table_schema;
According to limitations set in query it will generate as much rows as columns you want to search. Copy this result block in your client (delete 'union all' from last row)
and execute. Try to limit rows as much as possible for better performance. Due to columnar data store Redshift will execute this quite effectively, keep in mind that on row-oriented DBMS performance for such approach will be much worse.

How to quickly see what columns in a table have data?

We are currently undertaking a testing phase which requires us to see if there is any data in each column for each table. Now, the route that is long and labour-intensive is:
SELECT COUNT(Col1), COUNT(Col2)...FROM TABLE
Is there any easier way to do this? We can go down this route by concatenating each column name from our data lineage document with the COUNT() function, but we have a lot of tables and a lot of columns in each table, making this a bit unfeasible.
Essentially we just need a count of records in each column for each table, without having to write long COUNT(Col) queries.
Thanks
This query will return accurate results if the table statistics were recently gathered with the default value for ESTIMATE_PERCENT:
SELECT utab.table_name
, tcol.column_name
, utab.num_rows
from user_tables utab,
user_tab_cols tcol
where utab.table_name = tcol.table_name
and utab.num_rows > 0
and utab.num_rows = tcol.num_nulls;
You could use a dynamic query to build the queries. This will generate all the queries.
SELECT 'SELECT COUNT(' || t.column_name || ' ) FROM ' || t.owner || '.' || t.table_name || ';' FROM dba_tab_columns t
You can generate all the select statements like so:
SELECT CASE WHEN column_id = 1 AND column_id_desc != 1 THEN 'SELECT ''' || LOWER(owner) || '.' || LOWER(table_name) || ''' table_name, ' || CHR(10) || 'COUNT(' || LOWER(column_name) || ') ' || SUBSTR(LOWER(column_name), 1, 26) || '_cnt,'
WHEN column_id = 1 AND column_id_desc = 1 THEN 'SELECT ''' || LOWER(owner) || '.' || LOWER(table_name) || ''' table_name, ' || CHR(10) || 'COUNT(' || LOWER(column_name) || ') ' || SUBSTR(LOWER(column_name), 1, 26) || '_cnt FROM ' || LOWER(owner) || '.' || LOWER(table_name) || ';'
WHEN column_id_desc = 1 THEN ' COUNT(' || LOWER(column_name) || ') ' || SUBSTR(LOWER(column_name), 1, 26) || '_cnt' || CHR(10) || 'FROM ' || LOWER(owner) || '.' || LOWER(table_name) || ';'
ELSE ' COUNT(' || LOWER(column_name) || ') ' || SUBSTR(LOWER(column_name), 1, 26) || '_cnt,'
END sql_text
FROM (SELECT owner,
table_name,
column_name,
column_id,
row_number() OVER (PARTITION BY owner, table_name ORDER BY column_id DESC) column_id_desc
FROM all_tab_columns)
WHERE <predicates to filter on the tables you're interested in>
ORDER BY owner,
table_name,
column_id;
This goes through all the tables you're interested in plus their columns and outputs text that will, when taken together, form a select statement for each table.
The text that is output in the sql_text column depends on whether the column in the list is the first or last (or both!); this way you get the full statement which queries each table once, rather than one per table and column.
You can then copy and paste the results and run that as a script.
It's can help you
SELECT
a.table_name,
a.column_name
FROM
ALL_TAB_COLUMNS a
WHERE owner = '<your user>'
AND a.SAMPLE_SIZE = a.NUM_NULLS

Get count of rows from multiple tables Redshift SQL?

I have a redshift database that is being updated with new tables so I can't just manually list the tables I want. I want to get a count of the rows of all the tables from my query. So far I have:
select 'SELECT ''' || table_name || ''' as table_name, count(*) As con ' ||
'FROM ' || table_name ||
CASE WHEN lead(table_name) OVER (order by table_name ) IS NOT NULL
THEN ' UNION ALL ' END
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%results%'
but when I do this I get the error:
Specified types or functions (one per INFO message) not supported on Redshift tables.
I've searched a lot but I can't seem to find a solution for my problem. Any help would be greatly appreciated. Thanks!
EDIT:
I've changed my approach to this and decided to use a for loop in R to get the row counts of each but I'm running into the issue that 'row_counts' is only saving one number, not the count of each row like I want. Here is the code:
schema <- "x"
table_prefix <- "results"
geos <- ad_districts %>% filter(geo != "geo")
row_count <- list()
i = 1
for (geo in geos){
table_name <- paste0(schema, ".", table_prefix, geo)
row_count[[i]] <- dbGetQuery(con,
paste("SELECT COUNT(*) FROM", table_name))
i = i + 1
}
Your query is doing a select * for all tables, this will take a lot of time and resources. Instead use a system table to get the same info
select name, sum(rows) as rows
from stv_tbl_perm
where name like '%results%'
group by 1
[EDIT] - I think this is the root cause - some sql functions are only supported on the leader node. Try connecting to that node and re-run your SQL.
https://docs.aws.amazon.com/redshift/latest/dg/c_sql-functions-leader-node.html
Hope this helps.
select 'select count(*) as "' || table_schema || '.' || table_name || '" from ' || table_schema || '.' || table_name || ' ;' as sql_text
from information_schema.tables
;
[EDIT - refined this a bit to generate a series of statements that can be run at once]
select rownum, case when rownum > 1 then sql_text else replace(sql_text, 'union all', '') end as sql_text
from
(
select rank() over (order by sql_text DESC) as rownum,
sql_text
from
(
select 'select ''' || table_schema || ' ' || table_name || ''' , count(*) as "' || table_schema || '.' || table_name || '" from ' || table_schema || '.' || table_name || ' union all ' as sql_text
from information_schema.tables
where table_schema = 'public'
order by table_schema, table_name
)X
)Y
order by rownum desc ;
SELECT ' Select count(*) , '''+ tablename + ''' from '+'"' + tablename +'"' +' Union ALL '
FROM pg_table_def
GROUP BY tablename
Above query eliminates any table name with space. Remove UNION ALL at the end of the query and query will be ready to be executed.

POSTGRESQL: Create table as Selecting specific type of columns

I have a table with mixed types of data (real, integrer, character ...) but i would only recover columns that have real values.
I can construct this:
SELECT 'SELECT ' || array_to_string(ARRAY(
select 'o' || '.' || c.column_name
from information_schema.columns as c
where table_name = 'final_datas'
and c.data_type = 'real'), ',') || ' FROM final_datas as o' As sqlstmt
that gives that:
"SELECT o.random,o.struct2d_pred2_num,o.pfam_num,o.transmb_num [...] FROM final_datas as o"
The i would like to create a table with these columns. Of course, do this, doesn't work:
create table table2 as (
SELECT 'SELECT ' || array_to_string(ARRAY(
select 'o' || '.' || c.column_name
from information_schema.columns as c
where table_name = 'final_datas'
and c.data_type = 'real'), ',') || ' FROM final_datas as o' As sqlstmt
)
Suggestions?
You need to generate the whole CREATE TABLE statement as dynamic SQL:
SELECT 'CREATE TABLE table2 AS SELECT ' || array_to_string(ARRAY(
select 'o' || '.' || c.column_name
from information_schema.columns as c
where table_name = 'final_datas'
and c.data_type = 'real'), ',') || ' FROM final_datas as o' As sqlstmt
The result can be run with EXECUTE sqlstmt;