I have multiple tables (number > 100) within a database in SQL, each table may have a few hundred entries.
For every table, I am seeking to retrieve simply the names of the columns from the tables which have at least 1 non-null entry.
How can I do this?
To return table/column name:
SELECT table_name, column_name
FROM information_schema.columns
That's pretty easy, here's a solution for nulls depending on if you have permissions:
select a.table_name
, schema_name
, sum(c.rows) total_rows
from
information_schema.tables a
join information_schema.schemas b on (a.schema_id = b.schema_id)
join information_schema.partitions c on (a.object_id = c.object_id)
where c.index_id in (0,1)
group by a.name,b.name
having sum(c.rows) = 0;
Note: I did this in vertica, and you have to have access to partitions. Also, some dbs use sys instead of information_schema, but the idea is the same.
Related
I am testing tables (and data therein) acquired into our data lake against the source application tables. We do not transform any of the data on acquisition but we do not always acquire all columns of a table and the acquisition process adds several data lake columns to the table (date acquired etc.)
So I have to compare two tables where most of the columns are the same but some aren't. Obviously I can deal with this by manually specifying the columns for each SELECT statement. I want to make a testing script that will do this automatically, comparing the common columns and then allowing me to do further queries using that list of columns.
I already test common columns to ensure data type integrity between columns:
SELECT /*fixed*/
b.column_name,
a.data_type AS source_data_type,
b.data_type AS acquired_data_type,
CASE
WHEN a.data_type = b.data_type THEN 'Pass'
ELSE 'Fail'
END AS DATA_TYPE_TEST
FROM
all_tab_cols#&sourcelink a
INNER JOIN all_tab_cols b ON a.column_name = b.column_name
WHERE
a.owner = '&sourceschema'
AND b.owner = 'DATALAKE'
AND a.table_name = '&tableName'
AND b.table_name = '&tableName';
The above works as intended and gets only common columns. How can I save this list of common columns so that when I'm querying the tables directly I can use them in a further query, such as:
SELECT
<my dynamic list of columns here>
FROM
&sourceschema..&tablename#&sourcelink a
INNER JOIN datalake.&tablename b ON a.id = b.id;
Is this possible with Oracle PL/SQL or should I use python instead?
LISTAGG can reduce this to a column list for you
SQL> select listagg(column_name,',') within group ( order by column_id)
2 from user_tab_columns
3 where table_name = 'EMP';
LISTAGG(COLUMN_NAME,',')WITHINGROUP(ORDERBYCOLUMN_ID)
--------------------------------------------------------------------------
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
and then you could return a dynamic ref cursor to whatever client you want to, ie
open my_ref_cur for
'select '||col_list||' from ....';
Is there a simpler way to find matching column names in multiple tables?
The only way I know how to do this currently is to check each table individually but some tables have a bunch of columns and I know my human eye can miss things.
For SQL Server:
SELECT c.name, string_agg(t.name, ', ')
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
group by c.name
Use information_schema.columns. For instance, to get all column names in multiple tables:
select column_name, string_agg( concat(table_schema, '.', table_name), ',')
from information_schema.columns
group by column_name
having count(*) > 1;
The information_schema views are actually standard and available in many databases, including SQL Server.
In Oracle
Use DBA_TAB COLUMNS or ALL_TAB_COLUMNS table it contains names of columns with table name and other details
Vladimir's answer is right and specific to SQL Server.
Another good answer is to use INFORMATION_SCHEMA views to retrieve what you're looking for. INFORMATION_SCHEMA is a standard follow by some DBMS to provides a common view for common objects.
https://en.wikipedia.org/wiki/Information_schema
https://learn.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/system-information-schema-views-transact-sql?view=sql-server-ver15
I created the following query:
select
is_tables.table_name
from information_schema.tables is_tables
join pg_tables
on is_tables.table_name=pg_tables.tablename
where
is_tables.table_catalog='<mydatabase>'
and is_tables.table_schema<>'information_schema'
and is_tables.table_schema<>'pg_catalog'
and pg_tables.tableowner='<myuser>';
I assume there is no database vendor independent way of querying this. Is this the easiest/shortest SQL query to achieve what I want in PostgreSQL?
I think you're pretty close. Object owners don't seem to appear in the information_schema views, although I might have overlooked it.
select is_tables.table_schema,
is_tables.table_name
from information_schema.tables is_tables
inner join pg_tables
on is_tables.table_name = pg_tables.tablename
and is_tables.table_schema = pg_tables.schemaname
where is_tables.table_catalog = '<mydatabase>'
and is_tables.table_schema <> 'information_schema'
and is_tables.table_schema <> 'pg_catalog'
and pg_tables.tableowner = '<myuser>';
You need to join on both the table name and the schema name. Table names are unique within a schema; they're not unique within a database.
Strange situation: I am trying to remove some hard coding from my code. There is a situation where I have a field, lets say "CityID", and using this information, I want to find out which table contains a primary key called CityID.
Logically, you'd say that it's probably a table called "City" but it's not... that table is called "Cities". There are some other inconsistencies in database naming hence I can never be sure if removing the string "ID" and finding out the plural will be sufficient.
Note: Once I figure out that CityID refers to a table called Cities, I will perform a join to replace CityID with city name on the fly. I will appreciate if someonw can also tell me how to find out the first varchar field in a table given its name.
SELECT name FROM sysobjects
WHERE id IN ( SELECT id FROM syscolumns WHERE name = 'THE_COLUMN_NAME' )
To get column information from the specified table:
SELECT column_name, data_type, character_maximum_length
FROM information_schema.columns
WHERE table_name = 'myTable'
select table_name from information_schema.columns where column_name='CityID'
You can use the INFORMATION_SCHEMA tables to read metadata about the database.
SELECT
TABLE_NAME
FROM
[db].[INFORMATION_SCHEMA].[COLUMNS]
WHERE
COLUMN_NAME='CityID';
For a primer in what's in the INFORMAITON_SCHEMA, see INFORMATION_SCHEMA, a map to your database
The information you seek is all available in the information schema views. Note that you will find many sources telling you how to directly query the underlying system tables that these are views onto - and I must admit that I do the same when it's just to find something out quickly - but the recommended way for applications is to go through these views.
For example, to find your CityID column:
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME = 'CityID'
To find the first varchar field in a table:
SELECT TOP 1 * FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_NAME = 'TableName'
AND DATA_TYPE = 'varchar' -- This is off the top of my head!
ORDER BY ORDINAL_POSITION
As I understand from your question, you want to find tables which contain CITYID column in primary key
You can use SQL Server system views like sysindexes and sysindexkeys as shown in SQL tutorial to query database table primary keys including composite primary keys which are formed
SELECT
TBL.name as TableName
FROM sysobjects as PK
INNER JOIN sys.objects as TBL
on TBL.object_id = PK.parent_obj
INNER JOIN sysindexes as IND
on IND.name = PK.name AND
IND.id = TBL.object_id
INNER JOIN SysIndexKeys as KEYS
on KEYS.id = IND.id AND
KEYS.indid = IND.indid
INNER JOIN syscolumns as COL
on COL.id = KEYS.id AND
COL.colid = KEYS.colid
WHERE
PK.xtype = 'PK' AND
COL.name = 'CityID'
[This is on an iSeries/DB2 database if that makes any difference]
I want to write a procedure to identify columns that are left as blank or zero (given a list of tables).
Assuming I can pull out table and column definitions from the central system tables, how should I check the above condition? My first guess is for each column generate a statement dynamically such as:
select count(*) from my_table where my_column != 0
and to check if this returns zero rows, but is there a better/faster/standard way to do this?
NB This just needs to handle simple character, integer/decimal fields, nothing fancy!
To check for columns that contain only NULLs on DB2:
Execute RUNSTATS on your database (http://www.ibm.com/developerworks/data/library/techarticle/dm-0412pay/)
Check the database statistics by quering SYSSTAT.TABLES and SYSSTAT.COLUMNS . Comparing SYSSTAT.TABLES.CARD and SYSSTAT.COLUMNS.NUMNULLS will tell you what you need.
An example could be:
select t.tabschema, t.tabname, c.colname
from sysstat.tables t, sysstat.columns c
where ((t.tabschema = 'MYSCHEMA1' and t.tabname='MYTABLE1') or
(t.tabschema = 'MYSCHEMA2' and t.tabname='MYTABLE2') or
(...)) and
t.tabschema = c.tabschema and t.tabname = c.tabname and
t.card = c.numnulls
More on system stats e.g. here: http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/r0001070.htm and http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/r0001073.htm
Similarly, you can use SYSSTAT.COLUMNS.AVGCOLLEN to check for empty columns (just it doesn't seem to work for LOBs).
EDIT: And, to check for columns that contain only zeros, use try comparing HIGH2KEY and LOW2KEY in SYSSTAT.COLUMNS.
Yes, typically, I would do something like this in SQL Server:
SELECT
REPLACE(REPLACE(REPLACE(
'
SELECT COUNT(*) AS [COUNT NON-EMPTY IN {TABLE_NAME}.{COLUMN_NAME}]
FROM [{TABLE_SCHEMA}].[{TABLE_NAME}]
WHERE [{COLUMN_NAME}] IS NOT NULL
OR [{COLUMN_NAME}] <> 0
'
, '{TABLE_SCHEMA}', c.TABLE_SCHEMA)
, '{TABLE_NAME}', c.TABLE_NAME)
, '{COLUMN_NAME}', c.COLUMN_NAME) AS [SQL]
FROM INFORMATION_SCHEMA.COLUMNS c
INNER JOIN INFORMATION_SCHEMA.TABLES t
ON t.TABLE_TYPE = 'BASE TABLE'
AND c.TABLE_CATALOG = t.TABLE_CATALOG
AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
AND c.TABLE_NAME = t.TABLE_NAME
AND c.DATA_TYPE = 'int'
You can get a lot fancier by doing UNIONs of the entire query and checking the IS_NULLABLE on each column and obviously you might have different requirements for different data types, and skipping identity columns, etc.
I'm assuming you mean you want to know if there are any values in all the rows of a given column. If your column can have "blanks" you're probably going to need to add an OR NOT NULL to your WHERE clause to get the correct answer.