How to find out whether a table has some unique columns - sql

I use MS SQL Server.
Ive been handed some large tables with no constrains on them, no keys no nothing.
I know some of the columns have unique values. Is there a smart way for a given table to finde the cols that have unique values ?
Right now I do it manually for each column by counting if there is as many DISTINCT values as there are rows in the table.
SELECT COUNT(DISTINCT col) FROM table
Could prob make a cusor to loop over all the columns but want to hear if someone knows a smarter or build-in function.
Thanks.

Here's an approach that is basically similar to #JNK's but instead of printing the counts it returns a ready answer for every column that tells you whether a column consists of unique values only or not:
DECLARE #table varchar(100), #sql varchar(max);
SET #table = 'some table name';
SELECT
#sql = COALESCE(#sql + ', ', '') + ColumnExpression
FROM (
SELECT
ColumnExpression =
'CASE COUNT(DISTINCT ' + COLUMN_NAME + ') ' +
'WHEN COUNT(*) THEN ''UNIQUE'' ' +
'ELSE '''' ' +
'END AS ' + COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = #table
) s
SET #sql = 'SELECT ' + #sql + ' FROM ' + #table;
PRINT #sql; /* in case you want to have a look at the resulting query */
EXEC(#sql);
It simply compares COUNT(DISTINCT column) with COUNT(*) for every column. The result will be a table with a single row, where every column will contain the value UNIQUE for those columns that do not have duplicates, and empty string if duplicates are present.
But the above solution will work correctly only for those columns that do not have NULLs. It should be noted that SQL Server does not ignore NULLs when you want to create a unique constraint/index on a column. If a column contains just one NULL and all other values are unique, you can still create a unique constraint on the column (you cannot make it a primary key, though, which requires both uniquness of values and absence of NULLs).
Therefore you might need a more thorough analysis of the contents, which you could get with the following script:
DECLARE #table varchar(100), #sql varchar(max);
SET #table = 'some table name';
SELECT
#sql = COALESCE(#sql + ', ', '') + ColumnExpression
FROM (
SELECT
ColumnExpression =
'CASE COUNT(DISTINCT ' + COLUMN_NAME + ') ' +
'WHEN COUNT(*) THEN ''UNIQUE'' ' +
'WHEN COUNT(*) - 1 THEN ' +
'CASE COUNT(DISTINCT ' + COLUMN_NAME + ') ' +
'WHEN COUNT(' + COLUMN_NAME + ') THEN ''UNIQUE WITH SINGLE NULL'' ' +
'ELSE '''' ' +
'END ' +
'WHEN COUNT(' + COLUMN_NAME + ') THEN ''UNIQUE with NULLs'' ' +
'ELSE '''' ' +
'END AS ' + COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = #table
) s
SET #sql = 'SELECT ' + #sql + ' FROM ' + #table;
PRINT #sql; /* in case you still want to have a look at the resulting query */
EXEC(#sql);
This solution takes NULLs into account by checking three values: COUNT(DISTINCT column), COUNT(column) and COUNT(*). It displays the results similarly to the former solution, but the possible diagnoses for the columns are more diverse:
UNIQUE means no duplicate values and no NULLs (can either be a PK or have a unique constraint/index);
UNIQUE WITH SINGLE NULL – as can be guessed, no duplicates, but there's one NULL (cannot be a PK, but can have a unique constraint/index);
UNIQUE with NULLs – no duplicates, two or more NULLs (in case you are on SQL Server 2008, you could have a conditional unique index for non-NULL values only);
empty string – there are duplicates, possibly NULLs too.

Here is I think probably the cleanest way. Just use dynamic sql and a single select statement to create a query that gives you a total row count and a count of distinct values for each field.
Fill in the DB name and tablename at the top. The DB name part is really important since OBJECT_NAME only works in the current database context.
use DatabaseName
DECLARE #Table varchar(100) = 'TableName'
DECLARE #SQL Varchar(max)
SET #SQL = 'SELECT COUNT(*) as ''Total'''
SELECT #SQL = #SQL + ',COUNT(DISTINCT ' + name + ') as ''' + name + ''''
FROM sys.columns c
WHERE OBJECT_NAME(object_id) = #Table
SET #SQL = #SQL + ' FROM ' + #Table
exec #sql

If you are using 2008, you can use the Data Profiling Task in SSIS to return Candidate Keys for each table.
This blog entry steps through the process, it's fairly simple:
http://consultingblogs.emc.com/jamiethomson/archive/2008/03/04/ssis-data-profiling-task-part-8-candidate-key.aspx

A few words what my code does:
Read's all tables and columns
Creates a temp table to hold table/columns with duplicate keys
For each table/column it runs a query. If it finds a count(*)>1 for at least one value
it makes an insert into the temp table
Select's column and values from the system tables that do not match table/columns that are found to have duplicates
DECLARE #sql VARCHAR(max)
DECLARE #table VARCHAR(100)
DECLARE #column VARCHAR(100)
CREATE TABLE #temp (tname VARCHAR(100),cname VARCHAR(100))
DECLARE mycursor CURSOR FOR
select t.name,c.name
from sys.tables t
join sys.columns c on t.object_id = c.object_id
where system_type_id not in (34,35,99)
OPEN mycursor
FETCH NEXT FROM mycursor INTO #table,#column
WHILE ##FETCH_STATUS = 0
BEGIN
SET #sql = 'INSERT INTO #temp SELECT DISTINCT '''+#table+''','''+#column+ ''' FROM ' + #table + ' GROUP BY ' + #column +' HAVING COUNT(*)>1 '
EXEC (#sql)
FETCH NEXT FROM mycursor INTO #table,#column
END
select t.name,c.name
from sys.tables t
join sys.columns c on t.object_id = c.object_id
left join #temp on t.name = #temp.tname and c.name = #temp.cname
where system_type_id not in (34,35,99) and #temp.tname IS NULL
DROP TABLE #temp
CLOSE mycursor
DEALLOCATE mycursor

What about simple one line of code:
CREATE UNIQUE INDEX index_name ON table_name (column_name);
If the index is created then your column_name has only unique values. If there are dupes in your column_name, you will get an error message.

Related

Counting rows in the table which have 1 or more missing values

Could you please advise how to find the number of rows in the table which have 1 or more missing values? The missing values are represented in my table by question marks = '?'. The table has 15 columns and ~50k rows. When I run the following query for some of the columns I can receive some results:
SELECT
COUNT(*)
FROM table_name
WHERE column_name ='?'
However I have also columns which bring me result: "Error converting data type varchar to float"
I would like to be able to find the number of rows in the table which have 1 or more missing values using 1 query/not run separately for each column.
Thank you in advance for your support!
Select Count(*)
From mySchema.myTable
Where Cast(Col1 As NVarChar(128)) +
Cast(Col2 As NVarChar(128)) +
Cast(Coln As NVarChar(128)) Like '%?%'
It's ugly and WILL be slow and you may need to modify the Casts accordingly, but should do the trick.
This should work for any column:
select count(*)
from table_name
where column_name is null or cast(column_name as varchar(255)) = '?';
Try following query:
Just set table name and it will get all columns
Also you can give value_to_match like '?' in your case or any other if you want.
DECLARE #table_name nvarchar(max) = 'table_name'
DECLARE #value_to_match nvarchar(max) = '1'
DECLARE #query nvarchar(max) = ''
DECLARE #Condition nvarchar(max) = ' OR ' -- 1 OR when you want to count row if any column has that value -- 2 when you want all all columns to have same value
SELECT #query = #query + ' cast(' + COLUMN_NAME + ' as nvarchar(500)) = ''' + #value_to_match + '''' + #Condition FROM informatioN_schema.columns WHERE table_name = #table_name
if ##rowcount = 0
BEGIN
SELECT 'Table doesn''t Exists'
RETURN
END
SELECT #query = LEFT(#query,LEN(#query)-3)
PRINT ('select count(9) FROM ' + #table_name + ' WHERE ' + #query)
EXEC ('select count(9) FROM ' + #table_name + ' WHERE ' + #query)

How do I create a select statement to return distinct values, column name and table name?

I would like to create a SQL Statement that will return the distinct values of the Code fields in my database, along with the name of the column for the codes and the name of the table on which the column occurs.
I had something like this:
select c.name as 'Col Name', t.name as "Table Name'
from sys.columns c, sys tables t
where c.object_id = t.object_id
and c.name like 'CD_%'
It generates the list of columns and tables I want, but obviously doesn't return any of the values for each of the codes in the list.
There are over 100 tables in my database. I could use the above result set and write the query for each one like this:
Select distinct CD_RACE from PERSON
and it will return the values, but it won't return the column and table name, plus I have to do each one individually. Is there any way I can get the value, column name and table name for EACH code in my database?
Any ideas? THanks...
Just generate your selects and bring in the column and table names as static values. Here's an Oracle version:
select 'select distinct '''||c.column_name||''' as "Col Name", '''||t.table_name||''' as "Table Name", '||c.column_name||' from '||t.table_name||';'
from all_tab_columns c, all_tables t
where c.table_name = t.table_name;
This will give you a bunch of separate statements, you can modify the query a bit to put a union between each select if you really want one uber query you can execute to get all your code values at once.
Here's an approach for SQL Server since someone else covered Oracle (and specific DBMS not mentioned. The following steps are completed:
Setup table to receive the schema, table, column name, and column value (in example below only table variable is used)
Build the list of SQL commands to execute (accounting for various schemas and names with spaces and such)
Run each command dynamically inserting values into the setup table from #1 above
Output results from table
Here is the example:
-- Store the values and source of the values
DECLARE #Values TABLE (
SchemaName VARCHAR(500),
TableName VARCHAR(500),
ColumnName VARCHAR(500),
ColumnValue VARCHAR(MAX)
)
-- Build list of SQL Commands to run
DECLARE #Commands TABLE (
Id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
SchemaName VARCHAR(500),
TableName VARCHAR(500),
ColumnName VARCHAR(500),
SqlCommand VARCHAR(1000)
)
INSERT #Commands
SELECT
[TABLE_SCHEMA],
[TABLE_NAME],
[COLUMN_NAME],
'SELECT DISTINCT '
+ '''' + [TABLE_SCHEMA] + ''', '
+ '''' + [TABLE_NAME] + ''', '
+ '''' + [COLUMN_NAME] + ''', '
+ '[' + [COLUMN_NAME] + '] '
+ 'FROM [' + [TABLE_SCHEMA] + '].[' + [TABLE_NAME] + ']'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE 'CD_%'
-- Loop through commands
DECLARE
#Sql VARCHAR(1000),
#Id INT,
#SchemaName VARCHAR(500),
#TableName VARCHAR(500),
#ColumnName VARCHAR(500)
WHILE EXISTS (SELECT * FROM #Commands) BEGIN
-- Get next set of records
SELECT TOP 1
#Id = Id,
#Sql = SqlCommand,
#SchemaName = SchemaName,
#TableName = TableName,
#ColumnName = ColumnName
FROM #Commands
-- Add values for that command
INSERT #Values
EXEC (#Sql)
-- Remove command record
DELETE #Commands WHERE Id = #Id
END
-- Return the values and sources
SELECT * FROM #Values

Is there a way to remove '_' from column name while selecting * in sql statement?

My table has all the column names
(There are more than 80 columns, I can't change the column names now)
in the format of '_'. Like First_Name, Last_Name,...
So i want to use select * from table instead
of using AS.
I want to select them by removing '_' in one statement. Anyway i can do it?
something like Replace(coulmnName, '_','') in select statement ?
Thanks
You can simply rename the column in your query. For example:
SELECT FIRST_NAME [First Name],
LAST_NAME [Last Name]
FROM UserTable
You can also use the AS keyword but this is optional. Also note that if you don't want to do this on every query you can use this process to create a view with renamed columns. Then you can use SELECT * the way you want to (although this is considered a bad idea for many reasons).
Best of luck!
Alternative - Map In The Client Code:
One other alternative is to do the mapping in the client code. This solution is going to depend greatly on your ORM. Most ORM's (such as LINQ or EF) will allow you to remap. If nothing else you could use AutoMapper or similar to rename the columns on the client using convention based naming.
You can't do this in a single statement unless you're using dynamic SQL. If you're just trying to generate code, you can run a query against Information_Schema and get the info you want ...
DECLARE #MaxColumns INT
DECLARE #TableName VARCHAR(20)
SET #TableName = 'Course'
SELECT #MaxColumns = MAX(ORDINAL_POSITION) FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = #TableName
SELECT Col
FROM
(
SELECT 0 Num, 'SELECT' Col
UNION
SELECT ROW_NUMBER() OVER (PARTITION BY TABLE_NAME ORDER BY ORDINAL_POSITION) Num, ' [' + COLUMN_NAME + '] AS [' + REPLACE(COLUMN_NAME, '_', '') + ']' + CASE WHEN ORDINAL_POSITION = #MaxColumns THEN '' ELSE ',' END
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = #TableName
UNION
SELECT #MaxColumns + 1 Num, 'FROM ' + #TableName
) s
ORDER BY num
The question intrigued me and I did find one way. It makes it happen but if you just wanted to give a lot of aliases one time in one query I wouldn't recommend it though.
First I made a stored procedure that extracts all the column names and gives them an alias without '_'.
USE [DataBase]
GO
IF OBJECT_ID('usp_AlterColumnDisplayName', 'P') IS NOT NULL
DROP PROCEDURE usp_AlterColumnDisplayName
GO
CREATE PROCEDURE usp_AlterColumnDisplayName
#TableName VARCHAR(50)
,
#ret nvarchar(MAX) OUTPUT
AS
Select #ret = #ret + [Column name]
From
(
SELECT ([name] + ' AS ' + '[' + REPLACE([name], '_', ' ') + '], ') [Column name]
FROM syscolumns
WHERE id =
(Select id
From sysobjects
Where type = 'U'
And [name] = #TableName
)
) T
GO
Then extract that string and throw it into another string with a query-structure.
Execute that and you are done.
DECLARE #out NVARCHAR(MAX), #DesiredTable VARCHAR(50), #Query NVARCHAR(MAX)
SET #out = ''
SET #DesiredTable = 'YourTable'
EXEC usp_AlterColumnDisplayName
#TableName = #DesiredTable,
#ret = #out OUTPUT
SET #out = LEFT(#out, LEN(#out)-1) --Removing trailing ', '
SET #Query = 'Select ' + #out + ' From ' + #DesiredTable + ' WHERE whatever'
EXEC sp_executesql #Query
If you just wanted to give a lot of aliases at once without sitting and typing it out for 80+ columns I would rather suggest doing that with one simple SELECT statement, like the one in the sp, or in Excel and then copy paste into your code.

MS SQL Store Procedure to Merge Multiple Rows into Single Row based on Variable Table and Column Names

I'm working with MS SQL Server 2008. I'm trying to create a stored procedure to Merge (perhaps) several rows of data (answers) into a single row on target table(s). This uses a 'table_name' field and 'column_name' field from the answers table. The data looks like something like this:
answers table
--------------
id int
table_name varchar
column_name varchar
answer_value varchar
So, the target table (insert/update) would come from the 'table_name'. Each row from the anwsers would fill one column on the target table.
table_name_1 table
--------------
id int
column_name_1 varchar
column_name_2 varchar
column_name_3 varchar
etc...
Note, there can be many target tables (variable from answers table: table_name_1, table_name_2, table_name_3, etc.) that insert into many columns (column_name_1...2...3) on each target table.
I thought about using a WHILE statement to loop through the answers table. This could build a variable which would be the insert/update statement(s) for the target tables. Then executing those statements somehow. I also noticed Merge looks like it might help with this problem (select/update/insert), but my MS SQL Stored Procedure experience is very little. Could someone suggestion a strategy or solution to this problem?
Note 6/23/2014: I'm considering using a single Merge statement, but I'm not sure it is possible.
I'm probably missing something, but the basic idea to solve the problem is to use meta-programming, like a dynamic pivot.
In this particular case there is another layer to make the solution more difficult: the result need to be in different execution instead of beeing grouped.
The backbone for a possible solution is
DECLARE #cols AS NVARCHAR(MAX)
DECLARE #query AS NVARCHAR(MAX)
--using a cursor on SELECT DISTINCT table_name FROM answers iterate:
--*Cursor Begin Here*
--mock variable for the first value of the cursor
DECLARE #table AS NVARCHAR(MAX) = 't1'
-- Column list
SELECT #cols = STUFF((SELECT distinct
',' + QUOTENAME(column_name)
FROM answers with (nolock)
WHERE table_name = #table
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
, 1, 1, '')
--Query definition
SET #query = '
SELECT ' + #cols + '
INTO ' + #table + '
FROM (SELECT column_name, answer_value
FROM answers
WHERE table_name = ''' + #table + ''') b
PIVOT (MAX(answer_value) FOR column_name IN (' + #cols + ' )) p '
--select #query
EXEC sp_executesql #query
--select to verify the execution
--SELECT * FROM t1
--*Cursor End Here*
SQLFiddle Demo
The cursor definition is omitted, because I'm not sure if it'll work on SQLFiddle
In addition to the template for a Dynamic Pivot the columns list is filtered by the new table name, and in the query definition there is a SELECT ... INTO instead of a SELECT.
This script does not account for table already in the database, if that's a possibility the query can be divided in two:
SET #query = '
SELECT TOP 0 ' + #cols + '
INTO ' + #table + '
FROM (SELECT column_name, answer_value
FROM answers
WHERE table_name = ''' + #table + ''') b
PIVOT (MAX(answer_value) FOR column_name IN (' + #cols + ' )) p '
to create the table without data, if needed, and
SET #query = '
INSERT INTO ' + #table + '(' + #cols + ')'
SELECT ' + #cols + '
FROM (SELECT column_name, answer_value
FROM answers
WHERE table_name = ''' + #table + ''') b
PIVOT (MAX(answer_value) FOR column_name IN (' + #cols + ' )) p '
or a MERGE to insert/update the values in the table.
Another possibility will be to DROP and recreate every table.
Approach I took to this complex problem:
Create several temporary tables to work with your data
Select and populate the temporary tables with the data
Use dynamic pivoting to pivot the rows into one row
Use a CURSOR with WHILE loop for multiple table entries
SET #query with the dynamically built MERGE statement
EXECUTE(#query)
Drop temporary tables

Looping through column names with dynamic SQL

I just came up with an idea for a piece of code to show all the distinct values for each column, and count how many records for each. I want the code to loop through all columns.
Here's what I have so far... I'm new to SQL so bear with the noobness :)
Hard code:
select [Sales Manager], count(*)
from [BT].[dbo].[test]
group by [Sales Manager]
order by 2 desc
Attempt at dynamic SQL:
Declare #sql varchar(max),
#column as varchar(255)
set #column = '[Sales Manager]'
set #sql = 'select ' + #column + ',count(*) from [BT].[dbo].[test] group by ' + #column + 'order by 2 desc'
exec (#sql)
Both of these work fine. How can I make it loop through all columns? I don't mind if I have to hard code the column names and it works its way through subbing in each one for #column.
Does this make sense?
Thanks all!
You can use dynamic SQL and get all the column names for a table. Then build up the script:
Declare #sql varchar(max) = ''
declare #tablename as varchar(255) = 'test'
select #sql = #sql + 'select [' + c.name + '],count(*) as ''' + c.name + ''' from [' + t.name + '] group by [' + c.name + '] order by 2 desc; '
from sys.columns c
inner join sys.tables t on c.object_id = t.object_id
where t.name = #tablename
EXEC (#sql)
Change #tablename to the name of your table (without the database or schema name).
This is a bit of an XY answer, but if you don't mind hardcoding the column names, I suggest you do just that, and avoid dynamic SQL - and the loop - entirely. Dynamic SQL is generally considered the last resort, opens you up to security issues (SQL injection attacks) if not careful, and can often be slower if queries and execution plans cannot be cached.
If you have a ton of column names you can write a quick piece of code or mail merge in Word to do the substitution for you.
However, as far as how to get column names, assuming this is SQL Server, you can use the following query:
SELECT c.name
FROM sys.columns c
WHERE c.object_id = OBJECT_ID('dbo.test')
Therefore, you can build your dynamic SQL from this query:
SELECT 'select '
+ QUOTENAME(c.name)
+ ',count(*) from [BT].[dbo].[test] group by '
+ QUOTENAME(c.name)
+ 'order by 2 desc'
FROM sys.columns c
WHERE c.object_id = OBJECT_ID('dbo.test')
and loop using a cursor.
Or compile the whole thing together into one batch and execute. Here we use the FOR XML PATH('') trick:
DECLARE #sql VARCHAR(MAX) = (
SELECT ' select ' --note the extra space at the beginning
+ QUOTENAME(c.name)
+ ',count(*) from [BT].[dbo].[test] group by '
+ QUOTENAME(c.name)
+ 'order by 2 desc'
FROM sys.columns c
WHERE c.object_id = OBJECT_ID('dbo.test')
FOR XML PATH('')
)
EXEC(#sql)
Note I am using the built-in QUOTENAME function to escape column names that need escaping.
You want to know the distinct coulmn values in all the columns of the table ? Just replace the table name Employee with your table name in the following code:
declare #SQL nvarchar(max)
set #SQL = ''
;with cols as (
select Table_Schema, Table_Name, Column_Name, Row_Number() over(partition by Table_Schema, Table_Name
order by ORDINAL_POSITION) as RowNum
from INFORMATION_SCHEMA.COLUMNS
)
select #SQL = #SQL + case when RowNum = 1 then '' else ' union all ' end
+ ' select ''' + Column_Name + ''' as Column_Name, count(distinct ' + quotename (Column_Name) + ' ) As DistinctCountValue,
count( '+ quotename (Column_Name) + ') as CountValue FROM ' + quotename (Table_Schema) + '.' + quotename (Table_Name)
from cols
where Table_Name = 'Employee' --print #SQL
execute (#SQL)