SQL: Deleting Identical Columns With Different Names - sql

My original table ("original_table") looks like this (contains both numeric and character variables):
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
I would like to remove columns from this table that have identical entries, but are named differently. In the end, this should look like this ("new_table"):
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Is there a standard way to do this in SQL? I tried to do some research and came across the following link : How do I compare two columns for equality in SQL Server?
What I Tried So Far: It seems that something like this might work:
CREATE TABLE new_table AS SELECT * FROM original_table;
ALTER TABLE new_table
ADD does_age_equal_height varchar(255);
UPDATE new_table
SET does_age_equal_height = CASE
WHEN age = height THEN '1' ELSE '0' END AS does_age_equal_height;
From here, if the "sum" of all values in the "does_age_equal_height" column equals to the number of rows from "new_table" (i.e. select count(rownum) from new_table) - this must mean that both columns are equal, and that one of the columns can be dropped.
However, this is a very inefficient method, even for tables having a small number of columns. In my example, I have 5 columns - this means that I would have to repeat the above process " 5C2" times, i.e. 5! / (2!*3!) = 10 times. For example:
ALTER TABLE employees
ADD does_age_equal_height varchar(255),
does_age_equal_height2 varchar(255)
does_age_equal_gender varchar(255)
does_age_equal_gender2 varchar(255)
does_height_equal_height2 varchar(255)
does_height_equal_gender varchar(255)
does_height_equal_gender2 varchar(255)
does_height2_equal_gender varchar(255)
does_height2_equal_gender2 varchar(255)
does_gender_equal_gender2 varchar(255);
This would then be followed by multiple CASE statements - further complicating the process.
Can someone please show me a more efficient way of doing this?
Thanks!

I hope to get your problem in the right way. This is my code in SqlServer to handle it, you should customize it based on Netezza SQL.
My idea is:
Calculate MD5 for each column and then compare these columns together, if there is the same hash, one of the columns will be chosen.
I going to create the below table for this problem:
CREATE TABLE Students
(
Id INT PRIMARY KEY IDENTITY,
StudentName VARCHAR (50),
Course VARCHAR (50),
Score INT,
lastName VARCHAR (50) -- another alias for StudentName ,
metric INT, -- another alias for score
className VARCHAR(50) -- another alias for Course
)
GO
INSERT INTO Students VALUES ('Sally', 'English', 95, 'Sally', 95, 'English');
INSERT INTO Students VALUES ('Sally', 'History', 82, 'Sally', 82, 'History');
INSERT INTO Students VALUES ('Edward', 'English', 45, 'Edward', 45, 'English');
INSERT INTO Students VALUES ('Edward', 'History', 78, 'Edward', 78, 'History');
after creating the table and inserting sample records, it turns to find similar columns.
step 1. Declare variables.
DECLARE #cols_q VARCHAR(max),
#cols VARCHAR(max),
#table_name VARCHAR(max)= N'Students',
#res NVARCHAR(max),
#newCols VARCHAR(max),
#finalResQuery VARCHAR(max);
step 2. Generate dynamics query for calculating a hash for every column.
SELECT #cols_q = COALESCE(#cols_q+ ', ','')+'HASHBYTES(''MD5'', CONVERT(varbinary(max), (select '+ COLumn_NAME +' as t from Students FOR XML AUTO))) as '+ COLumn_NAME,
#cols = coalesce(#cols + ',','')+COLumn_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = #table_name;
set #cols_q = 'select '+ #cols_q +' into ##tmp_'+ #table_name+' from '+ #table_name;
step 3. Run generated query.
exec(#cols_q)
step 4. Get columns that duplicated columns removed.
set #res = N'select uniq_colname into ##temp_colnames
from(
select max(colname) as uniq_colname from (
select * from ##tmp_Students
)tt
unpivot (
md5_hash for colname in ( '+ #cols +')
) as tbl
group by md5_hash
)tr';
exec ( #res);
step 5. Get final results
select #newCols = COALESCE(#newCols+ ', ','')+ uniq_colname from ##temp_colnames
set #finalResQuery = 'select '+ #newCols +' from '+ #table_name;
exec (#finalResQuery)

Related

Use table of column metadata as column header and type

Tables
dbo.Metadata:
Column
Type
ID
int
Name
varchar(50)
Location
varchar(50)
dbo.Data:
Col1
Col2
Col3
1
Awesomenauts Inc.
Germany
2
DataMunchers
France
3
WeBuyStuff
France
Wanted Output:
ID
Name
Location
1
Awesomenauts Inc.
Germany
2
DataMunchers
France
3
WeBuyStuff
France
Is there any simple way to achieve this?
Perhaps with Dynamic SQL?
Oh, and the schema may wary from day to day, everything will be batch reloaded into the DWH daily.
You will need to have some sort of order defined in your metadata for this to work. For my script, I added ColumnOrder for reference
/*Setup Metadata table*/
DROP TABLE IF EXISTS #Metadata
CREATE TABLE #Metadata (
ColumnOrder INT IDENTITY(1,1) PRIMARY KEY /*Need to have some sort of defined column order, I created one for illustration purposes*/
,[Column] SYSNAME
,[Type] VARCHAR(255)
)
/*Load data*/
INSERT INTO #Metadata
VALUES
('ID','int')
,('Name','varchar(50)')
,('Location','varchar(50)')
/*Create dynamic SQL*/
DECLARE #DynamicSQL NVARCHAR(MAX);
/*Create column list*/
;WITH cte_Column AS (
SELECT ColumnOrder,
[Column]
,[Type]
,DataColName = CONCAT('Col',Row_Number () OVER (ORDER BY A.ColumnOrder))
FROM #Metadata AS A
)
SELECT #DynamicSQL
= STRING_AGG(
Concat(QUOTENAME([Column])
,' = CAST ('
,DataColName
,' AS '
,A.[Type]
,')')
,CONCAT(CHAR(13),CHAR(10),',') /*Line break + comma separators*/
)
WITHIN GROUP (ORDER BY A.ColumnOrder) /*Ensures columns concatenated in order*/
FROM cte_Column AS A
Set #DynamicSQL = CONCAT('SELECT ',#DynamicSQL,CHAR(13),CHAR(10),' FROM dbo.Data')
PRINT #DynamicSQL
/*Uncomment to execute*/
--EXEC (#DynamicSQL)

Convert CSV stored in a string variable to table

I've got CSV data stored in a string variable in SQL:
#csvContent =
'date;id;name;position;street;city
19.03.2019 10:06:00;1;Max;President;Langestr. 35;Berlin
19.04.2019 12:36:00;2;Bernd;Vice President;Haupstr. 40;Münster
21.06.2019 14:30:00;3;Franziska;financial;Hofstr. 19;Frankfurt'
What I want to do is to convert it to a #table, so it would look like
SELECT * FROM #table
date id name position street city
---------------------------------------------------------------------
19.03.2019 10:06:00 1 Max President Langestr. 35 Berlin
19.04.2019 12:36:00 2 Bernd Vice President Haupstr. 40 Münster
21.06.2019 14:30:00 3 Franzi financial Hofstr. 19 Frankfurt
The headers aren't fixed so the CSV could have more or less columns with differnt Header names.
I've tried it with split_string() and pivot but didn't find a solution for this.
If you are using SQL server, this might be a solution for your request:
How to split a comma-separated value to columns
Hope it will help you
CREATE TABLE #temp(
date date,
id int ,
name varchar(100),
. ....... //create column that you needed
)
DECLARE #sql NVARCHAR(4000) = 'BULK INSERT #temp
FROM ''' + #CSVFILE+ ''' WITH
(
FIELDTERMINATOR ='';'',
ROWTERMINATOR =''\n'',
FIRSTROW = 2
)';
EXEC(#sql);
SELECT *FROM #temp
DROP TABLE #temp

How can I update strings within an SQL server table based on a query?

I have two tables A and B. A has an Id and a string with some embedded information for some text and ids from a table C that is not shown
Aid| AString
1 "<thing_5"><thing_6">"
2 "<thing_5"><thing_6">"
Bid|Cid|Aid
1 5 1
2 6 1
3 5 2
4 6 2
I realise this is an insane structure but that is life.
I need to update the strings within A so that instead of having the Cid they have the corresponding Bid (related by the Aid and Bid pairing)
Is this even something I should be thinking of doing in SQL... A has about 300 entries and B about 1200 so not something doing by hand
For clarity I wish for B to remain the same and A to finally look like this
Aid| AString
1 "<thing_1"><thing_2">"
2 "<thing_3"><thing_4">"
This script relies on generating dynamic SQL statements to update the table, then executes those statements.
Taking into account that the cid's are within thing_ and ":
First replaces the cid's using a placeholder ($$$$$$ in this case) to account for the fact that cid's and bid's may overlap (example, changing 3->2 and later 2->1)
Then changes the placeholders to the proper bid
CREATE TABLE #a(aid INT,astr VARCHAR(MAX));
INSERT INTO #a(aid,astr)VALUES(1,'<thing_5"><thing_6">'),(2,'<thing_5"><thing_6">');
CREATE TABLE #rep(aid INT,bid INT,cid INT);
INSERT INTO #rep(bid,cid,aid)VALUES(5,6,1),(6,5,1),(3,5,2),(4,6,2);
DECLARE #cmd NVARCHAR(MAX)=(
SELECT
'UPDATE #a '+
'SET astr=REPLACE(astr,''thing_'+CAST(r.cid AS VARCHAR(16))+'"'',''thing_$$$$$$'+CAST(r.cid AS VARCHAR(16))+'"'') '+
'WHERE aid='+CAST(a.aid AS VARCHAR(16))+';'
FROM
(SELECT DISTINCT aid FROM #a AS a) AS a
INNER JOIN #rep AS r ON
r.aid=a.aid
FOR
XML PATH('')
);
EXEC sp_executesql #cmd;
SET #cmd=(
SELECT
'UPDATE #a '+
'SET astr=REPLACE(astr,''thing_$$$$$$'+CAST(r.cid AS VARCHAR(16))+'"'',''thing_'+CAST(r.bid AS VARCHAR(16))+'"'') '+
'WHERE aid='+CAST(a.aid AS VARCHAR(16))+';'
FROM
(SELECT DISTINCT aid FROM #a AS a) AS a
INNER JOIN #rep AS r ON
r.aid=a.aid
FOR
XML PATH('')
);
EXEC sp_executesql #cmd;
SELECT * FROM #a;
DROP TABLE #rep;
DROP TABLE #a;
Result is:
+-----+----------------------+
| aid | astr |
+-----+----------------------+
| 1 | <thing_6"><thing_5"> |
| 2 | <thing_3"><thing_4"> |
+-----+----------------------+
You could do this with SQL with something like below. It wasn't clear to me how c was related, but you can adjust it as necessary...
create table a (
Aid int null,
AString varchar(25) null)
insert into a values(1,'"<thing_5"><thing_6">"')
insert into a values(2,'"<thing_5"><thing_6">"')
create table b (
Aid int null,
Bid int null,
Cid int null)
insert into b values(1,1,5)
insert into b values(1,2,6)
insert into b values(2,3,5)
insert into b values(2,4,6)
UPDATE Ax
SET Ax.ASTRING = REPLACE(Ax.ASTRING, 'thing_' + cast(cID as varchar(1)),'thing_' + cast(BID as varchar(1)))
FROM A Ax
INNER JOIN Bx
on ax.Aid=bx.Aid
and Ax.AString like '%thing_' + cast(Cid as varchar(1)) + '%'

Create a view based on column metadata

Let's assume two tables:
TableA holds various data measurements from a variety of stations.
TableB holds metadata, about the columns used in TableA.
TableA has:
stationID int not null, pk
entryDate datetime not null, pk
waterTemp float null,
waterLevel float null ...etc
TableB has:
id int not null, pk, autoincrement
colname varchar(50),
unit varchar(50) ....etc
So for example, one line of data from tableA reads:
1 | 2013-01-01 00:00 | 2.4 | 3.5
two lines from tableB read:
1| waterTemp | celcius
2| waterLevel | meters
This is a simplified example. In truth, tableA might hold close to 20 different data columns, and table b has close to 10 metadata columns.
I am trying to design a view which will output the results like this:
StationID | entryDate | water temperature | water level |
1 | 2013-01-01 00:00 | 2.4 celcius | 3.5 meters |
So two questions:
Other than specifying subselects from TableB (..."where
colname='XXX'") for each column, which seems horribly insufficient
(not to mention...manual :P ), is there a way to get the result I
mentioned earlier with automatic match on colname?
I have a hunch
that this might be bad design on the database. Is it so? If yes,
what would be a more optimal design? (Bear in mind the complexity of
the data structure I mentioned earlier)
dynamic SQL with PIVOT is the answer. though it is dirty in terms of debugging or say for some new developer to understand the code but it will give you the result you expected.
check the below query.
in this we need to prepare two things dynamically. one is list columns in the result set and second is list of values will appear in PIVOT query. notice in the result i do not have NULL values for Column3, Column5 and Column6.
SET NOCOUNT ON
IF OBJECT_ID('TableA','u') IS NOT NULL
DROP TABLE TableA
GO
CREATE TABLE TableA
(
stationID int not null IDENTITY (1,1)
,entryDate datetime not null
,waterTemp float null
,waterLevel float NULL
,Column3 INT NULL
,Column4 BIGINT NULL
,Column5 FLOAT NULL
,Column6 FLOAT NULL
)
GO
IF OBJECT_ID('TableB','u') IS NOT NULL
DROP TABLE TableB
GO
CREATE TABLE TableB
(
id int not null IDENTITY(1,1)
,colname varchar(50) NOT NULL
,unit varchar(50) NOT NULL
)
INSERT INTO TableA( entryDate ,waterTemp ,waterLevel,Column4)
SELECT '2013-01-01',2.4,3.5,101
INSERT INTO TableB( colname, unit )
SELECT 'WaterTemp','celcius'
UNION ALL SELECT 'waterLevel','meters'
UNION ALL SELECT 'Column3','unit3'
UNION ALL SELECT 'Column4','unit4'
UNION ALL SELECT 'Column5','unit5'
UNION ALL SELECT 'Column6','unit6'
DECLARE #pvtInColumnList NVARCHAR(4000)=''
,#SelectColumnist NVARCHAR(4000)=''
, #SQL nvarchar(MAX)=''
----getting the list of Columnnames will be used in PIVOT query list
SELECT #pvtInColumnList = CASE WHEN #pvtInColumnList=N'' THEN N'' ELSE #pvtInColumnList + N',' END
+ N'['+ colname + N']'
FROM TableB
--PRINT #pvtInColumnList
----lt and rt are table aliases used in subsequent join.
SELECT #SelectColumnist= CASE WHEN #SelectColumnist = N'' THEN N'' ELSE #SelectColumnist + N',' END
+ N'CAST(lt.'+sc.name + N' AS Nvarchar(MAX)) + SPACE(2) + rt.' + sc.name + N' AS ' + sc.name
FROM sys.objects so
JOIN sys.columns sc
ON so.object_id=sc.object_id AND so.name='TableA' AND so.type='u'
JOIN TableB tbl
ON tbl.colname=sc.name
JOIN sys.types st
ON st.system_type_id=sc.system_type_id
ORDER BY sc.name
IF #SelectColumnist <> '' SET #SelectColumnist = N','+#SelectColumnist
--PRINT #SelectColumnist
----preparing the final SQL to be executed
SELECT #SQL = N'
SELECT
--this is a fixed column list
lt.stationID
,lt.entryDate
'
--dynamic column list
+ #SelectColumnist +N'
FROM TableA lt,
(
SELECT * FROM
(
SELECT colname,unit
FROM TableB
)p
PIVOT
( MAX(p.unit) FOR p.colname IN ( '+ #pvtInColumnList +N' ) )q
)rt
'
PRINT #SQL
EXECUTE sp_executesql #SQL
here is the result
ANSWER to your Second Question.
the design above is not even giving performance nor flexibility. if user wants to add new Metadata (Column and Unit) that can not be done w/o changing table definition of TableA.
if we are OK with writing Dynamic SQL to give user Flexibility we can redesign the TableA as below. there is nothing to change in TableB. I would convert it in to Key-value pair table. notice that StationID is not any more IDENTITY. instead for given StationID there will be N number of row where N is the number of column supplying the Values for that StationID. with this design, tomorrow if user adds new Column and Unit in TableB it will add just new Row in TableA. no table definition change required.
SET NOCOUNT ON
IF OBJECT_ID('TableA_New','u') IS NOT NULL
DROP TABLE TableA_New
GO
CREATE TABLE TableA_New
(
rowID INT NOT NULL IDENTITY (1,1)
,stationID int not null
,entryDate datetime not null
,ColumnID INT
,Columnvalue NVARCHAR(MAX)
)
GO
IF OBJECT_ID('TableB_New','u') IS NOT NULL
DROP TABLE TableB_New
GO
CREATE TABLE TableB_New
(
id int not null IDENTITY(1,1)
,colname varchar(50) NOT NULL
,unit varchar(50) NOT NULL
)
GO
INSERT INTO TableB_New(colname,unit)
SELECT 'WaterTemp','celcius'
UNION ALL SELECT 'waterLevel','meters'
UNION ALL SELECT 'Column3','unit3'
UNION ALL SELECT 'Column4','unit4'
UNION ALL SELECT 'Column5','unit5'
UNION ALL SELECT 'Column6','unit6'
INSERT INTO TableA_New (stationID,entrydate,ColumnID,Columnvalue)
SELECT 1,'2013-01-01',1,2.4
UNION ALL SELECT 1,'2013-01-01',2,3.5
UNION ALL SELECT 1,'2013-01-01',4,101
UNION ALL SELECT 2,'2012-01-01',1,3.6
UNION ALL SELECT 2,'2012-01-01',2,9.9
UNION ALL SELECT 2,'2012-01-01',4,104
SELECT * FROM TableA_New
SELECT * FROM TableB_New
SELECT *
FROM
(
SELECT lt.stationID,lt.entryDate,rt.Colname,lt.Columnvalue + SPACE(3) + rt.Unit AS ColValue
FROM TableA_New lt
JOIN TableB_new rt
ON lt.ColumnID=rt.ID
)t1
PIVOT
(MAX(ColValue) FOR Colname IN ([WaterTemp],[waterLevel],[Column1],[Column2],[Column4],[Column5],[Column6]))pvt
see the result below.
I would design this database like the following:
A table MEASUREMENT_DATAPOINT that contains the measured data points. It would have the columns ID, measurement_id, value, unit, name.
One entry would be 1, 1, 2.4, 'celcius', 'water temperature'.
A table MEASUREMENTS that contains the data of the measurement itself. Columns: ID, station_ID, entry_date.
You might want to look into the MS-SQL function called PIVOT/UNPIVOT
http://technet.microsoft.com/en-us/library/ms177410(v=sql.105).aspx
you can take column names and have them in rows or vice versa using this command.
Once you have the column name in the column itself you can join that column from tableA to tableB. Then unpivot to get your data back the way you want it. (caveat I may be swapping the use of pivot and unpivot :))
Word to the wise though, if you are working with large tables, pivot is not the fastest of operations.
I think you would have to flip it to a row per metric. Looking at your design above:
1 | 2013-01-01 00:00 | 2.4 | 3.5
How do I know what row in table b that applies to?
I would try something like this:
Table B:
Metric_Key | Metric
1 | WaterLevel in Meters
2 | Temp in Celcius
...
Table A:
StationID | entrydate | Metric_Key | Value
1 2013-01-01 00:00 1 2.4

Join with dynamic pivot (version 2)

I have some tables with data:
Category
CategoryID CategoryName
1 Home
2 Contact
3 About
Position
PositionID PositionName
1 Main menu
2 Left menu
3 Right menu
...(new row can be added later)
CategoryPosition
CPID CID PID COrder
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 3 5
How can I make a table like this:
CID CName MainMenu LeftMenu RightMenu
1 Home 1 2 3
2 Contact 4 0 5
3 About 0 0 0
And if a new Category or Position row is added later, the query should reflect the change automatically, e.g:
CID CName MainMenu LeftMenu RightMenu BottomMenu
1 Home 1 2 3 0
2 Contact 4 0 5 0
3 About 0 0 0 0
4 News 0 0 0 0
The following dynamic query seems to work:
declare #columnlist nvarchar(4000)
select #columnlist = IsNull(#columnlist + ', ', '') + '[' + PositionName + ']'
from #Position
declare #query nvarchar(4000)
select #query = '
select *
from (
select CategoryId, CategoryName, PositionName,
IsNull(COrder,0) as COrder
from #Position p
cross join #Category c
left join #CategoryPosition cp
on cp.pid = p.PositionId
and cp.cid = c.CategoryId
) pv
PIVOT (max(COrder) FOR PositionName in (' + #columnlist + ')) as Y
ORDER BY CategoryId, CategoryName
'
exec sp_executesql #query
Some clarification:
The #columnlist contains the dymamic field list, built from the Positions table
The cross join creates a list of all categories and all positions
The left join seeks the corresponding COrder
max() selects the highest COrder per category+position, if there is more than one
PIVOT() turns the various PositionNames into separate columns
P.S. My table names begin with #, because I created them as temporary tables. Remove the # to refer to a permanent table.
P.S.2. If anyone wants to try his hands at this, here is a script to create the tables in this question:
set nocount on
if object_id('tempdb..#Category') is not null drop table #Category
create table #Category (
CategoryId int identity,
CategoryName varchar(50)
)
insert into #Category (CategoryName) values ('Home')
insert into #Category (CategoryName) values ('Contact')
insert into #Category (CategoryName) values ('About')
--insert into #Category (CategoryName) values ('News')
if object_id('tempdb..#Position') is not null drop table #Position
create table #Position (
PositionID int identity,
PositionName varchar(50)
)
insert into #Position (PositionName) values ('Main menu')
insert into #Position (PositionName) values ('Left menu')
insert into #Position (PositionName) values ('Right menu')
--insert into #Position (PositionName) values ('Bottom menu')
if object_id('tempdb..#CategoryPosition') is not null
drop table #CategoryPosition
create table #CategoryPosition (
CPID int identity,
CID int,
PID int,
COrder int
)
insert into #CategoryPosition (CID, PID, COrder) values (1,1,1)
insert into #CategoryPosition (CID, PID, COrder) values (1,2,2)
insert into #CategoryPosition (CID, PID, COrder) values (1,3,3)
insert into #CategoryPosition (CID, PID, COrder) values (2,1,4)
insert into #CategoryPosition (CID, PID, COrder) values (2,3,5)
Since PIVOT requires a static list of columns, I think a dynamic-sql-based approach is really all that you can do: http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx
As mentioned by several posters, dynamic SQL using the PIVOT command is the way to go. I wrote a stored proc named pivot_query.sql awhile back that has been very handy for this purpose. It works like this:
-- Define a query of the raw data and put it in a variable (no pre-grouping required)
declare #myQuery varchar(MAX);
set #myQuery = '
select
cp.cid,
c.CategoryName,
p.PositionName,
cp.COrder
from
CategoryPosition cp
JOIN Category c
on (c.CategoryId = cp.cid)
JOIN Position p
on (p.PositionId = cp.pid)';
-- Call the proc, passing the query, row fields, pivot column and summary function
exec dbo.pivot_query #myQuery, 'CategoryName', 'PositionName', 'max(COrder) COrder'
The full syntax of the pivot_query call is:
pivot_query '<query>', '<field list for each row>', '<pivot column>', '<aggregate expression list>', '[<results table>]', '[<show query>]'
it is explained more in the comments at the top of the source code.
A couple of advantages of this proc are that you can specify multiple summary functions like max(COrder),min(COrder) etc. and it has the option to store the output in a table in case you want to join the summary data up with other information.
I guess you need to select using PIVOT. By default, pivots only select a static list of columns. There are some solutions on the net dealing with dynamic column pivots, such as here and here.
My suggestion would be to return your data as a simple join and let the front end sort it out. There are some things for which SQL is excellent, but this particular problem seems like something that the front end should be doing. Of course, I can't know that without knowing your full situation, but that's my hunch.