Select distinct values from multiple columns in same table - sql

I am trying to construct a single SQL statement that returns unique, non-null values from multiple columns all located in the same table.
SELECT distinct tbl_data.code_1 FROM tbl_data
WHERE tbl_data.code_1 is not null
UNION
SELECT tbl_data.code_2 FROM tbl_data
WHERE tbl_data.code_2 is not null;
For example, tbl_data is as follows:
id code_1 code_2
--- -------- ----------
1 AB BC
2 BC
3 DE EF
4 BC
For the above table, the SQL query should return all unique non-null values from the two columns, namely: AB, BC, DE, EF.
I'm fairly new to SQL. My statement above works, but is there a cleaner way to write this SQL statement, since the columns are from the same table?

It's better to include code in your question, rather than ambiguous text data, so that we are all working with the same data. Here is the sample schema and data I have assumed:
CREATE TABLE tbl_data (
id INT NOT NULL,
code_1 CHAR(2),
code_2 CHAR(2)
);
INSERT INTO tbl_data (
id,
code_1,
code_2
)
VALUES
(1, 'AB', 'BC'),
(2, 'BC', NULL),
(3, 'DE', 'EF'),
(4, NULL, 'BC');
As Blorgbeard commented, the DISTINCT clause in your solution is unnecessary because the UNION operator eliminates duplicate rows. There is a UNION ALL operator that does not elimiate duplicates, but it is not appropriate here.
Rewriting your query without the DISTINCT clause is a fine solution to this problem:
SELECT code_1
FROM tbl_data
WHERE code_1 IS NOT NULL
UNION
SELECT code_2
FROM tbl_data
WHERE code_2 IS NOT NULL;
It doesn't matter that the two columns are in the same table. The solution would be the same even if the columns were in different tables.
If you don't like the redundancy of specifying the same filter clause twice, you can encapsulate the union query in a virtual table before filtering that:
SELECT code
FROM (
SELECT code_1
FROM tbl_data
UNION
SELECT code_2
FROM tbl_data
) AS DistinctCodes (code)
WHERE code IS NOT NULL;
I find the syntax of the second more ugly, but it is logically neater. But which one performs better?
I created a sqlfiddle that demonstrates that the query optimizer of SQL Server 2005 produces the same execution plan for the two different queries:
If SQL Server generates the same execution plan for two queries, then they are practically as well as logically equivalent.
Compare the above to the execution plan for the query in your question:
The DISTINCT clause makes SQL Server 2005 perform a redundant sort operation, because the query optimizer does not know that any duplicates filtered out by the DISTINCT in the first query would be filtered out by the UNION later anyway.
This query is logically equivalent to the other two, but the redundant operation makes it less efficient. On a large data set, I would expect your query to take longer to return a result set than the two here. Don't take my word for it; experiment in your own environment to be sure!

try something like SubQuery:
SELECT derivedtable.NewColumn
FROM
(
SELECT code_1 as NewColumn FROM tbl_data
UNION
SELECT code_2 as NewColumn FROM tbl_data
) derivedtable
WHERE derivedtable.NewColumn IS NOT NULL
The UNION already returns DISTINCT values from the combined query.

Union is applied wherever the row data required is similar in terms of type, values etc. It doesnt matter you have column in the same table or the other to retrieve from as the results would remain the same ( in one of the above answers already mentioned though).
As you didn't wanted duplicates theres no point using UNION ALL and use of distinct is simply unnecessary as union gives distinct data
Can create a view would be best choice as view is a virtual representation of the table. Modifications could be then done neatly on that view created
Create VIEW getData AS
(
SELECT distinct tbl_data.code_1
FROM tbl_data
WHERE tbl_data.code_1 is not null
UNION
SELECT tbl_data.code_2
FROM tbl_data
WHERE tbl_data.code_2 is not null
);

Try this if you have more than two Columns:
CREATE TABLE #temptable (Name1 VARCHAR(25),Name2 VARCHAR(25))
INSERT INTO #temptable(Name1, Name2)
VALUES('JON', 'Harry'), ('JON', 'JON'), ('Sam','harry')
SELECT t.Name1+','+t.Name2 Names INTO #t FROM #temptable AS tSELECT DISTINCT ss.value FROM #t AS t
CROSS APPLY STRING_SPLIT(T.Names,',') AS ss

Related

Performance issues with UNION of large tables

I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them LargeTable1, LargeTable2, LargeTable3, LargeTable4...LargeTable7. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.
All these tables have three fields in common: Headquarter, Country and File. Headquarter and Country are numbers in the format '000', though in two of these tables they are parsed as int due to some other system necessities.
I have another, much smaller table called Headquarters with the information of each headquarter. This table has very few entries. At most 1000, actually.
Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters table or have been deleted (this table is deleted logically: it has a DeletionDate field to check this).
This is the query I've tried:
CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
DECLARE #headquartersFiles TABLE
(
hq int,
countryFile varchar(MAX)
);
SET NOCOUNT ON
INSERT INTO #headquartersFiles
SELECT headquarter, CONCAT(country, ' (', file, ')')
FROM
(
SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
CONVERT(int, country) as country,
file
FROM LargeTable1
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable2
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable3
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable4
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable5
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable6
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable7
) TC
SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
MAX(s.deletionDate) as deletionDate,
STUFF
(
(SELECT DISTINCT ', ' + st2.countryFile
FROM #headquartersFiles st2
WHERE st2.headquarter = st.headquarter
FOR XML PATH('')),
1,
1,
''
) countryFile
FROM #headquartersFiles as st
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
GROUP BY st.headquarter
END
This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):
LargeTable1: 1516666 rows
LargeTable2: 645740 rows
LargeTable3: 1950121 rows
LargeTable4: 779336 rows
LargeTable5: 1100999 rows
LargeTable6: 16499 rows
LargeTable7: 24454 rows
What can I do to improve performance? I've tried to do the following, with no much difference:
Inserting into the local table by batches, excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated
Creating a view for that UNION query
Creating indexes for the LargeTables for the headquarter field
I've also thought about inserting these missing headquarters in a permanent table after the LargeTables change, but the Headquarters table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.
Thanks
Take this filter
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
And add it to each individual query in the union and insert into #headquartersFiles
It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union.
Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all.
Do the filtering at each step. But first, modify the headquarters table so it has the right type for what you need . . . along with an index:
alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);
SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
FROM headquarters s
WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
);
Then, you want an index on LargeTable5(headquarter, country, file).
This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union to remove duplicates between the tables.
I'd try doing the filtering with each individual table first. You just need to account for the fact that a headquarter might appear in one table, but not another. You can do this like so:
SELECT
headquarter
FROM
(
SELECT DISTINCT
headquarter,
'table1' AS large_table
FROM
LargeTable1 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
SELECT DISTINCT
headquarter,
'table2' AS large_table
FROM
LargeTable2 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5
That would make sure that it's missing from all five tables.
Table variables have horrible performance because sql server does not generate statistics for them. Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it.
Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+.
Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems.
The real answer is to create separate INSERT statements for each table with the caveat that data to be inserted does not exist in the destination table.

'In' clause in SQL server with multiple columns

I have a component that retrieves data from database based on the keys provided.
However I want my java application to get all the data for all keys in a single database hit to fasten up things.
I can use 'in' clause when I have only one key.
While working on more than one key I can use below query in oracle
SELECT * FROM <table_name>
where (value_type,CODE1) IN (('I','COMM'),('I','CORE'));
which is similar to writing
SELECT * FROM <table_name>
where value_type = 1 and CODE1 = 'COMM'
and
SELECT * FROM <table_name>
where value_type = 1 and CODE1 = 'CORE'
together
However, this concept of using 'in' clause as above is giving below error in 'SQL server'
ERROR:An expression of non-boolean type specified in a context where a condition is expected, near ','.
Please let know if their is any way to achieve the same in SQL server.
This syntax doesn't exist in SQL Server. Use a combination of And and Or.
SELECT *
FROM <table_name>
WHERE
(value_type = 1 and CODE1 = 'COMM')
OR (value_type = 1 and CODE1 = 'CORE')
(In this case, you could make it shorter, because value_type is compared to the same value in both combinations. I just wanted to show the pattern that works like IN in oracle with multiple fields.)
When using IN with a subquery, you need to rephrase it like this:
Oracle:
SELECT *
FROM foo
WHERE
(value_type, CODE1) IN (
SELECT type, code
FROM bar
WHERE <some conditions>)
SQL Server:
SELECT *
FROM foo
WHERE
EXISTS (
SELECT *
FROM bar
WHERE <some conditions>
AND foo.type_code = bar.type
AND foo.CODE1 = bar.code)
There are other ways to do it, depending on the case, like inner joins and the like.
If you have under 1000 tuples you want to check against and you're using SQL Server 2008+, you can use a table values constructor, and perform a join against it. You can only specify up to 1000 rows in a table values constructor, hence the 1000 tuple limitation. Here's how it would look in your situation:
SELECT <table_name>.* FROM <table_name>
JOIN ( VALUES
('I', 'COMM'),
('I', 'CORE')
) AS MyTable(a, b) ON a = value_type AND b = CODE1;
This is only a good idea if your list of values is going to be unique, otherwise you'll get duplicate values. I'm not sure how the performance of this compares to using many ANDs and ORs, but the SQL query is at least much cleaner to look at, in my opinion.
You can also write this to use EXIST instead of JOIN. That may have different performance characteristics and it will avoid the problem of producing duplicate results if your values aren't unique. It may be worth trying both EXIST and JOIN on your use case to see what's a better fit. Here's how EXIST would look,
SELECT * FROM <table_name>
WHERE EXISTS (
SELECT 1
FROM (
VALUES
('I', 'COMM'),
('I', 'CORE')
) AS MyTable(a, b)
WHERE a = value_type AND b = CODE1
);
In conclusion, I think the best choice is to create a temporary table and query against that. But sometimes that's not possible, e.g. your user lacks the permission to create temporary tables, and then using a table values constructor may be your best choice. Use EXIST or JOIN, depending on which gives you better performance on your database.
Normally you can not do it, but can use the following technique.
SELECT * FROM <table_name>
where (value_type+'/'+CODE1) IN (('I'+'/'+'COMM'),('I'+'/'+'CORE'));
A better solution is to avoid hardcoding your values and put then in a temporary or persistent table:
CREATE TABLE #t (ValueType VARCHAR(16), Code VARCHAR(16))
INSERT INTO #t VALUES ('I','COMM'),('I','CORE')
SELECT DT. *
FROM <table_name> DT
JOIN #t T ON T.ValueType = DT.ValueType AND T.Code = DT.Code
Thus, you avoid storing data in your code (persistent table version) and allow to easily modify the filters (without changing the code).
I think you can try this, combine and and or at the same time.
SELECT
*
FROM
<table_name>
WHERE
value_type = 1
AND (CODE1 = 'COMM' OR CODE1 = 'CORE')
What you can do is 'join' the columns as a string, and pass your values also combined as strings.
where (cast(column1 as text) ||','|| cast(column2 as text)) in (?1)
The other way is to do multiple ands and ors.
I had a similar problem in MS SQL, but a little different. Maybe it will help somebody in futere, in my case i found this solution (not full code, just example):
SELECT Table1.Campaign
,Table1.Coupon
FROM [CRM].[dbo].[Coupons] AS Table1
INNER JOIN [CRM].[dbo].[Coupons] AS Table2 ON Table1.Campaign = Table2.Campaign AND Table1.Coupon = Table2.Coupon
WHERE Table1.Coupon IN ('0000000001', '0000000002') AND Table2.Campaign IN ('XXX000000001', 'XYX000000001')
Of cource on Coupon and Campaign in table i have index for fast search.
Compute it in MS Sql
SELECT * FROM <table_name>
where value_type + '|' + CODE1 IN ('I|COMM', 'I|CORE');

Query to write extra rows in Excel output

I'm trying to accomplish something that seems like it should be straightforward in MS Excel. I want to use a single SQL query - so I can pass it on to others to copy and paste - though I know the following could be achieved with other methods as well. Sheet 1 looks like this:
ID value value_type
1 minneapolis city_name
2 cincinnati city_name
I want an SQL query to return an "exploded" version of those two rows:
ID attr_name attr_value
1 value minneapolis
1 value_type city_name
2 value cincinnati
2 value_type city_name
There's much more I need to do, but this concept gets at the heart of the issue. I've tried a single SELECT statement, but can't seem to make it create two rows from one, and when I tried using UNION ALL I got a syntax error.
In Microsoft Query, how can I construct an SQL statement to create two rows from the existing values in one row?
UPDATE
thanks for the help so far. First, for reference, here is the default statement that recreates the table in Microsoft Query:
SELECT
`Sheet3$`.ID,
`Sheet3$`.name,
`Sheet3$`.name_type
FROM `path\testconvert.xlsx`.`Sheet3$` `Sheet3$`
So, following #lad2025's lead, I have:
SELECT
ID = `Sheet3$`.ID
,attr_name = 'value'
,attr_value = `Sheet3$`.value
FROM `path\testconvert.xlsx`.`Sheet3$` `Sheet3$`
UNION ALL
SELECT
ID = `Sheet3$`.ID
,attr_name = 'value_type'
,attr_value = `Sheet3$`.value_type
FROM `path\testconvert.xlsx`.`Sheet3$` `Sheet3$`
And the result is this error Too few parameters. Expected 4.
LiveDemo
CREATE TABLE #mytable(
ID INTEGER NOT NULL PRIMARY KEY
,value VARCHAR(11) NOT NULL
,value_type VARCHAR(9) NOT NULL
);
INSERT INTO #mytable(ID,value,value_type) VALUES (1,'minneapolis','city_name');
INSERT INTO #mytable(ID,value,value_type) VALUES (2,'cincinnati','city_name');
SELECT
ID
,[attr_name] = 'value'
,[attr_value] = value
FROM #mytable
UNION ALL
SELECT
ID
,[attr_name] = 'value_type'
,[attr_value] = value_type
FROM #mytable
ORDER BY id;
Ok, after going back to the original statement and working up from there as per the suggestions from #lad2025, I've come up with this statement which achieves what I was looking for in my original question:
SELECT
ID,
'name' AS [attr_name],
name AS [attr_value]
FROM `path\testconvert.xlsx`.`Sheet3$` `Sheet3$`
UNION ALL
SELECT
ID,
'name_type',
name_type
FROM `path\testconvert.xlsx`.`Sheet3$` `Sheet3$`
ORDER BY ID;
One of the main problems is that the new column names are only defined in the first SELECT statement. Also, brackets are ok, just not how #lad2025 was using them originally.
Microsoft Query is pretty finicky.

Create temporary table with fixed values

How do I create a temporary table in PostgreSQL that has one column "AC" and consists of these 4-digit values:
Zoom
Inci
Fend
In essence the table has more values, this should just serve as an example.
If you only need the temp table for one SQL query, then you can hard-code the data into a Common Table Expression as follows :
WITH temp_table AS
(
SELECT 'Zoom' AS AC UNION
SELECT 'Inci' UNION
SELECT 'Fend'
)
SELECT * FROM temp_table
see it work at http://sqlfiddle.com/#!15/f88ac/2
(that CTE syntax also works with MS SQL)
HTH

SQL Query to return rows even if it is not present in the table

This is a specific problem .
I have an excel sheet containing data. Similar data is present in a relational database table. Some rows may be absent or some additional rows may be present. The goal is to verify the data in the excel sheet with the data in the table.
I have the following query
Select e_no, start_dt,end_dt
From MY_TABLE
Where e_no In
(20231, 457)
In this case, e_no 457 is not present in the database (and hence not returned). But I want my query to return a row even if it not present (457 , null , null). How do I do that ?
For Sql-Server: Use a temporary table or table type variable and left join MY_TABLE with it
Sql-Server fiddle demo
Declare #Temp Table (e_no int)
Insert into #Temp
Values (20231), (457)
Select t.e_no, m.start_dt, m.end_dt
From #temp t left join MY_TABLE m on t.e_no = m.e_no
If your passing values are a csv list, then use a split function to get the values inserted to #Temp.
Why not simply populate a temporary table in the database from your spreadsheet and join against that? Any other solution is probably going to be both more work and more difficult to maintain.
You can also do it this way with a UNION
Select
e_no, start_dt ,end_dt
From MY_TABLE
Where e_no In (20231, 457)
UNION
Select 457, null, null