Using EXCEPT clause in PostgreSQL - sql

I am trying to use the EXCEPT clause to retrieve data from table. I want to get all the rows from table1 except the one's that exist in table2.
As far I understand, the following would not work:
CREATE TABLE table1(pk_id int, fk_id_tbl2 int);
CREATE TABLE table2(pk_id int);
Select fk_id_tbl2
FROM table1
Except
Select pk_id
FROM table2
The only way I can use EXCEPT seems to be to select from the same tables or select columns that have the same column name from different tables.
Can someone please explain how best to use the explain clause?

Your query seems perfectly valid:
SELECT fk_id_tbl2 AS some_name
FROM table1
EXCEPT -- you may want to use EXCEPT ALL
SELECT pk_id
FROM table2;
Column names are irrelevant to the query. Only data types must match. The output column name of your query is fk_id_tbl2, just because it's the column name in the first SELECT. You can use any alias.
What's often overlooked: the subtle differences between EXCEPT (which folds duplicates) and EXCEPT ALL - which keeps all individual unmatched rows.
More explanation and other ways to do the same, some of them much more flexible:
Select rows which are not present in other table
Details for EXCEPT in the manual.

Related

SQL - Two DISTINCTs performing very poorly

I've got two tables containing a column with the same name. I try to find out which distinct values exist in Table2 but don't exist in Table1. For that I have two SELECTs:
SELECT DISTINCT Field
FROM Table1
SELECT DISTINCT Field
FROM Table2
Both SELECTs finish within 2 Seconds and return about 10 rows each. If I restructure my query to find out which values are missing in Table1, the query takes several minutes to finish:
SELECT DISTINCT Field
FROM Table1
WHERE Field NOT IN
(
SELECT DISTINCT Field
FROM Table2
)
My temporary workaround is inserting the results of the second distinct in a temporary table an comparing against it. But the performance still isn't great.
Does anyone know why this happens? I guess because SQL-Server keeps recalculating the second DISTINCT but why would it? Shouldn't SQL-Server optimize this somehow?
Not sure if this will improve performance but i'd use EXCEPT:
SELECT Field
FROM Table1
EXCEPT
SELECT Field
FROM Table2
There is no need to use DISTINCT because EXCEPT is a set operator that removes duplicates.
EXCEPT returns distinct rows from the left input query that aren’t
output by the right input query.
The number and the order of the columns must be the same in all queries.
The data types must be compatible.

DB2/SQL equivalent of SAS's sum(of ) function

SAS has a sum(of col1 - coln ) function which finds the sum of all the values from col1, col2, col3...coln. (ie, you don't have to list out all the column names, as long as they are numbered consecutively). This is a handy shortcut to find a sum of several (suitably named) variables.
Question - Is there a DB2/SQL equivalent of this? I have 50 columns (they are named col1, col2, col3....col50 and I need to find the sum of them.
ie:
select sum(col1, col2, col3,....,col50) AggregateSum
from foo.table
No, DB2 has no such beast, at least to my knowledge. However, you can dynamically create such a query by first querying the database metadata to extract the columns for a given table.
From memory, DB2 has a sysibm.syscolumns table which basically contains the column information that you could use to construct a query on the fly.
You would first use a query like:
select column for sysibm.syscolumns
where schema = 'foo' and tablename = 'table'
and column like 'col%'
(the column names may not match exactly but, since they're not the same on the differing variants of DB2 (DB2/z, DB2/LUW, iSeries DB2, etc) anyway, that hardly matters).
Then use the results of that query to construct your actual query:
select col1+col2+...+colN AggregateSum from foo.table
where the col1+col2+...+colN bit has been built from the previous query.
If, as you mention in a comment, you only want the eighteen "highest" columns (e.g., if columns 1 thru 100 exist, you only want 83 thru 100), you can modify the first query to do that, with something like:
select column for sysibm.syscolumns
where schema = 'foo' and tablename = 'table'
and column like 'col%'
order by column desc
fetch first 18 rows only
but, in that case, you may want to call the columns col0001, col0145 and so on, or make the sorting able to handle variable width numbers.
Although it may be easier (if you can't change the column names) to get all the columns colNNN, sort them yourself by the numeric (not string) value after the col, and throw away all but the last eighteen when constructing the second query).
Both these options will return only eighteen rows maximum.
But you may also want to think, in that case, about moving the variable data to another table, if that's possible in your situation. If you ever find yourself maintaining an array within a table, it's usually better to separate that out.
So your main table would then be something like:
main_id primary key
other_data
and your auxiliary table would be akin to:
main_id foreign key to main(main_id)
sequence_nm
other_data
primary key (main_id, sequence_num)
That would allow you to have sparse data if needed, and also to add data without having to change the schema of the main table. The query to get the latest eighteen results would be a little more complicated but still a relatively simple join of the two tables.

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

SELECT COUNT(*) ;

I have a database, database1, with two tables (Table 1, Table2) in it.
There are 3 rows in Table1 and 2 rows in Table2. Now if I execute the following SQL query SELECT COUNT(*); on database1, then the output is "1".
Does anyone has the idea, what this "1" signifies?
The definition of the two tables is as below.
CREATE TABLE Table1
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
CREATE TABLE Table2
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
Normally all selects are of the form SELECT [columns, scalar computations on columns, grouped computations on columns, or scalar computations] FROM [table or joins of tables, etc]
Because this allows plain scalar computations we can do something like SELECT 1 + 1 FROM SomeTable and it will return a recordset with the value 2 for every row in the table SomeTable.
Now, if we didn't care about any table, but just wanted to do our scalar computed we might want to do something like SELECT 1 + 1. This isn't allowed by the standard, but it is useful and most databases allow it (Oracle doesn't unless it's changed recently, at least it used to not).
Hence such bare SELECTs are treated as if they had a from clause which specified a table with one row and no column (impossible of course, but it does the trick). Hence SELECT 1 + 1 becomes SELECT 1 + 1 FROM ImaginaryTableWithOneRow which returns a single row with a single column with the value 2.
Mostly we don't think about this, we just get used to the fact that bare SELECTs give results and don't even think about the fact that there must be some one-row thing selected to return one row.
In doing SELECT COUNT(*) you did the equivalent of SELECT COUNT(*) FROM ImaginaryTableWithOneRow which of course returns 1.
Along similar lines the following also returns a result.
SELECT 'test'
WHERE EXISTS (SELECT *)
The explanation for that behavior (from this Connect item) also applies to your question.
In ANSI SQL, a SELECT statement without FROM clause is not permitted -
you need to specify a table source. So the statement "SELECT 'test'
WHERE EXISTS(SELECT *)" should give syntax error. This is the correct
behavior.
With respect to the SQL Server implementation, the FROM
clause is optional and it has always worked this way. So you can do
"SELECT 1" or "SELECT #v" and so on without requiring a table. In
other database systems, there is a dummy table called "DUAL" with one
row that is used to do such SELECT statements like "SELECT 1 FROM
dual;" or "SELECT #v FROM dual;". Now, coming to the EXISTS clause -
the project list doesn't matter in terms of the syntax or result of
the query and SELECT * is valid in a sub-query. Couple this with the
fact that we allow SELECT without FROM, you get the behavior that you
see. We could fix it but there is not much value in doing it and it
might break existing application code.
It's because you have executed select count(*) without specifying a table.
The count function returns the number of rows in the specified dataset. If you don't specify a table to select from, a single select will only ever return a single row - therefore count(*) will return 1. (In some versions of SQL, such as Oracle, you have to specify a table or similar database object; Oracle includes a dummy table (called DUAL) which can be selected from when no specific table is required.)
you wouldn't normally execute a select count(*) without specifying a table to query against. Your database server is probably giving you a count of "1" based on default system table it is querying.
Try using
select count(*) from Table1
Without a table name it makes no sense.
without table name it always return 1 whether it any database....
Since this is tagged SQL server, the MSDN states.
COUNT always returns an int data type value.
Also,
COUNT(*) returns the number of items in a group. This includes NULL
values and duplicates.
Thus, since you didn't provide a table to do a COUNT from, the default (assumption) is that it returns a 1.
COUNT function returns the number of rows as result. If you don't specify any table, it returns 1 by default. ie., COUNT(*), COUNT(1), COUNT(2), ... will return 1 always.
Select *
without a from clause is "Select ALL from the Universe" since you have filtered out nothing.
In your case, you are asking "How many universe?"
This is exactly how I would teach it. I would write on the board on the first day,
Select * and ask what it means. Answer: Give me the world.
And from there I would teach how to filter the universe down to something meaningful.
I must admit, I never thought of Select Count(*), which would make it more interesting but still brings back a true answer. We have only one world.
Without consulting Steven Hawking, SQL will have to contend with only 1.
The results of the query is correct.

Optimize query that compares two tables with similar schema in different databases

I have two different tables with similar schema in different database. What is the best way to compare records between these two tables. I need to find out-
records that exists in first table whose corresponding record does not exist in second table filtering records from the first table with some where clauses.
So far I have come with this SQL construct:
Select t1_col1, t1_ col2 from table1
where t1_col1=<condition> AND
t1_col2=<> AND
NOT EXISTS
(SELECT * FROM
table2
WHERE
t1_col1=t2_col1 AND
t1_col2=t2_col2)
Is there a better way to do this?
This above query seems fine but I suspect it is doing row by row comparison without evaluating the conditions in the first part of the query because the first part of the query will reduce the resultset very much. Is this happening?
Just use except keyword!!!
Select t1_col1, t1_ col2 from table1
where t1_col1=<condition> AND
t1_col2=<condition>
except
SELECT t2_col1, t2_ col2 FROM table2
It returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query.
For more information on MSDN
If the data in both table are expected to have the same primary key, you can use IN keyword to filter those are not found in the other table. This could be the simplest way.
If you are open to third party tools like Redgate Data Compare you can try it, it's a very nice tool. Visual Studio 2010 Ultimate edition also have this feature.