SQL Distinct keyword bogs down performance? - sql

I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.
I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?
Here's the query:
c.username, o.orderno, o.totalcredits, o.totalrefunds,
o.recstatus, o.reason
FROM management.contacts c
JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')

Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).

Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.
What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)
Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)
Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).

Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.
If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.
Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.
Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.
Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.

You can try to make a group by like this:
SELECT c.username,
FROM management.contacts c,
management.orders o
WHERE c.custID = o.custID
AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username,
Also verify if you have index on o.recDate


Best practices of Oracle LEFT OUTER JOIN

I am new to sql, i use Sql Developer (Oracle db).
When I need to select some data with null values I write one of these selects:
SELECT i.number
WHERE a.parameter = 'aaa' AND a.item_nr = i.number) AS atr_value
SELECT i.number
,a.value as atr_value
left outer join ATTRIBUTES a
on a.parameter = 'aaa'
and a.item_nr = i.number
What is difference?
How first approach is called (how can I google it)? Where can I read about it?
Which one should I use further (what is best practices), maybe there is better way to select same data?
Axample of tables:
Your two queries are not exactly the same. If you have multiple matches in the second table, then the first generates an error and the second generates multiple rows.
Which is better? As a general rule, the LEFT JOIN method (the second method) is considered the better practice than the correlated subquery (the first method). Oracle has a pretty good optimizer and it offers many ways of optimizing joins. I also think Oracle can use JOIN algorithms for the correlated subqueries (not all databases are so smart). And, with the right indexes, the two forms probably have very similar performance.
There are situations where correlated subqueries have better performance than the equivalent JOIN construct. For this example, though, I would expect the performance to be similar.
In case there is never more than one matching row in table attributes, the queries do the same. It's just two ways to query the same data. Both querys are fine and straight-forward.
In case there can be more than one match, query one (which is using a correlated subquery) would fail. It would be inappropriate for the given task then.
The query with the outer join is easier to extend, when you want a second column from the attributes table.
The first query makes it crystal-clear that you expect zero or one matches in table attributes for each item. In case of data inconsistency or if you have an error in your query such as a forgotten criteria, it will fail, which is good.
The second query would simply retrieve more rows in case of such error, which may not be desired.
So it's a matter of personal preference and of your choice how the query is to deal with inconsistencies which query to choose.

Inefficient JOIN Method?

I'm trying to query two fairly large tables here to pull some results and having some trouble with effeciency.
Note: I've only included relevant columns to make this not look so messy!
TableA (Stock) has productID, ownerID, and count columns
TableB (Owners) has ID, accountHolderID, and name columns
What I'm trying to do is query TableA and where productID = X pull up Stock.productID, Stock.accountHolderID and Owners.name. The relation between these two tables is Stock.ownerID = Owners.ID so if the WHERE condition pulled say five productIDs then I'd want the name from TableB that matched up to the ownerID from TableA.
The only unique ID in this situation is Owners.ID from TableB
Just doing a basic SELECT query on TableA for those products takes 15 seconds however when I add an INNER JOIN to match things up to TableB the query takes significantly longer, upwards of 10 minutes. I'm guessing I've designed this query inefficiently.
FROM Stock
ON Stock.ownerID = Owners.ID
Stock.productID = 42301679
How can I make this query more efficient?
Would adding ORs to the WHERE condition allow me to pull multiple productIDs at once?
Based on your comment, it looks like you're missing a very critical index on the owners.id field. Now, keep in mind this index will help this query, but you have to take into consideration all of the other queries that run against this table to determine if it is a good idea to add that index.
At 29M rows, having an index on a table that is frequently inserted to may have a noticeable effect on insert times.
This may be a situation where different applications need different indexes - namely your OLTP app and your reporting app (which may just be you running ad hoc queries). A common solution is to have a second server that runs your reporting/data warehouse queries that has indexes properly tuned to this function.
Best of luck.
Your'e query looks right
perhaps we can see the schema
In order to pull multiple productIDs at once you can use the IN operator instead of OR
FROM Stock
ON Stock.ownerID = Owners.ID
Stock.productID IN (42301679,123232,232324)
If the productID is unique in the Stock table, it makes sense to make this the index and this can greatly improve performance as others have mentioned.
Another performance gain comes from setting a specific length Owner.name field. In mySQL, VARCHAR can be used for Strings of varied length while a CHAR(32) column indicates that the name will always occupy 32 characters. The extra unused space is just padded, so you can really think of the (32) as indicating a maximum length. The performance advantage comes from the fact that the database now knows exactly how many bytes each row occupies and it can use this information to improve lookup time.

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
ON a.propertyA = b.propertyA
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Does SQL performance degrade as the number elements in an "IN" clause increases?

I have a query like this,
SELECT Name FROM Customers WHERE Id IN (1,4,3,6,7)
There might be millions of customers in the DataBase. Will there be an efficiency problem with this query ? When the number of Ids inside IN statement are more ? If so, Why and Any workaround ?
I Use SQLServer. Below is my table Structure
Id is the primary key -non clustered index.
This query is as basic as it can get.
If you need to find the name of 5 customers, there is simply no other sane way of writing it.
It will perform well if you have an index on ID. The performance is almost instantaneous, directly related to the number of items in the IN clause.
If you don't it will scan the table, and the performance becomes directly related to the number of records in the table.
Assuming you have properly indexed the Id column, there should be no problem. That is the correct method, and if it does not work, you need a new database. (Millions shouldn't be an issue with most regular pieces of software; if you make it to multiple billions you might need to investigate clustered databases).
If you execute the following query:
select * from sys.objects where object_id in (
(I'm not going to break up all the lines).
In the resulting query, approximately 5% of the cost of the query is taken up with a constant scan (which is effectively turning all of those numbers into a temp table internally and that table is then passed to a join operator).
But, this is a remarkably simple query overall. For any more complex query, I'd expect that the cost, as a percentage, will go down (since I expect the absolute cost to remain the same)
I know this isn't the question that was asked, but, say your list of IDs came from another query:
Then this is cause to rewrite your query using EXISTS:
This is efficient because EXISTS gives more opportunity for the optimizer to determine an efficient execution path, whereas IN forces the subquery to be fully evaluated.
The query you specified didn't have a subsquery. It just has a list of constants which has little opportunity to be further optimized. As is, you have to do with the best you got, i.e. index the ID column as recommended by #zebediah49.

Count(*) vs Count(Id) in sql server 2005

I use SQL COUNT function to get the total number or rows from a table. Is there any difference between the two following statements?
Also, is there any difference in terms of performance and execution time?
Thilo nailed the difference precisely... COUNT( column_name ) can return a lower number than COUNT( * ) if column_name can be NULL.
However, if I can take a slightly different angle at answering your question, since you seem to be focusing on performance.
First, note that issuing SELECT COUNT(*) FROM table; will potentially block writers, and it will also be blocked by other readers/writers unless you have altered the isolation level (knee-jerk tends to be WITH (NOLOCK) but I'm seeing a promising number of people finally starting to believe in RCSI). Which means that while you're reading the data to get your "accurate" count, all these DML requests are piling up, and when you've finally released all of your locks, the floodgates open, a bunch of insert/update/delete activity happens, and there goes your "accurate" count.
If you need an absolutely transactionally consistent and accurate row count (even if it is only valid for the number of milliseconds it takes to return the number to you), then SELECT COUNT( * ) is your only choice.
On the other hand, if you are trying to get a 99.9% accurate ballpark, you are much better off with a query like this:
SELECT row_count = SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE [object_id] = OBJECT_ID('dbo.Table')
AND index_id IN (0,1);
(The SUM is there to account for partitioned tables - if you are not using table partitioning, you can leave it out.)
This DMV maintains accurate row counts for tables with the exception of rows that are currently participating in transactions - and those very transactions are the ones that will make your SELECT COUNT query wait (and ultimately make it inaccurate before you have time to read it). But otherwise this will lead to a much quicker answer than the query you propose, and no less accurate than using WITH (NOLOCK).
count(id) needs to null-check the column (which may be optimized away for a primary key or otherwise not-null column), so count(*) or count(1) should be prefered (unless you really want to know the number of rows with a non-null value for id).