How does SQL Server system check duplicates? - sql

I definitely know how to check duplicates/remove duplicates using SQL Server queries. But I am asking a deeper question about the system.
How does the system handle duplicates? For example, how does the system remove duplicates from UNION ALL to UNION? I am guessing if the system is using hash code to do so?
The employer said the process has something to do with ROWID. But even if two rows are exactly the same, their ROWID should be different, correct? How is that possible?

How SQL Server currently seems to do it (this is, after all, an implementation detail that you shouldn't worry about) is that it will temporarily sort the output rows. It doesn't matter what sort ordering it picks, so long as it picks one1.
Then it iterates over these sorted output rows, remembering the last row it emitted. If the current row is equal, in all columns, to the last emitted row, then that row itself is not emitted.
Since it's not defined what sort order it will choose, nor whether it will apply other tricks (such as partitioning the result data across some columns and then sorting each partition independently/in parallel), you should not assume the the output will be sorted, unless you've also applied a specific ORDER BY clause.
There is no ROWID in SQL Server.
1It does need to be based on all columns, however. Basically, we're working so that duplicate rows end up in consecutive rows.

Related

SQL - Order of records returned in join by default [duplicate]

As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.

Logical Query Processing: How the select is before the order by

I am using T-SQL and in the book T-SQL Fundamentals of Itzik Ben Gan, he said that Select clause is processed logically before the Order by clause.
I agree on this, but I want to know how the select is processed before the ORDER BY, when the TOP is in the select and it needs the result of the order by first?
Without rewriting a lot of what's already been written before:
https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql
Short version: Imagine that SQL Server creates virtual tables during query execution. Those tables and their values are passed, step-by-step, through a logical process that determines your end result. The goal is to "fetch" the minimum number of rows from the beginning, and thereby filter out as few rows as possible. After all, why "fetch" 100,000 rows if you only want to see 2?
In the case of a TOP clause, you're only going to see those TOP x rows, but that doesn't mean that they are the only rows that SQL Server checked during the query execution.
On the contrary - if you're looking for the TOP x rows by some column value, then clearly SQL Server needs to make sure that it first analyzes the values for that column, orders them accordingly, and can only then present you with the TOP x rows. This is why having proper indexes can make such a difference when executing these sorts of queries.
This is very different from the WHERE clause, which can happen earlier on, because a value either = X, or it doesn't; so when scanning a table, SQL Server can know for every single row whether or not that row should be included in the final result set. With a TOP clause, this is not necessarily the case. Halfway through the query process, it doesn't necessarily know if there are more unread rows that should be included in that TOP or not - it can only know once the rows have been selected in accordance with all previous conditions, then ordered by your ORDER BY clause. Finally, it knows which rows should be in those TOP x rows you asked for.
Notice that in Microsoft's documentation, they explicitly state that the logical processing order can change from query to query, especially in extenuating circumstances (i.e., a VIEW that uses CONVERT(), or depending upon the indexes for a given table.)

In SQL, does the LIMIT returns the row which is inserted the last in chronological order?

Suppose, if following rows are inserted in chronological order into a table:
row1, row2, row3, row4, ..., row1000, row1001.
After a while, we delete/remove the latest row1001.
As in this post: How to get Top 5 records in SqLite?
If the below command is run:
SELECT * FROM <table> LIMIT 1;
Will it assuredly provide the "row1000"?
If no, then is there any efficient way to get the latest row(s)
without traversing through all the rows? -- i.e. without using
combination of ORDER BY and DESC.
[Note: For now I am using "SQLite", but it will be interesting for me to know about SQL in general as well.]
You're misunderstanding how SQL works. You're thinking row-by-row which is wrong. SQL does not "traverse rows" as per your concern; it operates on data as "sets".
Others have pointed out that relational database cannot be assumed to have any particular ordering, so you must use ORDER BY to explicitly specify ordering.
However (not mentioned yet is that), in order to ensure it performs efficiently, you need to create an appropriate index.
Whether you have an index or not, the correct query is:
SELECT <cols>
FROM <table>
ORDER BY <sort-cols> [DESC] LIMIT <no-rows>
Note that if you don't have an index the database will load all data and probably sort in memory to find the TOP n.
If you do have the appropriate index, the database will use the best index available to retrieve the TOP n rows as efficiently as possible.
Note that the sqllite documentation is very clear on the matter. The section on ORDER BY explains that ordering is undefined. And nothing in the section on LIMIT contradicts this (it simply constrains the number of rows returned).
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
This behaviour is also consistent with the ANSI standard and all major SQL implementations. Note that any database vendor that guaranteed any kind of ordering would have to sacrifice performance to the detriment of queries trying to retrieve data but not caring about order. (Not good for business.)
As a side note, flawed assumptions about ordering is an easy mistake to make (similar to flawed assumptions about uninitialised local variables).
RDBMS implementations are very likely to make ordering appear consistent. They follow a certain algorithm for adding data, a certain algorithm for retrieving data. And as a result, their operations are highly repeatable (it's what we love (and hate) about computers). So things repeatably look the same.
Theoretical examples:
Inserting a row results in the row being added to the next available free space. So data appears sequential. But an update would have to move the row to a new location if it no longer fits.
The DB engine might retrieve data sequentially from clustered index pages and seem to use clustered index as the 'natural ordering' ... until one day a page-split puts one of the pages in a different location. * Or a new version of the DMBS might cache certain data for performance, and suddenly order changes.
Real-world example:
The MS SQL Server 6.5 implementation of GROUP BY had the side-effect of also sorting by the group-by columns. When MS (in version 7 or 2000) implemented some performance improvements, GROUP BY would by default, return data in a hashed order. Many people blamed MS for breaking their queries when in fact they had made false assumptions and failed to ORDER BY their results as needed.
This is why the only guarantee of a specific ordering is to use the ORDER BY clause.
No. Table records have no inherent order. So it is undefined which row(s) to get with a LIMIT clause without an ORDER BY.
SQLite in its current implemantation may return the latest inserted row, but even if that were the case you must not rely on it.
Give a table a datetime column or some sortkey, if record order is important for you.
In SQL, data is stored in tables unordered. What comes out first one day might not be the same the next.
ORDER BY, or some other specific selection criteria is required to guarantee the correct value.

Faster to get one row from DB or count number of rows

I was wondering if its faster to retrieve the "count" of the number of rows or to retrieve just 1 row using limit. The purpose being to see whether theres any row when given certain Where conditions.
A count is always an expensive query because it will take a full table scan. You requirements are not really clear to me, but if you just want to see whether there is any data it would be cheaper to do a regular select with a limit to 1.
A count must physically count all rows that match your criteria, which is unnecessary work as you don't care about the number.
Look at using EXISTS.
It think it depends on which storage engine you use for the database...
Anywawy, the good practise to test wether there are or not results is to check the return value from the feth() function!
[I'm using Oracle as an example here, but the same concepts apply usually across the board.]
COUNT has to physically identify all the rows that will be returned. Depending on the complexity of the query, the query plan could require a table scan on one or more of the tables of the query. You'd need to do an EXPLAIN PLAN to know for sure.
Returning a single row may require the same processing if an ORDER BY is required. The database can't just give you the first row until
all the rows are identified and
the rows have been sorted.
Also, depending on the number of rows being returned, the complexity of the ORDER BY, and the SGA size, a temporary table might need to be created (which causes all sorts of other overhead).
If there's no ORDER BY, a single row should be faster because as soon as the data identifies a single row to return, it's done. That said, you're not guaranteed from one execution to the next that the rows are returned in the same order, so usually an ORDER BY is involved.
It's faster to retrieve the count of the number of rows

How do database servers decide which order to return rows without any "order by" statements?

Kind of a whimsical question, always something I've wondered about and I figure knowing why it does what it does might deepen my understanding a bit.
Let's say I do "SELECT TOP 10 * FROM TableName". In short timeframes, the same 10 rows come back, so it doesn't seem random. They weren't the first or last created. In my massive sample size of...one table, it isn't returning the min or max auto-incrementing primary key value.
I also figure the problem gets more complex when taking joins into account.
My database of choice is MSSQL, but I figure this might be an interesting question regardless of the platform.
If you do not supply an ORDER BY clause on a SELECT statement you will get rows back in arbitrary order.
The actual order is undefined, and depends on which blocks/records are already cached in memory, what order I/O is performed in, when threads in the database server are scheduled to run, and so on.
There's no rhyme or reason to the order and you should never base any expectations on what order rows will be in unless you supply an ORDER BY.
If they're not ordered by the calling query, I believe they're just returned in the order they were read off disk. This may vary because of the types of joins used or the indexes that looked up the values.
You can see this if the table has a clustered index on it (and you're just selecting - a JOIN can re-order things) - a SELECT will return the rows in clustered-index-order, even without an ORDER BY clause.
There is a very detailed explanation with examples here: http://sqlserverpedia.com/blog/sql-server-bloggers/its-the-natural-order-of-things-not/
"How do database servers decide which order to return rows without any “order by” statements?"
They simply do not take any "decision" with respect to ordering. They see the user doesn't care about ordering, and so they don't care either. And thus they simply go out to find the requested rows. The order in which they find them is normally the order in which you get them. That order depends on user-unpredictable things like the chosen physical access paths, ordering of physical records inside the database's physical files, etc. etc.
Don't let yourself be misled by the ordering as you get it, in the case where you didn't explicitly specify an ordering in your query. If you don't specify an ordering in your query, no ordering in the result set is guaranteed, even if in practice results seem to suggest that some ordering appears to be adhered to by the server.