I am going through below Hive manual and confused by the details explained on documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
First it says
Hive uses the columns in SORT BY to sort the rows before feeding the
rows to a reducer.
Then it says
Hive supports SORT BY which sorts the data per reducer. The difference
between "order by" and "sort by" is that the former guarantees total
order in the output while the latter only guarantees ordering of the
rows within a reducer. If there are more than one reducer, "sort by"
may give partially ordered final results.
If it already sorts records before sending to reducer then how is the final output not guaranteed to be sorted? is it running dual sort ?
Most of the logics for sort by and order by are quite similar. You can think of order by as a more restricted case of sort by. Let's suppose the underling execution engine is MapReduce.
Both case rely on the Shuffle phase of MR to sort items. And the shuffle operation can be broken into two parts, each processed by the map side and reduce side of a MR job respectively. The former for local sorting, and the latter for merging those partial results come from different mappers.
When the doc say:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.
It means that rows are sorted by the map side operations of the shuffle phase. Actually this is true for both sort by and order by.
Then what's the difference of the two? It's the parallelism of reducers.
For order by, in order to get a globally ordered result set, Hive enforce the number of reducers to be 1, causing all data being sent to a single reducer. And at this single reducer, the merging part of shuffle guarantees that all data are sorted globally.
While for sort by, there's no such enforcement. So the number of reducers can be anything. This leads to data only being sorted within each reducer. But no global sorting is guaranteed. And when the num of reducer is set to 1, explicitly or implicitly, sort by and order by bear the same behavior.
Related
I know theoretically the answer is random, but I was wondering if you doing for example window functions with row_number() and you have duplicate values in your order by column for a given partition, will the result still be the same? Does Hive look at other columns to determine ordering even if not specified?
The order for duplicate rows is not guaranteed because query processing is being done in parallel in many mappers and reducers, each may execute faster of slower, not always the same, depending on cluster and each node involved load. Mapper's results may not be processed in the same order even on the single reducer.
Suppose, if following rows are inserted in chronological order into a table:
row1, row2, row3, row4, ..., row1000, row1001.
After a while, we delete/remove the latest row1001.
As in this post: How to get Top 5 records in SqLite?
If the below command is run:
SELECT * FROM <table> LIMIT 1;
Will it assuredly provide the "row1000"?
If no, then is there any efficient way to get the latest row(s)
without traversing through all the rows? -- i.e. without using
combination of ORDER BY and DESC.
[Note: For now I am using "SQLite", but it will be interesting for me to know about SQL in general as well.]
You're misunderstanding how SQL works. You're thinking row-by-row which is wrong. SQL does not "traverse rows" as per your concern; it operates on data as "sets".
Others have pointed out that relational database cannot be assumed to have any particular ordering, so you must use ORDER BY to explicitly specify ordering.
However (not mentioned yet is that), in order to ensure it performs efficiently, you need to create an appropriate index.
Whether you have an index or not, the correct query is:
SELECT <cols>
FROM <table>
ORDER BY <sort-cols> [DESC] LIMIT <no-rows>
Note that if you don't have an index the database will load all data and probably sort in memory to find the TOP n.
If you do have the appropriate index, the database will use the best index available to retrieve the TOP n rows as efficiently as possible.
Note that the sqllite documentation is very clear on the matter. The section on ORDER BY explains that ordering is undefined. And nothing in the section on LIMIT contradicts this (it simply constrains the number of rows returned).
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
This behaviour is also consistent with the ANSI standard and all major SQL implementations. Note that any database vendor that guaranteed any kind of ordering would have to sacrifice performance to the detriment of queries trying to retrieve data but not caring about order. (Not good for business.)
As a side note, flawed assumptions about ordering is an easy mistake to make (similar to flawed assumptions about uninitialised local variables).
RDBMS implementations are very likely to make ordering appear consistent. They follow a certain algorithm for adding data, a certain algorithm for retrieving data. And as a result, their operations are highly repeatable (it's what we love (and hate) about computers). So things repeatably look the same.
Theoretical examples:
Inserting a row results in the row being added to the next available free space. So data appears sequential. But an update would have to move the row to a new location if it no longer fits.
The DB engine might retrieve data sequentially from clustered index pages and seem to use clustered index as the 'natural ordering' ... until one day a page-split puts one of the pages in a different location. * Or a new version of the DMBS might cache certain data for performance, and suddenly order changes.
Real-world example:
The MS SQL Server 6.5 implementation of GROUP BY had the side-effect of also sorting by the group-by columns. When MS (in version 7 or 2000) implemented some performance improvements, GROUP BY would by default, return data in a hashed order. Many people blamed MS for breaking their queries when in fact they had made false assumptions and failed to ORDER BY their results as needed.
This is why the only guarantee of a specific ordering is to use the ORDER BY clause.
No. Table records have no inherent order. So it is undefined which row(s) to get with a LIMIT clause without an ORDER BY.
SQLite in its current implemantation may return the latest inserted row, but even if that were the case you must not rely on it.
Give a table a datetime column or some sortkey, if record order is important for you.
In SQL, data is stored in tables unordered. What comes out first one day might not be the same the next.
ORDER BY, or some other specific selection criteria is required to guarantee the correct value.
Operations join and group by can be much faster if the arguments are sorted on the key.
They also naturally produce sorted output when the input is sorted.
The question is: does pig guarantee that the output is sorted, or do I need to order by aliases produced by group by ... using 'merge'?
Pig offers no guarantees of ordering except following an ORDER BY statement. Since Pig sits on top of Hadoop, it does not directly control how output is created, including its order.
During the shuffle phase, keys are partitioned to each reducer and then sorted by key on each reducer. The result is that if you examine the output of each reducer in turn (i.e., look at the output from reducer 0, then reducer 1, etc.) you will find they are ordered by the map key. In the case of a Pig GROUP BY, the map key is the field you are grouping by. So frequently you will find that the output is sorted the way you want.
The rub is that Pig does not control the underlying map-reduce shuffle and sort phases. So the sort order can vary underneath and Pig does not need to worry about it. I don't know under what conditions the ordering varies -- possibly with different versions of Hadoop -- but you should not rely on it. In general I find the ordering to be lexicographic, which means a GROUP BY on an integer will not be sorted the way you expect. I have also seen output that is sorted first by length, and then lexicographically, which again is likely not what you want.
If you find it works for you in your distribution, then more power to you, you can skip those two MR jobs. But your script may not be portable and may be subject to breakage if you change something about the Hadoop installation.
Is the result of GROUP BY should be sorted accordingly the SQL standard?
Many databases return the sorted results for GROUP BY,
but is it enforced by SQL92 or other standard?
No. GROUP BY has no standard impact on the order of rows returned. That's what ORDER BY is designed to do.
If you're getting some kind of repeatable or predictable sort order returned by a GROUP BY, it's something being done in your DBMS that is not defined in the standards.
As a previous answer has explained, no sorting is ever implied by any basic SQL construct other than ORDER BY.
However, to compute GROUP BY, either index scan or in-memory sorting may take place (to create the buckets), and such an index scan, or sorting, implies a traversal of the data in a sorted order. So it is no accident that a particular database often behaves like this. Do not rely on it, however, because with a different set of indexes, or even just a different query plan (which may be triggered as little as by a few inserts and/or a restart of your database server) the behavior could be quite different.
Notice also that reordering the column list in the ORDER BY clause will result in reliably reordering the output, whereas reordering the column list in a GROUP BY clause will likely have no effect whatsoever.
There is no performance cost of using a seemingly "redundant" ORDER BY. The query plan will likely be identical, if the original one already guaranteed sorted output.
Um, sorting the output of a GROUP BY is not in the standard because there are standard algorithms for grouping that do not produce results in order.
The most common of these is the use of a hash table for doing the group by.
In addition, on a multithreaded server, the data could be sorted, but the results would be returned processor-by-processor. There is no guarantee that the lowest order processor would be the first to return data.
And also, on a parallel machine, the data may be split among the processors using a variety of methods. For instance, all strings that end in "a" may go to one processor. All that end in "b" to another. These could then be sorted locally, but the results themselves would not be sorted overall.
Databases such as mysql that guarantee a sort after the group by are making a poor design decision. In addition to not conforming to the standard, such databases either limit the choice of algorithm or impose additional processing for ordering.
Kind of a whimsical question, always something I've wondered about and I figure knowing why it does what it does might deepen my understanding a bit.
Let's say I do "SELECT TOP 10 * FROM TableName". In short timeframes, the same 10 rows come back, so it doesn't seem random. They weren't the first or last created. In my massive sample size of...one table, it isn't returning the min or max auto-incrementing primary key value.
I also figure the problem gets more complex when taking joins into account.
My database of choice is MSSQL, but I figure this might be an interesting question regardless of the platform.
If you do not supply an ORDER BY clause on a SELECT statement you will get rows back in arbitrary order.
The actual order is undefined, and depends on which blocks/records are already cached in memory, what order I/O is performed in, when threads in the database server are scheduled to run, and so on.
There's no rhyme or reason to the order and you should never base any expectations on what order rows will be in unless you supply an ORDER BY.
If they're not ordered by the calling query, I believe they're just returned in the order they were read off disk. This may vary because of the types of joins used or the indexes that looked up the values.
You can see this if the table has a clustered index on it (and you're just selecting - a JOIN can re-order things) - a SELECT will return the rows in clustered-index-order, even without an ORDER BY clause.
There is a very detailed explanation with examples here: http://sqlserverpedia.com/blog/sql-server-bloggers/its-the-natural-order-of-things-not/
"How do database servers decide which order to return rows without any “order by” statements?"
They simply do not take any "decision" with respect to ordering. They see the user doesn't care about ordering, and so they don't care either. And thus they simply go out to find the requested rows. The order in which they find them is normally the order in which you get them. That order depends on user-unpredictable things like the chosen physical access paths, ordering of physical records inside the database's physical files, etc. etc.
Don't let yourself be misled by the ordering as you get it, in the case where you didn't explicitly specify an ordering in your query. If you don't specify an ordering in your query, no ordering in the result set is guaranteed, even if in practice results seem to suggest that some ordering appears to be adhered to by the server.