Operations join and group by can be much faster if the arguments are sorted on the key.
They also naturally produce sorted output when the input is sorted.
The question is: does pig guarantee that the output is sorted, or do I need to order by aliases produced by group by ... using 'merge'?
Pig offers no guarantees of ordering except following an ORDER BY statement. Since Pig sits on top of Hadoop, it does not directly control how output is created, including its order.
During the shuffle phase, keys are partitioned to each reducer and then sorted by key on each reducer. The result is that if you examine the output of each reducer in turn (i.e., look at the output from reducer 0, then reducer 1, etc.) you will find they are ordered by the map key. In the case of a Pig GROUP BY, the map key is the field you are grouping by. So frequently you will find that the output is sorted the way you want.
The rub is that Pig does not control the underlying map-reduce shuffle and sort phases. So the sort order can vary underneath and Pig does not need to worry about it. I don't know under what conditions the ordering varies -- possibly with different versions of Hadoop -- but you should not rely on it. In general I find the ordering to be lexicographic, which means a GROUP BY on an integer will not be sorted the way you expect. I have also seen output that is sorted first by length, and then lexicographically, which again is likely not what you want.
If you find it works for you in your distribution, then more power to you, you can skip those two MR jobs. But your script may not be portable and may be subject to breakage if you change something about the Hadoop installation.
Related
I have an SQL query that reads
SELECT DISTINCT [NR] AS K_ID
FROM [DB].[staging].[TABLE]
WHERE [N]=1 and [O]='XXX' and [TYPE] in ('1_P', '2_I')
Since I'm saving the result in a CSV file (via Python Pandas) which is under version control I've noticed that the order of the result changes every time I run the query. In order to eliminate the Python part here I ran the query in MS SQL Server Management Studio, where I'm also observing a different order with every attempt.
It doesn't matter in my case, but: Is it correct, that the result of the query can be ordered differently with every execution? And if so, is there a way to make the order "deterministic"?
SQL database are based on a relational algebra set theory concept, where what you think of as tables are more formally called unordered relations. Unless you specify an ORDER BY, the database is free to return the data is whatever order is convenient.
This order might match an index, rather than the order on disk. It might also start in the middle of the data, if the database can take advantage of work already in progress for another query to reduce total reads between the two (Enterprise Edition will do this).
Worse, even the order on disk might change. If there's no primary key, the database can even move a page around to help things run more efficiently.
In other words, if the order matters (and it usually does), specify an ORDER BY clause.
SQL queries return results as an unordered set, unless the outermost query has an order by.
On smaller amounts of data, the results look repeatable. However, on larger systems -- and particularly on parallel systems -- the ordering may be based on hashing algorithms, when nodes complete, and congestion on the network (among other factors). So, you can in fact see different orderings each time you run.
I am going through below Hive manual and confused by the details explained on documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
First it says
Hive uses the columns in SORT BY to sort the rows before feeding the
rows to a reducer.
Then it says
Hive supports SORT BY which sorts the data per reducer. The difference
between "order by" and "sort by" is that the former guarantees total
order in the output while the latter only guarantees ordering of the
rows within a reducer. If there are more than one reducer, "sort by"
may give partially ordered final results.
If it already sorts records before sending to reducer then how is the final output not guaranteed to be sorted? is it running dual sort ?
Most of the logics for sort by and order by are quite similar. You can think of order by as a more restricted case of sort by. Let's suppose the underling execution engine is MapReduce.
Both case rely on the Shuffle phase of MR to sort items. And the shuffle operation can be broken into two parts, each processed by the map side and reduce side of a MR job respectively. The former for local sorting, and the latter for merging those partial results come from different mappers.
When the doc say:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.
It means that rows are sorted by the map side operations of the shuffle phase. Actually this is true for both sort by and order by.
Then what's the difference of the two? It's the parallelism of reducers.
For order by, in order to get a globally ordered result set, Hive enforce the number of reducers to be 1, causing all data being sent to a single reducer. And at this single reducer, the merging part of shuffle guarantees that all data are sorted globally.
While for sort by, there's no such enforcement. So the number of reducers can be anything. This leads to data only being sorted within each reducer. But no global sorting is guaranteed. And when the num of reducer is set to 1, explicitly or implicitly, sort by and order by bear the same behavior.
I know theoretically the answer is random, but I was wondering if you doing for example window functions with row_number() and you have duplicate values in your order by column for a given partition, will the result still be the same? Does Hive look at other columns to determine ordering even if not specified?
The order for duplicate rows is not guaranteed because query processing is being done in parallel in many mappers and reducers, each may execute faster of slower, not always the same, depending on cluster and each node involved load. Mapper's results may not be processed in the same order even on the single reducer.
Is the result of GROUP BY should be sorted accordingly the SQL standard?
Many databases return the sorted results for GROUP BY,
but is it enforced by SQL92 or other standard?
No. GROUP BY has no standard impact on the order of rows returned. That's what ORDER BY is designed to do.
If you're getting some kind of repeatable or predictable sort order returned by a GROUP BY, it's something being done in your DBMS that is not defined in the standards.
As a previous answer has explained, no sorting is ever implied by any basic SQL construct other than ORDER BY.
However, to compute GROUP BY, either index scan or in-memory sorting may take place (to create the buckets), and such an index scan, or sorting, implies a traversal of the data in a sorted order. So it is no accident that a particular database often behaves like this. Do not rely on it, however, because with a different set of indexes, or even just a different query plan (which may be triggered as little as by a few inserts and/or a restart of your database server) the behavior could be quite different.
Notice also that reordering the column list in the ORDER BY clause will result in reliably reordering the output, whereas reordering the column list in a GROUP BY clause will likely have no effect whatsoever.
There is no performance cost of using a seemingly "redundant" ORDER BY. The query plan will likely be identical, if the original one already guaranteed sorted output.
Um, sorting the output of a GROUP BY is not in the standard because there are standard algorithms for grouping that do not produce results in order.
The most common of these is the use of a hash table for doing the group by.
In addition, on a multithreaded server, the data could be sorted, but the results would be returned processor-by-processor. There is no guarantee that the lowest order processor would be the first to return data.
And also, on a parallel machine, the data may be split among the processors using a variety of methods. For instance, all strings that end in "a" may go to one processor. All that end in "b" to another. These could then be sorted locally, but the results themselves would not be sorted overall.
Databases such as mysql that guarantee a sort after the group by are making a poor design decision. In addition to not conforming to the standard, such databases either limit the choice of algorithm or impose additional processing for ordering.
Kind of a whimsical question, always something I've wondered about and I figure knowing why it does what it does might deepen my understanding a bit.
Let's say I do "SELECT TOP 10 * FROM TableName". In short timeframes, the same 10 rows come back, so it doesn't seem random. They weren't the first or last created. In my massive sample size of...one table, it isn't returning the min or max auto-incrementing primary key value.
I also figure the problem gets more complex when taking joins into account.
My database of choice is MSSQL, but I figure this might be an interesting question regardless of the platform.
If you do not supply an ORDER BY clause on a SELECT statement you will get rows back in arbitrary order.
The actual order is undefined, and depends on which blocks/records are already cached in memory, what order I/O is performed in, when threads in the database server are scheduled to run, and so on.
There's no rhyme or reason to the order and you should never base any expectations on what order rows will be in unless you supply an ORDER BY.
If they're not ordered by the calling query, I believe they're just returned in the order they were read off disk. This may vary because of the types of joins used or the indexes that looked up the values.
You can see this if the table has a clustered index on it (and you're just selecting - a JOIN can re-order things) - a SELECT will return the rows in clustered-index-order, even without an ORDER BY clause.
There is a very detailed explanation with examples here: http://sqlserverpedia.com/blog/sql-server-bloggers/its-the-natural-order-of-things-not/
"How do database servers decide which order to return rows without any “order by” statements?"
They simply do not take any "decision" with respect to ordering. They see the user doesn't care about ordering, and so they don't care either. And thus they simply go out to find the requested rows. The order in which they find them is normally the order in which you get them. That order depends on user-unpredictable things like the chosen physical access paths, ordering of physical records inside the database's physical files, etc. etc.
Don't let yourself be misled by the ordering as you get it, in the case where you didn't explicitly specify an ordering in your query. If you don't specify an ordering in your query, no ordering in the result set is guaranteed, even if in practice results seem to suggest that some ordering appears to be adhered to by the server.