I am executing a SPARQL query in my java application and I expect the resultset to preserve the order provided in the SPARQL query.
The query returns ordered results when run in GraphDb editor, but when using Apache Jena, it does not return results in the expected order.
Observed that this may be related to the union joins in the SPARQL query, but not sure if that's the case.
Has anyone else seen this happen and know how to solve this.
Thank you!
Unless your query contains an explicit ORDER BY clause there is no guarantee of result ordering. The specification even calls this out in places e.g. Section 15.4 OFFSET says the following:
Using LIMIT and OFFSET to select different subsets of the query solutions will not be useful unless the order is made predictable by using ORDER BY.
Queries without the ORDER BY clause may consistently return results using a specific backend implementation but you should not rely on this behaviour. At best this will give you different results in different implementations. At worse results ordering could change when you upgrade your backend due to internal implementation/optimization changes in the underlying implementation.
If you want predictable result ordering you MUST provide an ORDER BY clause in your queries.
Related
I need to run a query that groups the result and orders it. When I used the following query I noticed that the results were ordered by the field name:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
Originally I planed on using the following query:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
ORDER BY name
I'm worried that order by significantly slows the running time.
Can I depend on ms-access to always order by the field I am grouping by, and eliminate the order by?
EDIT: I tried grouping different fields in other tables and it was always ordered by the grouped field.
I have found answers to this question to other SQL DBMSs, but not access.
How GROUP BY and ORDER BY work in general
Databases usually choose between sorting and hashing when creating groups for GROUP BY or DISTINCT operations. If they do choose sorting, you might get lucky and the sorting is stable between the application of GROUP BY and the actual result set consumption. But at some later point, this may break as the database might suddenly prefer an order-less hashing algorithm to produce groups.
In no database, you should ever rely on any implicit ordering behaviour. You should always use explicit ORDER BY. If the database is sophisticated enough, adding an explicit ORDER BY clause will hint that sorting is more optimal for the grouping operation as well, as the sorting can then be re-used in the query execution pipeline.
How this translates to your observation
I tried grouping different fields in other tables and it was always ordered by the grouped field.
Have you exhaustively tried all possible queries that could ever be expressed? I.e. have you tried:
JOIN
OUTER JOIN
semi-JOIN (using EXISTS or IN)
anti-JOIN (using NOT EXISTS or NOT IN)
filtering
grouping by many many columns
DISTINCT + GROUP BY (this will certainly break your ordering)
UNION or UNION ALL (which defeats this argument anyway)
I bet you haven't. And even if you tried all of the above, can you be sure there isn't a very peculiar configuration where the above breaks, just because you've observed the behaviour in some (many) experiments?
You cannot.
MS Access specific behaviour
As far as MS Access is concerned, consider the documentation on ORDER BY
Remarks
ORDER BY is optional. However, if you want your data displayed in sorted order, then you must use ORDER BY.
Notice the wording. "You must use ORDER BY". So, MS Acces is no different from other databases.
The answer
So your question about performance is going in the wrong direction. You cannot sacrifice correctness for performance in this case. Better tackle performance by using indexes.
Here is the MSDN documentation for the GROUP BY clause in Access SQL:
https://msdn.microsoft.com/en-us/library/bb177905(v=office.12).aspx
The page makes no reference to any implied or automatic ordering of results - if you do see desired ordering without an explicit ORDER BY then it is entirely coincidental.
The only way to guarantee the particular ordering of results in SQL is with ORDER BY.
There is a slight performance problem with using ORDER BY (in general) in that it requires the DBMS to get all of the results first before it outputs the first row of results (though the DBMS is free to use an "online sort" algorithm that sorts data as it gets each row from its backing store, it still needs to get the last row from the backing store before it can return the first row to the client (in case the last row from the backing-store happens to be the 1st result according to the ORDER BY) - however unless you're querying tens of thousands of rows in a latency-sensitive application this is not a problem - and as you're using Access already it's very clear that this is not a performance-sensitive application.
I am in the process of changing the underlaying database from a relational database to MongoDB, and I need to "recreate" the same semantics through MongoDB queries. All in all, this is going fine, with the exception of one thing: the SQL greatest() function:
SELECT * FROM my_table
WHERE (GREATEST(FIELD_A, FIELD_B, FIELD_C, FIELD_D)
BETWEEN some_value AND some_value)
AND FIELD_E = another_value;
I cannot seem to find an equivalent to this GREATEST() function. I am aware that it is possible to achieve somewhat similar functionality by using the $cond operator, but as the GREATEST() function here is finding the greatest of 4 values, this would be a lot of conditinals. Is there any other way of achieving this? I have had a look at both the aggregation framework and mapReduce, but I can't seem to find anything directly similar in the aggregation framework and I am having a hard time understanding the mapReduce framework.
Is this even possible to achieve? I would assume that the answer is yes, but I cannot really seem to find a reasonable equivalent way of doing it.
If you query you quoted is what you are trying to replicate, you can take a different route...
You want to find all documents that the greatest of 4 values between a range (plus other criteria).
You can rephrase this as documents that all 4 values are below the upper limit and at least one is above the lower.
Something along the lines of:
find(
{field_a:{$lt:some_upper_limit}
,field_b:{$lt:some_upper_limit}
,field_c:{$lt:some_upper_limit}
,field_d:{$lt:some_upper_limit}
,$or:
[{field_a:{$gt:some_lower_limit}}
,{field_b:{$gt:some_lower_limit}}
,{field_c:{$gt:some_lower_limit}}
,{field_d:{$gt:some_lower_limit}}
]
})
Probably a good idea to look at how indexes might help make this efficient, depending on the data, etc...
MongoDb doesn't currently have the equivalent to the GREATEST function. You could use a MapReduce, but it won't provide efficient immediate results. Additionally, you wouldn't effectively be able to return the other fields of the document. You'd need to do more than one query, or potentially duplicate all of the data. And, without running an update process for the results, it wouldn't be up to date as documents were modified, as a Map Reduce in MongoDb must be initiated manually.
The MongoDb aggregation framework wasn't intended for this pattern, and if it is possible, would result in a very lengthy pipeline. Also, it's currently limited to 16MB of results and doesn't easily return more than the fields you've aggregated. Returning select * requires a manual field projection, potentially more than once depending on the desired output.
Given that you want to return multiple fields, and the result isn't an aggregation, I'd suggest doing something far simpler:
Precompute the result of a call to the greatest function and store it in the document as a new field for easy access in a variety of queries.
Is the result of GROUP BY should be sorted accordingly the SQL standard?
Many databases return the sorted results for GROUP BY,
but is it enforced by SQL92 or other standard?
No. GROUP BY has no standard impact on the order of rows returned. That's what ORDER BY is designed to do.
If you're getting some kind of repeatable or predictable sort order returned by a GROUP BY, it's something being done in your DBMS that is not defined in the standards.
As a previous answer has explained, no sorting is ever implied by any basic SQL construct other than ORDER BY.
However, to compute GROUP BY, either index scan or in-memory sorting may take place (to create the buckets), and such an index scan, or sorting, implies a traversal of the data in a sorted order. So it is no accident that a particular database often behaves like this. Do not rely on it, however, because with a different set of indexes, or even just a different query plan (which may be triggered as little as by a few inserts and/or a restart of your database server) the behavior could be quite different.
Notice also that reordering the column list in the ORDER BY clause will result in reliably reordering the output, whereas reordering the column list in a GROUP BY clause will likely have no effect whatsoever.
There is no performance cost of using a seemingly "redundant" ORDER BY. The query plan will likely be identical, if the original one already guaranteed sorted output.
Um, sorting the output of a GROUP BY is not in the standard because there are standard algorithms for grouping that do not produce results in order.
The most common of these is the use of a hash table for doing the group by.
In addition, on a multithreaded server, the data could be sorted, but the results would be returned processor-by-processor. There is no guarantee that the lowest order processor would be the first to return data.
And also, on a parallel machine, the data may be split among the processors using a variety of methods. For instance, all strings that end in "a" may go to one processor. All that end in "b" to another. These could then be sorted locally, but the results themselves would not be sorted overall.
Databases such as mysql that guarantee a sort after the group by are making a poor design decision. In addition to not conforming to the standard, such databases either limit the choice of algorithm or impose additional processing for ordering.
I've got a mysql plugin that will return a result set in a specified order. Part of the result set includes a foreign key, and I'd like join on that key, while ensuring the order remains the same.
If I do something like:
select f.id,
f.title
from sphinx s
inner join foo f on s.id = f.id
where query='test;filter=type,2;sort=attr_asc:stitle';
It looks like I'm getting my results back in the order that sphinx returns them. Is this a quirk of mysql, or am I assured that a join won't change the order?
If you need a guaranteed order in the results of a query, use ORDER BY. Anything else is wishful thinking.
To give some insight on this, many databases divide execution steps in a way that can vary depending on the execution plan of the query, the amount of available CPU, and the kinds of optimizations the database can infer are safe. If a query is run in parallel on multiple threads the results can vary. If the explain plan changes, the results can vary. If multiple queries are running simultaneously, the results can vary. If some data is cached in memory, the results can vary.
If you need to guarantee an order, use ORDER BY.
I don't believe that sql guarantees which table drives the ultimate sort order. Having said that, unless I would be very surprised if MySQL rewrites your query in such a way that the order changes.
SQL makes no guarantees about the result set order of a SELECT, which includes joins.
You cannot do that, SQL does not guarantee the order after such operation.
There is no specific order guaranteed unless you specify an ORDER BY statement.
Since you mentioned that you were using a plugin that returns result sets in a specified order, I'm assuming that that plugin generates SQL that will add the ORDER BY statement.
If you do joins, one thing to look out for is the column names of the tables you're joining on. If they're named the same, your query might brake or order by a different column than intended.
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
GROUP BY NR_DZIALU
HAVING NR_DZIALU = 30
or
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
WHERE NR_DZIALU = 30
GROUP BY NR_DZIALU
The theory (by theory I mean SQL Standard) says that WHERE restricts the result set before returning rows and HAVING restricts the result set after bringing all the rows. So WHERE is faster. On SQL Standard compliant DBMSs in this regard, only use HAVING where you cannot put the condition on a WHERE (like computed columns in some RDBMSs.)
You can just see the execution plan for both and check for yourself, nothing will beat that (measurement for your specific query in your specific environment with your data.)
It might depend on the engine. MySQL for example, applies HAVING almost last in the chain, meaning there is almost no room for optimization. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
I believe this behavior is the same in most SQL database engines, but I can't guarantee it.
The two queries are equivalent and your DBMS query optimizer should recognise this and produce the same query plan. It may not, but the situation is fairly simple to recognise, so I'd expect any modern system - even Sybase - to deal with it.
HAVING clauses should be used to apply conditions on group functions, otherwise they can be moved into the WHERE condition. For example. if you wanted to restrict your query to groups that have COUNT(DZIALU) > 10, say, you would need to put the condition into a HAVING because it acts on the groups, not the individual rows.
I'd expect the WHERE clause would be faster, but it's possible they'd optimize to exactly the same.
Saying they would optimize is not really taking control and telling the computer what to do. I would agree that the use of having is not an alternative to a where clause. Having has a special usage of being applied to a group by where something like a sum() was used and you want to limit the result set to show only groups having a sum() > than 100 per se. Having works on groups, Where works on rows. They are apples and oranges. So really, they should not be compared as they are two very different animals.
"WHERE" is faster than "HAVING"!
The more complex grouping of the query is - the slower "HAVING" will perform to compare because: "HAVING" "filter" will deal with larger amount of results and its also being additional "filter" loop
"HAVING" will also use more memory (RAM)
Altho when working with small data - the difference is minor and can absolutely be ignored
"Having" is slower if we compare with large amount of data because it works on group of records and "WHERE" works on number of rows..
"Where" restricts results before bringing all rows and 'Having" restricts results after bringing all the rows
Both the statements will be having same performance as SQL Server is smart enough to parse both the same statements into a similar plan.
So, it does not matter if you use WHERE or HAVING in your query.
But, ideally you should use WHERE clause syntactically.