I am in the process of changing the underlaying database from a relational database to MongoDB, and I need to "recreate" the same semantics through MongoDB queries. All in all, this is going fine, with the exception of one thing: the SQL greatest() function:
SELECT * FROM my_table
WHERE (GREATEST(FIELD_A, FIELD_B, FIELD_C, FIELD_D)
BETWEEN some_value AND some_value)
AND FIELD_E = another_value;
I cannot seem to find an equivalent to this GREATEST() function. I am aware that it is possible to achieve somewhat similar functionality by using the $cond operator, but as the GREATEST() function here is finding the greatest of 4 values, this would be a lot of conditinals. Is there any other way of achieving this? I have had a look at both the aggregation framework and mapReduce, but I can't seem to find anything directly similar in the aggregation framework and I am having a hard time understanding the mapReduce framework.
Is this even possible to achieve? I would assume that the answer is yes, but I cannot really seem to find a reasonable equivalent way of doing it.
If you query you quoted is what you are trying to replicate, you can take a different route...
You want to find all documents that the greatest of 4 values between a range (plus other criteria).
You can rephrase this as documents that all 4 values are below the upper limit and at least one is above the lower.
Something along the lines of:
find(
{field_a:{$lt:some_upper_limit}
,field_b:{$lt:some_upper_limit}
,field_c:{$lt:some_upper_limit}
,field_d:{$lt:some_upper_limit}
,$or:
[{field_a:{$gt:some_lower_limit}}
,{field_b:{$gt:some_lower_limit}}
,{field_c:{$gt:some_lower_limit}}
,{field_d:{$gt:some_lower_limit}}
]
})
Probably a good idea to look at how indexes might help make this efficient, depending on the data, etc...
MongoDb doesn't currently have the equivalent to the GREATEST function. You could use a MapReduce, but it won't provide efficient immediate results. Additionally, you wouldn't effectively be able to return the other fields of the document. You'd need to do more than one query, or potentially duplicate all of the data. And, without running an update process for the results, it wouldn't be up to date as documents were modified, as a Map Reduce in MongoDb must be initiated manually.
The MongoDb aggregation framework wasn't intended for this pattern, and if it is possible, would result in a very lengthy pipeline. Also, it's currently limited to 16MB of results and doesn't easily return more than the fields you've aggregated. Returning select * requires a manual field projection, potentially more than once depending on the desired output.
Given that you want to return multiple fields, and the result isn't an aggregation, I'd suggest doing something far simpler:
Precompute the result of a call to the greatest function and store it in the document as a new field for easy access in a variety of queries.
Related
Given a messy postgres query (e.g. with lots of subqueries) is there a way to figure out what columns will be returned by the query without running the query itself?
If I understand correctly, Sequel's Dataset#columns method (Documentation) calls the query with a LIMIT 1 attached. That's fine for a simple query, but if subqueries are involved it seems that this approach still results in computing those subqueries.
(One approach might be to add a LIMIT 1 to every subquery, but I'm not exactly sure how to go about doing that.)
I'm using Postgres 9.2 with Sequel.
Thanks! (I know this question isn't as precisely posed as might be desirable -- please let me know what more information I can provide that might be helpful.)
You can do this with explain and add the option VERBOSE. Have a look here
http://www.postgresql.org/docs/9.1/static/sql-explain.html
using distinct command in SQL is good practice or not? is there any drawback of distinct command?
It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.
If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing
From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .
It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.
If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.
When I write SQL queries, I find myself often thinking that "there's no way to do this with a single query". When that happens I often turn to stored procedures or multi-statement table-valued functions that use temp tables (of one sort or another) and end up simply combining the results and returning the result table.
I'm wondering if anyone knows, simply as a matter of theory, whether it should be possible to write ANY query that returns a single result set as a single query (not multiple statements). Obviously, I'm ignoring relevant points such as code readability and maintainability, maybe even query performance/efficiency. This is more about theory - can it be done... and don't worry, I certainly don't plan to start forcing myself to write a single-statement query when multi-statement will better suit my purpose in all cases, but it might make me think twice or a little bit longer on whether there is a viable way to get the result from a single query.
I guess a few parameters are in order - I'm thinking of a relational database (such as MS SQL) with tables that follow common best practices (such as all tables having a primary key and so forth).
Note: in order to win 'Accepted Answer' on this, you'll need to provide a definitive proof (reference to web material or something similar.)
I believe it is possible. I've worked with very difficult queries, very long queries, and often, it is possible to do it with a single query. But most of the time, it's harder to mantain, so if you do it with a single query, make sure you comment your query carefully.
I've never encountered something that could not be done in a single query.
But sometimes it's best to do it in more than one query.
At least with the a recent version of Oracle is absolutely possible. It has a 'model clause' which makes sql turing complete. ( http://blog.schauderhaft.de/2009/06/18/building-a-turing-engine-in-oracle-sql-using-the-model-clause/ ). Of course this is all with the usual limitation that we don't really have unlimited time and memory.
For a normal sql dialect without these abdominations I don't think it is possible.
A task that I can't see how to implement in 'normal sql' would be:
Assume a table with a single column of type integer
For every row
'take the value at the current row and go that many rows back, fetch that value, go that many rows back, and continue until you fetch the same value twice consecutively and return that as the result.'
I can't prove it, but I believe the answer is a cautious yes - provided your database design is done properly. Usually being forced to write multiple statements to get a certain result is a sign that your schema may need some improvements.
I'd say "yes" but can't prove it. However, my main thought process:
Any select should be a set based operation
Your assumption is that you are dealing with mathematically correct sets (ie normalised correctly)
Set theory should guarantee it's possible
Other thoughts:
Multiple SELECT statement often load temp tables/table variables. These can be derived or separated in CTEs.
Any RBAR processing (for good or bad) now be dealt with CROSS/OUTER APPLY onto derived tables
UDFs would be classed as "cheating" in this context I feel, because it allows you to put a SELECT into another module rather than in your single one
No writes allowed in your "before" sequence of DML: this changes state from SELECT to SELECT
Have you seen some of the code in our shop?
Edit, glossary
RBAR = Row By Agonising Row
CTE = Common Table Expression
UDF = User Defined Function
Edit: APPLY: cheating?
SELECT
*
FROM
MyTable1 t1
CROSS APPLY
(
SELECT * FROM MyTable2 t2
WHERE t1.something = t2.something
) t2
In theory yes, if you use functions or a torturous maze of OUTER APPLYs or sub-queries; however, for readability and performance, we have always ended up going with temp tables and multi-statement stored procedures.
As someone above commented, this is usually a sign that your data structure is starting to smell; not that it's bad, but that maybe it's time to denormalise for performance reasons (happens to the best of us), or maybe put a denormalised querying layer in front of your normalised "real" data.
I am concentrating this question on 'reporting-type' queries (count, avg etc. i.e. ones that don't return the domain model itself) and I was just wondering if there is any inherent performance benefit in using HQL, as it may be able to leverage the second level cache. Or perhaps even better - cache the entire query.
The obvious implied benefit is that NHibernate knows what the column names are as it already knows about the model mapping.
Any other benefits I should be aware of?
[I am using NHibernate but I assume that in this instance what applies to Hibernate will be equally applicable to NHibernate]
There are zero advantages. HQL will not outperform a direct database query to perform data aggregation and computation.
The result of something like:
Select count(*), dept from employees group by dept
Will always perform faster at the DB then in HQL. Note I say always because it is lame to take the 'depends on your situation' line of thinking. If it has to do with data and data aggregation; do it in SQL.
The objects in the second level cache are only retrieved by id, so Hibernate will always run a query to obtain a list of ids and then read those objects either from the second-level cache or with another query.
On the other hand, Hibernate can cache the query and avoid the DB call completely in some situations. However you have to consider that a change to any of the tables involved in the query will invalidate it, so you might not hit the cache very often. See here a description of how the query-cache works.
So the cost of your query is either 0, if the query is cached, or about the same as doing the query in straight SQL. Depending on how often your data changes you might save a lot by enabling query caching or you might not save anything.
If you have a high volume of queries and you can tolerate stale results, I'd say it's a lot better to use another cache for the query results that only expires every x minutes.
The only advantage i can think of is that ORM queries are typically cached at the (prepared) statement level, so if you do the same query lots of times chances are you are reusing a prepared statement.
But since you asked specifically for reporting queries and performance, I cannot think of any practical advantages (I'm glossing over the fact that you have other advantages like data-access consistency, ORM querying vs SQL (most of the times it's easier to write a query with HQL), data-type conversions, etc )
HQL is an object query language. SQL is a relational query language.
I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)