This might seem to be a silly question at first, but please read on.
I know that LINQ queries are deferred and only executed when the query is enumerated, but I'm having trouble figuring out exactly when that happens. Certainly in a For Each loop, the query would be enumerated. What's the rule of thumb to follow? I don't want to accidentally enumerate over my query twice if it's a huge result.
For example, does System.Linq.Enumerable.First enumerate over the whole query? I ask for performance reasons. I want to pass a LINQ result set to an ASP.NET MVC view, and I also want to pass the First element separately. Enumerating over the results twice would be painful.
It would be great to turn on some kind of flag that alerts me each time a LINQ query is enumerated. That way I could catch scenarios when I accidentally enumerate twice.
You can add your own logging quite easily to see what's going on. Other than that, the lazy/eager bit is reasonably clear. Basically it's lazy when it can be - any time the return type is IEnumerable<T> or IOrderedEnumerable<T>. It's possible for those to be lazy because you can't get at any of the data without calling GetEnumerator(). Compare that to First() for example - it has to return a value to you. It can't defer anything.
As a general point, if you want to make sure that a query won't be evaluated more than once, call ToList or ToArray on it, then use the results of that several times. Again, those methods have to return a list or an array immediately, neither of which allows for lazy population. The query is evaluated, but then it's effectively disconnected from the resulting populated collection - the query won't be executed again, however much you examine the list.
In addition to the lazy/eager question, there's streaming/non-streaming: will the method read everything from the source enumerable, or just "sip" at it, reading when it needs to. Again, in general LINQ will only read when it has to - so while Reverse is non-streaming (but still lazy), Where and Select are streaming.
There is no hard and fast rule as to when a LINQ query will be enumerated and when it won't. Partially because some methods will or won't based on the underlying type of the query source.
Here is a quick break down. This is not a complete break down by any means, mainly what I could come up with in 5 minutes.
Aggregate Functions
They enumerate the list entirely and immediately. They are usually spotted by the extension methods which return a scalar value. For example Sum, Min, Max, Count, Last etc ...
Note: Count and Last do not necessarily enumerate the entire list. If the underlying type is convertible to ICollection<T> they will instead use a more efficient method.
Front of the list Selectors
They only look at the first element of the list and potentially the second. They are First, FirstOrDefault, Single, SingleOrDefault.
The above is referencing the versions which do not take a predicate. If they take a predicate they are better classified as Inquiries (see below)
Inquiries
They will only enumerate the minimal amount of the list necessary to do the operation. This can be as little as 1 element and as many as the entire list.
Examples: Any, Contains
Create a new list and do no enumeration immediately.
This is the vast majority of the operators in LINQ. Their cost is incurred when the new list is enumerated. Examples: Select, Where, Group, Join, SkipWhile, Skip.
Related
I have some list: List <Pet>
I need to take 3 elements from it that satisfy the condition:
I do it like this:
list.filter { g -> !mainList.contains(g) }.take(3)
How can you optimize this code so that the filter operation is executed no more than enough times for the result?
Use a sequence. Instead of fully processing each step in turn (i.e. create a new filtered list, then take 3 elements from that) each element passes through the chain, one at a time. So you don't create intermediate lists, and you can stop as soon as you get to the element that meets your terminating condition
list.asSequence()
.filter { g -> !mainList.contains(g) }
.take(3)
.toList()
Note you have to execute the sequence (with toList in this case) to turn it into a concrete collection.
Also, as the link says, creating a sequence does introduce a bunch of overhead, so it's not necessarily more efficient - you'll see benefits with bigger collections, more steps in the chain, slower computation in things like the filter functions... anything where it makes sense to really minimise memory use or exiting early makes a big difference. You should benchmark it, make sure it's the right thing to do!
Also as a bonus, your filter operation could be filterNot(mainList::contains) if you like!
Curious which of these is better performance wise. If you have a User with many PlanDates, and you know your user is the user with an id of 60 in a variable current_user, is it better to do:
plan_dates = current_user.plan_dates.select { |pd| pd.attribute == test }
OR
plan_dates = PlanDate.joins(:user).where("plan_dates.attribute" => test).where("users.id" => 60)
Asking because I keep reading about the dangers of using select since it builds the entire object...
select is discouraged because, unlike the ActiveRelation methods where, joins, etc., it's a Ruby method from Enumerable which means that the entire user.plan_dates relation must be loaded into memory before the selection can begin. That may not make a difference at a small scale, but if an average user has 3,000 plan dates, then you're in trouble!
So, your second option, which uses just one SQL query to get the result, is the better choice. However, you can also write it like so:
user.plan_dates.where(attribute: test)
This is still just one SQL query, but leverages the power of ActiveRelation for a more expressive result.
The second. The select has to compare objects on the code level, and the second is just a query.
In addition, the second expression may not be actually executed unless you use the variable, while the first will be always executed.
I have a utility function in my program for searching for entities. It takes a max_count parameter. It returns a QuerySet.
I would like this function to limit the max number of entries. The standard way would be to take a slice out of my QuerySet:
return results[:max_count]
My problem is that the views which utilize this function sort in various ways by using .order_by(). This causes exceptions as re-ordering is not allowed after taking a slice.
Is it possible to force a "LIMIT 1000" into my SQL query without taking a slice?
Do results[:max_count] in a view, after .order_by(). Don't be afraid of requesting too much from DB, query won't be evaluated until the slice (and even after it either).
You could subclass QuerySet to achieve this, by simply ignore every slice and do [:max_count] at last in __getitem__, but I don't think it worths with the complex and side-effects.
If you are worrying about memory consumption by large queryset, follow http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/
For normal usage, please just follow DrTyrsa's suggestion. You could write shortcuts to shorten the order_by and afterwards slice in code to simplify your code.
I am new to grails and groovy.
Can anyone please explain to me the difference between these two groovy sql methods
sql.eachRow
sql.rows
Also, which is more efficient?
I am working on an application that retrieves data from the database(the resultset is very huge) and writes it to CSV file or returns a JSON format.
I was wondering which of the two methods mentioned above to use to have the process done faster and efficient.
Can anyone please explain to me the
difference between these two groovy
sql methods sql.eachRow sql.rows
It's difficult to tell exactly which 2 methods you're referring 2 because there are a large number of overloaded versions of each method. However, in all cases, eachRow returns nothing
void eachRow(String sql, Closure closure)
whereas rows returns a list of rows
List rows(String sql)
So if you use eachRow, the closure passed in as the second parameter should handle each row, e.g.
sql.eachRow("select * from PERSON where lastname = 'murphy'") { row ->
println "$row.firstname"
}
whereas if you use rows the rows are returned, and therefore should be handled by the caller, e.g.
rows("select * from PERSON where lastname = 'murphy'").each {row ->
println "$row.firstname"
}
Also, which is more efficient?
This question is almost unanswerable. Even if I had implemented these methods myself there's no way of knowing which one will perform better for you because I don't know
what hardware you're using
what JVM you're targeting
what version of Groovy you're using
what parameters you'll be passing
whether this method is a bottleneck for your application's performance
or any of the other factors that influence a method's performance that cannot be determined from the source code alone. The only way you can get a useful answer to the question of which method is more efficient for you is by measuring the performance of each.
Despite everything I've said above, I would be amazed if the performance difference between these two was in any way significant, so if I were you, I would choose whichever one you find more convenient. If you find later on that this method is a performance bottleneck, try using the other one instead (but I'll bet you a dollar to a dime it makes no difference).
If we set aside minor syntax differences, there is one difference that seems important. Let's consider
sql.rows("select * from my_table").each { row -> doIt(row) }
vs
sql.eachRow("select * from my_table") { row -> doIt(row) }
The first one opens connection, retrieves results, closes connection and returns them. Now you can iterate over the results while connection is released. The drawback is you now have entire result list in memory which in some cases might be a lot.
EachRow on the other hand opens a connection and while keeping it open executes your closure for each row. If your closure operates on the database and requires another connection your code will consume two connections from the pool at the same time. The connection used by eachRow is released after it iterates though all the resulting rows. Also if you don't perform any database operations but closure takes a while to execute, you will be blocking one database connection until eachRow completes.
I am not 100% sure but possibly eachRow allows you not to keep all resulting rows in memory but access them through a cursor - this may depend on the database driver.
If you don't perform any database operations inside your closure, closure executes fast and results list is big enough to impact memory then I'd go for eachRow. If you do perform DB operations inside closure or each closure call takes significant time while results list is manageable, then go for rows.
They differ in signature only - both support result sets paging, so both will be efficient. Use whichever fits your code.
I am making an application that represents a cell phone bill using Core Data. I have three entities: Bill, Line, and Calls. Bills can have many lines, and lines can have many calls. All of this is set up with relationships. Right now, I have a table view that displays all of the bills. When you double click on a bill, a sheet comes down with a popup box that lists all of the lines on the bill. Below the popup box is a box that has many labels that display various information about that line. Below that information I want to list the top 5 numbers called by that line in that month. Lines has a to-many relationship with Calls, which has two fields, number and minutes. I have all of the calls for the selected line loaded into an NSArrayController with a sort descriptor that properly arranges the values. How do I populate 5 labels with the top 5 values of this array controller?
EDIT: The array of calls is already unique, when I gather the data, I combine all the individual calls into total minutes per number for each month. I just need to sort and display the first 5 records of this combined array.
I may be wrong (and really hope I am), but it looks like you'll need to use brute force on this one. There are no set / array operators that can help, nor does NSPredicate appear to help.
I think this is actually a bit tricky and it looks like you'll have to do some coding. The Core Data Programming Guide says:
If you execute a fetch directly, you
should typically not add
Objective-C-based predicates or sort
descriptors to the fetch request.
Instead you should apply these to the
results of the fetch. If you use an
array controller, you may need to
subclass NSArrayController so you can
have it not pass the sort descriptors
to the persistent store and instead do
the sorting after your data has been
fetched.
I think this applies to your case because it's important to consider whether sorting or filtering takes place first in a fetch request (when the fetch requests predicate and sort descriptors are set). This is because you'll be tempted to use the #distinctUnionOfObjects set/array operator. If the list is collapsed to uniques before sorting, it won't help. If it's applied after sorting, you can just set the fetch request's limit to 5 and there're your results.
Given the documentation, I don't know that this is how it will work. Also, in this case, it might be easier to avoid NSArrayController for this particular problem and just use NSTableViewDataSource protocol, but that's beyond the scope of this Q&A.
So, here's one way to do it:
Create a predicate to filter for the
selected bill's line items.*
Create a sort descriptor to sort the
line items by their telephone number
(which are hopefully in a
standardized format internally, else
trouble awaits) via #"call.number" in your case.
Create a fetch request for the line
item entity, with the predicate and
sort descriptors then execute it**.
With those sorted results, it would be nice if you could collapse and "unique" them easily, and again, you'll be tempted to use #distinctUnionOfObjects. Unfortunately, set/array operators won't be any help here (you can't use them directly on NSArray/NSMutableArray or NSSet/NSMutableSet instances). Brute force it is, then.
I'd create a topFive array and loop through the results, adding the number to topFive if it's not there already, until topFive has 5 items or until I'm out of results.
Displaying it in your UI (using Bindings or not) is, as I said, beyond the scope of this Q&A, so I'll leave it there. I'd LOVE to hear if there's a better way to do this - it's definitely one of those "looks like it should be easy but it's not" kind of things. :-)
*You could also use #unionOfObjects in your key path during the fetch itself to get the numbers of the calls of the line items of the selected bill, which would probably be more efficient a fetch, but I'm getting tired of typing, and you get the idea. ;-)
**In practice I'd probably limit the fetch request to something reasonable - some bills (especially for businesses and teenagers) can be quite large.