TPL - are PLINQ queries ordered by default?

TPL - are PLINQ queries ordered by default? - .net-4.0

In a few podcasts I listened to discussing PLINQ, the original idea was to provide unordered queries by default for perfromance reasons and the dev could decide to make it ordered if that was important. But then it was stated that this was going to be changed to be ordered by default and devs who wanted extra performance and didn't care about order could make it unordered.
All examples and docs i've seen use .AsOrdered() which leads me to believe that it's unordered by default still.
Can someone shed some light on this?

They are absolutely unordered by default and that's exactly why AsOrdered exists. Using AsOrdered introduces an extra-step and, thus, extra-overhead, to take the results from the disparate worker threads and get them into proper order. Also, if it's not obvious, AsOrdered is blocking (see update below) which means results will not progress through the PLINQ query pipeline until items have arrived in the order they original started in.
Finally, note that AsUnordered exists so that you can switch the query back to a non-ordered query from that point forward in the query pipeline.
UPDATE:
I just want to clarify what I mean by "blocking". What happens is that PLINQ has to observe the original order of the elements at the point when they are handed to the ParallelQuery. From there it will ensure that elements that complete before elements out of order are buffered. So if you have elements in order like "one", "two", "three" and "two" finishes before "one", "two" will be buffered until "one" is completed.

PLINQ will not order your results unless you specify AsOrdered. Since Parallel LINQ will execute work concurrently, it may finish the second item before the first.
Say you have a class that looks like this:
public class Foo
{
public int Bar() {return something;}
}
Let's say Bar on different instances of Foo takes an indeterminate amount of time to complete because it checks a file. So say we have a case where item A's Bar takes 10 seconds to complete, but item B's takes 1. Since B finished first, it ends up on top.
.NET won't order it by default because it needs to complete items in order before moving on to the next one. You can't order what you don't know. So for performance reasons, they are unordered by default. AsOrdered indicates order is important, but at the cost of blocking.

Related

Kotlin optimize take first n elements with filter

I have some list: List <Pet>
I need to take 3 elements from it that satisfy the condition:
I do it like this:
list.filter { g -> !mainList.contains(g) }.take(3)
How can you optimize this code so that the filter operation is executed no more than enough times for the result?

Use a sequence. Instead of fully processing each step in turn (i.e. create a new filtered list, then take 3 elements from that) each element passes through the chain, one at a time. So you don't create intermediate lists, and you can stop as soon as you get to the element that meets your terminating condition
list.asSequence()
.filter { g -> !mainList.contains(g) }
.take(3)
.toList()
Note you have to execute the sequence (with toList in this case) to turn it into a concrete collection.
Also, as the link says, creating a sequence does introduce a bunch of overhead, so it's not necessarily more efficient - you'll see benefits with bigger collections, more steps in the chain, slower computation in things like the filter functions... anything where it makes sense to really minimise memory use or exiting early makes a big difference. You should benchmark it, make sure it's the right thing to do!
Also as a bonus, your filter operation could be filterNot(mainList::contains) if you like!

Cocoa Scripting: Deletion of elements in a loop getting out of sync

While adding scriptability to my Mac program, I am struggling with the common programming problem of deleting items from an indexed array where the item indexes shift due to removal of items.
Let's say my app maintains a data store in which objects of type "Person" are stored. In the sdef, I've define the Cocoa Key allPersons to access these elements. My app declares an NSArray *allPersons.
That far, it works well. E.g, this script works well:
repeat with p in every person
get name of p
end repeat
The problem starts when I want to support deletion of items, like this:
repeat with p in (get every person)
delete p
end repeat
(I realize that I could just write "delete every person", which works fine, but I want to show how "repeat" makes things more complicated).
This does not work because AppleScript keep using the original item numbers to reference the items even after deleting some of them, which naturally shifts the items and their numbering.
So, considering we have 3 Persons, "Adam", "Bonny" and "Clyde", this will happen:
get every person
--> {person 1, person 2, person 3}
delete person 1
delete person 2
delete person 3
--> error number -1719 from person 3
After deleting item 1 (Adam), the other items get renumbered to item 1 and 2. The second iteration deletes item 2 (which is now Clyde), and the third iteration attempts to delete item 3, which doesn't exist any more at that point.
How do I solve this?
Can I force the scripting engine to not address the items by their index number but instead by their unique ID so that this won't happen?

It's not your ObjC code, it's your misunderstanding of how repeat with VAR in EXPR loops work. (Not really your fault either: they're 1. counterintuitive, and 2. poorly explained.) When it first encounters your repeat statement, AppleScript sends your app a count event to get the number of items specified by EXPR, which in this case is an object specifier (query) that identifies all of the person elements in whatever. It then uses that information to generate its own sequence of by-index object specifiers, counting from 1 up to the result of the aforementioned count:
person 1 of whatever
person 2 of whatever
...
person N of whatever
What you need to realize is that an object specifier is a first-class query, not an object pointer (not that Apple tell you this either): it describes a request, not an object. Ignore the purloined jargon: Apple event IPC's nearest living relatives are RDBMSes, not Cocoa or SOAP or any of the OO messaging crud that modern developers so fixate on as The One True Way To Do... well, EVERYTHING.
It's only when that query is sent to your application in an Apple event that it's evaluated against the relational graph your Apple event IPC View-Controller – aka "Apple Event Object Model" – presents as an idealized, user-friendly representation of your Model's user date that it actually resolves to a specific Model object, or objects, with which the event handler should perform the requested operation.
Thus, when the delete command in your repeat loop tells your app to delete person 1 of whatever, all your remaining elements move down by one. But on the next iteration the repeat loop still generates the object specifier person 2 of whatever, which your script then sends off to your app, which resolves it to the second item in the collection – which was originally the third item, of course, until you shifted them all about.
Or, to borrow a phrase:
Nothing in AppleScript makes sense except in light of relational queries.
..
In fact, Apple events' query-based approach it actually makes a lot of sense considering it was originally designed to be efficient over very high-latency connections (i.e. System 7's abysmally inefficient process switcher), allowing a single Apple event carrying one or more complex queries to manipulate many objects at once. It's even quite elegant [when it works right], but is quite undone by idiots at Cupertino who think the best way to make programmers not hate the technology is to lie even harder about how it actually works.
So here, I suggest you go read this, which is not the best explanation either but still a damn sight better than anything you'll get from those muppets. And this, by its original designer that explains a lot of the rationale for creating a high-level coarse-grained query-based IPC system instead of the usual low-level fine-grained OO message passing crap.
Oh, and once you've done that, you might want to consider try running this instead:
delete every person whose name is "bob"
which is pretty much the whole point of creating a thick declarative-y abstraction that does all the work so the user doesn't have to.
And when nothing but an imperative client-side loop will do, you either want to get a list of by-ID object specifiers (which are the closest things to safe, persistent pointers that AEOMs can do) from the app first and then iterate over that, or at least use your own iterator loop that counts over elements in reverse:
repeat with i from (count every person) to 1 by -1
tell person i
..
end tell
end repeat
so that, assuming it's iterating over an ordered array on the server side, will delete from last to first, and so avoid the embarrassing off-by-N errors of your original script.
HTH

re: "If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them."
Yes, Apple recommends using formUniqueId or formName for object specifiers, but you can't always do that. For instance, in the Text Suite, you really only have indexing to work with; e.g. character 1, word 3, paragraph 7, etc. You don't have unique IDs for text elements. In addition to deletion, ordering can be affected by other Standard Suite commands: open, close, duplicate, make, and move.
The app implementer is a programmer, but so is the scripter. So it is reasonable to expect the scripter to solve some problems themselves. For instance, if the app has 5 persons, and the scripter wants to delete persons 2 and 4, they can easily do so even with indexed deletion:
delete person 4
delete person 2
Deleting from the end of an ordered list forward solves the problem. AS also supports negative indexes, which can be used for the same purpose:
delete person -2
delete person -4

The key to solving this lies in implementing the objectSpecifier method correctly so that it does return an NSUniqueIDSpecifier.
My code did so far only return an index specifier and that was wrong for this purpose. I guess that had I posted my code (which is, unfortunately, too complex for that), someone may have noticed my mistake.
So, I guess the rule is: If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them. For read-only element arrays, using an NSIndexSpecifier is (probably) safe, though, if your element array has persistent ordering behavior.
Update
As #foo points out, it's also important that the repeat command fetches the references to the items by using … in (get every person) and not just … in every person, because only the former leads to addressing the items by their id whereas the latter keeps indexing them as item N.

groovy sql eachRow and rows method

I am new to grails and groovy.
Can anyone please explain to me the difference between these two groovy sql methods
sql.eachRow
sql.rows
Also, which is more efficient?
I am working on an application that retrieves data from the database(the resultset is very huge) and writes it to CSV file or returns a JSON format.
I was wondering which of the two methods mentioned above to use to have the process done faster and efficient.

Can anyone please explain to me the
difference between these two groovy
sql methods sql.eachRow sql.rows
It's difficult to tell exactly which 2 methods you're referring 2 because there are a large number of overloaded versions of each method. However, in all cases, eachRow returns nothing
void eachRow(String sql, Closure closure)
whereas rows returns a list of rows
List rows(String sql)
So if you use eachRow, the closure passed in as the second parameter should handle each row, e.g.
sql.eachRow("select * from PERSON where lastname = 'murphy'") { row ->
println "$row.firstname"
}
whereas if you use rows the rows are returned, and therefore should be handled by the caller, e.g.
rows("select * from PERSON where lastname = 'murphy'").each {row ->
println "$row.firstname"
}
Also, which is more efficient?
This question is almost unanswerable. Even if I had implemented these methods myself there's no way of knowing which one will perform better for you because I don't know
what hardware you're using
what JVM you're targeting
what version of Groovy you're using
what parameters you'll be passing
whether this method is a bottleneck for your application's performance
or any of the other factors that influence a method's performance that cannot be determined from the source code alone. The only way you can get a useful answer to the question of which method is more efficient for you is by measuring the performance of each.
Despite everything I've said above, I would be amazed if the performance difference between these two was in any way significant, so if I were you, I would choose whichever one you find more convenient. If you find later on that this method is a performance bottleneck, try using the other one instead (but I'll bet you a dollar to a dime it makes no difference).

If we set aside minor syntax differences, there is one difference that seems important. Let's consider
sql.rows("select * from my_table").each { row -> doIt(row) }
vs
sql.eachRow("select * from my_table") { row -> doIt(row) }
The first one opens connection, retrieves results, closes connection and returns them. Now you can iterate over the results while connection is released. The drawback is you now have entire result list in memory which in some cases might be a lot.
EachRow on the other hand opens a connection and while keeping it open executes your closure for each row. If your closure operates on the database and requires another connection your code will consume two connections from the pool at the same time. The connection used by eachRow is released after it iterates though all the resulting rows. Also if you don't perform any database operations but closure takes a while to execute, you will be blocking one database connection until eachRow completes.
I am not 100% sure but possibly eachRow allows you not to keep all resulting rows in memory but access them through a cursor - this may depend on the database driver.
If you don't perform any database operations inside your closure, closure executes fast and results list is big enough to impact memory then I'd go for eachRow. If you do perform DB operations inside closure or each closure call takes significant time while results list is manageable, then go for rows.

They differ in signature only - both support result sets paging, so both will be efficient. Use whichever fits your code.

How can I tell when a LINQ query is enumerated?

This might seem to be a silly question at first, but please read on.
I know that LINQ queries are deferred and only executed when the query is enumerated, but I'm having trouble figuring out exactly when that happens. Certainly in a For Each loop, the query would be enumerated. What's the rule of thumb to follow? I don't want to accidentally enumerate over my query twice if it's a huge result.
For example, does System.Linq.Enumerable.First enumerate over the whole query? I ask for performance reasons. I want to pass a LINQ result set to an ASP.NET MVC view, and I also want to pass the First element separately. Enumerating over the results twice would be painful.
It would be great to turn on some kind of flag that alerts me each time a LINQ query is enumerated. That way I could catch scenarios when I accidentally enumerate twice.

You can add your own logging quite easily to see what's going on. Other than that, the lazy/eager bit is reasonably clear. Basically it's lazy when it can be - any time the return type is IEnumerable<T> or IOrderedEnumerable<T>. It's possible for those to be lazy because you can't get at any of the data without calling GetEnumerator(). Compare that to First() for example - it has to return a value to you. It can't defer anything.
As a general point, if you want to make sure that a query won't be evaluated more than once, call ToList or ToArray on it, then use the results of that several times. Again, those methods have to return a list or an array immediately, neither of which allows for lazy population. The query is evaluated, but then it's effectively disconnected from the resulting populated collection - the query won't be executed again, however much you examine the list.
In addition to the lazy/eager question, there's streaming/non-streaming: will the method read everything from the source enumerable, or just "sip" at it, reading when it needs to. Again, in general LINQ will only read when it has to - so while Reverse is non-streaming (but still lazy), Where and Select are streaming.

There is no hard and fast rule as to when a LINQ query will be enumerated and when it won't. Partially because some methods will or won't based on the underlying type of the query source.
Here is a quick break down. This is not a complete break down by any means, mainly what I could come up with in 5 minutes.
Aggregate Functions
They enumerate the list entirely and immediately. They are usually spotted by the extension methods which return a scalar value. For example Sum, Min, Max, Count, Last etc ...
Note: Count and Last do not necessarily enumerate the entire list. If the underlying type is convertible to ICollection<T> they will instead use a more efficient method.
Front of the list Selectors
They only look at the first element of the list and potentially the second. They are First, FirstOrDefault, Single, SingleOrDefault.
The above is referencing the versions which do not take a predicate. If they take a predicate they are better classified as Inquiries (see below)
Inquiries
They will only enumerate the minimal amount of the list necessary to do the operation. This can be as little as 1 element and as many as the entire list.
Examples: Any, Contains
Create a new list and do no enumeration immediately.
This is the vast majority of the operators in LINQ. Their cost is incurred when the new list is enumerated. Examples: Select, Where, Group, Join, SkipWhile, Skip.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?

An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.

The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.

If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.

Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.

I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.

Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)

Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.

I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.

I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.

I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.

Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas