mongodb: updating a document or replacing the existing one, what is faster - mongodb-query

I have a document hierarchy in one collection that for some documents may be rather deep. For example a document may contain an array of objects which themselves can contain other objects and possibly arrays. These documents actually represent objects that I hold in RAM : they are Vertx verticles representing connected users. APIs interact with these in memory representations for, for example, change the value of a field, or add a new one, or retrieve another one. When data is changed in RAM, it needs to be persisted.
Here, I have two alternatives:
as I have the full object in memory, including its _id, the first option would be to replace the whole document on disk by the object in RAM. This would also allow to do persistence only from time to time, say after five or ten data changes or creations, which will also speedup the process
or, to persist only that very part that had changed, which might be tricky if the data to be updated lies somewhere deeply in the structure
The easiest solution is clearly the first one, but I was wondering if it would be the fastest one, as it writes full blocks of data even for small changes ?
Update: To make things clearer, here is an extract of such a document:
{
"_id" : ObjectId("5cc000df5afdb9451e0fe762"),
"email" : {
"not_validated" : [ ],
"validated" : [
{
"email" : "user2#yopmail.com",
"ts" : NumberLong("1556087007701")
}
],
"default" : "user2#yopmail.com"
},
"msisdn" : {
"not_validated" : [ ],
"validated" : [ ]
},
"last_connection_ts" : NumberLong("1556087007701"),
.....
}
email object has two arrays that hold all validated and not yet validated (not yet confirmed by the user) emails. Sometimes, a first level element as last_connection_ts will be changed, but sometimes, when a not validated email becomes validated, it needs to be removed from not_validated array and added into not_validated array.

Related

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?
Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!
Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

Rebol: Dynamic binding of block words

In Rebol, there are words like foreach that allow "block parametrization" over a given word and a series, e.g., foreach w [1 2 3] [print w]. Since I find that syntax very convenient (as opposed to passing func blocks), I'd like to use it for my own words that operate on lazy lists, e.g map/stream x s [... x ... ].
How is that syntax idiom called? How is it properly implemented?
I was searching the docs, but I could not find a straight answer, so I tried to implement foreach on my own. Basically, my implementation comes in two parts. The first part is a function that binds a specific word in a block to a given value and yields a new block with the bound words.
bind-var: funct [block word value] [
qw: load rejoin ["'" word]
do compose [
set (:qw) value
bind [(block)] (:qw)
[(block)] ; This shouldn't work? see Question 2
]
]
Using that, I implemented foreach as follows:
my-foreach: func ['word s block] [
if empty? block [return none]
until [
do bind-var block word first s
s: next s
tail? s
]
]
I find that approach quite clumsy (and it probably is), so I was wondering how the problem can be solved more elegantly. Regardless, after coming up with my contraption, I am left with two questions:
In bind-var, I had to do some wrapping in bind [(block)] (:qw) because (block) would "dissolve". Why?
Because (?) of 2, the bind operation is performed on a new block (created by the [(block)] expression), not the original one passed to my-foreach, with seperate bindings, so I have to operate on that. By mistake, I added [(block)] and it still works. But why?
Great question. :-) Writing your own custom loop constructs in Rebol2 and R3-Alpha (and now, history repeating with Red) has many unanswered problems. These kinds of problems were known to the Rebol3 developers and considered blocking bugs.
(The reason that Ren-C was started was to address such concerns. Progress has been made in several areas, though at time of writing many outstanding design problems remain. I'll try to just answer your questions under the historical assumptions, however.)
In bind-var, I had to do some wrapping in bind [(block)] (:qw) because (block) would "dissolve". Why?
That's how COMPOSE works by default...and it's often the preferred behavior. If you don't want that, use COMPOSE/ONLY and blocks will not be spliced, but inserted as-is.
qw: load rejoin ["'" word]
You can convert WORD! to LIT-WORD! via to lit-word! word. You can also shift the quoting responsibility into your boilerplate, e.g. set quote (word) value, and avoid qw altogether.
Avoiding LOAD is also usually preferable, because it always brings things into the user context by default--so it loses the binding of the original word. Doing a TO conversion will preserve the binding of the original WORD! in the generated LIT-WORD!.
do compose [
set (:qw) value
bind [(block)] (:qw)
[(block)] ; This shouldn't work? see Question 2
]
Presumably you meant COMPOSE/DEEP here, otherwise this won't work at all... with regular COMPOSE the embedded PAREN!s cough, GROUP!s for [(block)] will not be substituted.
By mistake, I added [(block)] and it still works. But why?
If you do a test like my-foreach x [1] [print x probe bind? 'x] the output of the bind? will show you that it is bound into the "global" user context.
Fundamentally, you don't have any MAKE OBJECT! or USE to create a new context to bind the body into. Hence all you could potentially be doing here would be stripping off any existing bindings in the code for x and making sure they are into the user context.
But originally you did have a USE, that you edited to remove. That was more on the right track:
bind-var: func [block word value /local qw] [
qw: load rejoin ["'" word]
do compose/deep [
use [(qw)] [
set (:qw) value
bind [(block)] (:qw)
[(block)] ; This shouldn't work? see Question 2
]
]
]
You're right to suspect something is askew with how you're binding. But the reason this works is because your BIND is only redoing the work that USE itself does. USE already deep walks to make sure any of the word bindings are adjusted. So you could omit the bind entirely:
do compose/deep [
use [(qw)] [
set (:qw) value
[(block)]
]
]
the bind operation is performed on a new block (created by the [(block)] expression), not the original one passed to my-foreach, with separate bindings
Let's adjust your code by taking out the deep-walking USE to demonstrate the problem you thought you had. We'll use a simple MAKE OBJECT! instead:
bind-var: func [block word value /local obj qw] [
do compose/deep [
obj: make object! [(to-set-word word) none]
qw: bind (to-lit-word word) obj
set :qw value
bind [(block)] :qw
[(block)] ; This shouldn't work? see Question 2
]
]
Now if you try my-foreach x [1 2 3] [print x]you'll get what you suspected... "x has no value" (assuming you don't have some latent global definition of x it picks up, which would just print that same latent value 3 times).
But to make you sufficiently sorry you asked :-), I'll mention that my-foreach x [1 2 3] [loop 1 [print x]] actually works. That's because while you were right to say a bind in the past shouldn't affect a new block, this COMPOSE only creates one new BLOCK!. The topmost level is new, any "deeper" embedded blocks referenced in the source material will be aliases of the original material:
>> original: [outer [inner]]
== [outer [inner]]
>> composed: compose [<a> (original) <b>]
== [<a> outer [inner] <b>]
>> append original/2 "mutation"
== [inner "mutation"]
>> composed
== [<a> outer [inner "mutation"] <b>]
Hence if you do a mutating BIND on the composed result, it can deeply affect some of your source.
until [
do bind-var block word first s
s: next s
tail? s
]
On a general note of efficiency, you're running COMPOSE and BIND operations on each iteration of your loop. No matter how creative new solutions to these kinds of problems get (there's a LOT of new tech in Ren-C affecting your kind of problem), you're still probably going to want to do it only once and reuse it on the iterations.

How do I speed up 'watching' multiple arrays with the same data in vue.js

A large amount of data (+-15 000 records) is loaded via AJAX which is then used to populate two arrays, one sorted by ID the other by a name. Both the arrays contains the same objects just ordered differently.
The data is displayed in a tree view but the hierarchy can be changed by the user. The tree is virtualised on the first level so only the records displayed + 50% is 'materialized' as vue components.
Supplying both of the arrays as data to a vue instance is very slow. I suspect vue.js is adding observers to the objects twice, or change notifications is sent multiple times, I don't really know.
So only one of the arrays is added to vue the other is used out of band.
Vue slows down the addition of elements to an array a lot. If the array is populated before it is bound to the vue instance it takes +-20s before the tree view is displayed. If I bind it before populating the arrays it takes about +-50s before the tree view becomes usable (the elements are displayed almost instantly). This could be because of notifications going for all these elements added.
Is there a way to add a second array with duplicate data so vue.js watches it for changes, but it doesn't slow vue down as much?
Is there a way to switch watching/notifications of temporarily so elements could be added to an array without the penalty, yet be 'watched' when notifications is switched back on?
I'm not sure that my reasoning behind the slowdowns is correct, so maybe my questions are misguided.
O another thing I need the arrays to be watched and only one of the properties of the elements.
var recordsById = [];
var recordsByName = [];
// addRecord gets called for every record AJAX returns, so +-15 000
// calling addRecord 15 000 times before 'binding' takes 20 sec (20 sec with no display)
// calling addRecord after 'binding' takes > 50 sec (instant display but CPU usage makes treeview unausable)
function addRecord(record) {
var pos = binarySearch(recordsById, record);
recordsById.splice(0, pos, record);
pos = binarySearch(recordsByName, record);
recordsByName.splice(0, pos, record);
}
var treeView = new Vue({
el: '#treeView',
data: {
// If I uncomment following line, vue becomes very slow, not just for initial loading, but as a whole
//recordsById: recordsById,
recordsByName: recordsByName
},
computed: {
virtualizedList: function() {.....}
}
})
There are a couple techniques which might improve your performance.
Using the keys
For rendering large lists, the first thing you want to do is use a key. A great candidate for this is the id you speak about.
Use pagination
An obvious solution to "displaying a lot of data is slow" is "display less data". Let's face it, 15 000 is a lot. Even if Vue itself could insert and update so many rows quickly, think of the user's browser: will it be able to keep up? Even if it was the simplest possible list of text nodes, it would still be a lot of nodes.
Use virtual scrolling technique
If you don't like the look of pagination, a more advanced approach would be virtual scrolling. As user browses this large list, dynamically add the elements below, and remove the ones that the user has already seen. You keep the total number of elements from the list at the DOM at once to minimum.

CoreData: Storing (and sorting by) a vector of floating point numbers

I am building an application using CoreData which will require me to store an array of floating point numbers against instances of an Entity, and then fetch a selection of these entities in order of the (say) manhattan distance between their respective matrices.
Here is a rough diagram of something like what I mean:
Entity: {
name: 'instance 1',
data: [ 0.1, 0.2, 0.1, 0.1, 0.05, ... ]
},
Entity: {
name: 'instance 2',
data: [ 0.4, 0.9, 0.1, 0.1, 0.02 ... ] // want to sort using this data
}
I know that it's possible to use a 'transformable' attribute for 'data', and encode an NSArray, but I don't believe it's possible to use the contents of that array in fetch or sorting queries.
So, my question is: what options do I have to build this? Is it possible to somehow extend CoreData to allow me to perform the vector calculations as part of the 'fetch' request? Or would I have to load every object into memory and then sort manually?..
Ultimately I am trying to find the most efficient option, because I expect to be dealing with thousands of instances, each with a 10-20 item feature vector.
Any suggestions as to the possible architecture here would be appreciated by a CoreData newbie ;-)
Please let me know if I have not framed this clearly enough and I will try to elaborate.
I would create a new entity called VectorEntry with two attributes:
data (Float)
sort (Int16)
Then, in your main Vector entity, have a to-many relationship with VectorEntity (this is implemented as an NSSet by CoreData, hence the need to add sorting). You could also implement the VectorEntity as a doubly linked list (Vector would have a to-one relationship with the start and/or end).
In your VectorEntry fetch requests, specify a sort descriptor using the "sort" key path and a proper array will be returned.

Juggling multiple object instances

This question is coded in pseudo-PHP, but I really don't mind what language I get answers in (except for Ruby :-P), as this is purely hypothetical. In fact, PHP is quite possibly the worst language to be doing this type of logic in. Unfortunately, I have never done this before, so I can't provide a real-world example. Therefore, hypothetical answers are completely acceptable.
Basically, I have lots of objects performing a task. For this example, let's say each object is a class that downloads a file from the Internet. Each object will be downloading a different file, and the downloads are run in parallel. Obviously, some objects may finish downloading before others. The actual grabbing of data may run in threads, but that is not relevant to this question.
So we can define the object as such:
class DownloaderObject() {
var $url = '';
var $downloading = false;
function DownloaderObject($v){ // constructor
$this->url = $v;
start_downloading_in_the_background(url=$this->$url, callback=$this->finished);
$this->downloading = true;
}
function finished() {
save_the_data_somewhere();
$this->downloading = false;
$this->destroy(); // actually destroys the object
}
}
Okay, so we have lots of these objects running:
$download1 = new DownloaderObject('http://somesite.com/latest_windows.iso');
$download2 = new DownloaderObject('http://somesite.com/kitchen_sink.iso');
$download3 = new DownloaderObject('http://somesite.com/heroes_part_1.rar');
And we can store them in an array:
$downloads = array($download1, $download2, $download3);
So we have an array full of the downloads:
array(
1 => $download1,
2 => $download2,
3 => $download3
)
And we can iterate through them like this:
print('Here are the downloads that are running:');
foreach ($downloads as $d) {
print($d->url . "\n");
}
Okay, now suppose download 2 finishes, and the object is destroyed. Now we should have two objects in the array:
array(
1 => $download1,
3 => $download3
)
But there is a hole in the array! Key #2 is being unused. Also, if I wanted to start a new download, it is unclear where to insert the download into the array. The following could work:
$i = 0;
while ($i < count($downloads) - 1) {
if (!is_object($downloads[$i])) {
$downloads[$i] = new DownloaderObject('http://somesite.com/doctorwho.iso');
break;
}
$i++;
}
However, that is terribly inefficient (and while $i++ loops are nooby). So, another approach is to keep a counter.
function add_download($url) {
global $downloads;
static $download_counter;
$download_counter++;
$downloads[$download_counter] = new DownloaderObject($url);
}
That would work, but we still get holes in the array:
array(
1 => DownloaderObject,
3 => DownloaderObject,
7 => DownloaderObject,
13 => DownloaderObject
)
That's ugly. However, is that acceptable? Should the array be "defragmented", i.e. the keys rearranged to eliminate blank spaces?
Or is there another programmatic structure I should be aware of? I want a structure that I can add stuff to, remove stuff from, refer to keys in a variable, iterate through, etc., that is not an array. Does such a thing exist?
I have been coding for years, but this question has bugged me for very many of those years, and I am still not aware of an answer. This may be obvious to some programmers, but is extremely non-trivial to me.
The problem with PHP's "associative arrays" is that they aren't arrays at all, they're Hashmaps. Having holes there is perfectly fine. You might look at a linked list, as well, but a Hashmap seems perfectly suited to what you're doing.
What is maintaining your array of downloaders?
If you encapsulate the array in a class that is notified by the downloader when it is finished you won't have to worry about stale references to destroyed objects.
This class can manage the organisation of the array internally and present an interface to its users that looks more like an iterator than an array.
"$i++ loops" are nooby, but only because the code becomes much clearer if you use a for loop:
$i = 0;
while ($i < count($downloads) - 1) {
if (!is_object($downloads[$i])) {
$downloads[$i] = new DownloaderObject('http://somesite.com/doctorwho.iso');
break;
}
$i++;
}
Becomes
for($i=0;$i<count($downloads)-1;++$i){
if (!is_object($downloads[$i])) {
$downloads[$i] = new DownloaderObject('http://somesite.com/doctorwho.iso');
break;
}
}
Coming from a C# perspective, my first thought would be that you need a different data structure to an array - you need to think about the problem using a higher-level data structure. Perhaps a Queue, List or Stack would suit your purposes better?
The short answer to your question is that in PHP arrays are used for almost everything and you rarely end up using other data structures. Having holes in your array indexes isn't anything to worry about. In other programming languages such as Java you have a more diverse set of data structures to choose from: Sets, Hashes, Lists, Vectors and more. It seems that you would also need to have a closer interaction between the Array and DownloaderObject class. Just because the object $download2 has "destroyed()" itself the array will maintain a reference to that object.
Some good answers to this question, which reflect on the relative experience on the answerers. Thank you very much — they proved very educational.
I posted this question nearly three years ago. In hindsight, I can see my knowledge in that area was severely lacking. The biggest problem I had was that I was coming from a PHP perspective, which does not have the ability to arbitrarily pop elements. Something the other answers to this question helped me to discover was that a fundamentally superior model is 'linked lists'.
For C, I wrote a blog post about linked lists which contains code samples (too numerous to post here) but would neatly fill the original question's use case.
For PHP, a linked list implementation appears here, which I have never tried, but imagine it would also be the right way to deal with the above.
Interestingly, Python lists contain the pop() method which, unlike PHP's array_pop(), can pop arbitrary elements and keep everything in order. For example:
>>> x = ['baa', 'ram', 'ewe'] # our starting point
>>> x[1] # making sure element 1 is 'ram'
'ram'
>>> x.pop(1) # let's arbitrarily pop an element in the middle
'ram'
>>> x # the one we popped ('ram') is now gone
['baa', 'ewe']
>>> x[1] # and there are no holes: item 2 has become item 1
'ewe'