Using deepstream List for tens of thousands unique values - deepstream.io

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!

Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

Related

How to get just the most recent of all documents

In sanity studio you get a nice list of the most recent version of all your documents. If there is a draft you get that, if not, you get the published one.
I need the same list for a few filters and scripts. The following groq does the job but is not very fast and does not work in the new API (v2021-03-25).
*[
_type == $type &&
!defined(*[_id == "drafts." + ^._id])
]._id
A way around the breaking changes in the API is to use length() = 0 in place of !defined() but that makes an already slow query 10-20 X slower.
Does anyone know a way of making filters that consider only the latest version?
Edit: An example where I need this is if I want to see all documents without any categories. Regardless whether it is the published document or the draft that has no categories it shows up in a normal filter. So if you add categories but don't immediately want to publish it will be confusing in the no-categories-list. ,'-)
100 X improvement on API v2021-03-25 🥳
The only way I was able to solve this with speed was to first make a projection of the sub-query so it doesn't run once for every non-draft. Then I thought, why not project both sets and then figure out the overlap, and that was even faster! It runs more than 10 x faster than possible on API v1 and 100 x faster than any suggestions for new API.
{
'drafts': *[ _type == $type && _id in path("drafts.**") ]._id,
'published': *[ _type == $type && !(_id in path("drafts.**"))]._id,
}
{
'current': published[ !("drafts." + # in ^.drafts) ] + drafts
}
First I get both drafts and non-drafts and "store" it in this projection, like a variable-😉-ish
Then I start with my non-drafts - published
And filter out any that has a counterpart in my drafts "variable"
Lastly I add all drafts to the my list of filtered non-drafts
Overall I think you're on the right track. Some ideas to help you out:
Drafts are always fresher and newer than published documents, so if a given doc's id in path("drafts.**"), that's already the last updated one.
Knowing the above allows you to skip the defined(*[_id == ...]) part of the query for drafts, speeding up your execution
As drafts are already included, we can exclude published documents with a draft (defined(*[_id == "drafts." + ^._id][0]))
Notice I added a [0] to the end of the query to pick only the first element that matches. This will improve performance slightly.
For getting only documents that have no categories, use count(categoriesField) < 1
Order documents with | order(_updatedAt desc) to get the freshest documents first
And paginate your request to reduce the payload and speed things up.
Here's a sample query applying these principles (I haven't ran it, you may have to do some adjustments there):
*[
_type == $type &&
// Assuming you only want those without categories:
count(categories) < 1 &&
(
// Is either a draft -> drafts are always fresher
_id in path("drafts.**") ||
// Or a published document with no draft
!defined(*[_id == "drafts." + ^._id][0])
// 👆 with the check above we're ensuring only
// published documents run the expensive defined query
)
]
// Order by last updated
| order(_updatedAt desc)
// Paginate for faster queries
[$paginationStart..$paginationEnd]
// Get only the _id, assuming that's what you want
._id
Hope this helps 🙌

ImageJ batch processing - opening a series of images containing a specific name and doing stuff on them

I have 25K tif files (please don't ask why) that I want to organize into stacks on image J. Basically for each region of interest (ROI), there are 50 images which breaks down into 25 z-planes for two channels. I want everything in a single stack. And I'd like to batch process the whole folder without opening 50 images 500 times at a time. I've attached a picture of what the file names look like:
Folder organization
r01c01f01p01-ch1.tif - the first 10 characters are unique ID to each ROI, then plane number (p01) then channel - ch1 or ch2, then file extension
Here's what I have so far (which I cobbled together based on other macros so this may not make sense...).This is using the ImageJ macros language.
//Processing loop to process each file in the folder.
for (i=0; i<list.length; i++) {
showProgress(i+1, list.length);
if (endsWith(list[i], ".tif")) { // skip the subfolder (I create a subfolder earlier in the macros)
print("-- Processing file: " + list[i] + " --");
open(dir+list[i]);
imageTitle= getTitle();
newTitle = substring(imageTitle, 0, lengthOf(imageTitle)-10); // r01c01f01p, cutting off plane number and then the rest to just get the ROI ID
//This is where I'm stuck:
// find all files containing newTitle and open them (which would be 50 at a time), then run the following macros on them
run("Images to Stack", "name=Ch1 title=[] use");
run("Duplicate...", "title=Ch2 duplicate");
selectWindow("Ch1");
run("Slice Remover", "first=1 last=50 increment=2");
selectWindow("Ch2");
run("Slice Remover", "first=2 last=50 increment=2");
run("Merge Channels...", "c1=Ch1 c2=Ch2 create");
saveAs("tiff", dirNew + newTitle + "_Stack.tif");
//Close(All)?
}
}
print("-- Done --");
showStatus("Finished.");
setBatchMode(false); // Exit batch mode
run("Collect Garbage");
Thank you!
You could do something like:
for (plane=1; plane<51; plane++) {
open(newTitle+plane+"-ch1.tif");
open(newTitle+place+"-ch2.tif");
}
Which would take care of the opening. I would be inclined to have a loop prior to this which would collate the number of unique "newTitle"'s, as your current setup would end up doing something like opening the first item, assembling the combined TIF, and then repeat the process 25K times if I understand it correctly.
Given that you know the number of unique "r01c01f01p" values, in principle you could do a set of stacked loops akin to:
newTitleArray = newArray();
for (r=1; r<50; r++) {
titleBit = "r0" + toString(r);
for (c=1; c<501; c++) {
titleBit = titleBit + "f0"...
Alternatively, you could set up a loop where you check for unique "r01c01f01p" values and add them to an array. In any case, you'd replace the for "list" loop with the for "newTitleArray" loop, and then continue onto the opener I listed above, instead of your existing one.
If I am understanding correctly, it seems like you might do well to stack by channel first, then merge the two. I am not 100% sure, but I think you could potentially use a macro I have already created to do that. It was originally meant to batch process terabytes of 5D data, so it should be very comfortable handling your volume of images. It is not exactly what you are looking for, but should be super easy to modify (I went a little overboard with the commenting in the code), and I think the only thing it does that you might rather it not is produce max projects from the inputs. I'll throw a link here and look for your reply. If it's of interest, let me know and we can work to make it suit your needs together :-) Otherwise, if you could provide a little more detail about where you're getting stuck and/or where I may have misunderstood, I will do my very best to help!
https://github.com/evanjkiely/FIJIMacros

Neo4j: "ghost" node in label index throws error

I have a neo4j database with a set of nodes with label :EXAMPLE.
There are two operations. First I delete one node and then I look for another one. They are done separately using neo4j API.
MATCH (n:EXAMPLE {Name: { name1 }}) DELETE n;
and
MATCH (n:EXAMPLE {Name: { name2 }}) RETURN n;
Sometimes, when I execute second query, it throws an error "Node with id 123". Node with id 123 is the same node that was deleted in the first query.
It happens when there is a lot of requests are coming to the database simultaneously.
I guess that it could happen if node was deleted, but EXAMPLE label index wasn't updated yet. There are two facts that prove such theory.
1) The error is unstable.
2) If I change second query like this (remove the label), I won't get the error:
MATCH (n {Name: { name2 }}) RETURN n;
Neo4j version is 2.1.5, Java - OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1) and operation system is Debian. There are no other indexes in the database except the label.
The question is how can I fix this, but still use labels?
What ends up happening is that (simplified) the operations will order like so:
Q1: MATCH (n)
Q2: DELETE (n), COMMIT
Q1: RETURN n # Error, n no longer exists
For implementation reasons, this is much more likely to happen if cypher is going via an index. The database will eventually handle this for you, but for now, you'll need to wrap that read query in a retry block - if it fails with this type of error, you simply run it again.
On that note, there are other errors that are easily recoverable from by retrying, such as deadlock errors, so wrapping your statements and/or transactions in retry-blocks is a useful thing to do in general.
This is a possible workaround:
Mark nodes as deleted instead of deleting. Ignore nodes that are marked as deleted. Delete all such nodes at once with a garbage collector.

phalcon querybuilder total_items always returns 1

I make a query via createBuilder() and when executing it (getQuery()->execute()->toArray())
I got 10946 elements. I want to paginate it, so I pass it to:
$paginator = new \Phalcon\Paginator\Adapter\QueryBuilder(array(
"builder" => $builder,
"limit" => $limit,
"page" => $current_page
));
$limit is 25 and $current_page is 1, but when doing:
$paginator->getPaginate();
$page->total_items;
returns 1.
Is that a bug or am I missing something?
UPD: it seems like when counting items it uses created sql with limit. There is no difference what limit is, limit divided by items per page always equals 1. I might be mistaken.
UPD2: Colleague helped me to figure this out, the bug was in the query phalcon produces: count() of the group by counts grouped elements. So a workaround looks like:
$dataCount = $builder->getQuery()->execute()->count();
$page->next = $page->current + 1;
$page->before = $page->current - 1 > 0 ? $page->current - 1 : 1;
$page->total_items = $dataCount;
$page->total_pages = ceil($dataCount / 100);
$page->last = $page->total_pages;
I know this isn't much of an answer but this is most likely to be a bug. Great guys at Phalcon took on a massive job that is too big to do it properly in their little free time and things like PHQL, Volt and other big but non-core components do not receive as much attention as we'd like. Also given that most time in the past 6 months was spent on v2 there are nearly 500 bugs about stuff like that and it's counting. I came across considerable issues in ORM, Volt, Validation and Session, which in the end made me stick to other not as cool but more proven solutions. When v2 comes out I'm sure all attention will on the bug list and testing, until then we are mostly on our own. Given that it's all C right now, only a few enthusiast get involved, with v2 this will also change.
If this is the only problem you are hitting, the best approach is to update your query to get the information you need yourself without getPaginate().

Rails show different object every day

I want to match my user to a different user in his/her community every day. Currently, I use code like this:
#matched_user = User.near(#user).order("RANDOM()").first
But I want to have a different #matched_user on a daily basis. I haven't been able to find anything in Stack or in the APIs that has given me insight on how to do it. I feel it should be simpler than having to resort to a rake task with cron. (I'm on postgres.)
Whenever I find myself hankering for shared 'memory' or transient state, I think to myself "this is what (distributed) caches were invented for".
#matched_user = Rails.cache.fetch(#user.cache_key + '/daily_match', expires_in: 1.day) {
User.near(#user).order("RANDOM()").first
}
NOTE: While specifying a TTL for cache entry tells Rails/the cache system to try and keep that value for the given timeframe, there's NO guarantee that it will. In particular, a cache that aggressively tries to reclaim memory may expire an entry well before its desired expires_in time.
For this particular use case, it shouldn't be a big deal but in cases where the business/domain logic demands periodically generated values that are durable then you really have to factor that into your database.
How about using PostgreSQL's SETSEED function? I used the date to seed so that every day the seed will change, but within a day, the seed will be consistent.:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
You may want to seed a random value after using this so that any future calls to random aren't biased:
random = User.connection.execute("SELECT RANDOM()").to_a.first["random"]
# Same code as above:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
# Use random value before seed to make new seed:
User.connection.execute "SELECT SETSEED(#{random})"
I have split these steps in different sections just for readability. you can optimise query later.
1) Find all user records till today morning. so that the count will freeze.
usrs_till_today_morning = User.where("created_at <?", DateTime.now.in_time_zone(Time.zone).beginning_of_day)
2) Pluck all ID's
user_ids = usr_till_today_morning.pluck(:id)
3) Today date it will be a range (1..30) but will remain constant throughout the day.
day_today = Time.now.day
4) Select the same ID for the day
todays_user_id = user_ids[day_today % user_ids.count]
#matched_user = User.find(todays_user_id)
So it will give you random user records by maintaining same record throughout the day!!