Neptune `barrier()` optimization - amazon-neptune

https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-traversal-tuning.html
This documentation mentions an important optimization affecting Neptune engine version 1.0.5.0 recommending a barrier() step during a specific traversal sequence:
g.V().hasLabel('airport').
order().
by(out().count(),desc).
limit(10).
out()
Does this strictly apply only to the order().by().limit() sequence, or would it also affect order().by().range()?
Would placing barrier() as the very last step of the traversal suffice, or does it need to come directly after limit()?
If the traversal contains multiple order().by().limit() sequences, would a single barrier() suffice, or does each instance need it's own barrier()? Note that I'm using these within project(...).by(__.order().by().limit()).by(__.order().by().limit()).

Related

Implementations of (fully) dynamic connectivity data structures

The dynamic connectivity problem for graphs consists in maintaining a graph data structure that allows for adding and deleting edges of the graph.
Moreover, the data structure should support connectivity queries.
Typically, such a query is of the form ''Are the nodes u and v connected in the graph?''
There are variants of the dynamic connectivity problem that also support different connectivity queries like 2-edge-connectivity or biconnectivity.
My question is: Are there existing efficient implementations of dynamic connectivity data structures?
By efficient I mean that data structures with a low amortized operation costs.
In particular, I am NOT interested in trivial implementations with a complexity of O(n) per operation!
Below I describe in more detail what I am looking for an what I already know.
If only edge insertions are allowed the dynamic connectivity problem can be solved by the well known disjoint-set (aka union find) data structure.
For this data structure there are implementations available in many different programming languages.
Unfortunately, this does not seem to be the case for the dynamic connectivity problem that also allows edge deletions.
The situation is even worse for data structures that also allow other connectivity queries like 2-edge- or biconnectivity.
To the best of my knowledge the algorithms presented in Holm et al. (2001) are still state of the art for many dynamic connectivity problems.
This publication was accompanied by an experimental study, however, as far as I can tell the code was never made publicly available. Also, therein only implementations for the regular connectivity problem are discussed, not for 2-edge- or biconnectivity.
The algorithms by Holm et al. (and also by other authors) are highly non-trivial.
Even though the algorithms are described in much detail it requires a lot of expertise to implement these algorithms in practice.
Because of this I am looking for existing implementation of different dynamic connectivity data structures.
The table below summarizes the (currently underwhelming) implementations of different combinations of supported manipulations and queries.
Graph Manipulations
Connectivity
2-edge-connectivity
Biconnectivity
incremental (adding edges)
disjoint-set
decremental (deleting edges)
Rafael Glikis
fully (adding and deleting edges)
I have searched for implementations in different places. I have looked on git-hub, I have looked through the external links in the relevant Wikipedia articles and I have skimmed through a lot of literature without any success.
I expect we will need a framework for trying things out so that we can discuss this in concrete terms.
I have implemented a small windows application that accepts user queries to read, build, edit and query the connectivity of a graph, showing the time taken to execute each.
Sample run:
Supported queries
add v1 v2 : add link to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.539246 0.539246 query
type query> delete 23 20
4720 vertices 27443 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.004432 0.004432 query
type query> add 23 20
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.0046639 0.0046639 query
The complete application is at https://github.com/JamesBremner/graphConnectivity
To demonstrate how this application can be used, I built it with the graph engine at https://github.com/JamesBremner/PathFinderFeb2023 and ran it on a couple of the test datasets from https://dyngraphlab.github.io/
dataset
edge count
delete
add
3elt.graph.seq.txt
27,443
5ms
5ms
144.graph.seq.txt
2,148,787
13ms
13ms
To get the average time to perform multiple queries, use the random command, like this:
Supported queries
add v1 v2 : add link to graph
add random n : add n random links to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
type query> add random 10
4720 vertices 27454 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
10 1.62e-06 1.62e-05 randomAdd

What is the best order of arguments for spatially indexed functions of sf (st_intersects, etc.)?

It is explained on https://www.r-spatial.org/r/2017/06/22/spatial-index.html that "The R tree is constructed on the first argument (x), and used to match all geometries on the second argument (y) of binary functions". However, it is finally not explained what are the criteria that could guide the user in the choice of the sf object to put in the first argument. Are there some rules of thumb for this choice?
Generally, with creating spatial indexes you want to start with a larger envelopes that decompose to smaller envelopes. So, if you are extending from that then starting with a geometry with a larger extent then building smaller envelopes from that would match the general schema of how these selections are optimized.

Cypher BFS with multiple Relations in Path

I'd like to model autonomous systems and their relationships in Graph Database (memgraph-db)
There are two different kinds of relationships that can exist between nodes:
undirected peer2peer relationships (edges without arrows in image)
directed provider2customer relationships (arrows pointing to provider in image)
The following image shows valid paths that I want to find with some query
They can be described as
(s)-[:provider*0..n]->()-[:peer*0..n]—()<-[:provider*0..n]-(d)
or in other words
0-n c2p edges followed by 0-n p2p edges followed by 0-n p2c edges
I can fix the first and last node and would like to find a (shortest/cheapest) path. As I understand I can do BFS if there is ONE relation on the path.
Is there a way to query for paths of such form in Cypher?
As an alternative I could do individual queries where I specify the length of each of the segments and then do a query for every length of path until a path is found.
i.e.
MATCH (s)<-[]->(d) // All one hop paths
MATCH (s)-[:provider]->()-[:peer]-(d)
MATCH (s)-[:provider]->()<-[:provider]-(d)
...
Since it's viable to have 7 different path sections, I don't see how 3 BFS patterns (... BFS*0..n) would yield a valid solution. It's impossible to have an empty path because the pattern contains some nodes between them (I have to double-check that).
Writing individual patterns is not great.
Some options are:
MATCH path=(s)-[:BFS*0.n]-(d) WHERE {{filter_expression}} -> The expression has to be quite complex in order to yield valid paths.
MATCH path=(s)-[:BFS*0.n]-(d) CALL module.filter_procedure(path) -> The module.procedure(path) could be implemented in Python or C/C++. Please take a look here. I would recommend starting with Python since it's much easier. Python for the PoC should be fine. I would also recommend starting with this option because I'm pretty confident the solution will work, + it's modular. After all, the filter_procedure could be extended easily, while the query will stay the same.
Could you please provide a sample dataset in a format of a Cypher query (a couple of nodes and edges / a small graph)? I'm glad to come up with a solution.

How to constrain dtw from dtw-python library?

Here is what I want to do:
keep a reference curve unchanged (only shift and stretch a query curve)
constrain how many elements are duplicated
keep both start and end open
I tried:
dtw(ref_curve,query_curve,step_pattern=asymmetric,open_end=True,open_begin=True)
but I cannot constrain how the query curve is stretched
dtw(ref_curve,query_curve,step_pattern=mvmStepPattern(10))
it didn’t do anything to the curves!
dtw(ref_curve,query_curve,step_pattern=rabinerJuangStepPattern(4, "c"),open_end=True, open_begin=True)
I liked this one the most but in some cases it shifts the query curve more than needed...
I read the paper (https://www.jstatsoft.org/article/view/v031i07) and the API but still don't quite understand how to achieve what I want. Any other options to constrain number of elements that are duplicated? I would appreciate your help!
to clarify: we are talking about functions provided by the DTW suite packages at dynamictimewarping.github.io. The question is in fact language-independent (and may be more suited to the Cross-validated Stack Exchange).
The pattern rabinerJuangStepPattern(4, "c") you have found does in fact satisfy your requirements:
it's asymmetric, and each step advances the reference by exactly one step
it's slope-limited between 1/2 and 2
it's type "c", so can be normalized in a way that allows open-begin and open-end
If you haven't already, check out dtw.rabinerJuangStepPattern(4, "c").plot().
It goes without saying that in all cases you are getting is the optimal alignment, i.e. the one with the least accumulated distance among all allowed paths.
As an alternative, you may consider the simpler asymmetric recursion -- as your first attempt above -- constrained with a global warping window: see dtw.window and the window_type argument. This provides constraints of a different shape (and flexible size), which might suit your specific case.
PS: edited to add that the asymmetricP2 recursion is also similar to RJ-4c, but with a more constrained slope.

how do range queries work on a LSM (log structure merge tree)?

Recently I've been studying common indexing structures in databases, such as B+-trees and LSM. I have a solid handle on how point reads/writes/deletes/compaction would work in an LSM.
For example (in RocksDB/levelDB), on a point query read we would first check an in-memory index (memtable), followed by some amount of SST files starting from most to least recent. On each level in the LSM we would use binary search to help speed up finding each SST file for the given key. For a given SST file, we can use bloom filters to quickly check if the key exists, saving us further time.
What I don't see is how a range read specifically works. Does the LSM have to open an iterator on every SST level (including the memtable), and iterate in lockstep across all levels, to return a final sorted result? Is it implemented as just a series of point queries (almost definitely not). Are all potential keys pulled first and then sorted afterwards? Would appreciate any insight someone has here.
I haven't been able to find much documentation on the subject, any insight would be helpful here.
RocksDB has a variety of iterator implementations like Memtable Iterator, File Iterator, Merging Iterator, etc.
During range reads, the iterator will seek to the start range similar to point lookup (using Binary search with in SSTs) using SeekTo() call. After seeking to start range, there will be series of iterators created one for each memtable, one for each Level-0 files (because of overlapping nature of SSTs in L0) and one for each level later on. A merging iterator will collect keys from each of these iterators and gives the data in sorted order till the End range is reached.
Refer to this documentation on iterator implementation.