Can I rewrite this Cypher query to be compatible with Redis Graph? - cypher

My use-case is that I have some agents in organisation structure. I want select for some agent (can by me) to see sum (amount of money) of all contracts that that agents subordinates (and subordinates of their subordinates and so on...) created with clients grouped by contract category.
Problem is that Redis Graph do not currently support all predicate. But I need to filter relations between agents because we have multiple "modules" with different organisation structures and I need report just from one module at the time.
My current Cypher query is:
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WHERE all(rel IN relationships(path) WHERE
rel.module_id = 1
AND rel.valid_from < '2020-05-29'
AND '2020-05-29' < rel.valid_to)
WITH b as mediators
MATCH (mediators)-[:mediated]->(c:contract)
RETURN
c.category as category,
count(c) as contract_count,
sum(c.sum) as sum
ORDER BY sum DESC, category
This query works in Neo4j.
I don't event know if this query is correctly written for the type of result that I want.
My boss would really like to use Redis Graph instead Neo4j because of performance reasons but I can't find any way to rewrite this query to be functional in the Redis graph. Is it even possible?
Edit 1: I was told that we will be using graph just for currently valid data and just for one module so I no longer need functional all predicate but I am still interested in answer.

The ALL function isn't supported at the moment, we do intend to add it in the near future, an awkward way of achieving the same effect as the ALL function would be a combination of UNWIND and count
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WITH b AS b, relationships(path) AS edges, size(relationships(path)) AS edge_count
UNWIND edges AS r
WITH b AS b, edge_count AS edge_count, r AS r
WHERE r.module_id = 1 AND r.valid_from < '2020-05-29' AND '2020-05-29' < r.valid_to
WITH b AS b, edge_count AS edge_count, count(r) AS filter_edge_count
WHERE edge_count = filter_edge_count
....

Related

how to count relationships from inital query

Hi I would like to make a query from a query (if that makes any sense)
My original solution is
PROFILE MATCH (q:Question)-[:TAGGED]-> (:Tag {name:"python"})
CALL{ WITH q ]
MATCH (q:Question)-[:TAGGED]-> (t:Tag)
WITH q, count(t) as c
RETURN c}
RETURN max(c)
The aim is to find all the questions q with the relationship TAGGED that is python. From the q nodes that we get the second objective is to count the number of relationships TAGGED that they have. The goal is to find the maximum amount of TAGGED relationships a question q can have. The problem is that this is not optimized enough as I am trying to limit the db hits. Another idea was the following
MATCH (:Tag {name: 'python'}) <-[:TAGGED]- (q:Question)-[:TAGGED]->(t: Tag)
WITH q, count(t) + 1 AS c
RETURN max(c)
In the first case, I tried to find first the questions that had at least the tag python and then pipeline the questions to count the number of relationships the filtered questions had but this seemed to be worse compared to the second query.
In the second query I had a problem with an expansion when I tried PROFILE and at the stage (q)-[anon_2:TAGGED]->(t) I take on too many db hits.
I'm confused as to how my first query doesn't work as well as the second.
I would try the following query:
MATCH (q:Question)
WHERE (q)-[:TAGGED]-> (:Tag {name:"python"})
WITH q,size((q)-[:TAGGED]->()) AS count
RETURN max(count)
Let's try this one:
PROFILE MATCH (q:Question)-[:TAGGED]-> (t:Tag)
WITH q, collect(t) AS tags
WHERE ANY(tag IN tags WHERE tag.name = 'python')
WITH q, size(tags) AS tagSize
RETURN max(tagSize)

Neo4j cypher query perfomance

I have the following cypher queries and their execution plans respectively,
Before optimization,
match (o:Order {statusId:74}) <- [:HAS_ORDERS] - (m:Member)
with m,o
match (m:Member) - [:HAS_WALLET] -> (w:Wallet) where w.currentBalance < 250
return m as Members,collect(o) as Orders,w as Wallets order by m.createdAt desc limit 10
After optimization (db hits reduced by 40-50%),
match (m:Member) - [:HAS_ORDERS]->(o:Order {statusId:74})
with m, collect(o) as Orders
match (m) - [:HAS_WALLET] - (w:Wallet) where w.currentBalance < 250
return m as Members, Orders, w as Wallets
order by m.createdAt desc limit 10
There are 3 types of nodes, Member, Order and Wallet. And the relation between them goes like this,
Member - [:HAS_ORDERS] -> Order,
Member - [:HAS_WALLET] -> Wallet
I have around 100k Member nodes (100k wallet) and almost 570k orders for those members.
I want to fetch all the members who have order status 74 and wallet balance less than 250, and the above query gives the desired result but it takes an average 1.5 sec to respond.
I suspect there is a still scope of optimization here but I'm not be able to figure out. I've added indexing on fields upon which I'm filtering the data.
I've just started exploring neo4j and not sure how can I optimize this.
We can leverage index-backed ordering to try a different approach here. By providing a type hint (something to indicate the property value is a string) along with the ordering by the indexed property, we can have the planner use the index to check :Member nodes in the order you want (by m.createdAt DESC) for free (meaning we don't need to check every :Member node and order them), and check each of those in the given order to find the ones that meet the desired criteria until we get the 10 you need.
From some back-and-forth on the Neo4j users slack, you mentioned that of your 100k :Member nodes, about 52k of them fit the criteria you're looking for, so this is a good indicator that we may not have to look very far down the ordered :Member nodes before finding the 10 that meet the criteria.
Here's the query:
MATCH (m:Member)
WHERE m.createdAt > '' // type hint
WITH m
ORDER BY m.createdAt DESC
MATCH (m)-[:HAS_WALLET]->(w)
WHERE w.currentBalance < 250 AND EXISTS {
MATCH (m)-[:HAS_ORDERS]->(:Order {statusId:74})
}
WITH m, w
LIMIT 10
RETURN m as member, w as wallet, [(m)-[:HAS_ORDERS]->(o:Order {statusId:74}) | o] as orders
Note that by using an existential subquery, we just have to find one order that satisfies the condition. We wait until after the limit of 10 members is reached before using a pattern comprehension to grab all the orders for the 10 members.
Have you tried subqueries? If you can use a subquery to shrink down the number of nodes before passing it along to subsequent queries. (It would seem that an omniscient Query Planner could do this, but Cypher isn't there yet.). You may have to experiment with which subquery would filter out the most Nodes.
An example of using a subquery is here:
https://community.neo4j.com/t/slow-query-with-very-limited-data-and-boolean-false/31555
Another one is here:
https://community.neo4j.com/t/why-is-this-geospatial-search-so-slow/31952/24
(Of course, I assume you already have the appropriate properties indexed.)

Nested SQL evaluation question with unnest

this may be a basic question but I just couldn't figure it out. Sample data and query could be found here. (under the "First-touch" tab)
I'll skip the marketing terminology here but basically what the query does is attributing credits/points to placements (ads) based on certain rule. Here, the rule is "first-touch", which means the credit goes to the first ad user interacted with - could be view or click. The "FLOODLIGHT" here means the user takes action to actually buy the product (conversion).
As you can see in the sample data, user 1 has one conversion and the first ad is placement 22 (first-touch), so 22 gets 1 point. User 2 has two conversions and the first ad of each is 11, so 11 gets 2 points.
The logic is quite simple here but I had a difficult time understanding the query itself. What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time? Aren't they essentially the same? I mean both of them came from UNNEST(t.*_path.events). And attributed_event.event_time also came from the same place.
What does prev_conversion_event.event_time, conversion_event.event_time, and attributed_event.event_time evaluate to in this scenario anyway? I'm just confused as hell here. Much appreciate the help!
For convenience I'm pasting the sample data, the query and output below:
Sample data
Output
/* Substitute *_paths for the specific paths table that you want to query. */
SELECT
(
SELECT
attributed_event_metadata.placement_id
FROM (
SELECT
AS STRUCT attributed_event.placement_id,
ROW_NUMBER() OVER(ORDER BY attributed_event.event_time ASC) AS rank
FROM
UNNEST(t.*_paths.events) AS attributed_event
WHERE
attributed_event.event_type != "FLOODLIGHT"
AND attributed_event.event_time < conversion_event.event_time
AND attributed_event.event_time > (
SELECT
IFNULL( (
SELECT
MAX(prev_conversion_event.event_time) AS event_time
FROM
UNNEST(t.*_paths.events) AS prev_conversion_event
WHERE
prev_conversion_event.event_type = "FLOODLIGHT"
AND prev_conversion_event.event_time < conversion_event.event_time),
0)) ) AS attributed_event_metadata
WHERE
attributed_event_metadata.rank = 1) AS placement_id,
COUNT(*) AS credit
FROM
adh.*_paths AS t,
UNNEST(*_paths.events) AS conversion_event
WHERE
conversion_event.event_type = "FLOODLIGHT"
GROUP BY
placement_id
HAVING
placement_id IS NOT NULL
ORDER BY
credit DESC
It is a quite convoluted query to be fair, I think I know what are you asking, please correct me if not the case.
What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time?
You are doing something like "I want all the events from this (unnest), and for every event, I want to know which events are the predecessor of each other".
Say you have [A, B, C, D] and they are ordered in succession (A happened before B, A and B happened before C, and so on), the result of that unnesting and joining over that condition will get you something like [A:(NULL), B:(A), C:(A, B), D:(A, B, C)] (excuse the notation, hope it is not confusing), being each key:value pair, the Event:(Predecessors). Note that A has no events before it, but B has A, etc.
Now you have a nice table with all the conversion events joined with the events that happened before that one.

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !
As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm
Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.
This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.
I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.
Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.
IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).