How to remove/delete duplicate nodes from RedisGraph

How to remove/delete duplicate nodes from RedisGraph - redis

Can someone help me on deleting duplicate nodes from the RedisGraph for a particular label. I found cypher queries in Neo4j, but the are not supporting in Redis. Please help me on this.
I used below query, then RedisInsight thrown error
MATCH (p:Person)
WITH p.id as id , collect(p) AS nodes
WHERE size(nodes) > 1
RETURN [ n in nodes | n.id ] AS ids, size(nodes)
ORDER BY size(nodes) DESC
Error: RedisGraph does not currently support list comprehension

To delete duplicates you can use the following
MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes
WHERE size(nodes) > 1
UNWIND nodes[1..] AS node
DELETE node

Related

Neo4j cypher query perfomance

I have the following cypher queries and their execution plans respectively,
Before optimization,
match (o:Order {statusId:74}) <- [:HAS_ORDERS] - (m:Member)
with m,o
match (m:Member) - [:HAS_WALLET] -> (w:Wallet) where w.currentBalance < 250
return m as Members,collect(o) as Orders,w as Wallets order by m.createdAt desc limit 10
After optimization (db hits reduced by 40-50%),
match (m:Member) - [:HAS_ORDERS]->(o:Order {statusId:74})
with m, collect(o) as Orders
match (m) - [:HAS_WALLET] - (w:Wallet) where w.currentBalance < 250
return m as Members, Orders, w as Wallets
order by m.createdAt desc limit 10
There are 3 types of nodes, Member, Order and Wallet. And the relation between them goes like this,
Member - [:HAS_ORDERS] -> Order,
Member - [:HAS_WALLET] -> Wallet
I have around 100k Member nodes (100k wallet) and almost 570k orders for those members.
I want to fetch all the members who have order status 74 and wallet balance less than 250, and the above query gives the desired result but it takes an average 1.5 sec to respond.
I suspect there is a still scope of optimization here but I'm not be able to figure out. I've added indexing on fields upon which I'm filtering the data.
I've just started exploring neo4j and not sure how can I optimize this.

We can leverage index-backed ordering to try a different approach here. By providing a type hint (something to indicate the property value is a string) along with the ordering by the indexed property, we can have the planner use the index to check :Member nodes in the order you want (by m.createdAt DESC) for free (meaning we don't need to check every :Member node and order them), and check each of those in the given order to find the ones that meet the desired criteria until we get the 10 you need.
From some back-and-forth on the Neo4j users slack, you mentioned that of your 100k :Member nodes, about 52k of them fit the criteria you're looking for, so this is a good indicator that we may not have to look very far down the ordered :Member nodes before finding the 10 that meet the criteria.
Here's the query:
MATCH (m:Member)
WHERE m.createdAt > '' // type hint
WITH m
ORDER BY m.createdAt DESC
MATCH (m)-[:HAS_WALLET]->(w)
WHERE w.currentBalance < 250 AND EXISTS {
MATCH (m)-[:HAS_ORDERS]->(:Order {statusId:74})
}
WITH m, w
LIMIT 10
RETURN m as member, w as wallet, [(m)-[:HAS_ORDERS]->(o:Order {statusId:74}) | o] as orders
Note that by using an existential subquery, we just have to find one order that satisfies the condition. We wait until after the limit of 10 members is reached before using a pattern comprehension to grab all the orders for the 10 members.

Have you tried subqueries? If you can use a subquery to shrink down the number of nodes before passing it along to subsequent queries. (It would seem that an omniscient Query Planner could do this, but Cypher isn't there yet.). You may have to experiment with which subquery would filter out the most Nodes.
An example of using a subquery is here:
https://community.neo4j.com/t/slow-query-with-very-limited-data-and-boolean-false/31555
Another one is here:
https://community.neo4j.com/t/why-is-this-geospatial-search-so-slow/31952/24
(Of course, I assume you already have the appropriate properties indexed.)

Can I rewrite this Cypher query to be compatible with Redis Graph?

My use-case is that I have some agents in organisation structure. I want select for some agent (can by me) to see sum (amount of money) of all contracts that that agents subordinates (and subordinates of their subordinates and so on...) created with clients grouped by contract category.
Problem is that Redis Graph do not currently support all predicate. But I need to filter relations between agents because we have multiple "modules" with different organisation structures and I need report just from one module at the time.
My current Cypher query is:
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WHERE all(rel IN relationships(path) WHERE
rel.module_id = 1
AND rel.valid_from < '2020-05-29'
AND '2020-05-29' < rel.valid_to)
WITH b as mediators
MATCH (mediators)-[:mediated]->(c:contract)
RETURN
c.category as category,
count(c) as contract_count,
sum(c.sum) as sum
ORDER BY sum DESC, category
This query works in Neo4j.
I don't event know if this query is correctly written for the type of result that I want.
My boss would really like to use Redis Graph instead Neo4j because of performance reasons but I can't find any way to rewrite this query to be functional in the Redis graph. Is it even possible?
Edit 1: I was told that we will be using graph just for currently valid data and just for one module so I no longer need functional all predicate but I am still interested in answer.

The ALL function isn't supported at the moment, we do intend to add it in the near future, an awkward way of achieving the same effect as the ALL function would be a combination of UNWIND and count
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WITH b AS b, relationships(path) AS edges, size(relationships(path)) AS edge_count
UNWIND edges AS r
WITH b AS b, edge_count AS edge_count, r AS r
WHERE r.module_id = 1 AND r.valid_from < '2020-05-29' AND '2020-05-29' < r.valid_to
WITH b AS b, edge_count AS edge_count, count(r) AS filter_edge_count
WHERE edge_count = filter_edge_count
....

Recursive query within recursive query

I would like to solve a problem consisting of 2 recursions.
In one of the 2 recursions I find out the answer to one question which is "What is the leaf member of a specific input (template)?" This is already solved.
In a second recursion I would like to run this query for a number of other inputs (templates).
1st part of the problem:
I have a tree and would like to find the leaf of it. This part of the recursion can be solved using this query:
with recursive full_tree as (
select id, "previousVersionId", 1 as level
from template
where
template."id" = '5084520a-bb07-49e8-b111-3ea8182dc99f'
union all
select c.id, c."previousVersionId", p.level + 1
from template c
inner join full_tree p on c."previousVersionId" = p.id
)
select * from full_tree
order by level desc
limit 1
The query output is one record including the leaf id I'm interested in. This is fine.
2nd part of the query:
Here's the problem. I would like to run the first query n times.
Currently I can run the query only if it's just one id ('5084520a-bb07-49e8-b111-3ea8182dc99f' in the example). But what If I have a list of 100 such ids.
My ultimate goal is to get one id response (the leaf id) to each of the 100 template ids in the list.
In theory, a query that allows me to run above query for each of my e.g. 100 template ids would solve my problem.

Retrieve hierarchical groups ... with infinite recursion

I've a table like this which contains links :
key_a key_b
--------------
a b
b c
g h
a g
c a
f g
not really tidy & infinite recursion ...
key_a = parent
key_b = child
Require a query which will recompose and attribute a number for each hierarchical group (parent + direct children + indirect children) :
key_a key_b nb_group
--------------------------
a b 1
a g 1
b c 1
**c a** 1
f g 2
g h 2
**link responsible of infinite loop**
Because we have
A-B-C-A
-> Only want to show simply the link as shown.
Any idea ?
Thanks in advance

The problem is that you aren't really dealing with strict hierarchies; you're dealing with directed graphs, where some graphs have cycles. Notice that your nbgroup #1 doesn't have any canonical root-- it could be a, b, or c due to the cyclic reference from c-a.
The basic way of dealing with this is to think in terms of graph techniques, not recursion. In fact, an iterative approach (not using a CTE) is the only solution I can think of in SQL. The basic approach is explained here.
Here is a SQL Fiddle with a solution that addresses both the cycles and the shared-leaf case. Notice it uses iteration (with a failsafe to prevent runaway processes) and table variables to operate; I don't think there's any getting around this. Note also the changed sample data (a-g changed to a-h; explained below).
If you dig into the SQL you'll notice that I changed some key things from the solution given in the link. That solution was dealing with undirected edges, whereas your edges are directed (if you used undirected edges the entire sample set is a single component because of the a-g connection).
This gets to the heart of why I changed a-g to a-h in my sample data. Your specification of the problem is straightforward if only leaf nodes are shared; that's the specification I coded to. In this case, a-h and g-h can both get bundled off to their proper components with no problem, because we're concerned about reachability from parents (even given cycles).
However, when you have shared branches, it's not clear what you want to show. Consider the a-g link: given this, g-h could exist in either component (a-g-h or f-g-h). You put it in the second, but it could have been in the first instead, right? This ambiguity is why I didn't try to address it in this solution.
Edit: To be clear, in my solution above, if shared braches ARE encountered, it treats the whole set as a single component. Not what you described above, but it will have to be changed after the problem is clarified. Hopefully this gets you close.

You should use a recursive query. In the first part we select all records which are top level nodes (have no parents) and using ROW_NUMBER() assign them group ID numbers. Then in the recursive part we add to them children one by one and use parent's groups Id numbers.
with CTE as
(
select t1.parent,t1.child,
ROW_NUMBER() over (order by t1.parent) rn
from t t1 where
not exists (select 1 from t where child=t1.parent)
union all
select t.parent,t.child, CTE.rn
from t
join CTE on t.parent=CTE.Child
)
select * from CTE
order by RN,parent
SQLFiddle demo

Painful problem of graph walking using recursive CTEs. This is the problem of finding connected subgraphs in a graph. The challenge with using recursive CTEs is to prevent unwarranted recursion -- that is, infinite loops In SQL Server, that typically means storing them in a string.
The idea is to get a list of all pairs of nodes that are connected (and a node is connected with itself). Then, take the minimum from the list of connected nodes and use this as an id for the connected subgraph.
The other idea is to walk the graph in both directions from a node. This ensures that all possible nodes are visited. The following is query that accomplishes this:
with fullt as (
select keyA, keyB
from t
union
select keyB, keyA
from t
),
CTE as (
select t.keyA, t.keyB, t.keyB as last, 1 as level,
','+cast(keyA as varchar(max))+','+cast(keyB as varchar(max))+',' as path
from fullt t
union all
select cte.keyA, cte.keyB,
(case when t.keyA = cte.last then t.keyB else t.keyA
end) as last,
1 + level,
cte.path+t.keyB+','
from fullt t join
CTE
on t.keyA = CTE.last or
t.keyB = cte.keyA
where cte.path not like '%,'+t.keyB+',%'
) -- select * from cte where 'g' in (keyA, keyB)
select t.keyA, t.keyB,
dense_rank() over (order by min(cte.Last)) as grp,
min(cte.Last)
from t join
CTE
on (t.keyA = CTE.keyA and t.keyB = cte.keyB) or
(t.keyA = CTE.keyB and t.keyB = cte.keyA)
where cte.path like '%,'+t.keyA+',%' or
cte.path like '%,'+t.keyB+',%'
group by t.id, t.keyA, t.keyB
order by t.id;
The SQLFiddle is here.

you might want to check with COMMON TABLE EXPRESSIONS
here's the link

How do I remember which way round PRIOR should go in CONNECT BY queries

I've a terrible memory. Whenever I do a CONNECT BY query in Oracle - and I do mean every time - I have to think hard and usually through trial and error work out on which argument the PRIOR should go.
I don't know why I don't remember - but I don't.
Does anyone have a handy memory mnemonic so I always remember ?
For example:
To go down a tree from a node - obviously I had to look this up :) - you do something like:
select
*
from
node
connect by
prior node_id = parent_node_id
start with
node_id = 1
So - I start with a node_id of 1 (the top of the branch) and the query looks for all nodes where the parent_node_id = 1 and then iterates down to the bottom of the tree.
To go up the tree the prior goes on the parent:
select
*
from
node
connect by
node_id = prior parent_node_id
start with
node_id = 10
So starting somewhere down a branch (node_id = 10 in this case) Oracle first gets all nodes where the parent_node_id is the same as the one for which node_id is 10.
EDIT: I still get this wrong so thought I'd add a clarifying edit to expand on the accepted answer - here's how I remember it now:
select
*
from
node
connect by
prior node_id = parent_node_id
start with
node_id = 1
The 'english language' version of this SQL I now read as...
In NODE, starting with the row in
which node_id = 1, the next row
selected has its parent_node_id
equal to node_id from the previous
(prior) row.
EDIT: Quassnoi makes a great point - the order you write the SQL makes things a lot easier.
select
*
from
node
start with
node_id = 1
connect by
parent_node_id = prior node_id
This feels a lot clearer to me - the "start with" gives the first row selected and the "connect by" gives the next row(s) - in this case the children of node_id = 1.

I always try to put the expressions in JOIN's in the following order:
joined.column = leading.column
This query:
SELECT t.value, d.name
FROM transactions t
JOIN
dimensions d
ON d.id = t.dimension
can be treated either like "for each transaction, find the corresponding dimension name", or "for each dimension, find all corresponding transaction values".
So, if I search for a given transaction, I put the expressions in the following order:
SELECT t.value, d.name
FROM transactions t
JOIN
dimensions d
ON d.id = t.dimension
WHERE t.id = :myid
, and if I search for a dimension, then:
SELECT t.value, d.name
FROM dimensions d
JOIN
transactions t
ON t.dimension = d.id
WHERE d.id = :otherid
Ther former query will most probably use index scans first on (t.id), then on (d.id), while the latter one will use index scans first on (d.id), then on (t.dimension), and you can easily see it in the query itself: the searched fields are at left.
The driving and driven tables may be not so obvious in a JOIN, but it's as clear as a bell for a CONNECT BY query: the PRIOR row is driving, the non-PRIOR is driven.
That's why this query:
SELECT *
FROM hierarchy
START WITH
id = :root
CONNECT BY
parent = PRIOR id
means "find all rows whose parent is a given id". This query builds a hierarchy.
This can be treated like this:
connect_by(row) {
add_to_rowset(row);
/* parent = PRIOR id */
/* PRIOR id is an rvalue */
index_on_parent.searchKey = row->id;
foreach child_row in index_on_parent.search {
connect_by(child_row);
}
}
And this query:
SELECT *
FROM hierarchy
START WITH
id = :leaf
CONNECT BY
id = PRIOR parent
means "find the rows whose id is a given parent". This query builds an ancestry chain.
Always put PRIOR in the right part of the expression.
Think of PRIOR column as of a constant all your rows will be searched for.

Think about the order in which the records are going to be selected: the link-back column on each record must match the link-forward column on the PRIOR record selected.

Part of the reason that this is difficult to visualise each time, is because sometimes the data is modelled as id + child_id, and sometimes it is modelled as id + parent_id. Depending on which way round your data is modelled, you have to place your PRIOR keyword on the the opposite side.
The trick I find, is always to look at the world through the eyes of the leaf-node of the tree you're trying to build.
So (speaking as the leaf node) when I look up, I see my parent node, which I may also refer to as the PRIOR node. Ergo, my ID is the same as his CHILD_ID. So in this case PRIOR child_id = id.
When the data is modelled the other way around (i.e. when I hold my parent's ID rather than him holding mine), my parent_id is the same value as his id. Ergo in this scenario PRIOR id = parent_id.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas