Neo4j cypher query perfomance - optimization

I have the following cypher queries and their execution plans respectively,
Before optimization,
match (o:Order {statusId:74}) <- [:HAS_ORDERS] - (m:Member)
with m,o
match (m:Member) - [:HAS_WALLET] -> (w:Wallet) where w.currentBalance < 250
return m as Members,collect(o) as Orders,w as Wallets order by m.createdAt desc limit 10
After optimization (db hits reduced by 40-50%),
match (m:Member) - [:HAS_ORDERS]->(o:Order {statusId:74})
with m, collect(o) as Orders
match (m) - [:HAS_WALLET] - (w:Wallet) where w.currentBalance < 250
return m as Members, Orders, w as Wallets
order by m.createdAt desc limit 10
There are 3 types of nodes, Member, Order and Wallet. And the relation between them goes like this,
Member - [:HAS_ORDERS] -> Order,
Member - [:HAS_WALLET] -> Wallet
I have around 100k Member nodes (100k wallet) and almost 570k orders for those members.
I want to fetch all the members who have order status 74 and wallet balance less than 250, and the above query gives the desired result but it takes an average 1.5 sec to respond.
I suspect there is a still scope of optimization here but I'm not be able to figure out. I've added indexing on fields upon which I'm filtering the data.
I've just started exploring neo4j and not sure how can I optimize this.

We can leverage index-backed ordering to try a different approach here. By providing a type hint (something to indicate the property value is a string) along with the ordering by the indexed property, we can have the planner use the index to check :Member nodes in the order you want (by m.createdAt DESC) for free (meaning we don't need to check every :Member node and order them), and check each of those in the given order to find the ones that meet the desired criteria until we get the 10 you need.
From some back-and-forth on the Neo4j users slack, you mentioned that of your 100k :Member nodes, about 52k of them fit the criteria you're looking for, so this is a good indicator that we may not have to look very far down the ordered :Member nodes before finding the 10 that meet the criteria.
Here's the query:
MATCH (m:Member)
WHERE m.createdAt > '' // type hint
WITH m
ORDER BY m.createdAt DESC
MATCH (m)-[:HAS_WALLET]->(w)
WHERE w.currentBalance < 250 AND EXISTS {
MATCH (m)-[:HAS_ORDERS]->(:Order {statusId:74})
}
WITH m, w
LIMIT 10
RETURN m as member, w as wallet, [(m)-[:HAS_ORDERS]->(o:Order {statusId:74}) | o] as orders
Note that by using an existential subquery, we just have to find one order that satisfies the condition. We wait until after the limit of 10 members is reached before using a pattern comprehension to grab all the orders for the 10 members.

Have you tried subqueries? If you can use a subquery to shrink down the number of nodes before passing it along to subsequent queries. (It would seem that an omniscient Query Planner could do this, but Cypher isn't there yet.). You may have to experiment with which subquery would filter out the most Nodes.
An example of using a subquery is here:
https://community.neo4j.com/t/slow-query-with-very-limited-data-and-boolean-false/31555
Another one is here:
https://community.neo4j.com/t/why-is-this-geospatial-search-so-slow/31952/24
(Of course, I assume you already have the appropriate properties indexed.)

Related

Cypher recommendation query taking too long

I have a Neo4j query that has to return up to the 20 companies with the most number of investments made from co-investors of the given investor.
I have two types of nodes, Objects (That represents investors and companies), and FundingRound. It's indexes by objects.id and funding_round.id.
This is the query:
MATCH
(me:Object {id: $investorId})-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany:Object)
MATCH
(coinvestor:Object)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany)
MATCH
(coinvestor)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany:Object)
WITH me, othercompany, COUNT(distinct coinvestor) AS matches_count
WHERE NOT (me)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany)
RETURN othercompany.id AS id, othercompany.name AS name, matches_count
ORDER BY matches_count DESC, othercompany.id ASC
LIMIT 20
The query sometimes tasks up to 7 seconds to run for investors with a lot of investments. So I'm wondering, is there something that is not optimized correctly?
Profiling it in Neo4j app show it had 14601993 total db hits but the steps make total sense. I hoped it had a better performance when I read https://neo4j.com/news/how-much-faster-is-a-graph-database-really/
I would try the following:
MATCH
(me:Object {id: $investorId})-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany:Object),
(coinvestor:Object)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany)
WITH collect(distinct coinvestor) AS coinvestors, collect(distinct mycompany) AS mycompanies
UNWIND coinvestors AS coinvestor
MATCH
(coinvestor)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany:Object)
WHERE NOT othercompany IN mycompanies
WITH othercompany, COUNT(distinct coinvestor) AS matches_count
ORDER BY matches_count DESC, othercompany.id ASC
LIMIT 20
RETURN othercompany.id AS id, othercompany.name AS name, matches_count
It should be a bit of improvement as we don't do a couple of redundant operations. However, it might still take some time if there are a lot of coinvestors and other companies, since Neo4j is known to have some issues with ordering large number of rows.
If your graph model allows, I would also remove node labels in the query. If, for example, the INVESTED_IN relation can only point from an object to a funding round, we don't have to check the node label for it.

Can I rewrite this Cypher query to be compatible with Redis Graph?

My use-case is that I have some agents in organisation structure. I want select for some agent (can by me) to see sum (amount of money) of all contracts that that agents subordinates (and subordinates of their subordinates and so on...) created with clients grouped by contract category.
Problem is that Redis Graph do not currently support all predicate. But I need to filter relations between agents because we have multiple "modules" with different organisation structures and I need report just from one module at the time.
My current Cypher query is:
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WHERE all(rel IN relationships(path) WHERE
rel.module_id = 1
AND rel.valid_from < '2020-05-29'
AND '2020-05-29' < rel.valid_to)
WITH b as mediators
MATCH (mediators)-[:mediated]->(c:contract)
RETURN
c.category as category,
count(c) as contract_count,
sum(c.sum) as sum
ORDER BY sum DESC, category
This query works in Neo4j.
I don't event know if this query is correctly written for the type of result that I want.
My boss would really like to use Redis Graph instead Neo4j because of performance reasons but I can't find any way to rewrite this query to be functional in the Redis graph. Is it even possible?
Edit 1: I was told that we will be using graph just for currently valid data and just for one module so I no longer need functional all predicate but I am still interested in answer.
The ALL function isn't supported at the moment, we do intend to add it in the near future, an awkward way of achieving the same effect as the ALL function would be a combination of UNWIND and count
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WITH b AS b, relationships(path) AS edges, size(relationships(path)) AS edge_count
UNWIND edges AS r
WITH b AS b, edge_count AS edge_count, r AS r
WHERE r.module_id = 1 AND r.valid_from < '2020-05-29' AND '2020-05-29' < r.valid_to
WITH b AS b, edge_count AS edge_count, count(r) AS filter_edge_count
WHERE edge_count = filter_edge_count
....

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.

How to set max return result for many-to-one

I have a Product Class which has a one to many relationship to a Price class.
So a product can have multiple prices.
I need to query the db to get me 10 products which have Price.amount < $2. In this case its to populate a UI with 10 items in a page.
so i writ the following code:
ICriteria criteria = session.CreateCriteria(typeof(Product));
criteria.SetFirstResult(pageNumber);
criteria.SetMaxResults(numberOfItemInPage);
criteria = criteria.CreateCriteria("PriceCollection");
criteria.Add(Restrictions.Le("Amount", new Decimal(2)));
criteria.SetResultTransformer(CriteriaSpecification.DistinctRootEntity);
Instead of getting 10 Product on the list, I'm getting less than that (i.e. 5).
The reason being SetMaxResults(10) return me 10 Products but with duplicates. The duplicates are then removed by SetResultTransformer(DistinctRootEntity).
Can anyone tell me any way for me to get 10 unique Products without increasing SetMaxResults()? I need to use pagenumber as some sort of indexing.
That would be up to the SQL to decide, depending on what happens in the methods that gets the list you need to change the SQL so that it behaves as you like.
But being Distinct, you shouldnt get any duplicates.
It seems your duplicates problem stems from the fact that you are joining two tables and so you can get the same product as many times as you have prices for it.
How about adding 2 extra columns to your product table:
MinimumPrice (numeric(18,2)
MaximumPrice (numeric(18,2)
Whenever your system amends pricing for a product you update these two fields on the product. Now you can write a SQL query like the following:
SELECT TOP 10 * FROM Product
WHERE MinimumPrice > 2.0
And you will not have duplicate products.
Would the order of the statements make a difference? It looks like it's setting the maximum count early, and weeding out duplicates at the end, which applied in that order could end up with less than what you limited it to, consistent with what you described happening.
I would think you would need to effectively get all of the results, then apply the restriction (and possibly a sort?) and weed out the duplicates, and then finally apply your paging or count limit to those to get the first 10, next 10, and so on. So reordering the statements to reflect this logical order might help fix your bug.