Cypher recommendation query taking too long - cypher

I have a Neo4j query that has to return up to the 20 companies with the most number of investments made from co-investors of the given investor.
I have two types of nodes, Objects (That represents investors and companies), and FundingRound. It's indexes by objects.id and funding_round.id.
This is the query:
MATCH
(me:Object {id: $investorId})-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany:Object)
MATCH
(coinvestor:Object)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany)
MATCH
(coinvestor)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany:Object)
WITH me, othercompany, COUNT(distinct coinvestor) AS matches_count
WHERE NOT (me)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany)
RETURN othercompany.id AS id, othercompany.name AS name, matches_count
ORDER BY matches_count DESC, othercompany.id ASC
LIMIT 20
The query sometimes tasks up to 7 seconds to run for investors with a lot of investments. So I'm wondering, is there something that is not optimized correctly?
Profiling it in Neo4j app show it had 14601993 total db hits but the steps make total sense. I hoped it had a better performance when I read https://neo4j.com/news/how-much-faster-is-a-graph-database-really/

I would try the following:
MATCH
(me:Object {id: $investorId})-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany:Object),
(coinvestor:Object)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(mycompany)
WITH collect(distinct coinvestor) AS coinvestors, collect(distinct mycompany) AS mycompanies
UNWIND coinvestors AS coinvestor
MATCH
(coinvestor)-[:INVESTED_IN]->(:FundingRound)-[:BELONGS_TO]->(othercompany:Object)
WHERE NOT othercompany IN mycompanies
WITH othercompany, COUNT(distinct coinvestor) AS matches_count
ORDER BY matches_count DESC, othercompany.id ASC
LIMIT 20
RETURN othercompany.id AS id, othercompany.name AS name, matches_count
It should be a bit of improvement as we don't do a couple of redundant operations. However, it might still take some time if there are a lot of coinvestors and other companies, since Neo4j is known to have some issues with ordering large number of rows.
If your graph model allows, I would also remove node labels in the query. If, for example, the INVESTED_IN relation can only point from an object to a funding round, we don't have to check the node label for it.

Related

Neo4j cypher query perfomance

I have the following cypher queries and their execution plans respectively,
Before optimization,
match (o:Order {statusId:74}) <- [:HAS_ORDERS] - (m:Member)
with m,o
match (m:Member) - [:HAS_WALLET] -> (w:Wallet) where w.currentBalance < 250
return m as Members,collect(o) as Orders,w as Wallets order by m.createdAt desc limit 10
After optimization (db hits reduced by 40-50%),
match (m:Member) - [:HAS_ORDERS]->(o:Order {statusId:74})
with m, collect(o) as Orders
match (m) - [:HAS_WALLET] - (w:Wallet) where w.currentBalance < 250
return m as Members, Orders, w as Wallets
order by m.createdAt desc limit 10
There are 3 types of nodes, Member, Order and Wallet. And the relation between them goes like this,
Member - [:HAS_ORDERS] -> Order,
Member - [:HAS_WALLET] -> Wallet
I have around 100k Member nodes (100k wallet) and almost 570k orders for those members.
I want to fetch all the members who have order status 74 and wallet balance less than 250, and the above query gives the desired result but it takes an average 1.5 sec to respond.
I suspect there is a still scope of optimization here but I'm not be able to figure out. I've added indexing on fields upon which I'm filtering the data.
I've just started exploring neo4j and not sure how can I optimize this.
We can leverage index-backed ordering to try a different approach here. By providing a type hint (something to indicate the property value is a string) along with the ordering by the indexed property, we can have the planner use the index to check :Member nodes in the order you want (by m.createdAt DESC) for free (meaning we don't need to check every :Member node and order them), and check each of those in the given order to find the ones that meet the desired criteria until we get the 10 you need.
From some back-and-forth on the Neo4j users slack, you mentioned that of your 100k :Member nodes, about 52k of them fit the criteria you're looking for, so this is a good indicator that we may not have to look very far down the ordered :Member nodes before finding the 10 that meet the criteria.
Here's the query:
MATCH (m:Member)
WHERE m.createdAt > '' // type hint
WITH m
ORDER BY m.createdAt DESC
MATCH (m)-[:HAS_WALLET]->(w)
WHERE w.currentBalance < 250 AND EXISTS {
MATCH (m)-[:HAS_ORDERS]->(:Order {statusId:74})
}
WITH m, w
LIMIT 10
RETURN m as member, w as wallet, [(m)-[:HAS_ORDERS]->(o:Order {statusId:74}) | o] as orders
Note that by using an existential subquery, we just have to find one order that satisfies the condition. We wait until after the limit of 10 members is reached before using a pattern comprehension to grab all the orders for the 10 members.
Have you tried subqueries? If you can use a subquery to shrink down the number of nodes before passing it along to subsequent queries. (It would seem that an omniscient Query Planner could do this, but Cypher isn't there yet.). You may have to experiment with which subquery would filter out the most Nodes.
An example of using a subquery is here:
https://community.neo4j.com/t/slow-query-with-very-limited-data-and-boolean-false/31555
Another one is here:
https://community.neo4j.com/t/why-is-this-geospatial-search-so-slow/31952/24
(Of course, I assume you already have the appropriate properties indexed.)

Efficient way to select one from each category - Rails

I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.