I have a decently sized graph (~600 million nodes, 3.5 billion edges) that I imported into neo4j. The graph is also quite dense (median edge count around 10); though I'm not sure if that affects performance.
For one type of node (:Authors) - there are roughly 200 million nodes of this type - I would like to run a query for a specific name, which is stored in the property normalizedName. Here is the (very simple) query:
MATCH (a:AUTHOR)
WHERE a.normalizedName = "jonathan smith"
RETURN a
As one might expect, this query takes a LONG (several minutes) time to execute. Although I have no explicit guarantee of uniqueness on this property, I still tried to create an index on it, and I got no complaints from neo4j. Afterwards, I would have expected that above query would execute in ms, due to the O(1) complexity for lookups in an index. Unfortunately, the query still takes several minutes.
What am I doing wrong?
Ensure that you have set the index as
CREATE INDEX ON :AUTHOR(normalizedName)
Be aware that you will need to set an index on each property you wish to use an index look-up on. This is also filtered by node label, i.e. if you're using multiple labels on a node and need an index look up, you would need to set one per label, e.g. if you had :Person:Author, you'd also need to set:
CREATE INDEX ON :Person(normalizedName)
Related
I have a very simple object as keys in my cache and I want to be able to iterate on the key/value pairs where a string matches a field in my keys.
Here is how the field is declared in the class
#AffinityKeyMapped #QueryTextField String crawlQueueID;
I run many queries and expect a small amount of documents to match. The queries take a relatively large amount of time, which is surprising given that there are maybe only 100K pairs locally in the cache. My queries are local, I want to hit only the K/V stored in the local node.
According to the profiler I am using, 80% of the CPU is spent here
GridLuceneIndex.java:285 org.apache.lucene.search.IndexSearcher.search(Query, int)
Knowing Lucene's performance, I am really surprised. Any suggestions?
BTW I want to sort the results based on a numerical field in the value object. Can this be done via annotations?
I could have one cache per value of the field I am querying against but given that there are potentially hundreds of thousands or even millions of different values, that would probably be too many caches for Ignite to handle.
EDIT
Looking at the code that handles the Lucene indexing and querying, the index gets reloaded for every query. Given that I do hundreds of them in a row, we probably don't benefit from any caching or optimisation of the index structure in Lucene.
Additionally, there is a range query running as a filter to check for the TTL. FilterQueries are faster but on a fresh indexreader, there would not be much caching either. Of course, if no TTL is needed for a given table, this should not be required.
Judging by the documentation about the indexing with SQL indexing:
Ignite automatically creates indexes for each primary key and affinity
key field.
the indexing is done on the key alone. In my case, the value I want to use for sorting is in the value object so that would not work.
I'm learning indexing in PostgreSQL now. I started trying to create my index and analyzing how it will affect execution time. I created some tables with such columns:
also, I filled them with data. After that I created my custom index:
create index events_organizer_id_index on events(organizer_ID);
and executed this command (events table contains 148 rows):
explain analyse select * from events where events.organizer_ID = 4;
I was surprised that the search was executed without my index and I got this result:
As far as I know, if my index was used in search there would be the text like "Index scan on events".
So, can someone explain or give references to sites, please, how to use indexes effectively and where should I use them to see differences?
From "Rows removed by filter: 125" I see there are too few rows in the events table. Just add couple of thousands rows and give it another go
from the docs
Use real data for experimentation. Using test data for setting up indexes will tell you what indexes you need for the test data, but that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.
In most cases, when database using an index it gets only address where the row is located. It contains data block_id and the offset because there might be many rows in one block of 4 or 8 Kb.
So, the database first searches index for the block adress, then it looks for the block on disk, reads it and parses the line you need.
When there are too few rows they fit into one on in couple of data blocks which makes it easier and quicker for DB to read whole table without using index at all.
See it the following way:
The database decides which way is faster to find your tuple (=record) with organizer_id 4. There are two ways:
a) Read the index and then skip to the block which contains the data.
b) Read the heap and find the record there.
The information in your screenshot show 126 records (125 skipped + your record) with a length ("width") of 62 bytes. Including overhead these data fit into two database blocks of 8 KB. As a rotating disk or SSD reads a series of blocks anyway - they read always more blocks into the buffer - it's one read operation for these two blocks.
So the database decides that it is pointless to read first the index to find the correct record (of in our case two blocks) and then read the data from the heap with the information from the index. That would be two read operations. Even with modern technology newer than rotating disks this needs more time than just scanning the two blocks. That's why the database doesn't use the index.
Indexes on such small tables aren't good for searching. Nevertheless unique indexes avoid double entries.
We have a scenario to evaluate LIKE search for a key (with limit 100) which is indexed (range). The query uses the index, but the performance of the query varies based on the number of matches the 'key' returns.
i.e if the search is more specific, the query takes longer and if the search is generic (and has more results) it returns faster (may be because it got first 100 faster).
The results are ranging from 1ms to 5 minutes for a replicated-region with 400k records.
e.g query
select * from /REGION where field like '%SEARCH_STRING%'
Interestingly, the % in the beginning causes the issue. If we just remove that, it returns in milli-seconds. In either case the 'indexesUsed' is returning the correct index.
It looks like we're missing something fundamental with the indexing or there is a weird behavior with the indexing.
Note: Gemfire version: 8.2.0, Spring-data: 1.8.5. This issue is reproducible with query directly in gfsh too. So not related to spring-data layer.
The prefix % is causing you to scan the entire index. When you remove the prefix %, you are actually using the index to jump to the starting point (of the range index) and scan from there.
Another thing I noticed in your query is that you are using "select *". Do you really need to serialize the entire record back? You will get better performance if you enable PDX and just select the fields that you need.
You reported that it's using the index even though it's effectively doing a scan. If you don't have PDX enabled and it's really not using the index, deserializing all the records could be a problem (since you have select *) or the JVM is constrained for memory due to the scan.
Wes Williams
P.S. An effective way to also get help directly from the devs is to post the question on the geode user list.
Take a look at this execution plan: http://sdrv.ms/1agLg7K
It’s not estimated, it’s actual. From an actual execution that took roughly 30 minutes.
Select the second statement (takes 47.8% of the total execution time – roughly 15 minutes).
Look at the top operation in that statement – View Clustered Index Seek over _Security_Tuple4.
The operation costs 51.2% of the statement – roughly 7 minutes.
The view contains about 0.5M rows (for reference, log2(0.5M) ~= 19 – a mere 19 steps given the index tree node size is two, which in reality is probably higher).
The result of that operator is zero rows (doesn’t match the estimate, but never mind that for now).
Actual executions – zero.
So the question is: how the bleep could that take seven minutes?! (and of course, how do I fix it?)
EDIT: Some clarification on what I'm asking here.
I am not interested in general performance-related advice, such as "look at indexes", "look at sizes", "parameter sniffing", "different execution plans for different data", etc.
I know all that already, I can do all that kind of analysis myself.
What I really need is to know what could cause that one particular clustered index seek to be so slow, and then what could I do to speed it up.
Not the whole query.
Not any part of the query.
Just that one particular index seek.
END EDIT
Also note how the second and third most expensive operations are seeks over _Security_Tuple3 and _Security_Tuple2 respectively, and they only take 7.5% and 3.7% of time. Meanwhile, _Security_Tuple3 contains roughly 2.8M rows, which is six times that of _Security_Tuple4.
Also, some background:
This is the only database from this project that misbehaves.
There are a couple dozen other databases of the same schema, none of them exhibit this problem.
The first time this problem was discovered, it turned out that the indexes were 99% fragmented.
Rebuilding the indexes did speed it up, but not significantly: the whole query took 45 minutes before rebuild and 30 minutes after.
While playing with the database, I have noticed that simple queries like “select count(*) from _Security_Tuple4” take several minutes. WTF?!
However, they only took several minutes on the first run, and after that they were instant.
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.
First I'd like to point out a little misconception here: although the delete statement is said to take nearly 48% of the entire execution, this does not have to mean it takes 48% of the time needed; in fact, the 51% assigned inside that part of the query plan most definitely should NOT be interpreted as taking 'half of the time' of the entire operation!
Anyway, going by your remark that it takes a couple of minutes to do a COUNT(*) of the table 'the first time' I'm inclined to say that you have an IO issue related to said table/view. Personally I don't like materialized views very much so I have no real experience with them and how they behave internally but normally I would suggest that fragmentation is causing its toll on the underlying storage system. The reason it works fast the second time is because it's much faster to access the pages from the cache than it was when fetching them from disk, especially when they are all over the place. (Are there any (max) fields in the view ?)
Anyway, to find out what is taking so long I'd suggest you rather take this code out of the trigger it's currently in, 'fake' an inserted and deleted table and then try running the queries again adding times-stamps and/or using some program like SQL Sentry Plan Explorer to see how long each part REALLY takes (it has a duration column when you run a script from within the program).
It might well be that you're looking at the wrong part; experience shows that cost and actual execution times are not always as related as we'd like to think.
Observations include:
Is this the biggest of these databases that you are working with? If so, size matters to the optimizer. It will make quite a different plan for large datasets versus smaller data sets.
The estimated rows and the actual rows are quite divergent. This is most apparent on the fourth query. "delete c from #alternativeRoutes...." where the _Security_Tuple5 estimates returning 16 rows, but actually used 235,904 rows. For that many rows an Index Scan could be more performant than Index Seeks. Are the statistics on the table up to date or do they need to be updated?
The "select count(*) from _Security_Tuple4" takes several minutes, the first time. The second time is instant. This is because the data is all now cached in memory (until it ages out) and the second query is fast.
Because the problem moves with the database then the statistics, any missing indexes, et cetera are in the database. I would also suggest checking that the indexes match with other databases using the same schema.
This is not a full analysis, but it gives you some things to look at.
Fyodor,
First:
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.
I presume that you: a) run this query in isolated environment, b) the data is not under mutation.
Is this correct?
Second: post here your CREATE INDEX script. Do you have a funny FILLFACTOR? SORT_IN_TEMPDB?
Third: which type is your ParentId, ObjectId? int, smallint, uniqueidentifier, varchar?
I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';