Gemfire LIKE Query with Indexes - gemfire

We have a scenario to evaluate LIKE search for a key (with limit 100) which is indexed (range). The query uses the index, but the performance of the query varies based on the number of matches the 'key' returns.
i.e if the search is more specific, the query takes longer and if the search is generic (and has more results) it returns faster (may be because it got first 100 faster).
The results are ranging from 1ms to 5 minutes for a replicated-region with 400k records.
e.g query
select * from /REGION where field like '%SEARCH_STRING%'
Interestingly, the % in the beginning causes the issue. If we just remove that, it returns in milli-seconds. In either case the 'indexesUsed' is returning the correct index.
It looks like we're missing something fundamental with the indexing or there is a weird behavior with the indexing.
Note: Gemfire version: 8.2.0, Spring-data: 1.8.5. This issue is reproducible with query directly in gfsh too. So not related to spring-data layer.

The prefix % is causing you to scan the entire index. When you remove the prefix %, you are actually using the index to jump to the starting point (of the range index) and scan from there.
Another thing I noticed in your query is that you are using "select *". Do you really need to serialize the entire record back? You will get better performance if you enable PDX and just select the fields that you need.
You reported that it's using the index even though it's effectively doing a scan. If you don't have PDX enabled and it's really not using the index, deserializing all the records could be a problem (since you have select *) or the JVM is constrained for memory due to the scan.
Wes Williams
P.S. An effective way to also get help directly from the devs is to post the question on the geode user list.

Related

simple lookup takes several minutes despite using an index

I have a decently sized graph (~600 million nodes, 3.5 billion edges) that I imported into neo4j. The graph is also quite dense (median edge count around 10); though I'm not sure if that affects performance.
For one type of node (:Authors) - there are roughly 200 million nodes of this type - I would like to run a query for a specific name, which is stored in the property normalizedName. Here is the (very simple) query:
MATCH (a:AUTHOR)
WHERE a.normalizedName = "jonathan smith"
RETURN a
As one might expect, this query takes a LONG (several minutes) time to execute. Although I have no explicit guarantee of uniqueness on this property, I still tried to create an index on it, and I got no complaints from neo4j. Afterwards, I would have expected that above query would execute in ms, due to the O(1) complexity for lookups in an index. Unfortunately, the query still takes several minutes.
What am I doing wrong?
Ensure that you have set the index as
CREATE INDEX ON :AUTHOR(normalizedName)
Be aware that you will need to set an index on each property you wish to use an index look-up on. This is also filtered by node label, i.e. if you're using multiple labels on a node and need an index look up, you would need to set one per label, e.g. if you had :Person:Author, you'd also need to set:
CREATE INDEX ON :Person(normalizedName)

Neo4j import tool and querying

I have some very basic conceptual questions related to functioning of neo4j.
1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere.
2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error.
Thanks for clearing these questions in advance.
Best,
What is this node index? Where is it actually used?
The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person nodes into a database, then 1 million :Location nodes in a database. When you MATCH (p:Person { last_name: "Smith" } you want the database to search through only the :Person nodes, and not all 2 million. The index is what makes that happen.
Read up on indexes in neo4j
What is this index and where is it actually used?
The index by label is basically a searchable collection of nodes categorized by label (in this case :Person and :Location) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.
Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.
Running a cypher query with does not show that index is being used anywhere.
Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.
What I understood that while running cypher queries, results are stored in heap
Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.
What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size
See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.
Would neo4j switch to disk?
No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.
or would it produce an error
Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.

Is the Lucene query language hack proof

Obviously it cannot be used to trash the index or crack card numbers, passwords etc. (unless one is stupid enough to put card numbers or passwords in the index).
Is it possible to bring down the server with excessively complex searches?
I suppose what I really need to know is can I pass a user-entered Lucene query directly to the search engine without sanitization and be safe from malice.
It is impossible to modify the index from the input of a query parser. However, there are several things that could hurt a search server running Lucene:
A high value for the number of top results to collect
Lucene puts hits in a priority queue to order them (which is implemented with a backing array of the size of the priority queue). So running a request which fetches the results from offset 99 999 900 to offset 100 000 000 will make the server allocate a few hundred of megabytes for this priority queue. Running several queries of this kind in parallel is likely to make the server run out of memory.
Sorting on arbitrary fields
Sorting on a field requires the field cache of this field to be loaded. In addition to taking a lot of time, this operation will use a lot of memory (especially on text fields with a lot of large distinct values), and this memory will not be reclaimed until the index reader for which this cache has been loaded is not used anymore.
Term dictionary intensive queries
Some queries are more expensive than other ones. To prevent query execution from taking too long, Lucene already has some guards against too complex queries: by default, a BooleanQuery cannot have more than 1024 clauses.
Other queries such as wildcard queries and fuzzy queries are very expensive too.
To prevent your users from hurting your search service, you should decide what they are allowed to do and what they are not. For example, Twitter (which uses Lucene for its search backend) used to limit queries to a few clauses in order to be certain to provide the response in reasonable time. (This question Twitter api - search too complex? talks about this limitation)
As far as I know, there are no major vulnerabilities that you need to worry about. Depending on the query parser you are using, you may want to do some simple sanitization.
Limit the length of the query string
Check for characters that you don't want to support. For example, +, -, [, ], *
If you let the user pick the number of results returned (e.g. 10, 20, 50), then make sure they can't use a really large value.

Is O(1) access to a database row is possible?

I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.

How do I force Postgres to use a particular index?

How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan and enable_indexscan parameters. See:
Examining index usage
enable_ parameters
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_ params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
For small tables, it's faster to do sequential scans.
Postgres doesn't use indexes when datatypes don't match properly, you may need to include appropriate casts.
Your planner settings might be causing problems.
See also this old newsgroup post.
Probably the only valid reason for using
set enable_seqscan=false
is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.
TL;DR
Run the following three commands and check whether the problem is fixed:
ANALYZE;
SET random_page_cost = 1.0;
SET effective_cache_size = 'X GB'; # replace X with total RAM size minus 2 GB
Read on for further details and background information about this.
Step 1: Analyze tables
As a simple first attempt to fix the issue, run the ANALYZE; command as the database superuser in order to update all table statistics. From the documentation:
The query planner uses these statistics to help determine the most efficient execution plans for queries.
Step 2: Set the correct random page cost
Index scans require non-sequential disk page fetches. PostgreSQL uses the random_page_cost configuration parameter to estimate the cost of such non-sequential fetches in relation to sequential fetches. From the documentation:
Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.
The default value is 4.0, thus assuming an average cost factor of 4 compared to sequential fetches, taking caching effects into account. However, if your database is stored on an SSD drive, then you should actually set random_page_cost to 1.1 according to the documentation:
Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1.
Also, if an index is mostly (or even entirely) cached in RAM, then an index scan will always be significantly faster than a disk-served sequential scan. The query planner however doesn't know which parts of the index are already cached, and thus might make an incorrect decision.
If your database indices are frequently used, and if the system has sufficient RAM, then the indices are likely to be cached eventually. In such a case, random_page_cost can be set to 1.0, or even to a value below 1.0 to aggressively prefer using index scans (although the documentation advises against doing that). You'll have to experiment with different values and see what works for you.
As a side note, you could also consider using the pg_prewarm extension to explicitly cache your indices into RAM.
You can set the random_page_cost like this:
SET random_page_cost = 1.0;
Step 3: Set the correct cache size
On a system with 8 or more GB RAM, you should set the effective_cache_size configuration parameter to the amount of memory which is typically available to PostgreSQL for data caching. From the documentation:
A higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.
Note that this parameter doesn't change the amount of memory which PostgreSQL will actually allocate, but is only used to compute cost estimates. A reasonable value (on a dedicated database server, at least) is the total RAM size minus 2 GB. The default value is 4 GB.
You can set the effective_cache_size like this:
SET effective_cache_size = '14 GB'; # e.g. on a dedicated server with 16 GB RAM
Step 4: Fix the problem permanently
You probably want to use ALTER SYSTEM SET ... or ALTER DATABASE db_name SET ... to set the new configuration parameter values permanently (either globally or per-database). See the documentation for details about setting parameters.
Step 5: Additional resources
If it still doesn't work, then you might also want to take a look at this PostgreSQL Wiki page about server tuning.
Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description = 'Refund'
GROUP BY client_id
PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description||'' = 'Refund'
GROUP BY client_id
The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.
Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.
There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.
Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.
One thing to note with PostgreSQL; where you are expecting an index to be used and it is not being used, is to VACUUM ANALYZE the table.
VACUUM ANALYZE schema.table;
This updates statistics used by the planner to determine the most efficient way to execute a query. Which may result in the index being used.
Another thing to check is the types.
Is the index on an int8 column and you are querying with numeric? The query will work but the index will not be used.
There is a trick to push postgres to prefer a seqscan adding a OFFSET 0 in the subquery
This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.
Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.
see How can I prevent Postgres from inlining a subquery?
Indexes can only be used under certain circumstances.
For example the type of the value fits to the type of the column.
You are not doing a operation on the column before comparing to the value.
Given a customer table with 3 columns with 3 indexes on all of the columns.
create table customer(id numeric(10), age int, phone varchar(200))
It might happend that the database trys to use for example the index idx_age instead of using the phone number.
You can sabotage the usage of the index age by doing an operation of age:
select * from customer where phone = '1235' and age+1 = 24
(although you are looking for the age 23)
This is of course a very simple example and the intelligence of postgres is probably good enough to do the right choice. But sometimes there is no other way then tricking the system.
Another example is to
select * from customer where phone = '1235' and age::varchar = '23'
But this is probably more costy than the option above.
Unfortunately you CANNOT set the name of the index into the query like you can do in MSSQL or Sybase.
select * from customer (index idx_phone) where phone = '1235' and age = 23.
This would help a lot to avoid problems like this.
Apparently there are cases where Postgre can be hinted to using an index by repeating a similar condition twice.
The specific case I observed was using PostGIS gin index and the ST_Within predicate like this:
select *
from address
natural join city
natural join restaurant
where st_within(address.location, restaurant.delivery_area)
and restaurant.delivery_area ~ address.location
Note that the first predicate st_within(address.location, restaurant.delivery_area) is automatically decomposed by PostGIS into (restaurant.delivery_area ~ address.location) AND _st_contains(restaurant.delivery_area, address.location) so adding the second predicate restaurant.delivery_area ~ address.location is completely redundant. Nevertheless, the second predicate convinced the planner to use spatial index on address.location and in the specific case I needed, improved the running time 8 times.