Index performance with a startswith structure - indexing

Is there not an expression in neo4j like startswith that runs fast on an indexed property?
I currently run a query like
match (p:Page) where p.Url =~ 'http://www.([\\w.?=#/&]*)' return p
The p.Url property is indexed however the query above is very slow. Especially a startswith index search should be quite fast or not?

Currently regex (or 'startsWith') filters are not supported with schema indexes. Your cypher statement will scan through all nodes having the Page label and filter them based on their properties.
More sophisticated query capabilities for schema indexes are to be expected in one of the next releases of Neo4j.
If you need that functionality now you basically have 2 options:
use legacy indexes as documented in the reference manual.
if your queries always filter for starting with protocoll://prefix you can work around by putting the prefix protocoll://prefix into an additional property urlPrefix and declare an index on it (create index on :Page(urlPrefix). Your query is then match (p:Page) where p.urlPrefix='http://www.' return p and will be run via the existing index.

Related

Prisma and Postgres. Using indexes for optimized 'LIKE' search

I have a usual string column which is not unique and doesn't have any indexes. I've added a search function which uses the Prisma contains under the hood.
Currently it takes Ä…round 40ms for my queries (with the test database having around 14k records) which could be faster as I understood from this article: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
The issue is that nothing really changes after I add the trigram index, like this
CREATE INDEX CONCURRENTLY trgm_idx_users_name
ON users USING gin (name gin_trgm_ops);
The query execution time is literally the same. I also found that I can check if the index is actually used by disabling the full scan. And I really see that the execution time became times worse after that (meaning the added index is not actually used as it's performance is worse than full scan).
I was trying to use B-tree and Gin indexes.
The query example for testing is just searching records with LIKE:
SELECT *
FROM users
WHERE name LIKE '%test%'
ORDER BY NAME
LIMIT 10;
I couldn't find articles describing best practices for such "LIKE" queries that's why I'm asking here.
So my questions are:
Which index type is suitable for such case - if I need to find N records in string column using LIKE (prisma.io contains). Some docs say that the default B-tree is fine. Some articles show that Gin is better for that purpose.
As I'm using the prisma contains with Postgres, some features are still not supported in Prisma and different workarounds are needed (such as using unsupported types, experimental features etc.). I'd appreciate if Prisma examples were given.
Also, the full text search is not suitable as it requires the knowledge of the language which was used in the column, but my column will contain data in different languages

Search for nodes in Neo4j with schema index

I have a graph that has only Schema indexes and not legacy indexes as Neo4j documentation recommends. I want to search for nodes like in this example described under the legacy indexing section (exact match, start queries etc). I am wondering if this is possible with schema indexes and if schema indexes use lucene underneath.
As of today schema indexes just support exact matches, e.g.
MATCH (p:Person) WHERE p.name='abc'
or IN operators
MATCH (p:Person) WHERE p.name in ['abc','def']
Future releases might have support for wildcards as well.
you can use wildcards as well in that case the query would be
MATCH (b:book) WHERE b.title=~"F.*" RETURN b;

How does a full text search server like Sphinx work?

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

On What operations indexs do not work well or are not used

I am creating sql queries and I have read in one book that when using NOT operator or LIKE operators Indexes do not work. How much true this statement is. How can we avoid this if the statement is true. How a query should be made to do the same work avoiding these operators.
What all other areas are there in sql server where Indexes are deferred.
Like statements that wildcard the leftside can not use an index if one is defined for the column:
WHERE column LIKE '%abc'
WHERE column LIKE '%abc%'
But either of these can use an index:
WHERE column LIKE 'abc%'
WHERE column LIKE 'abc'
Frankly, use LIKE for very simple text searching - if you're really needing a text search that performs well, look at Full Text Searching (FTS). Full Text Searching has it's own indexes.
The decision really comes down to the optimizer for which index to use, but something that ensures that an index will not be used it to wrap column values in function calls. This for example will not use indexes:
WHERE CHARINDEX(column, 'abc') > 0
WHERE CAST(column AS DATETIME) <= '2010-01-01'
Anywhere you are manipulating table data--especially changing the data type--will render an index useless because an index is of the unaltered values and there's no way to relate to altered data.
For like looks to this article:
SQL Performance - Indexes and the LIKE clause
For NOT operator using indexes will depend on particular query.
From within Query Analyzer is an option called "Show Execution Plan" (located on the Query drop-down menu). If you turn this option on, then whenever you run a query in Query Analyzer, you will get a query execution plan. Use this to analyze the effectiveness of your query. Based on the results, you may need to use another query or add better indexes. E.g. table scan means that indexes are not used, bookmark lookups mean you should limit the rows or use a covering index, if there's a filter you might want to remove any function calls from the where clause, sort can be slow.
The search term you need to explore this issue in more depth is sargable.
here is one article to get you started:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

How do I write a string search query that uses the non-clustered indexing I have in place on the field?

I'm looking to build a query that will use the non-clustered indexing plan on a street address field that is built with a non-clustered index. The problem I'm having is that if I'm searching for a street address I will most likely be using the 'like' eval function. I'm thinking that using this function will cause a table scan instead of using the index. How would I go about writing one in this case? Is it just pointless to put a non-clustered index on an address3 field? Thanks in advance.
varchar fields are indexed from left to right, much the same as a dictionary or encyclopedia is indexed.
If you knew what the field started with, (ex. LIKE 'streetname%') then the index would be efficient. However, if you only know part of the field (ex. LIKE '%something%') then an index cannot be used.
If your LIKE expression is doing a start-of-string search (Address LIKE 'Blah%'), I would expect the index to be used, most likely through an index seek.
If you search for Address LIKE '%Blah%', a table scan/index scan will occur, depending on how many fields you return in your query and how selective the index is.
Using LIKE will not necessarily use a table scan; it may make use of an index, depending on what string you're searching against. (For instance, LIKE 'something%' is generally able to use an index, whereas LIKE '%something' is probably not, although the server may still be able to at least do an index scan in that case, which is more expensive that a straight index lookup, but still cheaper than a full table scan.) There's a good article here that talks about LIKE vs. indexes with respect to SQL Server (different DBMSs will implement it differently, obviously).
In theory the database will use whatever index is best. What database server are you using, what are you really trying to achieve, and what is your LIKE statement going to be like? For instance, where the wildcard characters are can make a difference to the query plan that is used.
Other possibilities depending on what you want to achieve are performing some pre-processing of the data and having other columns that are useful for your search, or using an indexed view.
Here's some discussion on the use of indexes with SQL Server 2005 and varchar fields.