OrientDB: FullText indexes vs Lucene FullText indexes

OrientDB: FullText indexes vs Lucene FullText indexes - lucene

OrientDB has two types of full-text indexes: one is their own implementation and the second one is Lucene implementation. However it is absolutely unclear what I should use.
I understand that Lucene provides more features. However what if these features are not required. Should I go with standard full-text indexes or with Lucene? Then obviously performance is the main question.

The indices "FULL TEXT" with engine LUCENE
provides good full-text indexes, but cannot be used to index other types. It is durable, transactional and supports range queries.
More information about lucene see link.
The indices "FULL TEXT" with engine SB-TREE
the index is created with the algorithm that is based on the B-Tree index algorithm. It has been adapted with several optimizations, which relate to data insertion and range queries. As is the case with all other tree-based indexes, SB-Tree index algorithm experiences log(N) complexity, but the base to this logarithm is about 500. This indexing algorithm provides a good mix of features, similar to the features available from other index types. It is good for general use and is durable, transactional and supports range queries.
a simple example that compares the speed:
DB one: 100000 top of Class Person with property name with value "the name is 1...n " and Lucene index on this property
DB one: 100000 top of Class Person with property name with value "the name is 1...n " and sbtree index on this property
On one db: select from Person where name LUCENE "49000" return one record --> Query executed in 0.039 sec.
Db on two: select from Persona where name = "49000" return one record --> Query executed in 1.364 sec

Related

Prisma and Postgres. Using indexes for optimized 'LIKE' search

I have a usual string column which is not unique and doesn't have any indexes. I've added a search function which uses the Prisma contains under the hood.
Currently it takes ąround 40ms for my queries (with the test database having around 14k records) which could be faster as I understood from this article: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
The issue is that nothing really changes after I add the trigram index, like this
CREATE INDEX CONCURRENTLY trgm_idx_users_name
ON users USING gin (name gin_trgm_ops);
The query execution time is literally the same. I also found that I can check if the index is actually used by disabling the full scan. And I really see that the execution time became times worse after that (meaning the added index is not actually used as it's performance is worse than full scan).
I was trying to use B-tree and Gin indexes.
The query example for testing is just searching records with LIKE:
SELECT *
FROM users
WHERE name LIKE '%test%'
ORDER BY NAME
LIMIT 10;
I couldn't find articles describing best practices for such "LIKE" queries that's why I'm asking here.
So my questions are:
Which index type is suitable for such case - if I need to find N records in string column using LIKE (prisma.io contains). Some docs say that the default B-tree is fine. Some articles show that Gin is better for that purpose.
As I'm using the prisma contains with Postgres, some features are still not supported in Prisma and different workarounds are needed (such as using unsupported types, experimental features etc.). I'd appreciate if Prisma examples were given.
Also, the full text search is not suitable as it requires the knowledge of the language which was used in the column, but my column will contain data in different languages

Lucene Difference between term and query?

What is the exact difference between the Term based index and the Query based Index also searching in LUCENE 6.5?

I don't know where you heard about "term-based" and "query-based" indexes.
Terms are the analyzed chunks of the text in the index. Most commonly, these are words, but it depends on your analyzer.
Queries are a set of search criteria that specifies what to look for among the indexed terms.

Role of selectivity in index scan/seek

I have been reading in many SQL books and articles that selectivity is an important factor in creating index. If a column has low selectivity, an index seek does more harm that good. But none of the articles explain why. Can anybody explain why it is so, or provide a link to a relevant article?

From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask
The ratio of unique values within a key column is referred to as index
selectivity. The more unique the values, the higher the selectivity,
which means that a unique index has the highest possible selectivity.
The query engine loves highly selective key columns, especially if
those columns are referenced in the WHERE clause of your frequently
run queries. The higher the selectivity, the faster the query engine
can reduce the size of the result set. The flipside, of course, is
that a column with relatively few unique values is seldom a good
candidate to be indexed.
Also check these articles:
Check this post by Pinal Dave
this other on SQL Serverpedia
This forum post on SqlServerCentral can help you too.
This article on SqlServerCentral also
From the SqlServerCentral article:
In general, a nonclustered index should be selective. That is, the
values in the column should be fairly unique and queries that filter
on it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations
and if a nonclustered index is to be used to evaluate a query it needs
to be covering or sufficiently selective that the costs of the lookups
aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the
query would be seeking on) insufficiently selective then it is very
likely that the index will be ignored and the query executed as a
clustered index (table) scan.
It is important to note that this does not just apply to the leading
column. There are scenarios where a very unselective column can be
used as the leading column, with the other columns in the index making
it selective enough to be used.

I try to write a very simple explanation (based on my current knowledge of Sql Server):
If an index has low selectivity it means that for the same value a bigger percentage of the total rows are found. (like 200 from the 500 rows has the same value on your index based)
Usually if the index does not contain all the column information what you need, then it is using a pointer, where to find the row physically which is connected to that "entry" on the index. Then in a secpnd step the engine has to read out that row.
So as you see a search like this using two step. And here comes the selectivity:
More results you get becuse of the low selectivity more double work the engine has to do. So there are some cases because of this fact where even a table scan is more efficient then an index seek with very low selectivity.

Algorithm for rdbms, select statement

hello dunno if this is the correct place to ask for this question,, im havin a thesis research and im in the algoritm now.. my thesis is an application that send messages using wherein the contacts will be query from the db.. so my question is what is the algorithm for searching the contacts from DB? linear search??

If the contacts field is indexed in your database, it will be using B-Tree search, hash search or a FULLTEXT search (which is combination of some more simple algorithms), depending on the type of the index and the structure of the search query.
If the contacts are not indexed or a search query structure does not allow using an index, then yes, it will be using linear search.

The index doesn't necessarily need to be primary index, it can be a index on any field. As Quassnoi said, you can specify the data structure beneath the index. Mysql assumes it to be B-tree by default. So the time to search the node will be O(logn) in case the tree is balanced, which B-tree are.
In case the contact field is not indexed, the db will linearly scan through each record and find the row until it finds one. This is worst case take O(n) time.

What are the different types of indexes, what are the benefits of each?

What are the different types of indexes, what are the benefits of each?
I heard of covering and clustered indexes, are there more? Where would you use them?

Unique - Guarantees unique values for the column(or set of columns) included in the index
Covering - Includes all of the columns that are used in a particular query (or set of queries), allowing the database to use only the index and not actually have to look at the table data to retrieve the results
Clustered - This is way in which the actual data is ordered on the disk, which means if a query uses the clustered index for looking up the values, it does not have to take the additional step of looking up the actual table row for any data not included in the index.

OdeToCode has a good article covering the basic differences
As it says in the article:
Proper indexes are crucial for good
performance in large databases.
Sometimes you can make up for a poorly
written query with a good index, but
it can be hard to make up for poor
indexing with even the best queries.
Quite true, too... If you're just starting out with it, I'd focus on clustered and composite indexes, since they'll probably be what you use the most.

I'll add a couple of index types
BITMAP - when you have very low number of different possible values, very fast and doesn't take up much space
PARTITIONED - allows the index to be partitioned based on some property usually advantageous on very large database objects for storage or performance reasons.
FUNCTION/EXPRESSION indexes - used to pre-calculate some value based on the table and store it in the index, a very simple example might be an index based on lower() or a substring function.

PostgreSQL allows partial indexes, where only rows that match a predicate are indexed. For instance, you might want to index the customer table for only those records which are active. This might look something like:
create index i on customers (id, name, whatever) where is_active is true;
If your index many columns, and you have many inactive customers, this can be a big win in terms of space (the index will be stored in fewer disk pages) and thus performance. To hit the index you need to, at a minimum, specify the predicate:
select name from customers where is_active is true;

Conventional wisdom suggests that index choice should be based on cardinality. They'll say,
For a low cardinality column like GENDER, use bitmap. For a high cardinality like LAST_NAME, use b-tree.
This is not the case with Oracle, where index choice should instead be based on the type of application (OLTP vs. OLAP). DML on tables with bitmap indexes can cause serious lock contention. On the other hand, the Oracle CBO can easily combine multiple bitmap indexes together, and bitmap indexes can be used to search for nulls. As a general rule:
For an OLTP system with frequent DML and routine queries, use btree. For an OLAP system with infrequent DML and adhoc queries, use bitmap.
I'm not sure if this applies to other databases, comments are welcome. The following articles discuss the subject further:
Bitmap Index vs. B-tree Index: Which and When?
Understanding Bitmap Indexes

Different database systems have different names for the same type of index, so be careful with this. For example, what SQL Server and Sybase call "clustered index" is called in Oracle an "index-organised table".

I suggest you search the blogs of Jason Massie (http://statisticsio.com/) and Brent Ozar (http://www.brentozar.com/) for related info. They have some post about real-life scenario that deals with indexes.

Oracle has various combinations of b-tree, bitmap, partitioned and non-partitioned, reverse byte, bitmap join, and domain indexes.
Here's a link to the 11gR1 documentation on the subject: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/data_acc.htm#PFGRF004

Unique
cluster
non-cluster
column store
Index with included column
index on computed column
filtered
spatial
xml
full text

SQL Server 2008 has filtered indexes, similar to PostgreSQL's partial indexes. Both allow to include in index only rows matching specified criteria.
The syntax is identical to PostgreSQL:
create index i on Customers(name) where is_alive = cast(1 as bit);

To view the types of indexes and its meaning visits:
https://msdn.microsoft.com/en-us/library/ms175049.aspx

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas