Multiple or single index in Lucene? - lucene

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.

If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .

Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)

You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.

Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager

Related

SQL DBM - Spatial Index with Arbitary Types (non-geometric)

Find off, at the moment I'm not looking for alternate suggestions, just a yes or a no, and if it's a yes, what the name is.
Are there any SQL DBMS that allows you to create "Spatial" indexes using arbitrary (i.e. non geometric) data types like integers, dates, etc? While spatial indexes are most commonly used for location data, they can also be used to properly indexes queries where you need to search within two or more ranges.
For example (and this is just a made-up example), if you had a database of customer receipts, and you wanted to find all transactions between $10-$1000 and which took place between 2000-01-01 thru 2005-03-01. The fact that you're searching within multiple ranges means that the regular b-tree indexes cannot be used to efficiently perform this lookup, at least not in a way that's scalable.
Now yes, for the specific example I provided, and probably any other case, you could likely come up with some tricks to do it efficiently using the b-tree indexes, or at the very least narrow it down; I'm well aware, but again, not looking for alternate suggestions, just a no, or a yes and the name.
Appreciate any help you all can provide
EDIT: Just to clarify; I'm using the term spatial index as this is the most common term for it as well as the most commonly implemented use case. I am however referring to any index which uses quadtrees, r-trees, etc to achieve the same or similar effect.

Search on multiple index in RediSearch

I'm using Redisearch for my project. there are different indexes in the project, such as job_idx, company_idx, article_idx and event_idx (article_idx and event_idx structure are very similar). The index is used on different pages, ex: job_idx used by Job page search, company_idx used by Company page search.
The question is on the homepage, the search engine should return result from every index, so should I call search 4 times? I think that there should be better solution for my case.
The FT.SEARCH command allows you to pass exactly one index as a parameter. So if you are already having 4 indexes, then you need to call the command 4 times.
It's typically the simplest to have one index per entity, BUT it's at the end a question how you are designing your physical data model to support your queries best. This can range from entirely separated indexes up to one single index for everything (e.g., an 'all_fields' index with a type field). The best implementation might be somewhere in the middle (very much similar to 'normalized vs. de-normalized database schema' in relational database systems).
A potential solution for you could be to create an additional index (e.g., called combined_homepage) which indexes on specific fields that are needed for the search on the homepage. This index would then enable you to do a single search.
However, this additional index would indeed need additional space. So, considering that you don't want to rethink the physical data model from scratch, you either invest into space (memory) to enable a more efficient access, or spend more for compute and network (for combining the results of the 4 queries on the client-side).
Hope this helps, even if my answer comes basically down to 'it depends' :-).

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Postgresql performance comparison between arrays and joins

What are the performance implications in postgres of using an array to store values as compared to creating another table to store the values with a has-many relationship?
I have one table that needs to be able to store anywhere from about 1-100 different string values in either an array column or a separate table. These values will need to be frequently searched for exact matches, so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
These values will need to be frequently searched
Searched how? This is crucial.
Prefix pattern match only? Infix/suffix pattern matches too? Fuzzy string search / similarity matching? Stubbing and normalization for root words, de-pluralization? Synonym search? Is the data character sequences or natural language text? One language, or multiple different languages?
Hand-waving around "searched" makes any answer that ignores that part pretty much invalid.
so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
Impossible to be strictly sure without proper info on the data you're searching.
Searching text fields is much more flexible, giving you many options you don't have with an array search. It also generally reduces the amount of data that must be read.
In general, I strongly second Clodaldo: Design it right. Optimize later, if you need to.
According to the official PostgreSQL reference documentation, searching for specific elements in a table is expected to perform better than in an array
https://www.postgresql.org/docs/current/arrays.html#ARRAYS-SEARCHING :
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements.
The reason for the worse search performance on array elements than on tables could be that arrays are internally stored as strings as stated here
https://www.postgresql.org/message-id/op.swbsduk5v14azh%40oren-mazors-computer.local
the array is actually stored as a string by postgres. a string that
happens to have lots of brackets in it.
although I could not corroborate this statement by any official PostgreSQL documentation. I also do not have any evidence that handling well-structured strings is necessarily less performant than handling tables.

should I use LIKE to query tables with 4 million rows

I am designing a search form, and I am wondering whether should I give the possibility to search by using LIKE %search_string% for a table that is going to have up to 4 million rows
In general, I would say no. This is a good candidate for full-text indexing. The leading % in your search string is going to eliminate the possibility of using any indexes.
There may be cases where the wait is acceptable and/or you do not want the additional administrative overhead of maintaining full-text indexes, in which case you might opt for LIKE.
No, you should really only use LIKE '%...%' when your tables are relatively small or you don't care about the performance of your own or other peoples' queries on your database.
There are other ways to achieve this capability which scale much better, full text indexing or, if that's unavailable or not flexible enough, using insert/update triggers to extract non-noise words for querying later.
I mention that last possibility since you may not want a full text index. In other words, do you really care about words like "is", "or" and "but" (these are the noise-words I was alluding to before).
You can separate the field into words and place the relevant ones in another table and use blindingly fast queries on that table to find the actual rows.
The search with LIKE %search_string% is very slow even on indexed columns. Worstcase the search does a full table scan.
If a search LIKE search_string% is enough I'd just provide this possibility.
It depends - without knowing how responsive the search has to be, it could either be fine or completely no go. You'll only really know if you profile your search with likely data patterns and search criteria.
And as RedFilter points out, you might want to consider Full Text Search, if plain search isn't performing well