Change dynamically elasticsearch synonyms - lucene

Is it possible to store the synonyms for elasticsearch in the index? Or is it possible to get the synonym list from a database like couchdb?
I'd like to add synonyms dynamically to elasticsearch via the REST-API.

There are two approaches when working with synonyms :
expanding them at indexing time,
expanding them at query time.
Expanding synonyms at query time is not recommended since it raises issues with :
scoring, since synonyms have different document frequencies,
multi-token synonyms, since the query parser splits on whitespaces.
More details on this at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory (on Solr wiki, but relevant for ElasticSearch too).
So the recommended approach is to expand synonyms at indexing time. In your case, if the synonym list is managed dynamically, it means that you should re-index every document which contains a term whose synonym list has been updated so that scoring remains consistent between documents analyzed pre and post update. I'm not saying that it is not possible but it requires some work and will probably raise performance issues with synonyms which have a high frequency in your index.

There are few new solutions now to those proposed in other answers few years ago. The two main approaches implemented as plugins:
The file-watcher-synonym filter is a plugin that can periodically reload synonyms every given numbers of seconds, as defined by user.
The refresh-token-plugin allows a real-time update of the index. However, this plugin aparrently has some problems, which stem from the fact that elasticsearch is unable to distinguish between analyzers at search time only from those used at index time.
Good discussion on this subject can be found on the elastisearch github ticket system: https://github.com/brusic/refresh-token-filters

It isn't too painful in elasticsearch to update the synonym list. It can be done by opening and closing You could have it driven from anywhere, but need some of your own infrastructure. It'd work like this:
You want an alias pointing at your current index
Sync down a new index file to your servers
Create a new index with a custom analyzer that uses the new index
Rebuild the content from current index to new index
Repoint index alias from current to new index

In 2021, just expand synonyms at query time using a specific search analyzer and use the Reload analyzer API:
POST /my-index/_reload_search_analyzers
The synonym graph token filter must have set updatable to true:
"my-synonyms": {
"type": "synonym_graph",
"synonyms_path": "my-synonyms.txt",
"updateable": true
}
Besides, you should probably expand synonyms at query time anyway. Why?
Chances are that you have too much data to reindex every night or so.
Elasticsearch does not allow the Synonym Graph Filter for an index analyzer, only the deprecated Synonym Filter which does not handle multi word synonyms correctly.

Related

Search on multiple index in RediSearch

I'm using Redisearch for my project. there are different indexes in the project, such as job_idx, company_idx, article_idx and event_idx (article_idx and event_idx structure are very similar). The index is used on different pages, ex: job_idx used by Job page search, company_idx used by Company page search.
The question is on the homepage, the search engine should return result from every index, so should I call search 4 times? I think that there should be better solution for my case.
The FT.SEARCH command allows you to pass exactly one index as a parameter. So if you are already having 4 indexes, then you need to call the command 4 times.
It's typically the simplest to have one index per entity, BUT it's at the end a question how you are designing your physical data model to support your queries best. This can range from entirely separated indexes up to one single index for everything (e.g., an 'all_fields' index with a type field). The best implementation might be somewhere in the middle (very much similar to 'normalized vs. de-normalized database schema' in relational database systems).
A potential solution for you could be to create an additional index (e.g., called combined_homepage) which indexes on specific fields that are needed for the search on the homepage. This index would then enable you to do a single search.
However, this additional index would indeed need additional space. So, considering that you don't want to rethink the physical data model from scratch, you either invest into space (memory) to enable a more efficient access, or spend more for compute and network (for combining the results of the 4 queries on the client-side).
Hope this helps, even if my answer comes basically down to 'it depends' :-).

Can I mark an elastic search index as incomplete? Can I retrieve the list of "complete" indices?

I want to populate an index but make it searchable only after I'm done. Is there a standard way of doing that with elastic search? I think I can set "index.blocks.read": true but I'd like a way to be able to ask elastic for a list of the searchable indices and I don't know how to do that with that setting. Also closing/opening an index feels a bit cumbersome.
A solution I found is to add a document to each index defining that index's status. Though querying for the list of indices is a bit annoying. Specifically since querying and paginating a long list of 2,000 index status documents is problematic. Scroll-scan is a solution because it gives me all the results in one go (because every shard has at most 1 index status document). Though that feels like I'm using the wrong tool for the job (i.e. a scroll-scan op that always does exactly one scroll).
I don't want one document that references all the indices because then I'd have to garbage collect it manually alongside garbage collecting indices. But maybe that's the best tradeoff...
Is there a standard practice that I'm not aware of?
How about using aliases? Instead of querying an index directly, your application could query an alias (e.g. live) instead. As long as your index is not ready (i.e. still being populated), you don't assign the live alias to it and hence the index won't be searchable.
Basically, the process goes like this:
Create the index with its settings and mappings
Populate it
When done, assign the live alias to it and send your queries against it
Later when you need to index new data, you create another index
You populate that new index
When done, you switch the aliases, i.e. remove the live alias from the previous searchable index and assign the live alias to the new searchable index
Here is a simple example that demonstrates this.

Spring Data Neo4j search and index

I´m using:
neo4j 2.0.1
spring data neo4j 3.0.1.RELEASE
I have a Node Person with the property name and I would like to search with Lucene syntax on that property. I´m using the findByNameLike method in my repository and that works perfect for query like value* or *value or * etc.
But I need a query like that {A* TO D*}. I found a deprecated method findAllByQuery("name", query) and with that method I can achieve my needs.
I was wondering what is the new method not deprecated that understand such query syntax.
I also noticed that if I create node from cypher, the node are not available in my search.
With SDN I think that the generated node will be automatically add also to the index, but I don´t know how to check that and what is the index name. I must generate the node from cypher to have some base data in all my system. Should I add some special property in my cypher query?
First of all, make sure you understand the differences between the legacy (deprecated) Lucene indexes and the newer Schema indexes.
I was wondering what is the new method not deprecated that understand
such query syntax.
You'll have to use one of the methods of the SchemaIndexRepositry which is extended in the GraphRepository interface for convenience. Bear in mind that e.g. wildcard searches are not yet implemented on schema indexes. If you want to use wildcard searches, you'll have 2 options. Continue to use the Lucene indexes (your best option for now), or use a regex query in a custom repository method. E.g.
MATCH (p:Person) WHERE p.name =~ ".*test.*" RETURN p
I also noticed that if I create node from cypher, the node are not
available in my search. With SDN I think that the generated node will
be automatically add also to the index, but I don´t know how to check
that and what is the index name. I must generate the node from cypher
to have some base data in all my system. Should I add some special
property in my cypher query?
If you're using Lucene indexes, new entries will not be added to the index. AFAIK, you can only to that programatically. Schema indexes can be created as follows:
CREATE INDEX ON :Person(name)
New entries with the name property will be automatically added to the index. Again, wildcard searches will not yet use these indexes.

Lucene index a large many-to-many relationship

I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.

RavenDB Index Prefix

Since I don't yet have the ability to promote a temp index in Raven Studio (using build 573), I created two indexes manually. It seems to have worked well, but I have a question about the prefixes on each index: Temp, Auto, Raven. Is there anything special about those keywords? When I create my own index, should I use a prefix like that? For now, when I created my index, I used the index name from the temp index and replaced the word Temp with Manual.
Is that an acceptable approach? Should I be using a certain prefix?
Bob,
The names are just names, they are there for humans, not for RavenDB.
Indexes starting with Raven/ are reserved, and may be overwritten by the system at some point.
Indexes starting with Auto/ or Temp/ may be generated by the system, and may overwrite an existing index.
I generally use the collection/entity name as prefix before just so that helps me visually to understand right away what entity the index is primarily based on. If I had index for getting latest list of movies. I would name it Movie/GetLatestIndex..