Apache Solr: Conditional block - indexing

I am reading columns from HBase and indexing it in Solr using morphines file. Some field values will be in either English or German. Is there a way to specify the type of the field as "text_english_german" and inside the definition of "text_english_german" can we do an condition check to see if it is English or German and use the language specific Stemmer filter factory for indexing and querying the data?
Thanks,
Kishore

With a slightly different approach, you could define two fields:
text_en
text_de
Each of them would have a language-specific text analysis configured. Then, you can use the language autodetection UpdateRequestProcessor [1]. There a lot of parameters where you can tune the behaviour of such component.
[1] https://wiki.apache.org/solr/LanguageDetection
[2] https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Related

What does the Liferay documentation mean by "without using the indexer"

In the Liferay documentation, many *LocalServiceUtil classes have search methods with the following documentation:
Returns an ordered range of all the [...] matching the parameters without using the indexer, including keyword parameters for [...].
What does the without using the indexer part of the sentence mean?
In particular, does it mean that it does not use any database indexes? Does it mean that for instance JournalArticleLocalServiceUtil.search can be expected to run much slower than the equivalent JournalArticleLocalServiceUtil.getArticles? Or is it a different meaning?
Or does this indexer refer to the indexes in the result set in the same method's documentation, maybe?
The indexer refers to searchengine indexers such as those using Lucene, Solr, Elastic (or similar) implementations.
search and getArticles operations will query the database - if you do a keyword search your database might not use in (DB) index, because content or title are not part of an index by default. Therefore, when there is a bigger amount of articles, a keyword searchengine query might lead to a better response time.

Is there a way to properly experiment with Solr field-types?

I'm working with Solr for a basic search engine, and I've created a couple different fieldTypes that include various filters and tokenizers in their analyzer chains.
However, I'm finding it very difficult to assess how these components of the chain interact and when I query in the Solr Admin, I consistently get different results than I expect-- with no clue as to why.
Is there a way to see what a phrase like education:"x university" is being transformed into when I type it in the q section of the Admin?
Also, when the phrase goes through the chain can it be transformed into multiple things that are all searched or is it just a single modified phrase?
Thanks for any help!
Use Analysis in Solr Admin to check how each field and its type process the tokens both while querying and indexing.
Analyse Fieldname / FieldType:
from the drop down option select field/type that you want to analyse and clieck on Analyse values.
ex: what tokenizer used, which all filter classes applied to token and how token is transformed after passing each filter class.
if
Verbose Output is checked, it shows more details about each filter class used for the selected field/type.

SOLR indexed item has extra word which is not available in query parameter - how to identify those cases?

We have a scenario where we are trying to perform accurate name matching of Items using SOLR.
Query Parameter: Apple
SOLR Indexed Word: Apple-D
In our business case, "Apple" and "Apple-D" are totally different items and therefore SOLR shouldn't return the match.
Is there an option to achieve the same?
You need to change the fieldType used for the field. Use the String fieldType for the your field.
This String fieldType will make sure that the words will be stored as it is by solr.
It won't apply any analysis on the word. Or it won't create any tokes of it.
With the String type applied to it . The Apple and Apple-D are stored/indexed different token. As there won't be any tokenizing on the same. This will help you to achieve the exact match.
Once you change the fieldType. Re-index the same.
You can use the solr analysis tool to check how it is indexing and querying .
Note : Make sure whenever you ask question on it, Share your schema.xml

Neo4j index for full text search

I am working on neo4j database version 2.0.I have following requirements :
Case 1. I want to fetch all records where name contains some string,for example if i am searching for Neo4j then all records having name Neo4j Data,Neo4j Database,Neo4jDatabase etc. should be returned.
Case 2. When i want to fire field less query,if a set of properties is having matching value then those records should be returned or it may also be global level instead of label level.
Case Sensitivity is also a point.
I have read multiple thing about like,index,full text search,legacy index etc.,so what will be the best fit for my case,or i have to use elastic search etc.
I am using spring-data-neo4j in my application,so provide some configuration for SDN
Annotate your name with #Indexed annotation:
#Indexed(indexName = "whateverIndexName", indexType = IndexType.FULLTEXT)
private String name;
Then query for it following way (example for method in SDN repository, you can use similar anywhere else you use cypher):
#Query("START n=node:whateverIndexName({query}) return n"
Set<Topic> findByName(#Param("query") String query);
Neo4j uses lucene as backend for indexing so the query value must be a valid lucene query, e.g. "name:neo4j" or "name:neo4j*".
There is an article that explains the confusion around various Neo4j indexes http://nigelsmall.com/neo4j/index-confusion.
I don't think you need to be using elastic search-- you can use the legacy indexes or the lucene indexes to do full text searches.
Check out Michael Hunger's blog: jexp.de/blog
thix post specifically: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.