SOLR indexed item has extra word which is not available in query parameter - how to identify those cases? - apache

We have a scenario where we are trying to perform accurate name matching of Items using SOLR.
Query Parameter: Apple
SOLR Indexed Word: Apple-D
In our business case, "Apple" and "Apple-D" are totally different items and therefore SOLR shouldn't return the match.
Is there an option to achieve the same?

You need to change the fieldType used for the field. Use the String fieldType for the your field.
This String fieldType will make sure that the words will be stored as it is by solr.
It won't apply any analysis on the word. Or it won't create any tokes of it.
With the String type applied to it . The Apple and Apple-D are stored/indexed different token. As there won't be any tokenizing on the same. This will help you to achieve the exact match.
Once you change the fieldType. Re-index the same.
You can use the solr analysis tool to check how it is indexing and querying .
Note : Make sure whenever you ask question on it, Share your schema.xml

Related

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Neo4j index for full text search

I am working on neo4j database version 2.0.I have following requirements :
Case 1. I want to fetch all records where name contains some string,for example if i am searching for Neo4j then all records having name Neo4j Data,Neo4j Database,Neo4jDatabase etc. should be returned.
Case 2. When i want to fire field less query,if a set of properties is having matching value then those records should be returned or it may also be global level instead of label level.
Case Sensitivity is also a point.
I have read multiple thing about like,index,full text search,legacy index etc.,so what will be the best fit for my case,or i have to use elastic search etc.
I am using spring-data-neo4j in my application,so provide some configuration for SDN
Annotate your name with #Indexed annotation:
#Indexed(indexName = "whateverIndexName", indexType = IndexType.FULLTEXT)
private String name;
Then query for it following way (example for method in SDN repository, you can use similar anywhere else you use cypher):
#Query("START n=node:whateverIndexName({query}) return n"
Set<Topic> findByName(#Param("query") String query);
Neo4j uses lucene as backend for indexing so the query value must be a valid lucene query, e.g. "name:neo4j" or "name:neo4j*".
There is an article that explains the confusion around various Neo4j indexes http://nigelsmall.com/neo4j/index-confusion.
I don't think you need to be using elastic search-- you can use the legacy indexes or the lucene indexes to do full text searches.
Check out Michael Hunger's blog: jexp.de/blog
thix post specifically: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/

solr unable to search with exact value

I am using Solr 4.1.0 and I'm facing a strange issue. If I give a value to search for a field, even be it exact or involving a wildcard, it gives me 0 search results. On the other hand if I just give the field name and a * in place of value, I get all the results.
Also, if I search in the text field, i.e where I have copied values of all my fields, it gives me correct output. text is by default, my catch-all for all fields. feature is a field which has value Butter.
So now, what is happening here is that if I try to find in the actual field with the exact value or even with starting alphabet and a *, it doesn't give me a value while if I search in the text field, which is a catch-all field, I'm able to retrieve the value. Although if I try to find in the feature field using *, it gives me complete result list correctly.
You can view the logs for text:Butter here, logs for feature:Butter here, logs for feature:B* here and logs for feature:* here
I'm facing this issue with this particular field only. Any pointers to what could be the reason behind this strange problem?
If you search without the field name, Solr is going to search in the default search field.
So make sure you are marking the fields you want to search on as default.
If you are using dismax query handler, you can add them to the qf parameter.
Also, for Wildcard Queries check [Analyzers][1]
On wildcard and fuzzy searches, no text analysis is performed on the search word.
As no analysis is done at query time for wilcard searches and hence the lower casing, stemming would not be applied during query time but just the index time.

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

determine which value produced a hit in SOLR multivalued field type

If I have a multiValued field type of text, and I put values [cat,dog,green,blue] in it. Is there a way to tell when I execute a query against that field for dog, that it was in the 1st element position for that multiValued field?
Assumption: client does not have any pre-knowledge of what the field type of the field being queried is. (i.e. Solr must provide the answer and the client can't post process the return doc to figure it out because it would not know how SOLR matched the query to the result).
Disclosure: I posted to solr-user list and am getting no traction so I post here now.
Currently, there's no out-of-the-box functionality provided in Solr which tells you the position of a value in a multiValue field.
Hopefully I understand your question correctly.
If you want to get field index or value there is an ugly workaround:
You could add the index directly in the value e.g. store "1; car", "2; test" and so on. Then use highlighting. When reading the returned fields simply skip the text before the semicolon.
But if you want to query only one type:
You can avoid the multivalue approach and simply store it as item_i and query via item_1. To query against all items regardless the type you need to use the copyField directive in the schema.xml
The Lucene API allows for this, but I'm not sure if Solr does out of the box. In Lucene you can use the IndexReader.termPositions(Term term) method.