Show fields for a Lucene/Elasticsearch index - lucene

Is there a way to get Lucene/Elasticsearch to just show what fields have been indexed in a given index? I'm trying to figure out whether certain fields have been indexed properly as a result of configuration options, but I have no idea how to make that determination.

You can check the mappings for a specific index and type via a call to:
http://localhost:9200/index/type/_mapping
Anything indexed should have an entry there.

See also analyze API to see how text is broken into terms.
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html

Related

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

How to tell what index is trying to be used for ContentSearchManager.GetIndex(SitecoreIndexableItem)

So, ContentSearchManager.GetIndex(SitecoreIndexableItem) is returning null. Im pretty sure we might be missing an index. When using the sitecore master database, everything works fine, but in web is null.
I guess the question is, is there a way to know which index is GetIndex trying to recover that is returning null.
If you haven't overriden the default Sitecore logic for getting the index, Sitecore checks all the indexes which are registered in the configuration and for each of them, it checks if the SitecoreIndexableItem passed to the
ContentSearchManager.GetIndex(SitecoreIndexableItem)
is not excluded from that index.
Then is simply returns the first matching index.
So the answer to your question is - Sitecore check all indexes if they are a match for your item.
You may want to look through your logs for an error like this:
"There is no appropriate index for {indexable.AbsolutePath} - {indexable.Id}. You have to add an index crawler that will cover this item"
This may help you find which item is not indexed at all.

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Why are my Lucene Document results empty?

I'm running a simple test--trying to index something and then search for it. I index a simple document, but then when a search for a string in it, I get back what looks to be an empty document (it has no fields). Lucene seems to be doing something, because if I search for a word that's not in the document, it returns 0 results.
Any reason why Lucene would reliably return a document when it finds one that matches the given query, and yet that document has nothing in it?
More details:
I'm actually running Lucandra (Lucene + Cassandra). That certainly may be a relevant detail, but not sure.
The fields are set to Field.Store/YES and Field.Index/ANALYZED
Interestingly, I'm able to get this to work just fine on my local machine, but when we put it on our main server (which is a multi-node cassandra setup), I get the behavior described above. So this seems like probably the relevant detail, but unfortunately, I see no error message to clue me in to what specifically is causing it.
Unsure if this will work with Lucandra, but you have tried opening the index using Luke? Viewing the index contents with Luke might help
It's hard to tell what the problem is since you only provide a very abstract description. However, it sounds a bit like you are not storing the field value in the index. There are different modes for indexing a field. One option determines whether the original value is stored in the index to retrieve it later:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/document/Field.Store.html
See also the description of the enclosing class Field
Read: http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/

determine which value produced a hit in SOLR multivalued field type

If I have a multiValued field type of text, and I put values [cat,dog,green,blue] in it. Is there a way to tell when I execute a query against that field for dog, that it was in the 1st element position for that multiValued field?
Assumption: client does not have any pre-knowledge of what the field type of the field being queried is. (i.e. Solr must provide the answer and the client can't post process the return doc to figure it out because it would not know how SOLR matched the query to the result).
Disclosure: I posted to solr-user list and am getting no traction so I post here now.
Currently, there's no out-of-the-box functionality provided in Solr which tells you the position of a value in a multiValue field.
Hopefully I understand your question correctly.
If you want to get field index or value there is an ugly workaround:
You could add the index directly in the value e.g. store "1; car", "2; test" and so on. Then use highlighting. When reading the returned fields simply skip the text before the semicolon.
But if you want to query only one type:
You can avoid the multivalue approach and simply store it as item_i and query via item_1. To query against all items regardless the type you need to use the copyField directive in the schema.xml
The Lucene API allows for this, but I'm not sure if Solr does out of the box. In Lucene you can use the IndexReader.termPositions(Term term) method.