Solr field name rules?

Solr field name rules? - indexing

Sorry for the newby question, I'm new to Solr. In managed-schema, I see that there are many fields with identical types but different names. How does Solr identify which field to store the tokens given that the types are all the same but only names are different? For instance,
<field name="content_type" type="text_general">
<field name="content_type_hint" type="text_general">
<field name="blitz" type="text_general">
They all the have the same type (same analyzer). How does Solr store different content into all these text_general fields? Do they check the names of tags with the actual content? and if not identical, it moves on to dynamic fields? I searched on the web and it seems no one has mentioned in detail if name helps in the process of indexing.

So names and type are two different things.
<field name="content_type" type="text_general">
In the above case the name of the field is "content_type" which will be used to search it.
e.g if you want to search a docuemnt with content_type="xml", you will query something like this
q=content_type:xml
however type defines the analysis that will occur on a field when documents are indexed or queries are sent to the index.
so somewhere in your schema you will have defined the field type text_general something like this.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
You can read more about it here
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html
Solr wont store everything into the type field. type field just tells it what analysis to run on the field while indexing or querying.
Every field will have its own index.
Edit:
i think you are confused how data is indexed. Lets take an example of.
lets say i have a document like this
{
"content_type" : "text/html",
"content_type_hint" : "some_hint",
"blitz" : "some_text"
}
So when you index the document you will tell solr which field you are putting what value.
so in this case you are saying the field content_type has value "text/html" and blitz has value "some_text".
Then solr will do some analysis based on what the type of that field is and then put it to respective index.

Related

Solr : is it necessary reindexing of a collection after add of a new dynamic field in schema.xml?

I have a problem:
I use (it is a constraint)
solr 4.10.3
I have already a collection
with a lot of documents
that uses a schema.xml
with static and dynamic fields.
Unfortunately,
in the file schema.xml
there are already dynamic fileds
for all types (string, text, int, etc..)
except for the type "date" .
Now, I need that in the collection
must be present also a dynamic field with type "date".
Is it enough to add in the schema.xml (the type "date"
is already defined in the schema.xml) the following item ?:
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
So (important for me), after this adding of dynamic field,
is it only necessary to
perform the 2 proper commands of zkcli.sh (commands upconfig and linkconfig)?
Is it (after) also necessary the reindexing of the collection?
(I hope no, maybe, because I added only a dynamic field (in the schema)
and not static field, the reindexing is not necessary, I hope).
In case I need the reindexing of the collection, how can I perform it ?
Thanks for a possible help.
Regards.
Fabrizio

Whether reindexing is required is not dependent on the type (i.e. dynamic or static) of the field; only whether you want to change any data that has already been indexed for the field - or add data that isn't in existing documents, but is present in your original data source.
As long as the content is only meant to be present in any document being indexed after you've updated your schema, adding a dynamic or static field does not require reindexing.
If the field has already been indexed under a different type, clearing out the index and reindexing is a necessity (although you might get away with doing atomic updates if all existing fields are set as stored - but I really recommend doing a full reindex in that case anyway, since you don't want your index to be in some sort of limbo while updates are being performed).

Issue with Solr Indexing, Solr Indexing Chain is not complete

In my solr, i get this result after running analysis for Indexing. I have a number of documents containing the word Machine Learning but seems like something broke and indexing chain didn't complete. Can i find a work-around for this?
Field type is for the value being searched is: <field name="Skills" type="text_general" indexed="true" stored="true"/>
EDIT 1:
Analysis with Query:

I'm guessing that the "SF" is a Stemming filter - the filter will remove common endings to allow 'machine' to match 'machines', storing 'machin' as the common term in the index. As long as stemming is performed both when indexing and when querying, you should get the result you're looking for.
The EdgeNGramFilter stores a token for each extra letter in the token, so you get a token (that will match a query token) for each additional letter (where your filter seems to be configured for 3 as the minimum ngram size).
If you're not performing stemming when searching as well, the query machine will not find any terms matching, since the token after indexing has been stored as machin.
Use both the "query" and "index" section on the analysis page to see how each part is parsed and processed, and see why they don't end up with the same terms on both sides (the end tokens on both sides are compared, and if they're the same, there's a match - this is shown with a slightly darked background in the interface IIRC).

I am not sure what's your first image stands for, but your two image shows different token filter order.
As a side note of the Stem filter, The kstem token filter is a high performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
Your first image shows you have LCF (LowercaseFilter) as the first token filter. But your second image shows you have stem filter run first, then do the LCF (LowercaseFilter), it is not going to work

SOLR indexed item has extra word which is not available in query parameter - how to identify those cases?

We have a scenario where we are trying to perform accurate name matching of Items using SOLR.
Query Parameter: Apple
SOLR Indexed Word: Apple-D
In our business case, "Apple" and "Apple-D" are totally different items and therefore SOLR shouldn't return the match.
Is there an option to achieve the same?

You need to change the fieldType used for the field. Use the String fieldType for the your field.
This String fieldType will make sure that the words will be stored as it is by solr.
It won't apply any analysis on the word. Or it won't create any tokes of it.
With the String type applied to it . The Apple and Apple-D are stored/indexed different token. As there won't be any tokenizing on the same. This will help you to achieve the exact match.
Once you change the fieldType. Re-index the same.
You can use the solr analysis tool to check how it is indexing and querying .
Note : Make sure whenever you ask question on it, Share your schema.xml

determine which value produced a hit in SOLR multivalued field type

If I have a multiValued field type of text, and I put values [cat,dog,green,blue] in it. Is there a way to tell when I execute a query against that field for dog, that it was in the 1st element position for that multiValued field?
Assumption: client does not have any pre-knowledge of what the field type of the field being queried is. (i.e. Solr must provide the answer and the client can't post process the return doc to figure it out because it would not know how SOLR matched the query to the result).
Disclosure: I posted to solr-user list and am getting no traction so I post here now.

Currently, there's no out-of-the-box functionality provided in Solr which tells you the position of a value in a multiValue field.

Hopefully I understand your question correctly.
If you want to get field index or value there is an ugly workaround:
You could add the index directly in the value e.g. store "1; car", "2; test" and so on. Then use highlighting. When reading the returned fields simply skip the text before the semicolon.
But if you want to query only one type:
You can avoid the multivalue approach and simply store it as item_i and query via item_1. To query against all items regardless the type you need to use the copyField directive in the schema.xml

The Lucene API allows for this, but I'm not sure if Solr does out of the box. In Lucene you can use the IndexReader.termPositions(Term term) method.

Field having multiple distinct values

Am building a "Book search" API using Lucene.
I need to index Book Name,Author, and Book category fields in Lucene index.
A single book can fall under multiple distinct book categories...for example:
BookName1 --fiction,humour,philosophy.
BookName1 --fiction,science.
BookName1 --humour,business.
BookName4-humour
and so on.....
User should be able to search all the books under a particular category say "homour".
Given this situation, how do i index above fields and build the query in lucene?

You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category
create new lucene document
add name field and value
add author field and value
for each category:
add category field and value
add document to index
When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.
I've written it in english because the specific code is slightly different per lucene version.

You can create a simple "category" field, where you list all categrories for a book seperated by spaces.
Then you can search something like:
stock market AND category:(+"business")
Or if you want to search in more than one category
stock market AND category:(+"business" +"philosophy")

I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.
If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...
In Solr, you would simply define the fields you want to index something like this in schema.xml:
<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />
Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.
Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".
There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.
There is lots of great documentation on their website to get you started.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Solr field name rules? - indexing

Related

Solr : is it necessary reindexing of a collection after add of a new dynamic field in schema.xml?

Issue with Solr Indexing, Solr Indexing Chain is not complete

SOLR indexed item has extra word which is not available in query parameter - how to identify those cases?

determine which value produced a hit in SOLR multivalued field type

Field having multiple distinct values

Categories

Resources