Is it possible to do a case insensitive search using Examine Index and Lucene without altering the data stored?
I'm saving articles with Id, title, the text and a date.
I don't want to index my data as lowercase since I want to read my data from the index and display it as it is. So I can skip the step going to the DB to get data.
Saving the same data twice, once as it is and once as lower case, dosn't feel like the right way of doing it.
Any suggestions of how to aproach this?
ExamineIndex.config
<IndexSet SetName="MySearchIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/MySearch/" >
<IndexUserFields>
<add Name="Id" />
<add Name="Title" />
<add Name="Text" />
<add Name="Date" />
</IndexUserFields>
ExamineSettings.config
<add name="MySearchIndexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine"
dataService="X.Service.MyIndexerService, X"
indexTypes="CustomData"
runAsync="false"
enableDefaultEventHandler="true"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>
<add name="MySearchSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcard="true" />
In lucene analyzers does not alter your data. They determine how data is indexed only. So you can index your data as you want (don't lowercase your data in your code), and retrieve values as they are.
As a side note in lucene you can have fields with different attributes (indexed/not indexed, stored/not stored). So you can add same field twice: one for retrieving only (stored & not indexed) and one for searching (indexed as lowercase but not stored). Check if examine supports these types of fields.
Related
Sorry for the newby question, I'm new to Solr. In managed-schema, I see that there are many fields with identical types but different names. How does Solr identify which field to store the tokens given that the types are all the same but only names are different? For instance,
<field name="content_type" type="text_general">
<field name="content_type_hint" type="text_general">
<field name="blitz" type="text_general">
They all the have the same type (same analyzer). How does Solr store different content into all these text_general fields? Do they check the names of tags with the actual content? and if not identical, it moves on to dynamic fields? I searched on the web and it seems no one has mentioned in detail if name helps in the process of indexing.
So names and type are two different things.
<field name="content_type" type="text_general">
In the above case the name of the field is "content_type" which will be used to search it.
e.g if you want to search a docuemnt with content_type="xml", you will query something like this
q=content_type:xml
however type defines the analysis that will occur on a field when documents are indexed or queries are sent to the index.
so somewhere in your schema you will have defined the field type text_general something like this.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
You can read more about it here
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html
Solr wont store everything into the type field. type field just tells it what analysis to run on the field while indexing or querying.
Every field will have its own index.
Edit:
i think you are confused how data is indexed. Lets take an example of.
lets say i have a document like this
{
"content_type" : "text/html",
"content_type_hint" : "some_hint",
"blitz" : "some_text"
}
So when you index the document you will tell solr which field you are putting what value.
so in this case you are saying the field content_type has value "text/html" and blitz has value "some_text".
Then solr will do some analysis based on what the type of that field is and then put it to respective index.
I'm new working with Solr and I have an instance running properly in my server
My problem is:
When I query Solr with some terms it doesn't return results, but there are items with that term indexed. I talked with a developer who was working with this Solr instance and he remember something about a "blacklist", or "empty list", or something related, that act as a filter for queries, it's like a common words list that return poor quality results to a query, words like:
"a", "the", "for", ...
I want to know how to manage that list to remove a term from it(or add one, edit, etc)
It sounds like you're talking about the stopwords filter. If you have stopword filtering active, you should see something similar to this in your field analysis within schema.xml
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
This references the file stopwords.txt, which is a standard name for this file, but a different file name may be used, so it could be different on your server. This file will contain a list of words that should be disregarded during search. You should find this file in the conf directory for your index (the same place as the schema.xml and solrconfig.xml). You can edit this file, though for best results, you should re-index your records after you do so.
Alternately, if you would prefer not to filter common words from your search, you can remove the reference to the StopFilterFactory from your field analysis entirely. Again, you should plan to reindex your records after doing so.
I would need to store the following value V in a database. An instance of V is linked with a certain record in a certain table. The problem with V is that it has a resemblance to a union type and can indicate three things:
V has an integer value, meaning that the value should be used for the record in question.
V is absent, i.e. NULL, meaning that a global setting takes precedence for the record in question.
V has a meaning of "ANY", meaning that no value should be used for the record in question.
1 and 2 are easy (make it a NULLable integer column), but how to deal with 3? Now I don't feel comfortable using a special numeric value for indicating the ANY state, because e.g. -1 and 0 are totally valid values for case 1.
What has come to mind so far is
Putting this union type into a separate table that has two columns, one for the numeric value and one for the ANY condition (boolean), and a nillable foreign key reference to it.
Storing it as a VARCHAR column and using some special character (e.g. "*") for ANY state.
Is there any "industry standard" way of doing this? :)
For reference, this union type looks something like this in XSD representation:
<complexType name="V">
<choice>
<element name="anyValue" type="xs:string" fixed="" />
<element name="numericValue" type="xs:int" />
</choice>
</complexType>
<complexType name="E">
<sequence>
<element ... />
<element ... />
<element name="configValue" type="V" minOccurs="0" />
</sequence>
</complexType>
I would solve it as either.
Nullable foregn key. And in the foregn key table there would be a row that indicates that it has the "any state".. for example has all columns as null.
or addin a int/boolean column to inditake an overiding state.
Please don't use varchar for linking to other tables...
I have seen many solutions to this problem and all of them have serious issues. This is never going to be pretty. And try to tailor your solution for the code that is going to consume this mess. Make it as simple for that as you can. In those situations readability is king.
But best advice is never to get into those situations but that I know is not always wihtin your power
I've created an index (using Lucene 2.9) which stores text messages. (the documents also contain some other meta data which are not indexed, just stored) I use the StandardAnalyzer for parsing these messages. I'm trying to run some tests on this index using Solr (I replaced the example app index with my index) to see what kind of results I get from various queries.
When I tried the following query , I got 0 results
"text:happiness"
However, changing that to "text:happiness*" gives me some results. All of them contain terms like "happiness,", "happiness." etc. So I thought that it was a tokenization issue during index creation, however, when I used Luke (a lucene index debugging tool) to run the same query (text:happiness), I got the exact same results that I get for happiness* from Solr, which led me to believe that the problem is not while indexing, but in the way that I'm specifying my Solr query. I looked at the solrconfig.xml, and noticed that it has the following line (commented), I tried uncommenting it, and then modified my query to use "defType=lucene" in addition to the original query, but got the same results.
<queryParser name="lucene" class="org.apache.solr.search.LuceneQParserPlugin"/>
I have very little experience with Solr, so any help is greatly appreciated :)
The field that I was querying on was defined as type "text" in the solr schema.xml (not solrconfig.xml as I incorrectly mentioned in my earlier comment). Here's a relevant snippet from the schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
I replaced it with the following,
<fieldType name = "text" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
</fieldType>
Which gives me the required behavior.
Am building a "Book search" API using Lucene.
I need to index Book Name,Author, and Book category fields in Lucene index.
A single book can fall under multiple distinct book categories...for example:
BookName1 --fiction,humour,philosophy.
BookName1 --fiction,science.
BookName1 --humour,business.
BookName4-humour
and so on.....
User should be able to search all the books under a particular category say "homour".
Given this situation, how do i index above fields and build the query in lucene?
You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category
create new lucene document
add name field and value
add author field and value
for each category:
add category field and value
add document to index
When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.
I've written it in english because the specific code is slightly different per lucene version.
You can create a simple "category" field, where you list all categrories for a book seperated by spaces.
Then you can search something like:
stock market AND category:(+"business")
Or if you want to search in more than one category
stock market AND category:(+"business" +"philosophy")
I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.
If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...
In Solr, you would simply define the fields you want to index something like this in schema.xml:
<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />
Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.
Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".
There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.
There is lots of great documentation on their website to get you started.