I have a relatively simple Lucene index, being served by Solr. The index consists of two major fields, title and body, and a few less-important fields.
Most search engines give more relevance to results with matches in the title, over the body. I'm going to start providing an index-time boost to the title field.
My question is, what values do people typically use for their title fields? 2? 4? 10? 100?
I suggest you divide the median body length by the median title length. This roughly gives you a factor M - for M appearances of a word in the body, it will appear once in the title.
Now, use something like M*3. This is, of course, a rationalized heuristic, and it is best you iterate over the values. See Grant Ingersoll's "Debugging Relevance Issues in Search" for a much more structured discussion.
Related
I feel we can improve the search results from our help site (tested few terms and not seeing relevant results on the first page) and I am exploring our options.
We use Apache Solr Search and after reading around it seems like we can improve the results by tweaking the Field Bias. Here is the list of the field available. Some of the fields are self-explanatory but I do not know what others mean. For eg. Path alias, tm_vid_2_names, etc .
The full, rendered content (e.g. the rendered node body)
Title or label
Path alias
Body text inside links (A tags)
Body text inside H1 tags
Body text inside H2 or H3 tags
Body text inside H4, H5, or H6 tags
Body text in inline tags like EM or STRONG
All taxonomy term names
tm_ds_search_result
tm_vid_11_names
tm_vid_12_names
tm_vid_16_names
Taxonomy term names only from the Tags vocabulary
tm_vid_21_names
tm_vid_26_names
tm_vid_2_names
tm_vid_3_names
tm_vid_4_names
tm_vid_5_names
tm_vid_6_names
tm_vid_9_names
Extra rendered content or keywords
Author name
Author name (Formatted)
The rendered comments associated with a node
Thank you very much for your help.
It's impossible to say what the meaning behind all those fields are without knowing your domain and what is actually relevant information. I'd start by actually looking at how people are using your search, and what they're searching for - and then start tweaking how much to boost each field to get more relevant results than what you're seeing now.
If you're using the dismax or edismax query handlers (which you probably are), you can tweak the weights and boosts of each field by applying weights in the list of fields to query: qf=field^10 field_2^5 field_3. This would search all three fields, but give more weight to hits in the first field than the second and third.
In your case you'd probably want to give more boost to anything in the title, h1, h2, h3, etc. fields, as they're probably better descriptors of the content, as well as the taxonomy fields. The body field shouldn't be considered very important (so no boost is probably a good start), except to make sure that you're finding the document if it's a rarely used term.
You can append debugQuery=true to your query to see exactly how the results are scored and why a certain document ranked above another in the search results.
It's impossible for anyone without specific knowledge of your data and search patterns to say anything exact about which fields to include and their weights.
I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.
SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.
I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/
I define a Document object for my product entity which has several fields: Title, Brand, Category, Size, Color, Material.
Now I want to support user to do an AND search on multiple fields. Any document that have one, two or more fields contain all the search words will be responded.
For example, when user enter "gucci shirt red" I want to return all documents that have fields matched with all 3 tokens "gucci", "shirt" AND "red". So all documents below will be responded:
1.Documents with title contains all the 3 words, for example Title = "Gucci Modern Shirt Red" or "Gucci blue shirt"...
2.Documents with Title = "Gucci classical shirt" AND Color = "red"
3.Documents with Category = "mens shirt" AND Brand = "gucci" AND Color = "red"
4.etc..
I know that Lucene support operator + that do a MUST for search query. For example I can translate the above keyword to query "+gucci +shirt +red" then I'm sure documents of example (1) above will definitely be responded. But does it work for cases (2) and (3) above ?
When doing these types of queries I like to: create a master BooleanQuery and add several sub-queries that work together to give the best result:
TermQuery: (exact match), someone types in the exact match of the title
PhraseQuery: (use slop), so if you have "Gucci Modern Shirt Red" and someone types in "Gucci Shirt" (notice one word gap) it would match
FuzzyQuery: (slow on large(> 50 million records)/non-memory indexes) to account for potential misspellings
Boolean SubQuery: with all of the terms seperated and OR'ed. Queries matching 1 our of 4 words will have low score however 3/4 words will have a higher score.
Query Parse (as mentioned above with potential field boosts)
Other: i.e. Synonym search on phrases etc.
I would OR all of these types and then filter them out using a Collector minimum score.
The reason I like the master BooleanQuery approach is that you can have a setting where a user chooses "the type" of query. Maybe as simple -> advanced and it is easy to add/remove query types rather quickly on the fly and the query can be built pretty easily giving predicitve results. Boosting records/similarity you are working within the internal Lucene algorithm and results are not sometimes clear.
Performance: I have done queries like this using Lucene 3.0.x on indexes with > 100M records NOT IN MEMORY and it works pretty quickly giving sub-second responses. Fuzzy Query does slow things down, but as stated before that can be made into an advanced search option (or "Search again with...")
No, when not given a a field to search explicitly in the query, it will go to the default field, which it would appear is the "title" in your case. You would need a query more like:
+shirt +color:red +brand:gucci
for instance.
Or, one common usage is to set up a catch all field, in which all (or a large subset) of searchable data is mashed together, allowing you to search everything in a very loose fashion, on that field, in which case you would just use something like:
all:(+shirt +gucci +red)
Or, if you made that field your default field instead:
+shirt +gucci +red
As you indicated.
You could use MultiFieldQueryParser. Add Title, color, brand etc to this.
If you search for "gucci shirt red" then using above Parser would return query like
+((Title:gucci Color:gucci Brand:gucci) (Title:shirt Color:shirt Brand:shirt) (Title:red Color:red Brand:red)
This should solve the problem.
Also, if you want that lets say, for above query you want to show brand with gucci products to be shown 1st then you could apply boost to this field.
I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".