Field bias explanation and how to improve search results - apache

I feel we can improve the search results from our help site (tested few terms and not seeing relevant results on the first page) and I am exploring our options.
We use Apache Solr Search and after reading around it seems like we can improve the results by tweaking the Field Bias. Here is the list of the field available. Some of the fields are self-explanatory but I do not know what others mean. For eg. Path alias, tm_vid_2_names, etc .
The full, rendered content (e.g. the rendered node body)
Title or label
Path alias
Body text inside links (A tags)
Body text inside H1 tags
Body text inside H2 or H3 tags
Body text inside H4, H5, or H6 tags
Body text in inline tags like EM or STRONG
All taxonomy term names
tm_ds_search_result
tm_vid_11_names
tm_vid_12_names
tm_vid_16_names
Taxonomy term names only from the Tags vocabulary
tm_vid_21_names
tm_vid_26_names
tm_vid_2_names
tm_vid_3_names
tm_vid_4_names
tm_vid_5_names
tm_vid_6_names
tm_vid_9_names
Extra rendered content or keywords
Author name
Author name (Formatted)
The rendered comments associated with a node
Thank you very much for your help.

It's impossible to say what the meaning behind all those fields are without knowing your domain and what is actually relevant information. I'd start by actually looking at how people are using your search, and what they're searching for - and then start tweaking how much to boost each field to get more relevant results than what you're seeing now.
If you're using the dismax or edismax query handlers (which you probably are), you can tweak the weights and boosts of each field by applying weights in the list of fields to query: qf=field^10 field_2^5 field_3. This would search all three fields, but give more weight to hits in the first field than the second and third.
In your case you'd probably want to give more boost to anything in the title, h1, h2, h3, etc. fields, as they're probably better descriptors of the content, as well as the taxonomy fields. The body field shouldn't be considered very important (so no boost is probably a good start), except to make sure that you're finding the document if it's a rarely used term.
You can append debugQuery=true to your query to see exactly how the results are scored and why a certain document ranked above another in the search results.
It's impossible for anyone without specific knowledge of your data and search patterns to say anything exact about which fields to include and their weights.

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

SQL like '%term%' except without letters

I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.
SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.

How would you reproduce a tagging system like the one StackOverflow uses?

I am trying to produce a tagging system for a recruitment agency model and love the way SO separates tags and searches for the remaining phrases.
How would you compare the tags in a table to the search query etc...
I have come up with the following but it has some hickups...
User enters search query
Full text SQL contains() search on tbl_tags
Returns 5 results
Check if each "exact tag phrase" exists in original query string.
If it does exist then add tagID to array.
Remove tag names from original search string...
Search in tbl_people for people with linked TagIDs and search text fields with remaining text.
Example search : French Project Managers with Oracle experience
Tags : [French] [Project Manager]s with [Oracle] experience
Remaining text : s with experience
Now the problem comes when I search for Project Managers it leaves me with a surplus "s"... and there are probably other bugs with this logic too that I cannot account for...
Any ideas on how to make the logic perfect?
Thanks in advance, I understand this might be a bit of an abstract question...
You're missing a key ingredient of how StackOverflow does its search. SO requires that the user delineate the tags in the search string by explicitly putting brackets around the tags. The (probably simplified) logic would then be.
Extract marked tags using regex to detect contents inside brackets
Using list of most common tags, scan string for unmarked tags and extract them.
Remove tag meta characters
Perform full-text search, filtered by tags

Recommended title boost?

I have a relatively simple Lucene index, being served by Solr. The index consists of two major fields, title and body, and a few less-important fields.
Most search engines give more relevance to results with matches in the title, over the body. I'm going to start providing an index-time boost to the title field.
My question is, what values do people typically use for their title fields? 2? 4? 10? 100?
I suggest you divide the median body length by the median title length. This roughly gives you a factor M - for M appearances of a word in the body, it will appear once in the title.
Now, use something like M*3. This is, of course, a rationalized heuristic, and it is best you iterate over the values. See Grant Ingersoll's "Debugging Relevance Issues in Search" for a much more structured discussion.