Solr facet counts are not correct, how to deduplicate - indexing

We are using two solrs to index the files. Sometimes one article is indexed in both solrs because we do update. It cause a problem that the facet counts are not correct due to these duplicated articles. How can I de-duplicate the counts?

My advise would be not to keep duplicated articles. So you need a method to identify this duplicates articles and deleted it form one SOLR.
If you don't want to delete duplicate articles you still need to keep track of them.
Knowing which articles from SOLR1 are duplicates in SOLR2 will help you de-duplicate the counts like this:
create an extra field in SOLR1 named :
IsDuplicateField = true, if article is duplicated in SOLR2
= false, otherwise
when you do the query to SOLR1 add: IsDuplicatedField=true to facets.
when retrieving result just decrease the total number of facet counts with total number of IsDuplicateField from SOLR1.
In this situation the facet IsDuplicateField will retrieve all the articles that are duplicated and match your query.
Good luck !

Related

Solr/Lucene result field term count

I am using solr to do a search. As result I get back a set of fields. One of the fields is "domains". The domain field is a many to many relationship in my database, so my docs contain an array of "domains" the are linked to.
What I want to do is, for each domain in the resultset, count how many times this "domain term" is found in the global result set.
How should I do this ?
You need to look at the Field collapsing feature.

Get models with distinct attribute ActiveRecord

I have a bunch of records in my database which all have the same Title but different Locations. Once I filter by within a location boundary, I want to filter out ones with the same Title. Is there an ActiveRecord way to do this? I know about select, but that will only return titles, and I actually need the entire record.
So I have a Business which has a Title. If I select all of the businesses within a given lat/long boundary, multiple instances with the same name (say, Subway) will be returned. I want to limit the result to 10.
In English: Given me ten records (the entire record, not just certain columns) where every title is unique amongst the ten returned.
You can simply use .first, i.e.
Venue.where(name: "Subway").first
If you need more than one element, pass a parameter to first:
Venue.where(name: "Subway").first(10)
To select one entry per distinct value in some column, you can use .group("column_name"):
Venue.where(some_condition).group("name")
ModelName.where(title: "Building")
If you provide a more specific question, I'll provide a more specific answer...

MongoDB infinite scroll sorted results

I am having a problem trying to achieve the following:
I'd like to have a page with 'infinite' scrolling functionality and all the results fetched to be sorted by certain attributes. The way the code currently works is, it places the query, sorts the results, and displays them. The problem is, that once the user reaches the bottom of the page and new query is placed, the results from this query are sorted, but in its own context. That is, if you have a total of 100 results, and the first query display only 50, then they are sorted. But the next query (for the next 50) sorts the results only based on these 50 results, not based on the 100 (total results).
So, do I have to fetch all the results at once, sort them, and then apply some pagination logic to them or there's a way for MongoDB to actually have infinite scrolling (AJAX requests) with sorting applying to the results?
There's a few ways to do this with MongoDB. You can use the .skip() and .limit() commands (documented here: http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-CursorMethods) to apply pagination to the query.
Alternatively, you could add a clause to your query like: {sorted_field : {$gt : <value from last record>}}. In other words, filter out matches of the query whose sorted value is less than that of the last resulting item from the current page of results. For example, if page 1 of results returns documents A through D, then to retrieve the next page 2 you repeat the same query with the additional filter x > D.
Let me preface this by saying that I have no experience with MongoDB (though I am aware that it is a NoSQL database).
This question, however, is somewhat of a general database one (you'd probably get more responses tagging it as such). I've implemented such a feature using Cassandra (another, albiet quite different NoSQL database), however the same principles apply.
Use the sorted-by attribute of the last retrieved record, and conduct a range search based on it in the database. So, assuming your database consists of the following set of letters:
A
B
C
D
E
F
G
..and you were retrieving 2 letters at a time, you'd retrieve A, B first. When more records are needed, you'd use B to conduct a range search on the set of letters in the database. In plain English this would be something like:
Get the letters that appear after B, limit the results to 2
From a brief look at the MongoDB tutorial, it looks like you have conditional operators to help you implement this.

Solr: Search in multiple fields BUT STOP if documents match was found

I want to search in multiple fields in Solr.
(In know the concept of the copy-fields and I know the (e)dismax search handler.)
So I have an orderd list of fields, I want the terms to be searched against.
1.) SKU
2.) Name
3.) Description
4.) Summary
and so on.
Now, when the query matches a term, let's say in the SKU field, I want this match and no further searches in the proceeding fields.
Only, if there are NO matches at all in the first field (SKU field), the second field (in this case "name") should be used and so on.
Is this possible with Solr?
Do I have to implement my own Lucene Search Handler for this?
Any advice is welcome!
Thank you,
Bernhard
I think your case requires executing 4 different searches. If you implement you very own SearchHandler you could avoid penalty of search result accumulation in 4 different request. Which means, you would send one query, and custom SearchHandler would execute 4 searches and prepare one result set.
If my guess is right you want to rank the results based on the order of the fields. If so then you can just use standard query like
q=sku:(query)^4 OR name:(query)^3 OR description:(query)^2 OR summary:(query)
this will rank the results by the order of the fields.
Hope is helps.

Apache SOLR search by category

I am using apache-solr-1.4.1 and jdk1.6.0_14.
I have the following scenario.
I have 3 categories of data indexed in SOLR i.e. CITIES, STATES, COUNTRIES.
When I query data from SOLR I need the search result from SOLR based on the following criteria:
In a single query to SOLR I need data fetched from SOLR grouped by each category with a predefined results count for each category.
How can I specify this condition in SOLR?
I have tried to use SOLR Field Collapsing feature, but I am not able to get the desired output from SOLR.
Please suggest.
My solution is not exactly what you have asked but is my take on what SOLR does best, which is full text search. Instead of grouping the results by "category", I'd suggest you order the results by relevance score but also provide a facet count for the category values. In my experience users expect a "search" to behave like Google, with the best matches at the top. Deviating form this norm confuses the user in most cases.
If you want exactly as you have asked (actual results grouped by category) then you could use a relational database and do a group_by or write a custom function query with SOLR (I cannot advise on this as I've never done it).
More info: index the data with the appropriate fields, e.g. name, population, etc. But also add a field called "category", which would have a value of either CITIES, STATES or COUNTRIES. Then perform a standard SOLR search, which will return results in order of relevance - i.e. best matches at the top. As part of the request, you can specify a facet.field=category, which will return counts for the search results for each of the given categories (in the "facet" results section). In the UI you can then create links for each category facet which performs the original search plus &fq=category:CITIES, etc., thus restricting results to just that category. See the facetting overview on the SOLR wiki for more info.