Apache solr set domain priority - apache

I crawled with nutch 3 domains (domain01, domain02 and domain03).
I want to get all posts which contains specific keyword (ex. "champions league"), and than in results first show the posts from domain02, next posts from domain01 and last posts from domain03. simply i want to sort them in priority by domain
If there is a way to set priority of domains ?

If you always have the same order of domains, then you can use either index time document level boost or query time sort by domain (or domainorder) then by score.
If the domain order depends on the query, you can use QueryElevationComponent, though I think you have to provide full list of IDs then for each elevation rule and it may not support sequence.
You could also write your own Custom Function Query or component (similar to Query Elevation one).

Related

How can I filter results by number of associated objects in Apache Solr?

Let's say I have Apache Solr index, with posts and comments. They are connected via post_id. How Can I query for MoreLikeThis posts with more than 1 comment?
You've to customize apache solr search results or simply write a hook / function for search results which will decrease ranking of result. Some idea is presented here and here

CategoryId in venues search not working correctly

In foursquare Api documentation for "Search venues" https://developer.foursquare.com/docs/venues/search it states
"categoryId - A comma separated list of categories to limit results to. This is an experimental feature and subject to change or may be unavailable. If you specify categoryId you may also specify a radius. If specifying a top-level category, all sub-categories will also match the query."
Realise its supposed to be experimental, but when I provide Food category i.e. 4d4b7105d754a06374d81259, it only returns a few local results, the rest are miles away. However if I execute same search on website sing Food category, it returns correctly lots of results, assuming its the last bit "If specifying a top-level category, all sub-categories will also match the query" is not working , i.e. its not searching sub-categories ?
Any fix work around for this ?
Thanks,
Neil Pepper
You're making a /venues/search request with its default intent of intent=checkin. This returns a filter on nearby results, heavily biased by distance since it's trying to guess where the user might be checking in.
Foursquare Explore uses the /venues/explore endpoint and attempts to return recommended results for a query. If you want to get the sorts of results you get in that tool, call /venues/explore?section=food

Flickr Geo queries not returning any data

I cannot get the Flickr API to return any data for lat/lon queries.
view-source:http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&has_geo=1&extras=geo&bbox=0,0,180,90
This should return something, anything. Doesn't work if I use lat/lng either. I can get some photos returned if I lookup a place_id first and then use that in the query, except then all the photos returned are from anywhere and not the place id
Eg,
http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&placeId=8iTLPoGcB5yNDA19yw
I deleted out my key obviously, replace with yours to test.
Any help appreciated, I am going mad over this.
I believe that the Flickr API won't return any results if you don't put additional search terms in your query. If I recall from the documentation, this is treated as an unbounded search. Here is a quote from the documentation:
Geo queries require some sort of limiting agent in order to prevent the database from crying. This is basically like the check against "parameterless searches" for queries without a geo component.
A tag, for instance, is considered a limiting agent as are user defined min_date_taken and min_date_upload parameters — If no limiting factor is passed we return only photos added in the last 12 hours (though we may extend the limit in the future).
My app uses the same kind of geo searching so what I do is put in an additional search term of the minimum date taken, like so:
http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&has_geo=1&extras=geo&bbox=0,0,180,90&min_taken_date=2005-01-01 00:00:00
Oh, and don't forget to sign your request and fill in the api_sig field. My experience is that the geo based searches don't behave consistently unless you attach your api_key and sign your search. For example, I would sometimes get search results and then later with the same search get no images when I didn't sign my query.

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

How are the Apache Solr search results sequenced?

I have Apache Solr 6.x-1.0-rc3 module installed in my site and it works fine. I wanted to know how are the Apache Solr search results sequenced. I have tried a few things and have concluded that it's not alphabetical or according to the recently updated node.
How are the search results sequenced? I mean in what order or logic.
Basically, it's scored based on the strength of the match with a few boosting factors added. There's a great breakdown on the algorithm here:
http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Yeah, Solr results are sorted by relevance score by default. The fields within a single result should be in the same order in which they were committed (at least within multi-valued fields...should be true for all fields, though), so you should be able to preserve some field sorting that you might do in the processing script. If you just show all results, I think the results are ordered by the order they were added to the index.
The Solr Wiki page (http://wiki.apache.org/solr/CommonQueryParameters) says the default sort order is score desc.