How to get solr results with atmost one document from each website - indexing

I am using solr 4.10.3. I have crawled some documents from web. Now when I query there appears one thing that I do not want. That is: more than one results are show from one website.
What I want is that atmost one result should be shown from each website in results.

Use grouping on host or whatever else you have that uniquely identifies a website: https://cwiki.apache.org/confluence/display/solr/Result+Grouping
group=true&groups.field=host

Related

How to get all links and (some of) the link pages categories for a Wikipedia page?

Is it possible for a given Wikipedia page (e.g. Dolphin) to get all links and for every link its categories (at least some, let's say 5 for every link)?
I want to do this in a single query/API call
I tried:
https://en.wikipedia.org/w/api.php?action=query&titles=Dolphin&generator=links&prop=categories&gpllimit=20
this returns the links, but only categories for one link.
I think your query has to work, but shows only 10 categories totally for the whole query because cllimit=10 by default. If you set cllimit=3 again 3 categories will be returned for the whole query:
https://en.wikipedia.org/w/api.php?action=query&titles=Dolphin&generator=links&prop=categories&gpllimit=20&cllimit=3
But: if use cllimit=max you will get the categories for every link:
https://en.wikipedia.org/w/api.php?action=query&titles=Dolphin&generator=links&prop=categories&gpllimit=20&redirects=true&cllimit=max
Also do not forget to use &redirects=true to resolve redirected links, because these pages are without categories.

Google GeoCoding API multiple search results

Currently I am using the following URL to get the addresses and coordinates of a search string in my app.
https://maps.googleapis.com/maps/api/geocode/json?address=<Address>&sensor=false
The URL returns multiple results only when the string is given in a particular format. For example, I am trying to search for a business named "Yashoda Hospital, Hyderabad" and the results give only one address although the same search string gives multiple results in Google Maps website. But when I enter the string with a slight variation ("Yashoda Hospitals, Hyderabad") by adding an 's', it returns two addresses.
Is there a particular URL through which we can get all the possible results? Or what should be the approach to capture all the possible addresses that are returned in Google Maps website?

Sample code of java lucene indexing and searching for creating one document per line

I am very new to lucene.I have a text file containing 100s of records with two columns per line.First column is of userid and second is of url_list(I guess those will be my document fields)
I need to provide a search feature using lucene which will give the document containing entered url or userid. And for that i need to create one lucene document per line of my text file.
Please suggest me some sample code for this..
I m using lucene version 3.6.2
Here is a short but fantastic tutorial on Lucene for starters.
Lucene in 5 minutes
Steps
1) I assume that you are pre-parsing the text file to get hold of userid, corresponding url list. You've got to do this. Lucene won't help. Lucene does break the text that belongs to a single field, but won't break the text and add userid to userid field and urls to URL field.
2) Read the above tutorial. I highly recommend you to use the latest version of Lucene which is 4.1 as of now.
3) Things to remember that are specific to your use-case
Have two fields for each document: USER_ID, URL (of course you may change those names)
Do not ANALYZE (break into tokens) the content of USER_ID field.
I am not sure how you wanna store the URL field. You may not want to ANALYZE it or use the StandardAnalyzer which recognizes a URL without tokenizing.
4) You can find the sample code to index, query, search, retrieve results in the tutorial.

Apache SOLR search by category

I am using apache-solr-1.4.1 and jdk1.6.0_14.
I have the following scenario.
I have 3 categories of data indexed in SOLR i.e. CITIES, STATES, COUNTRIES.
When I query data from SOLR I need the search result from SOLR based on the following criteria:
In a single query to SOLR I need data fetched from SOLR grouped by each category with a predefined results count for each category.
How can I specify this condition in SOLR?
I have tried to use SOLR Field Collapsing feature, but I am not able to get the desired output from SOLR.
Please suggest.
My solution is not exactly what you have asked but is my take on what SOLR does best, which is full text search. Instead of grouping the results by "category", I'd suggest you order the results by relevance score but also provide a facet count for the category values. In my experience users expect a "search" to behave like Google, with the best matches at the top. Deviating form this norm confuses the user in most cases.
If you want exactly as you have asked (actual results grouped by category) then you could use a relational database and do a group_by or write a custom function query with SOLR (I cannot advise on this as I've never done it).
More info: index the data with the appropriate fields, e.g. name, population, etc. But also add a field called "category", which would have a value of either CITIES, STATES or COUNTRIES. Then perform a standard SOLR search, which will return results in order of relevance - i.e. best matches at the top. As part of the request, you can specify a facet.field=category, which will return counts for the search results for each of the given categories (in the "facet" results section). In the UI you can then create links for each category facet which performs the original search plus &fq=category:CITIES, etc., thus restricting results to just that category. See the facetting overview on the SOLR wiki for more info.

Show hitted documents in the same series together in Lucene

The are some articles are written in several parts,
for example, I got those articles from IBM developer works:
Distributed data processing with
Hadoop, Part 1:Getting started
Distributed data processing with
Hadoop, Part 2:Going further
Distributed data processing with
Hadoop, Part 3: Application
development
I will index those three articles separately. And some one search certain keywords, it is possible the part3 is on the top of hit whle part1 is on the 32th. Therefor, if I list results page by page, the part1 and part3 will display on different page.
How can I make sure the hitted documents in the same series displayed together?
I guess in SQL, we can use "group by".
I believe what you are asking for is Field Collapsing, which is currently a trunk feature in Solr, and will be incorporated into the next Solr version.
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a series. You will have to ensure that this gets incremented for every new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.