I am using apache solr 4.10. I have to boost documents against which they are served so that they will have better score. Now for that, I have to log that document id as well as that query.
How to save document id and query ?
Second, I have to use that information to boost that document. How should I do it?
Related
I have generated webgrapgh db in apache nutch using command 'bin/nutch webgraph -segmentDir crawl/segments -webgraphdb crawl/webgraphdb'.... It generated three folders in crawl/webgraphdb which are inlinks, outlinks and nodes. Each of those folders contained two binary files like data and index. How to get visual web graph in apache nutch? What is the use of web graph?
The Webgraph is intented to be a step in the score calculation based on the link structure (i.e webgraph):
webgraph will generate the data structure for the specified segment/s
linkrank will calculate the score based on the previous structures
scoreupdater will update the score from the webgraph back into the crawldb
Be aware that this program is very CPU/IO intensive and that will ignore the internal links of a website by default.
You could use the nodedumper command to get useful data out of the webgraph data, including the actual score of a node and the highest scored inlinks/outlinks. But this is not intented to be visualized, although you could parse the output of this command and generate any visualization that you may need.
That being said, since Nutch 1.11 the plugin index-links has been added, which will allow you to index into Solr/ES the inlinks and outlinks of each URL. I've used this plugin indexing into Solr along with the sigmajs library to generate some graph visualizations of the link structure of my crawls, perhaps this could suit your needs.
I try to use a lucene index on a remote server as an input for carrot2 installed on the same server. Regarding the documentation this should be possible with carrot2-dcs (documentation chapter 3.4 Carrot2 Document Clustering Server: Various document sources included. Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr)).
After installing carrot2-dcs 3.9.3 I discovered that lucene isnĀ“t available as a document source. How to proceed?
To cluster content from a Lucene index, the index needs to be available on the server the DCS is running (either through the local file system or e.g. as an NSF mount).
To make the Lucene source visible in the DCS:
Open for editing: war/carrot2-dcs.war/WEB-INF/suites/source-lucene-attributes.xml
Uncomment the configuration sections and provide the location of your Lucene index and the fields that should serve documents' titles and content (at least one is required). Remember the fields must be "stored" in Lucene speak.
Make sure the edited file is packed back to the WAR archive and run the DCS. You should now see the Lucene document source.
My requirement is like job sites where a user can upload a document(can be PDF,Text or word document) like Resume/CV. Then all these documents can be searched for a specific or a combination of keyword and they also have to be ranked based on those key words. I need to know which technology can be good from performance point of view when the number of files are huge and also there are good number of request for searching and indexing.
The website is built using SQL Server. So can I store those files in SQL Server? Will be good in terms of performance.
Or can it be done alone using Lucene.NET and i can store those files in single folder?
I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..
i wanted to build a search engine like google where if i enter a search term it retrieves the urls to websites.
i used lucene with tomcat but it searches the files residing in my system.
i want to search throughout the web.Please tell me how to do this using lucene?
if we can't do this using lucene,please suggest alternatives.
Use Nutch.
I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."
Is it possible to include page numbers in the query result like this?
It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.
Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.
Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)