I have installed drupal 7 and the apache solr search module and configured with Apache Solr(solr version:4.10.4). The content has been indexed from the drupal to the apache solr and searching also works fine.I need to configure Nutch(Apache Nutch Version:1.12) web crawler to the apache solr and drupal 7 and to fetch the details from the specific URL (for eg: http://www.w3schools.com) and need to search in the drupal for the contents. My problem is how to configure all three solr nutch and drupal 7.Can any one suggest the solution for this?
Ok... here's my ugly solution that maybe fits in what you are doing.
You can use a php field (a custom field with Display Suite) in your node (or page) which basically reads your full page with CURL and then print the contents right there. This field should be only in a display of your node that will see nobody (except Apache Solr).
Finally in Solr config (which honestly I don't remember well how it worked) you could choose which display of the page to be indexed, or the field to be indexed, which will be your full page.
If all these works, you don't need to integrate Nutch with Solr and Drupal.
Good luck :)
PD: If you have a doubt just ask.
My 2 cents on this: looks like you want to aggregate content from your Drupal site (your nodes) and from an external content hosted on your site but not as a Drupal content right? If this is the case then you don't need to any integration between Nutch and Drupal, just to index everything in the same Solr core/collection. Of course you'll need to make sure that the Solr schema is compatible (Nutch has it's own metadata different from the Drupal nodes). Also if you index in separated cores/collections you could use the shards parameter to span you query to several cores and still get only one result set, but with this approach you'll need to keep and eye on the relevance of your results (the order of the documents) and also keep and eye on what fields the Drupal Solr module uses to show the result, so in the end you'll still need to make the schema of both cores compatible at some degree.
Related
I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.
The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.
I am doing some reverse engineering on a website.
We are using LAMP stack under CENTOS 5, without any commercial/open source framework (symfony, laravel, etc). Just plain PHP with an in-house framework.
I wonder if there is any way to know which files in the server have been used to produce a request.
For example, let's say I am requesting http://myserver.com/index.php.
Let's assume that 'index.php' calls other PHP scripts (e.g. to connect to the database and retrieve some info), it also includes a couple of other html files, etc
How can I get the list of those accessed files?
I already tried to enable the server-status directive in apache, and although it is working I can't get what I want (I also passed the 'refresh' parameter)
I also used lsof -c httpd, as suggested in other forums, but it is producing a very big output and I can't find what I'm looking for.
I also read the apache logs, but I am only getting the requests that the server handled.
Some other users suggested to add the PHP directives like 'self', but that means I need to know which files I need to modify to include that directive beforehand (which I don't) and which is precisely what I am trying to find out.
Is that actually possible to trace the internal activity of the server and get those file names and locations?
Regards.
Not that I tried this, but it looks like mod_log_config is the answer to my own question
I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?
facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.
You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"
I tried the example provided by the Apache Solr package.
I was trying to create a new data collection for my own schema and configurations.
There how should I start running Solr? When I was running the example, there was a start.jar in example directory to start it. Will the same jar work for my case?
If not, how to create a executable for it?
The first line on the solr install page says : "Solr already includes a working demo server in the example directory that you may use as a template" . http://wiki.apache.org/solr/SolrInstall#Setup .
Even if the recomended server is tomcat i have a feeling jetty will work just as well for you. Having the index production ready is more about knowing your fields and query patterns really well, as well as optimising the index through the schema and config for speed according to those patterns
I am new to Apache Solr, and I want to create Solr instances. I have configured the Apache Solr 3.2.0 locally in my laptop. How I can create Solr Instance and what is the meaning of Solr Instance. Any help will be appreciated..
what is the meaning of Solr Instance
An solr-instance is something like an runtime-environment. It's an running solr service. U use Solr to work with lucene.
Lucene are java libraries to create/access/... the fulltext index, no instance, no running system.
To contact/use this indexed data (import/export/search/...) , you need an "mediator" or "agent" like solr. Solr can index or access the indexed data by using lucene.
Example: if you have an laptop or pc. So the running (linux/win) system is an instance, while the data from the systems is stored on hard disk.
Solr is comparable with the running system, lucene is comparable with the disk-dirver to store data on your system-drive.
To get an definition of what an instance is, check this: http://en.wikipedia.org/wiki/Instance_%28computer_science%29
How I can create Solr Instance
by starting the tomcat server, which runs the Solr J2EE component.
So if your Solr is still configured, start the tomcat for starting/creating an solr-instance.
In general, this link is helpful: http://wiki.apache.org/solr/SolrTomcat