How can I create Solr Instance - instance

I am new to Apache Solr, and I want to create Solr instances. I have configured the Apache Solr 3.2.0 locally in my laptop. How I can create Solr Instance and what is the meaning of Solr Instance. Any help will be appreciated..

what is the meaning of Solr Instance
An solr-instance is something like an runtime-environment. It's an running solr service. U use Solr to work with lucene.
Lucene are java libraries to create/access/... the fulltext index, no instance, no running system.
To contact/use this indexed data (import/export/search/...) , you need an "mediator" or "agent" like solr. Solr can index or access the indexed data by using lucene.
Example: if you have an laptop or pc. So the running (linux/win) system is an instance, while the data from the systems is stored on hard disk.
Solr is comparable with the running system, lucene is comparable with the disk-dirver to store data on your system-drive.
To get an definition of what an instance is, check this: http://en.wikipedia.org/wiki/Instance_%28computer_science%29
How I can create Solr Instance
by starting the tomcat server, which runs the Solr J2EE component.
So if your Solr is still configured, start the tomcat for starting/creating an solr-instance.
In general, this link is helpful: http://wiki.apache.org/solr/SolrTomcat

Related

Stormcrawler: Injecting new URL to crawl without restarting the topology

Is there any way to inject new URL to crawl without stoping the topology from command line and editing
proper files ? I want to do that with Elasticsearch as indexer
It depends on what you use as a backend for storing the status of the URLs. If the URLs are stored in Elasticsearch in the status index, you won't need to restart the crawl topology. You can use the injector topology separately in local mode to inject the new URLs into the status index.
This is also the case with the SOLR or SQL modules but not with MemorySpout + MemoryStatusUpdater as it lives within the JVM and nowhere else.
Which spout do you use?

Regarding Drupal 7 configuring with apache Solr and apache Nutch

I have installed drupal 7 and the apache solr search module and configured with Apache Solr(solr version:4.10.4). The content has been indexed from the drupal to the apache solr and searching also works fine.I need to configure Nutch(Apache Nutch Version:1.12) web crawler to the apache solr and drupal 7 and to fetch the details from the specific URL (for eg: http://www.w3schools.com) and need to search in the drupal for the contents. My problem is how to configure all three solr nutch and drupal 7.Can any one suggest the solution for this?
Ok... here's my ugly solution that maybe fits in what you are doing.
You can use a php field (a custom field with Display Suite) in your node (or page) which basically reads your full page with CURL and then print the contents right there. This field should be only in a display of your node that will see nobody (except Apache Solr).
Finally in Solr config (which honestly I don't remember well how it worked) you could choose which display of the page to be indexed, or the field to be indexed, which will be your full page.
If all these works, you don't need to integrate Nutch with Solr and Drupal.
Good luck :)
PD: If you have a doubt just ask.
My 2 cents on this: looks like you want to aggregate content from your Drupal site (your nodes) and from an external content hosted on your site but not as a Drupal content right? If this is the case then you don't need to any integration between Nutch and Drupal, just to index everything in the same Solr core/collection. Of course you'll need to make sure that the Solr schema is compatible (Nutch has it's own metadata different from the Drupal nodes). Also if you index in separated cores/collections you could use the shards parameter to span you query to several cores and still get only one result set, but with this approach you'll need to keep and eye on the relevance of your results (the order of the documents) and also keep and eye on what fields the Drupal Solr module uses to show the result, so in the end you'll still need to make the schema of both cores compatible at some degree.

How to know the status of apache solr for daily indexed documents

I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?
facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.
You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"

host solr after creating a new collection

I tried the example provided by the Apache Solr package.
I was trying to create a new data collection for my own schema and configurations.
There how should I start running Solr? When I was running the example, there was a start.jar in example directory to start it. Will the same jar work for my case?
If not, how to create a executable for it?
The first line on the solr install page says : "Solr already includes a working demo server in the example directory that you may use as a template" . http://wiki.apache.org/solr/SolrInstall#Setup .
Even if the recomended server is tomcat i have a feeling jetty will work just as well for you. Having the index production ready is more about knowing your fields and query patterns really well, as well as optimising the index through the schema and config for speed according to those patterns

Disable custom index on CMS. Will CD indexing be affected?

I setup a custom Lucene index for several template types in Sitecore in a 1 CA and 3 CD environment. This works fine on the CD servers but this seems to be overloading the CMS server. If I comment out this index on CMS will it in anyway affect the indexing on CD servers?
It shouldn't affect the CD servers unless you have some kind of file or configuration replication between the two that may move the index files or the sitecore config.
On a side note : The lucene indexing part should have very little impact on any server its running on (unless you maybe have custom indexers) so I'm a little confused while it would overload the CMS server.