How to know the status of apache solr for daily indexed documents - apache

I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?

facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.

You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"

Related

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.
The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.

Regarding Drupal 7 configuring with apache Solr and apache Nutch

I have installed drupal 7 and the apache solr search module and configured with Apache Solr(solr version:4.10.4). The content has been indexed from the drupal to the apache solr and searching also works fine.I need to configure Nutch(Apache Nutch Version:1.12) web crawler to the apache solr and drupal 7 and to fetch the details from the specific URL (for eg: http://www.w3schools.com) and need to search in the drupal for the contents. My problem is how to configure all three solr nutch and drupal 7.Can any one suggest the solution for this?
Ok... here's my ugly solution that maybe fits in what you are doing.
You can use a php field (a custom field with Display Suite) in your node (or page) which basically reads your full page with CURL and then print the contents right there. This field should be only in a display of your node that will see nobody (except Apache Solr).
Finally in Solr config (which honestly I don't remember well how it worked) you could choose which display of the page to be indexed, or the field to be indexed, which will be your full page.
If all these works, you don't need to integrate Nutch with Solr and Drupal.
Good luck :)
PD: If you have a doubt just ask.
My 2 cents on this: looks like you want to aggregate content from your Drupal site (your nodes) and from an external content hosted on your site but not as a Drupal content right? If this is the case then you don't need to any integration between Nutch and Drupal, just to index everything in the same Solr core/collection. Of course you'll need to make sure that the Solr schema is compatible (Nutch has it's own metadata different from the Drupal nodes). Also if you index in separated cores/collections you could use the shards parameter to span you query to several cores and still get only one result set, but with this approach you'll need to keep and eye on the relevance of your results (the order of the documents) and also keep and eye on what fields the Drupal Solr module uses to show the result, so in the end you'll still need to make the schema of both cores compatible at some degree.

Nutch not crawling entire website

I am using nutch 2.3.1
I preform the commands to crawl a site:
./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all
The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.
All the other URLS that were accumulated by the generate command are not actually crawled.
I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?
Does anyone have any insights or recommendations?
Thank you so much for your help
In the case that Nutch crawls only one specified URL, please check Nutch filter (conf/regex-urlfilter.txt). To crawl all URLs in the seed, the content of regex-urlfilter.txt should be as follows.
# accept all URLs
+.
See details here: http://wiki.apache.org/nutch/NutchTutorial
Hope this helps,
Le Quoc Do

Apache nutch and solr : queries

I have just started using Nutch 1.9 and Solr 4.10
After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr
This version of package doesnt require Tomcat for running. I started Solr:
java -jar start.jar
and checked localhost:8983/solr/admin, its working.
I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"
Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5
I got few IO exceptions in the middle and so to avoid the IO exception I downloaded
patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.
After running Nutch,
Following folders were created:
apache-nutch-1.9\bin\-dir\crawldb\current\part-00000
I can see following files in that path:
data<br>
index<br>
.data.crc<br>
.index.crc<br>
I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?
I indexed the crawled data from Nutch into Solr:
for linking solr with nutch (command completed successfully)
Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5
Why do we need to index the data crawled by Nutch into Solr?
After crawling using Nutch
command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?
OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?
How to view the crawled data in Solr web?
command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5
Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run
crawl url crawl solraddress depth level
You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.
Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.
If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.
To view data at solr, you simplly issue a query to Solr like this:
curl http://127.0.0.1:8983/solr/collection1/select/?q=*:*
Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.
The following summarizes what happens in Nutch:
seed list -> crawldb -> fetching raw data (download site contents)
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)->
sending the structured data to storage for usage (like ElasticSearch and Solr).
Each of these stages is extendable and allows you to add your logic to suit your requirements.
I hope that clears your confusion.
u can run nutch on windows-I am also a beginner-yes it is a bit tough to install in windows but it do works!-this input path doesn't exist problem can be solved by:-
replacing Hadoop-core-1.2.0.jar file in apache-nutch-1.9/lib with hadoop-core-0.20.2.jar(from maven)
then rename this new file as hadoop-core-1.2.0

Remote streaming with Solr

I'm having trouble using remote streaming with Apache Solr.
We previously had Solr running on the same server where the files to be indexed are located so all we had to to was pass it the path of the file we wanted to index.
We used something like this:
stream.file=/path/to/file.pdf
This worked fine. We have now moved Solr so that it runs on a different server to the website that uses it. This was because it was using up too many resources.
I'm now using the following to point Solr in the direction of the file:
stream.file=http://www.remotesite.com/path/to/file.pdf
When I do this Solr reports the following error:
http:/www.remotesite.com/path/to/file.pdf (No such file or directory)
Note that it is stripping one of the slashes from http://.
How can I get Solr to index a file at a certain URL like i'm trying to do above? The enableRemoteStreaming parameter is already set to true.
Thank you
For remote streaming
you would need to enable remote streaming
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
and probably use stream.url for urls
If remote streaming is enabled and URL content is called for during
request handling, the contents of each stream.url and stream.file
parameters are fetched and passed as a stream.