How to Crawl .pdf links using Apache Nutch - apache

I got a website to crawl which includes some links to pdf files.
I want nutch to crawl that link and dump them as .pdf files.
I am using Apache Nutch1.6 also i am tring this in java as
ToolRunner.run(NutchConfiguration.create(), new Crawl(),
tokenize(crawlArg));
SegmentReader.main(tokenize(dumpArg));
can some one help me on this

If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:
Document crawling
1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"
1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
If what you really want is to download all pdf files from a page, you can use something like Teleport in Windows or Wget in *nix.

you can either write your own own plugin, for pdf mimetype or there is embedded apache-tika parser, that can retrieve text from pdf..

Related

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.
The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.

Regarding Drupal 7 configuring with apache Solr and apache Nutch

I have installed drupal 7 and the apache solr search module and configured with Apache Solr(solr version:4.10.4). The content has been indexed from the drupal to the apache solr and searching also works fine.I need to configure Nutch(Apache Nutch Version:1.12) web crawler to the apache solr and drupal 7 and to fetch the details from the specific URL (for eg: http://www.w3schools.com) and need to search in the drupal for the contents. My problem is how to configure all three solr nutch and drupal 7.Can any one suggest the solution for this?
Ok... here's my ugly solution that maybe fits in what you are doing.
You can use a php field (a custom field with Display Suite) in your node (or page) which basically reads your full page with CURL and then print the contents right there. This field should be only in a display of your node that will see nobody (except Apache Solr).
Finally in Solr config (which honestly I don't remember well how it worked) you could choose which display of the page to be indexed, or the field to be indexed, which will be your full page.
If all these works, you don't need to integrate Nutch with Solr and Drupal.
Good luck :)
PD: If you have a doubt just ask.
My 2 cents on this: looks like you want to aggregate content from your Drupal site (your nodes) and from an external content hosted on your site but not as a Drupal content right? If this is the case then you don't need to any integration between Nutch and Drupal, just to index everything in the same Solr core/collection. Of course you'll need to make sure that the Solr schema is compatible (Nutch has it's own metadata different from the Drupal nodes). Also if you index in separated cores/collections you could use the shards parameter to span you query to several cores and still get only one result set, but with this approach you'll need to keep and eye on the relevance of your results (the order of the documents) and also keep and eye on what fields the Drupal Solr module uses to show the result, so in the end you'll still need to make the schema of both cores compatible at some degree.

Scroll SVN-stored text file served by Apache

I have a 1.7.9 SVN server exposed via Apache. It uses apache2-svn which allows to specify revision number as a part of the URL like this (for r65):
https://SERVER:PORT/REPO/FILE?p=65
I'd like to add a parameter to a query string that allows scrolling the file or even better - highlighting a range in the file. So users can send links pointing to "revision 65, lines 110-125".
Any ideas? The SVN stores only regular text files. Do browsers even support scrolling an arbitrary text file? Or would I need to transform the file into a HTML document? Any ready to use solution?
Cheers,
Pawel
Built-in Apache's SVN repository browsing feature is very simple and minimalistic. It does not allow you to specify the particular string to navigate to. The available URL syntax allows
viewing / downloading a particular file:
https://svn.domain.com/svn/MyProject/trunk/README.txt
viewing / downloading a particular file in revision 321:
https://svn.domain.com/svn/MyProject/trunk/README.txt?r=321
viewing / downloading a particular file, which is not available in the youngest revision, by specifying peg revision:
https://svn.domain.com/svn/MyProject/trunk/FILE_ID.DIZ?p=123
combining both of the above methods you can tune the view.
If you want more browsing features, install 3-rd party repository browsing UI. Take a closer look at ViewVC, WebSVN and Sventon.

Make Indexed File Downloadable In Apache Solr

I am trying to indexed pdf file to Solr which I have done successfully using the command
curl "http://localhost:8983/solr/update/extract?literal.id=id=true"-F myfile=#filename.pdf"
I am able to see the file contents and search, but when I try to click on file name it shows
HTTP ERROR 404
Problem accessing /solr/collection1/id. Reason:
not found
What I want is to have a link which allows downloading the file, I know Solr merely indexes the file and stores it. I was wondering if there is a way by which I can add attribute location like you have done and proceed from there, can you please share with me what you have done, if you want any more clarity regarding my problem do ask.
We have the actual files hosted through a separate web application to be download from with auditing and additional security.
you can always directly host these files through http server.
If you are having the file names with id, it is as easy as appending the id.extension to the fixed http hosted url.
Else index the path of the file with an additional parameter e.g. literal.url.
The url will the solr field which will now be available with the Solr response.

Apache Nutch crawler how to exclude static folders like; cgi-bin, images, css exclude from nutch crawler?

When we run the crawler we see static folders like; /cgi-bin, /images, /css etc. popup in the crawler jobs, we want to exclude them from crawling (not that they end up in indexer) and we donĀ“t want them in indexer, but how we can exclude them in the crawler so it is not occupied with these static folders? Any help is appreciated. Does it help performance, excluding them? as now we see it fetches them for some reason or another. Nutch crawler 1.2, Lucene indexer.
Add reject rules to the conf/regex-urlfilter.txt file.
-cgi-bin
-images
-css
Note that this must be added before the accept all rule ie. +. in the regex file.