Apache Nutch crawler how to exclude static folders like; cgi-bin, images, css exclude from nutch crawler? - apache

When we run the crawler we see static folders like; /cgi-bin, /images, /css etc. popup in the crawler jobs, we want to exclude them from crawling (not that they end up in indexer) and we donĀ“t want them in indexer, but how we can exclude them in the crawler so it is not occupied with these static folders? Any help is appreciated. Does it help performance, excluding them? as now we see it fetches them for some reason or another. Nutch crawler 1.2, Lucene indexer.

Add reject rules to the conf/regex-urlfilter.txt file.
-cgi-bin
-images
-css
Note that this must be added before the accept all rule ie. +. in the regex file.

Related

Archiving an old PHP website: will any webhost let me totally disable query string support?

I want to archive an old website which was built with PHP. Its URLs are full of .phps and query strings.
I don't want anything to actually change from the perspective of the visitor -- the URLs should remain the same. The only actual difference is that it will no longer be interactive or dynamic.
I ran wget --recursive to spider the site and grab all the static content. So now I have thousands of files such as page.php?param1=a&param2=b. I want to serve them up as they were before, so that means they'll mostly have Content-Type: text/html, and the webserver needs to treat ? and & in the URL as literal ? and & in the files it looks up on disk -- in other words it needs to not support query strings.
And ideally I'd like to host it for free.
My first thought was Netlify, but deployment on Netlify fails if any files have ? in their filename. I'm also concerned that I may not be able to tell it that most of these files are to be served as text/html (and one as application/rss+xml) even though there's no clue about that in their filenames.
I then considered https://surge.sh/, but hit exactly the same problems.
I then tried AWS S3. It's not free but it's pretty close. I got further here: I was able to attach metadata to the files I was uploading so each would have the correct content type, and it doesn't mind the files having ? and & in their filenames. However, its webserver interprets ?... as a query string, and it looks up and serves the file without that suffix. I can't find any way to disable query strings.
Did I miss anything -- is there a way to make any of the above hosts act the way I want them to?
Is there another host which will fit the bill?
If all else fails, I'll find a way to transform all the filenames and all the links between the files. I found how to get wget to transform ? to #, which may be good enough. It would be a shame to go this route, however, since then the URLs are all changing.
I found a solution with Netlify.
I added the wget options --adjust-extension and --restrict-file-names=windows.
The --adjust-extension part adds .html at the end of filenames which were served as HTML but didn't already have that extension, so now we have for example index.php.html. This was the simplest way to get Netlify to serve these files as HTML. It may be possible to skip this and manually specify the content types of these files.
The --restrict-file-names=windows alters filenames in a few ways, the most important of which is that it replaces ? with #. This is needed since Netlify doesn't let us deploy files with ? in the name. It's a bit of a hack; this is not really what this option is meant for.
This gives static files with names like myfile.php#param1=value1&param2=value2.html and myfile.php.html.
I did some cleanup. For example, I needed to adjust a few link and resource paths to be absolute rather than relative due to how Netlify manages presence or lack of trailing slashes.
I wrote a _redirects file to define URL rewriting rules. As the Netlify redirect options documentation shows, we can test for specific query parameters and capture their values. We can use those values in the destinations, and we can specify a 200 code, which makes Netlify handle it as a rewrite rather than a redirection (i.e. the visitor still sees the original URL). An exclamation mark is needed after the 200 code if a "query-string-less" version (such as mypage.php.html) exists, to tell Netlify we are intentionally shadowing.
/mypage.php param1=:param1 param2=:param2 /mypage.php#param1=:param1&param2=:param2.html 200!
/mypage.php param1=:param1 /mypage.php#param1=:param1.html 200!
/mypage.php param2=:param2 /mypage.php#param2=:param2.html 200!
If not all query parameter combinations are actually used in the dumped files, not all of the redirect lines need to be included of course.
There's no need for a final /mypage.php /mypage.php.html 200 line, since Netlify automatically looks for a file with a .html extension added to the requested URL and serves it if found.
I wrote a _headers file to set the content type of my RSS file:
/rss.php
Content-Type: application/rss+xml
I hope this helps somebody.

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.
The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.

Nutch not crawling entire website

I am using nutch 2.3.1
I preform the commands to crawl a site:
./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all
The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.
All the other URLS that were accumulated by the generate command are not actually crawled.
I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?
Does anyone have any insights or recommendations?
Thank you so much for your help
In the case that Nutch crawls only one specified URL, please check Nutch filter (conf/regex-urlfilter.txt). To crawl all URLs in the seed, the content of regex-urlfilter.txt should be as follows.
# accept all URLs
+.
See details here: http://wiki.apache.org/nutch/NutchTutorial
Hope this helps,
Le Quoc Do

How to Crawl .pdf links using Apache Nutch

I got a website to crawl which includes some links to pdf files.
I want nutch to crawl that link and dump them as .pdf files.
I am using Apache Nutch1.6 also i am tring this in java as
ToolRunner.run(NutchConfiguration.create(), new Crawl(),
tokenize(crawlArg));
SegmentReader.main(tokenize(dumpArg));
can some one help me on this
If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:
Document crawling
1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"
1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
If what you really want is to download all pdf files from a page, you can use something like Teleport in Windows or Wget in *nix.
you can either write your own own plugin, for pdf mimetype or there is embedded apache-tika parser, that can retrieve text from pdf..

Setup /MyStaticSite on a Drupal 7 installation

I have a drupal installation at http://myserver.com and I'd like to serve an old micro-site that I have from http://myserver.com/MyStaticSite. For the record, this site has a ton of HTML pages and images I had downloaded in the past with wget.
How would be the correct way to do that in Drupal? Do I need some specific rewrite recipe? Maybe some rule like
when you get /MyStaticSite/X -> /sites/default/files/MyStaticSite/X
?
If MyStaticSite is a folder in your root directory, you don't need to do anything, standard Drupal .htaccess will find it and serve content from that folder. Just be aware that if MyStaticSite is also a path in your Drupal namespace, your real folder will override it and you'll have no access to the Drupal's path. If MyStaticSite happens to be the name of a Drupal root's subdirectory (e.g. scripts or modules), you have a problem. You may well copy your microsite's content in the Drupal subdirectory (provided that none of your file names conflicts with Drupal's ones), but you will have a hard time in updating Drupal when needed.