Nutch Crawler taking very long - apache

I only want Nutch to give me a list of the urls it crawled and the status of that link. I do not need the entire page content or fluff. Is there a way I can do this? Crawling a seedlist of 991 urls with a depth of 3 takes over 3 hours to crawl and parse. I'm hoping this will speed things up.
In the nutch-default.xml file there is
<property>
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>file.content.ignored</name>
<value>true</value>
<description>If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
</description>
</property>
These properties are ones I think may have something to do with it but i'm not sure. Can someone give me some help and clarification? Also I was getting many urls with the status code of 38. I cannot find what that status code indicates in this document. Thanks for the help!

Nutch performs parsing after fetching the URL, to get all the outlinks from the fetched URL. The outlinks from a URL are used as the new fetchlist for the next round.
If parsing is skipped, no new URLs might be generated and hence no more fetching.
One way I can think of is to configure the parse plugins to include only the type of content you require to be parsed(in your case its the outlinks).
One example here - https://wiki.apache.org/nutch/IndexMetatags
This links describes the features of the parser https://wiki.apache.org/nutch/Features
Now, to get only the list of the URLs fetched with their statuses you can use the
$bin/nutch readdb crawldb -stats command.
Regarding the status code of 38, looking at the document you have linked, it seems like the status of the URL is
public static final byte STATUS_FETCH_NOTMODIFIED = 0x26
Since, Hex(26) corresponds to Dec(38).
Hope the answer gives some direction :)

Related

How to force dispatcher cache urls with get parameters

As I understood after reading these links:
How to find out what does dispatcher cache?
http://docs.adobe.com/docs/en/dispatcher.html
The Dispatcher always requests the document directly from the AEM instance in the following cases:
If the HTTP method is not GET. Other common methods are POST for form data and HEAD for the HTTP header.
If the request URI contains a question mark "?". This usually indicates a dynamic page, such as a search result, which does not need to be cached.
The file extension is missing. The web server needs the extension to determine the document type (the MIME-type).
The authentication header is set (this can be configured)
But I want to cache url with parameters.
If I once request myUrl/?p1=1&p2=2&p3=3
then next request to myUrl/?p1=1&p2=2&p3=3 must be served from dispatcher cache, but myUrl/?p1=1&p2=2&p3=3&newParam=newValue should served by CQ for the first time and from dispatcher cache for subsequent requests.
I think the config /ignoreUrlParams is what you are looking for. It can be used to white list the query parameters which are used to determine whether a page is cached / delivered from cache or not.
Check http://docs.adobe.com/docs/en/dispatcher/disp-config.html#Ignoring%20URL%20Parameters for details.
It's not possible to cache the requests that contain query string. Such calls are considered dynamic therefore it should not be expected to cache them.
On the other hand, if you are certain that such request should be cached cause your application/feature is query driven you can work on it this way.
Add Apache rewrite rule that will move the query string of given parameter to selector
(optional) Add a CQ filter that will recognize the selector and move it back to query string
The selector can be constructed in a way: key_value but that puts some constraints on what could be passed here.
You can do this with Apache rewrites BUT it would not be ideal practice. You'll be breaking the pattern that AEM uses.
Instead, use selectors and extensions. E.g. instead of server.com/mypage.html?somevalue=true, use:
server.com/mypage.myvalue-true.html
Most things you will need to do that would ever get cached will work this way just fine. If you give me more details about your requirements and what you are trying to achieve, I can help you perfect the solution.

Gemfire region with data expiration

Regarding this document, "entry-time-to-live-expiration" means How long the region's entries can remain in the cache without being accessed or updated. The default is no expiration of this type. However, when I use Spring Cache and client-region with following configuration, I find that setting dose not work well with being accessed. Going forward, regarding this document-> XMLTTL tab, it said "Configures a replica region to invalidate entries that have not been modified for 15 seconds.". So I am confused if TTL work for being accessed.
<gfe:client-region id="Customer2" name="Customer2" destroy="false" load-factor="0.5" statistics="true" cache-ref="client-cache">
<gfe:entry-ttl action="DESTROY" timeout="60"/>
<gfe:eviction threshold="5"/>
</gfe:client-region>
So, the documentation you might want refer to is here and here. Perhaps relevant to your situation is...
"Requests for entries that have expired on the consumers will be forwarded to the producer."
Based on your configuration, given you did not set either a ClientRegionShortcut or DataPolicy, your Client Region, "Customer2", defaults to ClientRegionShortcut.LOCAL, which sets a DataPolicy of "NORMAL". DataPolicy.NORMAL states...
"Allows the contents in this cache to differ from other caches. Data that this region is interested in is stored in local memory."
And for the shortcut of "LOCAL"...
"A LOCAL region only has local state and never sends operations to a server. ..."
However, it does not mean the client Region cannot receive data (of interests) from the Server. It simply implies operations are not distributed to the Server. It may be expiring the entry and then repopulating it from the Server (producer).
Of course, I am speculating and have not tested these ideas. You might try setting the Expiration Action to "LOCAL_DESTROY" and/or changing your distribution properties through different ClientRegionShortcuts.
Post back if you are still having problems. I too echo what #hubbardr is asking.
Cheers!

Override forcedownload behavior in Sitecore

We had a problem with some of our IE clients failing to download a PDF, even after clicking on the link. We found the answer here resolved our problems: set forcedownload=true for PDF mime types in web.config.
However, that created another problem: we are now unable to render a PDF in a browser when we want to. We used to do this with an iframe. However, as you can see, the PDF just downloads, and does not render in the browser.
I learned that the forcedownload=true setting is actually a default in a subsequent version of Sitecore (v7.2). So, I'm hesitant to revert that.
So, how do I render a PDF in a browser in this situation?
You can leave forceDownload=false on the PDF mime type and instead set the following setting to false:
<setting name="Media.EnableRangeRetrievalRequest" value="false"/>
I faced the same dilema a few months back with the same initial fix. Found out the actual issue last week, I wrote a blog post about it. (In fact, I wrote the answer you linked to, I've updated it with the same information now for future visitors)
The issue is basically a combination of Adobe Reader plugin for IE9, chunked transfer encoding and streaming the file directly from the database. I found if you close your browser and try again, or force refresh with Ctrl+F5 it worked fine. Once Sitecore had cached the file to disk it would continue to work for everyone.
The above setting disables chunked transfer encoding, instead sending the file down to the browser as a single piece. This setting was introduced in Sitecore 6.5+
This is one of the flaws in the MediaRequestHandler and in my opinion; the forceDownload option is pretty useless the way it is designed by default. (Why would ever want to configure this option on media extension only?)
You’ll have to basically turn off the forcedownload option again and replace the MediaRequestHandler with your own one. I usually end up with writing my own anyway because if other issues with the default handler, such as dealing properly with CDN’s etc.
In the ProcessRequest pipeline, you can determine if the item should be “downloaded” or not by setting the Content-Disposition header. You basically need to get rid of the default handling of forceDownload and set your headers based on your own logic.
Personally I prefer to set a query string parameter, such as ?dl=1, and base the Content-Disposition header on this. You could also extend the MediaItem template to contain a default behavior on each item or sub tree (leverage from Sitecore inheritance and standard values), and potentially you could thereby also define (override) a specific filename on each item for the attachment part in the Content-Disposition header.
When rendering the link, you can leverage from the properties collection (write a suitable extension method or similar), so that you can clearly mark your code that the link is meant for download, but still leverage from the built in field render methods. Thereby you eliminate the risk of messing up the page editor etc.
/ Mikael
You have to disable range retrieval request in web.config by setting its value to false.
<setting name="Media.EnableRangeRetrievalRequest" value="false" />
MediaRequestHandler enables Sitecore to download PDF content partially in range using HTTP 206 Status code. You can also overwrite MediaRequestHandler and write your own custom implementation to handle media request.

Is meta charset required if AddDefaultCharset is set?

Is there a reason why I should keep <meta charset='utf-8'> in my html head when my .htaccess file already has AddDefaultCharset utf-8?
Just for serving files over the web it is rather redundant. Since people may want to save the page to a file and open it later without the context of a web server though, it's good practice to embed the information into the document itself using the meta tag.
This W3 document gives a great overview of the tradeoffs of each approach:
http://www.w3.org/International/questions/qa-html-encoding-declarations
You should definitely use HTTP header declarations if it is likely that the document will be transcoded (ie. the character encoding will be changed by intermediary servers), since HTTP declarations have higher precedence than in-document ones.
Otherwise you should use HTTP headers if it makes sense for any type of content, but in conjunction with an in-document declaration (see below). You should always ensure that HTTP declarations are consistent with the in-document declarations.
One specific example where <meta> tag may still be appropriate is when specific, user-contributed content may be of a different character set, but the users don't have access to modify your Apache server settings to control that themselves, therefore its beneficial to offer control of the charset within the document.
See the document for more in-depth details.

How do you make an etag that matches Apache?

I want to make an etag that matches what Apache produces. How does apache create it's etags?
Apache uses the standard format of inode-filesize-mtime. The only caveat to this is that the mtime must be epoch time and padded with zeros so it is 16 digits. Here is how to do it in PHP:
$fs = stat($file);
header("Etag: ".sprintf('"%x-%x-%s"', $fs['ino'], $fs['size'],base_convert(str_pad($fs['mtime'],16,"0"),10,16)));
One thing to remember about Apache's Etags is that they don't play well in clusters because they include inode information that can—and probably will—vary between machines in the same cluster.
If you're dynamically generating your page though, this probably won't make sense. If you're in PHP, you can pick the inode and file size of the main script, but the modify time won't tell you if your data has changed. Unless you have a good caching process or just generate static pages, etags aren't helpful. If you do have a good caching process, the inode and file size are probably irrelevant.
Edit: For people who don't know what etags are - they're just supposed to be a value that changes when the content has changed, for caching purposes. The browser gets the etag from the web server, compares it to the etag for its cached copy and then fetches the whole page if the etag has changed.
the answer above (from Chris) works well, but can be simplified using an implicit cast in the sprintf:
sprintf('"%x-%x-%x"', $s['ino'], $s['size'], str_pad($s['mtime'], 16, "0"));
The suggested %016x doesn't work because the padding is applied after the conversion to hex, rather than before.