Modify crawled URL before indexing it

Modify crawled URL before indexing it - apache

I am using nutch 1.4. I want to manipulate the crawled url before indexing it.
For example, if my URL is http://xyz.com/home/xyz.aspx then I want to modify the URL to http://xyz.com/index.aspx?role=xyz and only the latter field should be indexed in SOLR. The reason is I don't want to expose the first URL. The 2nd URL will ultimately redirect it to same page.
Do we have a provision in Nutch to manipulate the crawled URL's before indexing it to SOLR??

There is no out of the box way to modify the value fed to solr unless you write a custom plugin to do so.
However, this can be easily handled at client side before the results are displayed to the User.

Related

Apache Nutch 2.3.1 Website home page handling

I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google).
For rest of pages, its working fine ( crawling text etc.)

At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If you want to do this you'll probably need to customise some portions of Nutch.
If you're collecting a fixed set of URLs (that you usually put in the seed file) you can attach some metadata to these URLs and use a different strategy for these URLs. For instance setting a really high score & short fetch interval (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/InjectorJob.java#L56-L59).
Since the generator job will sort the URLs by score, this should work as long as all other URLs have a score lower than the value that you use for the seed URLs. Keep in mind that this will cause Nutch to crawl this URLs every time that a new cycle starts (since the seed URLs are going to be on the top all the time).
If you discover new homepages during your normal craw cycle, then it is tricky because Nutch doesn't have any way of detecting if a given URL is a homepage or not. For this case you'll need to check if the current URL is a homepage, if it is indeed a homepage then, you'll need to modify the score/fetch interval to ensure that this URL ends up in the top ranking URLs.
This workaround could potentially cause some issues: Nutch could end up crawling always only the homepages and not the rest of the URLs, which is not a good case.
You could also write your own generator, this way you have more control and don't rely only on the score, fetch interval alone.
Full disclosure: Although I've used a similar approach in the past we ended up changing this system to use StormCrawler (we were building a news search engine) so we needed more control over when the pages were being fetched (the batch nature of Nutch it is not a great fit for this use case), and some other business cases that needed a more NRT approach.

script to check entire website to figure out if there are any pages which are taking more time to load

Can we have a script which will crawl through the entire website to figure out if there are any pages which are taking more time to load (some pages under a particular category were taking more time to load) in selenium Webdriver or jmeter

For JMeter you can use HTML Link Parser configuration element for this purposes. From the documentation:
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
More information on above approach and a couple more options: How to Spider a Site with JMeter - A Tutorial
Remember that JMeter is not a browser hence it doesn't execute JavaScript so your results may not be precise enough as JMeter doesn't measure the time required to actually render the page.

index apache nutch result in solr

I wanna to crawl web site with Nutch and then index result in Solr.
I have in solr schema.xml file .that imagine in this file i have field content.
but every site has own pattern for example in some one i wanna to set "body" tag in "content filed(in solr schema)"
and for another site i wanna to set "content" in "content filed(in solr schema)".
I mean if in crawl result i find body tag i use this to store in content field,
else if i find body tag i use this value to store in schema file.
how can i do that?
can i set specialfield in solr fill based on multi Tag value in nutch crael result based on what tag it found in each web site?

Indexing content with Nutch and posting to Solr should be straightforward. But if you want to add logic and the list of rules might grow, it might be recommended to use a content processing engine.
I've seen this tool used for that specific purpose but it uses Heritrix as crawler and you can create Groovy scripts to decide how to handle your content :www.searchtechnologies.com/aspire

Apache NUTCH, relevant crawling

I am crawling websites using Apache NUTCH 2.2.1, which provides me content to index on SOLR. When NUTCH fetches content, there are contextual information such as "contact us","legal notice" or some other irrelevant information (generally coming from upper menu, left menu or from footer of the page) that I do not need to index.
One of the solution would be to automatically select the most relevant part of the content to index, which can be done by an automatic summarizer. There is a plugin "summary-basic", is it used for this purpose? If so how is it configured? Other solutions are also welcome.

In regex-urlfilter.txt you can specify the list of urls you want to ignore. You can specify the http link for "contact us"(typically all header, footer information that you don't want to crawl) etc.. in that regex list. While crawling web, nutch will ignore those urls and will only fetch the requires content. You can find regex-urlfilter.txt under apache-nutch-2.2.1/conf folder

scrapy CrawlSpider: crawl policy / queue questions

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).

I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas