Template based Indexing/Extraction with Apache Nutch & Solr - apache

I am new to Apache Nutch/Solr family of products. I have setup basic Nutch (1.6) with Solr (4.3) and have successfully crawled a site and Solr has indexed my crawled data as well.
Now my question is if I crawl a web blog like where user can give their comments (e.g http://blogs.alliedtechnique.com/2009/04/16/setting-global-environment-variables-in-centos/), how can I make sure Nutch consider user's comments and main blog as separate document, So when I search for keyword , it returns me main blog and comments as separate results and later I could use that data for sentiment analysis as well.
I would greatly appreciate any help here.
Thanks.
Tony

You could use the xpath filter plugin to segregate crawled content into two different fields.
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
Content in class="post" would go to field A, content in class="commentlist" would go to field B.
In your search page logic, you query Solr on field A so your search results are only from your blog post, not the comments.
Comments data is still saved against the document, but not searchable.

Related

Incrementally crawl a website with Scrapy

I am new to crawling and would like to know whether it's possible to use Scrapy to crawl a site, like CNBC.com, incrementally? For example, if today I crawled all pages from a site, then from tomorrow I only want to collect pages that are newly posted to this site, to avoid crawling all the old pages.
Thank you for any info. or input on this.
Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.
Please try the scrapy plugin scrapy-deltafetch , which would make your life easier.
Short answer: no.
Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.
Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

Extracting specific tags values with Apache Nutch

I'm trying to fetch a list of several URLs and parse their title keywords and description (and ignore all the rest) using Apache Nutch
After that I just want to save for each URL all the title, keywords and description content (preferably without the tags themselves) without any indexing
I looked at several examples on how to this. Just a few examples of what I encountered:
How to parse content located in specific HTML tags using nutch plugin?
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
However they all propose complicated (at least to a Nutch newbie) plugins configuration and settings
Since my use case sounds like a very common one I was wondering if there is any simpler solution?
Thanks

Apache NUTCH, relevant crawling

I am crawling websites using Apache NUTCH 2.2.1, which provides me content to index on SOLR. When NUTCH fetches content, there are contextual information such as "contact us","legal notice" or some other irrelevant information (generally coming from upper menu, left menu or from footer of the page) that I do not need to index.
One of the solution would be to automatically select the most relevant part of the content to index, which can be done by an automatic summarizer. There is a plugin "summary-basic", is it used for this purpose? If so how is it configured? Other solutions are also welcome.
In regex-urlfilter.txt you can specify the list of urls you want to ignore. You can specify the http link for "contact us"(typically all header, footer information that you don't want to crawl) etc.. in that regex list. While crawling web, nutch will ignore those urls and will only fetch the requires content. You can find regex-urlfilter.txt under apache-nutch-2.2.1/conf folder

Google SEO - duplicate content in web pages for submitting sitemaps

I hope my question is not too irrelevant to stackoverflow.
this is my website: http://www.rader.my
It's a car information website. The content is dynamic. Therefore, google crawler could not find all the cars specification pages in my website.
I created a sitemap with all my cars URL in it (for instance: http://www.rader.my/Details.php?ID=13 is for one car). I know I haven't made any mistake in my .xml file format and structure. But after submission, google only indexed one URL which is my index.php.
I have also read about rel="canonical". But I don't think in my case I should use such a thing since all my pages ARE different with different content but only the structure is the same.
Is there anything that I missed? Why google doesn't accept my URLs even though the contents are different? What can I do to fix this?
Thanks and regards,
Amin
I have a similar type of site. Google is good about figuring out dynamic sites. They'll crawl the pages and figure out the unique content as time goes on. Give it time.
You should do all the standard things:
Make sure each page has a unique H1 tag.
Make sure each page has substantial unique content
Unique keywords and description tags aren't as useful as they used to be but they can't hurt.
Cross-link internally. Create category pages that include links to all of one manufacturer and have each of the pages of that manufacturer link back to 'similar' pages.
Get links to your pages. Nothing helps getting indexed like external authority.

How do I get Google to index changes made to my website's keywords?

I have made changes to my website's keywords, description, and title, but Google is not indexing the new keyword. Instead, I have found that Google is indexing the older one.
How can I get Google to index my site using the new keywords that I have added?
Periods between crawling a page vary a lot across pages. A post to SO will be crawled and indexed by Google in seconds. Your personal page that hasn't changed content in 20 years might not even be crawled as much as once a year.
Submitting a sitemap to the webmaster tools will likely re-crawl your website to validate your sitemap. You could use this to speed up the re-crawling.
However, as #Charles noted, the keywords meta-tag is mostly ignored by Google. So it sounds like you're wasting your time.