Which is the better way to use Scrapy to crawl 1000 sites? - scrapy

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?

I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

Related

Apache Nutch 2.3.1 Website home page handling

I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google).
For rest of pages, its working fine ( crawling text etc.)
At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If you want to do this you'll probably need to customise some portions of Nutch.
If you're collecting a fixed set of URLs (that you usually put in the seed file) you can attach some metadata to these URLs and use a different strategy for these URLs. For instance setting a really high score & short fetch interval (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/InjectorJob.java#L56-L59).
Since the generator job will sort the URLs by score, this should work as long as all other URLs have a score lower than the value that you use for the seed URLs. Keep in mind that this will cause Nutch to crawl this URLs every time that a new cycle starts (since the seed URLs are going to be on the top all the time).
If you discover new homepages during your normal craw cycle, then it is tricky because Nutch doesn't have any way of detecting if a given URL is a homepage or not. For this case you'll need to check if the current URL is a homepage, if it is indeed a homepage then, you'll need to modify the score/fetch interval to ensure that this URL ends up in the top ranking URLs.
This workaround could potentially cause some issues: Nutch could end up crawling always only the homepages and not the rest of the URLs, which is not a good case.
You could also write your own generator, this way you have more control and don't rely only on the score, fetch interval alone.
Full disclosure: Although I've used a similar approach in the past we ended up changing this system to use StormCrawler (we were building a news search engine) so we needed more control over when the pages were being fetched (the batch nature of Nutch it is not a great fit for this use case), and some other business cases that needed a more NRT approach.

Incrementally crawl a website with Scrapy

I am new to crawling and would like to know whether it's possible to use Scrapy to crawl a site, like CNBC.com, incrementally? For example, if today I crawled all pages from a site, then from tomorrow I only want to collect pages that are newly posted to this site, to avoid crawling all the old pages.
Thank you for any info. or input on this.
Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.
Please try the scrapy plugin scrapy-deltafetch , which would make your life easier.
Short answer: no.
Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.
Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

How to remove duplicate title and meta description tags if google indexed them

So, I have been building an ecommerce site for a small company.
The url structure is : www.example.com/product_category/product_name and the site has around 1000 products.
I've checked google webmaster tools and in the HTML improvements section it shows that I have multiple title and meta description tags for all the product pages. They all appear two times, both:
-www.example.com/product_category/product_name
and
-www.example.com/product_category/product_name/ (with slash in the end)
got indexed as separate pages.
I've added a 301 redirect from every www.example.com/product_category/product_name/ to www.example.com/product_category/product_name, but this was almost two weeks ago. I have resubmitted my sitemap and asked google to fetch the whole page a few times. Nothing has changed, GWT still shows the pages as duplicate tags.
I did not get any manual action message.
So I have two questions:
-how can I accelerate the reindexation process, if it's possible?
-and do these tags hurt my organic search results? I've googled it, yes and some say it does and some say it doesn't.
An option is to set a canonical link on both URLs (with and without /) using the URL without a /. Little by little, Google will stop complaining. Keep in mind Google Webmaster Tools is slow to react, especially when you don't have much traffic or backlinks.
And yes, duplicate tags can influence your rankings negatively because users won't have proper and specific information for each page.
Set a canonical link on both Urls is a solution but it take time from my experience.
The fasted way is to block old URL in robots.txt file.
Disallow: /old_url
canonical tag is option but why you are not adding different title and description for all pages.
you can add dynamic meta tags one time and it will create automatically for all pages so we dont worry about duplication.

scrapy CrawlSpider: crawl policy / queue questions

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.