scrapy CrawlSpider: crawl policy / queue questions - scrapy

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).

I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

Related

Scrapy - making sure I get all the pages from a domain / how to tell I didn't / what to do about it?

I have a pretty generic spider that I do broad crawls with. I feed it a couple hundred starting urls, limit the allowed_domains and let it go wild (I'm following the suggested 'Avoiding getting banned' measures like auto-throttle, no cookies, rotating user agents, rotating proxies etc).
Everything has been going smoothly until like a week ago when the batch of the starting URLs included a pretty big, known domain. At that time, fortunately, I was monitoring the scrape and noticed that the big domain just "got skipped". When looking into why, it seemed that the domain recognized I was using a public proxy and 403ed my initial request to 'https://www.exampledomain.com/', hence the spider didn't find any urls to follow and hence no urls were scraped for that domain.
I then tried using a different set of proxies and/or VPN and that time I was able to scrape some of the pages but got banned shortly after.
The problem with that is that I need to scrape every single page until 3 levels deep. I cannot afford to miss a single one. Also, as you can imagine, missing a request at the default or first level can potentially lead to missing thounsands of urls or no urls being scraped at all.
When a page fails on the initial request it is pretty straight-forward to tell something went wrong. However, when you scrape thousands of urls from multiple domains in one go it's hard to tell if any got missed. And even if I did notice there are 403s and I got banned the only thing to do at that point seems to be to cross my fingers and run the whole domain again since I can't say the urls I missed due to 403s (and all the urls I would get from deeper levels) didn't get scraped from any other urls that contained the 403ed url.
The only thing that comes to mind is to SOMEHOW collect the failed urls, save them to a file at the end of the scrape, make them the starting_urls and run the scrape again. But that would scrape all of the other pages that were successfully scraped previously. Preventing that would require somehow passing a list of successfully scraped urls setting them as denied. But that also isn't a be all end all solution since there are pages you will get 403ed despite not being banned, like resources you need to logged in to see etc.
TLDR: How do I make sure I scrape all the pages from a domain? How do I tell I didn't? What is the best way of doing something about it?

Apache Nutch 2.3.1 Website home page handling

I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google).
For rest of pages, its working fine ( crawling text etc.)
At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If you want to do this you'll probably need to customise some portions of Nutch.
If you're collecting a fixed set of URLs (that you usually put in the seed file) you can attach some metadata to these URLs and use a different strategy for these URLs. For instance setting a really high score & short fetch interval (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/InjectorJob.java#L56-L59).
Since the generator job will sort the URLs by score, this should work as long as all other URLs have a score lower than the value that you use for the seed URLs. Keep in mind that this will cause Nutch to crawl this URLs every time that a new cycle starts (since the seed URLs are going to be on the top all the time).
If you discover new homepages during your normal craw cycle, then it is tricky because Nutch doesn't have any way of detecting if a given URL is a homepage or not. For this case you'll need to check if the current URL is a homepage, if it is indeed a homepage then, you'll need to modify the score/fetch interval to ensure that this URL ends up in the top ranking URLs.
This workaround could potentially cause some issues: Nutch could end up crawling always only the homepages and not the rest of the URLs, which is not a good case.
You could also write your own generator, this way you have more control and don't rely only on the score, fetch interval alone.
Full disclosure: Although I've used a similar approach in the past we ended up changing this system to use StormCrawler (we were building a news search engine) so we needed more control over when the pages were being fetched (the batch nature of Nutch it is not a great fit for this use case), and some other business cases that needed a more NRT approach.

Incrementally crawl a website with Scrapy

I am new to crawling and would like to know whether it's possible to use Scrapy to crawl a site, like CNBC.com, incrementally? For example, if today I crawled all pages from a site, then from tomorrow I only want to collect pages that are newly posted to this site, to avoid crawling all the old pages.
Thank you for any info. or input on this.
Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.
Please try the scrapy plugin scrapy-deltafetch , which would make your life easier.
Short answer: no.
Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.
Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

SEO: how can dynamic URL with query strings be searched by search engine bots?

I’m developing an ecommerce web site in ASP.NET using SQL server 2008 database.
Most of my pages are database driven and all the content is gathered from a SQL Server.
Every product page is created dynamically from data coming from the database, hence every product’s page URL has a unique query string, containing a “product_id” variable.
*Example: http://www.myecommence.com/products.aspx?product_id=1*
I'd like to improve my Search Engine Optimization.
Dealing with a small number of products could be fine but what if I
had more than 1000 products, how could every product be crawled?
How does the google spider/bot know that a product_id with a
hypothetical number of 767 exists?
I’ve been googleing this, still I can’t understand how pages that
have absolutely no reference in the site or external sites can be
crawled? If this is possible the spider should know how to read the
website’s database tables, but I guess that this is not the case.
At this point since most of the pages and links are dynamic how
could they be indexed, the same thing applies to “user detail” pages
that are accessed via query string using a “user id=n”?
Probably what I’m asking has already been discussed but still I don’t have clear some points.
I would advise using Mod Rewrite rules to make your URLs search engine friendly.
This is very important for Google.
As is a good category structure.
Eg:
domain.com/t-shirts/girls/star-wars-t-shirt/
is far better than
domain.com/products.aspx?product_id=1*
Here is some info:
http://msdn.microsoft.com/en-us/library/ms972974.aspx
http://www.wrox.com/WileyCDA/Section/id-305997.html
To answer your questions:
Dealing with a small number of products could be fine but what if I had more than 1000 products, how could every product be crawled?
If you have a good sitemap / menu structure etc, it is likely that Google will crawl all your pages.
How does the google spider/bot know that a product_id with a hypothetical number of 767 exists?
Via crawling your site, via your sitemap, via the menu system on the site etc. However always remember: Google is not psychic - it cannot find a page unless you tell how to / link to it.
I’ve been googleing this, still I can’t understand how pages that have absolutely no reference in the site or external sites can be crawled? If this is possible the spider should know how to read the website’s database tables, but I guess that this is not the case.
If you have no reference - you are doing something wrong. Improve your site structure.
At this point since most of the pages and links are dynamic how could they be indexed, the same thing applies to “user detail” pages that are accessed via query string using a “user id=n”?
Nothing wrong with a dynamic URL per-se - but again I would recommend implementing search engine friendly URLs via Mod Rewrite or similar - see the above resources.
Good luck,
Colin
Modern systems optimize for SEO by allowing for either custom or automated URLs that remap to your id based url pattern. This URL style allows for a fully custom word for word product title or keyword/description, which carries more weight than a random id number in a URL.
To ensure all individual pages are indexed, you generally benefit most from submitting or making available a sitemap xml. More info from google on generating one here:
https://code.google.com/p/googlesitemapgenerator/
Hope that gets you going in the right direction!