Scrapy - making sure I get all the pages from a domain / how to tell I didn't / what to do about it? - scrapy

I have a pretty generic spider that I do broad crawls with. I feed it a couple hundred starting urls, limit the allowed_domains and let it go wild (I'm following the suggested 'Avoiding getting banned' measures like auto-throttle, no cookies, rotating user agents, rotating proxies etc).
Everything has been going smoothly until like a week ago when the batch of the starting URLs included a pretty big, known domain. At that time, fortunately, I was monitoring the scrape and noticed that the big domain just "got skipped". When looking into why, it seemed that the domain recognized I was using a public proxy and 403ed my initial request to 'https://www.exampledomain.com/', hence the spider didn't find any urls to follow and hence no urls were scraped for that domain.
I then tried using a different set of proxies and/or VPN and that time I was able to scrape some of the pages but got banned shortly after.
The problem with that is that I need to scrape every single page until 3 levels deep. I cannot afford to miss a single one. Also, as you can imagine, missing a request at the default or first level can potentially lead to missing thounsands of urls or no urls being scraped at all.
When a page fails on the initial request it is pretty straight-forward to tell something went wrong. However, when you scrape thousands of urls from multiple domains in one go it's hard to tell if any got missed. And even if I did notice there are 403s and I got banned the only thing to do at that point seems to be to cross my fingers and run the whole domain again since I can't say the urls I missed due to 403s (and all the urls I would get from deeper levels) didn't get scraped from any other urls that contained the 403ed url.
The only thing that comes to mind is to SOMEHOW collect the failed urls, save them to a file at the end of the scrape, make them the starting_urls and run the scrape again. But that would scrape all of the other pages that were successfully scraped previously. Preventing that would require somehow passing a list of successfully scraped urls setting them as denied. But that also isn't a be all end all solution since there are pages you will get 403ed despite not being banned, like resources you need to logged in to see etc.
TLDR: How do I make sure I scrape all the pages from a domain? How do I tell I didn't? What is the best way of doing something about it?

Related

Is there a quick way to detect redirections?

I am migrating a website and it has many redirections. I would like to generate a list in which I can see all redirects, target and source.
I tried using Cyotek WebCopy but it seems to be unable to give the data I need. Is there a crawling method to do that? Or probably this can be accessed in Apache logs?
Of course you can do it by crawling the website, but I advise against it in this specific situation, because there is an easier solution.
You use Apache, so you are (probably) working with HTTP/HTTPS protocol. You could refer to HTTP referrer, if you use PHP, then you can reach the previous page via $_SERVER['HTTP_REFERER']. So, you will need to do the following:
figure out a way to store previous-next page pairs
at the start of each request store such a pair, knowing what the current URL is and what the previous was
maybe you will need to group your URLs and do some aggregation
load the output somewhere and analyze

Dealing with errors in a rails app generated by spiders

I'm seeing a lot of exceptions in an app -- that was converted from an off the shelf ecommerce site -- a year ago, when a spider hits routes that no longer exist. There's not many of these, but they're hit by various spiders sometimes multiple times a day. I've blocked the worst offenders (garbage spiders, mostly), but I can't block google and bing obviously. There are too many URL's to remove manually.
I'm not sure why the app doesn't return a 404 code, I'm guessing one of the routes is catching the URL's and trying to generate a view, but since the resource is missing it returns nil, which is what's throwing the errors. Like this:
undefined method `status' for nil:NilClass
app/controllers/products_controller.rb:28:in `show'
Again, this particular product is gone, so I'm not sure why the app didn't return the 404 page, instead it's trying to generate the view even though the resource doesn't exist, it's checking to make sure the nil resource has a public status, and the error is thrown.
If I rescue for Active:Record not found, will that do it? It's kind of hard to test, as I have to wait for the various bots to come through.
I also have trouble with some links that rely on a cookie being set for tracking, and if the cookie's not set, the app sets it before processing the request. That doesn't seem to be working with the spiders, and I've set those links to nofollow links, but that doesn't seem to be honored by all the spiders.
For your first question about the 404 page.
Take a look on this post, I'm sure it will help you.

scrapy CrawlSpider: crawl policy / queue questions

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

How do I prevent GoogleBot from finding acquisition URLs?

I have apache in front of zope 2 (multiple virtual hosts) using the standard simple rewrite rule.
I am having big issues with some of the old sites I host and googlebot.
Say I have:
site.example.com/documents/
site.example.com/images/i.jpg
site.example.com/xml/
site.example.com/flash_banner.swf
How do I stop the following from happening?
site.example.com/documents/images/xml/i.jpg
site.example.com/images/xml/i.jpg
site.example.com/images/i.jpg/xml/documents/flash_banner.swf
All respond with the correct object from the last folder on the end of the URI, the old sites where not written very well and it some cases Google is going in and out of hundreds of permutations of folder structures that don’t exist but always finding large flash files. So instead of Googlebot hitting the flash file once, it's dragging it off the site thousands of times. I am in the process of moving the old sites the Django. But I need to put a halt to it in Zope. In tthe past have tried ipchains and mod_security but they are not an option this time around.
Find out what page is providing Google all the variant paths to the same objects. Then fix that page so that it only provides the canonical paths using the absoute_url(), absoute_url_path(), or virtual_url_path() methods of traversable objects.
You could also use sitemaps.xml or robots.txt to tell Google not to spider the wrong paths but that's definitely a workaround and not a fix as the above would be.

Should a sitemap have *every* url

I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?
If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)
In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.