I'm scraping a particular URL set: e.g.
example.com/job/1
example.com/job/3
example.com/job/4
example.com/job/31
example.com/job/50
The problem is, I don't know which ones have been removed, and if I decide to crawl from 1 to 10000, I will get a lot of redirects to page not found e.g.
example.com/job-not-found.html
I used the while loop to define the starting URLs, but now I want Scrapy to exclude from the parse method all URL's that get redirected to the 404 page.
Currently I get a lot of unnecessary h1 tags belonging to the 404 page because they still parse.
Scrapy ignores 404 responses by default, which means you have disabled it somehow, check for the following attributes in settings, your spider code or maybe passing it throught the meta parameters:
handle_httpstatus_list
handle_httpstatus_all
HTTPERROR_ALLOWED_CODES
HTTPERROR_ALLOW_ALL
Any of those variables should be True or should be a list containing the 404 status.
If that isn't your case, probably you've disabled the HttpErrorMiddleware middleware.
Related
I have a pretty generic spider that I do broad crawls with. I feed it a couple hundred starting urls, limit the allowed_domains and let it go wild (I'm following the suggested 'Avoiding getting banned' measures like auto-throttle, no cookies, rotating user agents, rotating proxies etc).
Everything has been going smoothly until like a week ago when the batch of the starting URLs included a pretty big, known domain. At that time, fortunately, I was monitoring the scrape and noticed that the big domain just "got skipped". When looking into why, it seemed that the domain recognized I was using a public proxy and 403ed my initial request to 'https://www.exampledomain.com/', hence the spider didn't find any urls to follow and hence no urls were scraped for that domain.
I then tried using a different set of proxies and/or VPN and that time I was able to scrape some of the pages but got banned shortly after.
The problem with that is that I need to scrape every single page until 3 levels deep. I cannot afford to miss a single one. Also, as you can imagine, missing a request at the default or first level can potentially lead to missing thounsands of urls or no urls being scraped at all.
When a page fails on the initial request it is pretty straight-forward to tell something went wrong. However, when you scrape thousands of urls from multiple domains in one go it's hard to tell if any got missed. And even if I did notice there are 403s and I got banned the only thing to do at that point seems to be to cross my fingers and run the whole domain again since I can't say the urls I missed due to 403s (and all the urls I would get from deeper levels) didn't get scraped from any other urls that contained the 403ed url.
The only thing that comes to mind is to SOMEHOW collect the failed urls, save them to a file at the end of the scrape, make them the starting_urls and run the scrape again. But that would scrape all of the other pages that were successfully scraped previously. Preventing that would require somehow passing a list of successfully scraped urls setting them as denied. But that also isn't a be all end all solution since there are pages you will get 403ed despite not being banned, like resources you need to logged in to see etc.
TLDR: How do I make sure I scrape all the pages from a domain? How do I tell I didn't? What is the best way of doing something about it?
I have a website which I have recently started and also submitted my sitemap on google webmaster tool. My site got index whiten short time but whenever I search about my website on google, I see two three version of my same pages with diff URL arguments on each
Means suppose my site name is example.com, so when I search about exmaple.com on Google I get the results like following
www.example.com/?page=2
www.example.com/something/?page=3
www.example.com
As I know result 1 and result 3 are same, why are they being shown separately ? I don't have any such URL in my sitemap and not even in any of my html page so why is this happening I am little confused. I want to get rid of it
Also result no 2 should be displayed simple as www.exaple.com/something
and not like www.example.com/something?page=3
There is actually a setting in google webmaster tool which helps in removing URLs with parameters. To access & configure the setting, navigate to Webmaster tool --> Crawl --> URL Parameters and set them according to your needs
I also found following article useful for understanding concept behind those parameters and how could we remove pages getting crawled with unnecessary parameters
http://www.shoutmeloud.com/google-webmaster-tool-added-url-parameter-option-seo.html
I'm seeing a lot of exceptions in an app -- that was converted from an off the shelf ecommerce site -- a year ago, when a spider hits routes that no longer exist. There's not many of these, but they're hit by various spiders sometimes multiple times a day. I've blocked the worst offenders (garbage spiders, mostly), but I can't block google and bing obviously. There are too many URL's to remove manually.
I'm not sure why the app doesn't return a 404 code, I'm guessing one of the routes is catching the URL's and trying to generate a view, but since the resource is missing it returns nil, which is what's throwing the errors. Like this:
undefined method `status' for nil:NilClass
app/controllers/products_controller.rb:28:in `show'
Again, this particular product is gone, so I'm not sure why the app didn't return the 404 page, instead it's trying to generate the view even though the resource doesn't exist, it's checking to make sure the nil resource has a public status, and the error is thrown.
If I rescue for Active:Record not found, will that do it? It's kind of hard to test, as I have to wait for the various bots to come through.
I also have trouble with some links that rely on a cookie being set for tracking, and if the cookie's not set, the app sets it before processing the request. That doesn't seem to be working with the spiders, and I've set those links to nofollow links, but that doesn't seem to be honored by all the spiders.
For your first question about the 404 page.
Take a look on this post, I'm sure it will help you.
I'm using pretty URLs in my web app, one example is 'forum/post/1' which invokes PostController in Forum module, which loads a post with id=1. This is what I need but that post is also accessible from 'forum/post/view/id/1'. That's bad, because search crawlers don't like when same page is accessible from several URLs, right?
I'm using Yii framework which supports 'useStrictParsing' option, which tells that incoming request must match at least one "pretty" route, otherwise request fails with 404. However it's not a perfect solution, because I don't have pretty URLs for every controller/action.
Ideally, framework should redirect 'forum/post/view/id/1' to 'forum/post/1' with a 301 status code. How did you solve this problem? It's not Yii/PHP specific question, how does your framework/tool deal with it?
The best way to make sure search engines only rank one page the pretty url over another, if there are multiple ways to view the content is to your a canonical tag within the header of your document
<link rel="canonical" href="http://www.mydomain.com/nice-url/" />
This is very useful with windows based system as IIS is not case sensitive with its web pages but the web standard is case sensitive.
So
www.maydomain.com/Newpage.aspx
www.maydomain.com/newpage.aspx
www.maydomain.com/NEWPAGE.aspx
These are all seen by Google as different pages, and you are then marked down for having a site with duplicate content, but not so with a canonical as each page in the case above would have the same canonical meta tag and the that url is the only one which will be used by the search engines.
Provided that no one links to your non-pretty urls, the search engines will never know that they exist.
If you do want to eliminate them, you could bypass your web framework by adding an alias in you web server's configuration file; the url will be redirected before it ever reaches the framework.
Frameworks like Django, which don't provide 'magic' routing, don't face this issue, the only routes which exist are those which you define manually. In it's case, you could define a view for the non-pretty url which returns the appropriate redirect.
Couple of months ago, we revamped our web site. We adopted totally new site structure, specifically merged several pages into one. Everything looks charming.
However, there are lots of dead links which produce a large number of 404 errors.
So how can I do with it? If I leave it alone, could it bite back someday, say eating up my pr?
One basic option is using 301 redirect, however it is almost impossible considering the number of it.
So is there any workaround? Thanks for your considering!
301 is an excellent idea.
Consider you can take advantage of global configurations to map a group of pages. You don't necessary need to write one redirect for every 404.
For example, if you removed the http://example/foo folder, using Apache you can write the following configuration
RedirectMatch 301 ^/foo/(.*)$ http://example.org/
to catch all 404 generated from the removed folder.
Also, consider to redirect selectively. You can use Google Webmaster Tools to check which 404 URI are receiving the highest number inbound links and create a redirect configuration only for those.
Chances are the number of redirection rules you need to create will decrease drastically.
301 is definitely the correct route to go down to preserve your page rank.
Alternatively, you could catch 404 errors and redirect either to a "This content has moved" type page, or your home page. If you do this I would still recommend cherry picking busy pages and important content and setting up 301s for these - then you can preserve PR on your most important content, and deal gracefully with the rest of the dead links...
I agree with the other posts - using mod_rewrite you can remap URLs and return 301s. Note - it's possible to call an external program or database with mod_rewrite - so there's a lot you can do there.
If your new and old site don't follow any remapable pattern, then I suggest you make your 404 page as useful as possible. Google has a widget which will suggest the page the user is probably looking for. This works well once Google has spidered your new site.
Along with the other 301 suggestions, you could also split the requested url string into a search string routing to your default search page (if you have one) passing those parameters automatically to the search.
For example, if someone tries to visit http://example.com/2009/01/new-years-was-a-blast, this would route to your search page and automatically search for "new years was a blast" returning the best result for those key words and hopefully your most relevant article.