How to limit scrapy to a particular section of a website, e.g. http://www.domain.com/section/ - scrapy

I have a scrapy project which crawls all the internal links of a given website. This is working fine, however we have found a few situations where we want to limit the crawling to a particular section of a website.
For example, if you could imagine a bank has a special section for investor information, e.g. http://www.bank.com/investors/
So in the example above, everything in http://www.bank.com/investors/ only would be crawled. For example, http://www.bank.com/investors/something/, http://www.bank.com/investors/hello.html, http://www.bank.com/investors/something/something/index.php
I know I could write some hacky code on parse_url which scans the URL and does a pass if it doesn't meet my requirements (i.e. it's not /investors/), but that seems horrible.
Is there a nice way to do this?
Thank you.

I figured this out.
You need to add an allow() for the pattern you want to allow.
For example:
Rule(LinkExtractor(allow=(self.this_folder_only)), callback="parse_url", follow=True)
Everything else will be denied.

Related

Geotargeting a blog folder with GWT

I'm about to launch a blog in a multilingual website.
The website uses geotargeting: site.com/fr/ for france, /be/ for belgium, ch for switzerland, ...
I was wondering if the blog should be run in root level: site.com/blog/
in that case, how the blog could be geotargeted?
Thanks a lot
You should have different URLs for each region/language. For example:
example.com/fr/blog or
example.com/be/blog
Or, even:
example.com/blog/fr or
example.com/blog/be
That depends on you. The main thing is to separate URLs for different languages/regions.
After you do all this, you should add hreflang attributes. That way you tell Google what version of a URL should be displayed when someone searches in certain language/from certain region.
If you use hreflang, you don't have to set geotargeting in WMT. If you still want to do that, you should add separate folders to WMT as different websites.

How to properly use a CDN?

Good evening everyone! Thank you for opening this post.
I currently bought myself the ProCDN from MediaTemple (basically EdgeCast) and have setup a CDN where now I go to cdn-small.DOMAIN.com (or cdn-large.DOMAIN.com) it loads the normal website just fine...
However, I'm not sure which one to use.. Would I use this for the whole complete site to optimize, or use the links to add one by one for each script/stylesheet based on file size? (e.g. All JS/CSS will have the cdn-small while anything larger such as 300kb will have the cdn-large link)
And to say, if the correct way is to load the whole site as one link (e.g. everything is linked normally like js/jquery.js instead of a full link like https://cdn-small.domain.com/js/jquery.js).. Would I set a redirect from DOMAIN.com to cdn-small.DOMAIN.com for the best loading and that they only need to type in the domain not the full sub-CDN-domain?
Apologize if this isn't making sense or anything, but trying to do my best. To put it much more simple terms again is that I'm trying to find the best way to use my cdn-small/cdn-large for my website by having the user enter in the domain (https:// or http://) normally to serve my content as fast as possible near the user.
Kindly appreciate your time for reading this and wish you all a positive weekend.
Here is my live site if it even matters or want to experiement; http://bit.ly/1eGCShX

How to direct multiple clean URL paths to a single page?

(Hi! This is my first time asking a question on Stack Overflow after years of finding answers here... Thanks!)
I have a dynamic page, and I'd like to have fixed URLs that point to different states of that page. So, for example: "www.mypage.co"(/index.php) is the base page, and it rearranges its content based on user choices. I'd then like to be able to point to "www.mypage.co/contentA" or "www.mypage.co/contentB" in order to automatically load base the page at "www.mypage.co" with the desired content.
At heart the problem is an aesthetic one. I know I could simply write www.mypage.co/index.html?state=contentA to reach the desired end, but I want to keep the URL simple and readable (ie, clean). I also, due to limitations in my hosting relationship, would most appreciate a solution that is server-independent (across LAM[PHP] stacks, at least), if possible.
Also, if I just have incorrect assumptions about how to implement clean URLs, I'd appreciate direction to a good, comprehensive explanation. I can't seem to find one...
You could use a htaccess file to redirect all requests to one location and then from there determine what you want to return to the client. Look over the htaccess/dispatch system that Tonic uses.
If you use Apache, you can use mod_rewrite. I have a rule like this where multiple restful urls all go to the same page, using regex and moving parts of the old url into parameters for the new url:
RewriteRule ^/testapp/(name|number|rn|sid|unii|inchikey|formula)(/(startswith))?/?(.*) /testapp/ProxyServlet?objectHandle=Search&actionHandle=drillIn&searchtype=$1&searchterm=$4&startswith=$3 [NC,PT]
That particular regex accepts urls like
testapp/name
testapp/name/zuchini
testapp/name/startswith/zuchini
and forwards them to the same page.
I also use UrlRewriteFilter for Tomcat, but as you mentioned PHP, that doesn't seem that it would be useful.

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.