How to make the crawler pass the domain when it has extracted what it was searching for? - scrapy

I have a crawler browsing from link to link, searching for a specified item to scrape. I want to implement that he should pass the whole domain (leave that website) when one has been found and proceed with the other sites which are in the queue. How can that be handled?
Practically speaking: So when the crawler is at www.example.com/products/news/... and finds the item to look for, it should leave www.example.com and not return to it again.

Related

scrapy strategy with links with names

I am trying to find the best strategy to read links with names (eg. href=/mypage#sectionA )
If I don't do anything special, this kind of link can get skipped if I've already visited that page. If I check if my url has a hash (#), I can parse the result before yielding a new request, but it works only if the link point to a name on the same page.
How should I manage this kind of link? Disable duplicate check and potentially parse a page many many times?

Associated Content & SEO, Sitemaps with External links, using CNAMEs to include External Links as my own in the sitemap

Is there any HTML code or page paramater or metaname that can tell search engines that the content of a page is closely linked to another page on another domain..
I keep the content metatag updated and also the keyword metatag.
I don't want to show these links to my visitors.
1)
I need to know if there is a protocol for communicating related links specifically to crawlers so as to improve my ranking
Is there any way via code I can tell crawlers (crawlers specifically, like how No Follow is addressed to crawlers) that mydomain.com/Porduct.php is closely linked to say
http://ebay.com/sameProduct
http://wikipedia.com/GenericProduct or
http://google.com?q=someKeywords
Should I include external links or CNAME mapped External links(Read Q3) inside the content tag ?? Would that make a difference
2)
Can I include these links in my Sitemap.. Common sense would suggest that links in my sitemap should be hoisted on my domain. Still though I did ask since the sitemap takes in the full URL including the domain name.
3)
If a particular well indexed page has content largely similar to mine can I map a CNAME of my page to that site and include that in the sitemap?? would that amount to cheating ??
First of all, I'm not sure what do you want to achieve there. Search engines in general are already pretty good at recognizing what your page is about. If your content is about product A, write a description about product A, have images about product A, let your users comment about or review product A, or add microdata to your page (i.e. http://schema.org/Product). All these will help search engines recognize that your page is about that product, just like that page on the other site which also have content about the same product.
To answer your questions:
1) I'm not aware of any tag like that which would also be supported by search engines.
2) In your Sitemap you can include only URLs that point to a location on the same hostname the Sitemap is hosted on (there are some exceptions, but those are irrelevant now). See http://www.sitemaps.org/protocol.html for more info about Sitemaps.
3) A CNAME resource record specifies that the domain name is an alias of another domain name, and thus it can't be used the way you described.
Lastly, you're trying to do something for crawlers which is usually a bad idea. Create an awesome website, something useful for the users, something they would love and they'd miss in case you closed the shop. Just focus on the user and all else will come.

Can Search Engines bots crawl pages requiring login?

If a homepage on a website has a content if a user is not logged in and another content when the user login, would a search engine bot be able to crawl the user specific content?
If they are not able to crawl, then I can duplicate the content from another part of the website to make it easily accessible to users who have mentioned their needs at the registration time.
My guess is no, but I would rather make sure before I do something stupid.
You cannot assume that crawler support cookies, but you can identify the crawler and let the crawler be "Logged in" in your site by code. However this will open up for any user to pretend being a crawler to gain the data in the logged in area.
The bot will be able to see all the content in your document. If the content does not exist in the document, then it will not be seen by the bot. If it exists in the document but is hidden from view, the crawler will be able to pick it up.
Even if this could be done it is against the terms for most search engines to show the crawler content that is not the same as what any user will get on entry and can cause your site to be banned from the index.
This is why sites like expertsexchange have to provide the answer if you scroll all the way to the bottom even though they try to make it look like you have to register. (This is only possible if you enter expertsexchange with a google referer btw, for this reason)

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.

How do I convince the Googlebot that two formerly aliased sites are now separate?

This will require a little setup. Trust me that this is for a good cause.
The Background
A friend of mine has run a non-profit public interest website for two years. The site is designed to counteract misinformation about a certain public person. Of course, over the last two years those of us who support what he is doing have relentlessly linked to the site in order to boost it in Google so that it appears very highly when you search for this public person's name. (In fact it is the #2 result, right below the public person's own site). He does not have the support of this public person, but what he is doing is in the public interest and good.
The friend had a stroke recently. Coincidentally, the domain name came up for renewal right when he was in the hospital and his wife missed the email about it. A domain squatter snapped up the domain, and put up content diametrically opposed to his intent. This squatter is now benefitting from his Google placement and page rank.
Fortunately there were other domains he owned which were aliased to point to this domain, i.e. they used a DNS mapping or HTTP 301 redirect (I'm not sure which) to send people to the right site. We reconfigured one of the alias domains to point directly to the original content.
We have publicized this new name for the site and the community has now created thousands of links to the new domain, and is fixing all the old links. We can see from the cache that Google has in fact crawled the original site at the new address, and has re-crawled the imposter site.
The Problem
Even though Google has crawled both sites, you can't get the site to appear in relevant searches under the new URL!
It appears to me that Google remembers the old redirect between the two names (probably because someone linked to the new domain back when it was an alias). It is treating the two sites as if they are the same site in all results. The results for the site name, and using the "link:" operator to find sites that link to this site, are entirely consistent with Google being convinced they are the same site.
Keep in mind that we do not have control of the content of the old domain, and we do not have the cooperation of the person that these sites relate to.
How can we convince the Googlebot that domain "a" and domain "b" are now two different sites and should be treated as such in results?
EDIT: Forward was probably DNS, not HTTP based.
Google will detect the decrease in links to the old domain and that will hurt it.
Include some new interesting content on the new domain. This will encourage Google to crawl this domain.
The 301 redirects will be forgotten, in time. Perhaps several months. Note that they redirected one set of URLs to another set, not from one domain to another. Get some links to some new pages within the site, not just the homepage, as these URLs will not be in the old redirected set.
Set up Google Webmaster Tools and submit an XML sitemap. Thoroughly check everything in Webmaster Tools about once per week.
Good luck.
Time heals all wounds...
Losing control of the domain is a big blow, and it will take time to recover. It sounds like you're following all the correct procedures (getting people to change links, using 301s, etc.)
Has the content of the original site changed since being put up again? If not, you should probably make some changes. If Google re-crawls the page and finds it substantially identical to the one previously indexed, it might consider it a copy and that's why it's using the original URL.
Also, I believe that Google has a resolution process for just such situations. I'm not sure what the form to fill out is or who to contact, but surely some other SO citizens could help.
Good luck!