The robots.txt in yahoo robots.txt say:
User-agent: *
Sitemap: https://finance.yahoo.com/sitemap_en-us_desktop_index.xml
Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_index_US_en-US.xml.gz
Disallow: /r/
Disallow: /__rapidworker-1.2.js
Disallow: /__blank
Disallow: /_td_api
Disallow: /_remote
Does yahoo finance ban web scrapy or not?
What was disallowed by yahoo finance website?
What we can infer from yahoo's robots.txt file?
Nothing in the robots.txt file expressly prevents you from scraping Yahoo Finance, however Yahoo finance is governed by Yahoo's Terms of Service.
The most pertinent part of this document says basically that you should not do anything which would interfere with their services. Realistically, this means that if you are planning on scraping Yahoo Finance for data, you should do so responsibly (not many thousands of requests, as this will quickly get you banned).
That said, web scraping is generally inefficient (as you are reloading an entire HTML page just to collect data programmatically). I would look into using an API instead (like those discussed here), as this will be a) more reliable b) faster and c) definitely be legal.
They don't disallow it but my scraper gets hundreds of companies every 30 seconds and ever since, their website has kept changing formats. Also I noticed something new, they actually in fact will block your router IP for a little bit by replacing some of the variables with N/A and misinforming your program, so they don't state that they disallow it but they definitely don't like you doing it. So all im saying is be sneaky.
Related
Is there a way to create pattern based rule in the robots.txt file that search engines can index?
New York 100
New York 101
New York 102
...
Atlanta 100
Atlanta 101
Atlanta 102
...
Our website has millions of records that we'd like search engines to index.
The indexing should be based on data-driven results, following a simple pattern: City + Lot Number.
The webpage loaded shows the city lot and related info.
Unfortunately, there are too many records to simply put them in the robots.txt file (over 21MB), where google has a 500KB robots file limit.
The default permissions from robots.txt are that bots are allowed to crawl (and index) everything unless you exclude it. You shouldn't need any rules at all. You could have no robots.txt file or it could be as simple as this one that allows all crawling (disallows nothing):
User-agent: *
Disallow:
Robots.txt rules are all "Starts with" rules. So if you did want to disallow a specific city, you could do it like this:
User-agent: *
Disallow: /atlanta
Which would disallow all the following URLs:
/atlanta-100
/atlanta-101
/atlanta-102
But allow crawling for all other cities, including New York.
As an aside, it is a big ask for search engines to index millions of pages from a site. Search engines will only do so if the content is high quality (lots of text, unique, well written,) your site has plenty of reputation (links from lots of other sites,) and your site has good information architecture (several usable navigation links to and from each page.) Your next question is likely to be Why aren't search engines indexing my content?
You probably want to create XML sitemaps with all of your URLs. Unlike robots.txt, you can list each of your URLs in a sitemap to tell search engines about them. A sitemap's power is limited, however. Just listing a URL in the sitemap is almost never enough to get it to rank well, or even to get it indexed at all. At best sitemaps can get search engine bots to crawl your whole site, give you extra information in webmaster tools, and are a way of telling search engines about your preferred URLs. See The Sitemap Paradox for more information.
I am working on a small php script and i have some links like this
*-phones-*.html
* are variables i want to disallow google to index this kind of links using robots.txt, it is possible ?
You're not disallowing anything. robots.txt is just a set of guidelines for webcrawlers, who can choose to follow them or not.
Rude crawlers should of course be IP banned. But you can't avoid that the webcrawler might come across that page. Anyway, you can add it to your robots.txt and googles webcrawler might obey.
I have a robots.txt file set up as such
User-agent: *
Disallow: /*
For a site that is all unique URL based. Sort of like https://jsfiddle.net/ when you save a new fiddle it gives it a unique URL. I want all of my unique URLs to be invisible to Google. No indexing.
Google has indexed all of my unique URLs, even though it says "A description for this result is not available because of the site's robots.txt file. - learn more"
But that still sucks because all the URLs are there, and clickable - so all the data inside is available. What can I do to 1) get rid of these off Google and 2) stop Google from indexing these URLs.
Robots.txt tells search engines not to crawl the page, but it does not stop them from indexing the page, especially if there are links to the page from other sites. If your main goal is to guarantee that these pages never wind up in search results, you should use robots meta tags instead. A robots meta tag with 'noindex' means "Do not index this page at all". Blocking the page in robots.txt means "Do not request this page from the server."
After you have added the robots meta tags, you will need to change your robots.txt file to no longer disallow the pages. Otherwise, the robots.txt file would prevent the crawler from loading the pages, which would prevent it from seeing the meta tags. In your case, you can just change the robots.txt file to:
User-agent: *
Disallow:
(or just remove the robots.txt file entirely)
If robots meta tags are not an option for some reason, you can also use the X-Robots-Tag header to accomplish the same thing.
Can you disallow all and then allow specific sites only. I am aware one approach is to disallow specific sites and allow all. Its is valid to do the reverse: E.G:
User-agent: *
Disallow: /
Allow: /siteOne/
Allow: /siteTwo/
Allow: /siteThree/
To simply disallow all and then allow sites seems much more secure than to all all and them have to think about all the places you dont want them to crawl.
could this method above be responsible for the sites description saying 'A description for this result is not available because of this site's robots.txt – learn more.' in the organic ranking on Google's home page
UPDATE - I have gone into Google webmaster tools > Crawl > robots.txt tester. At first when I entered siteTwo/default.asp it said Blocked and highlighted the 'Disallow: /' line. After leaving and re visiting the tool it now says Allowed. Very weird. So if this says Allowed I wonder why it gived the message above in the description for the site?
UPDATE2 - The example of the robots.txt file above should have said dirOne, dirTwo and not siteOne, siteTwo. Two great links to know all about robot.txt are unor's robot.txt specification in the accepted answer below and the robots exclusion standard is also a must read. This is all explained in these two pages. In summary yes you can disallow and them allow BUT always place the disallow last.
(Note: You don’t disallow/allow crawling of "sites" in the robots.txt, but URLs. The value of Disallow/Allow is always the beginning of a URL path.)
The robots.txt specification does not define Allow.
Consumers following this specification would simply ignore any Allow fields. Some consumers, like Google, extend the spec and understand Allow.
For those consumers that don’t know Allow: Everything is disallowed.
For those consumers that know Allow: Yes, your robots.txt should work for them. Everything’s disallowd, except those URLs matched by the Allow fields.
Assuming that your robots.txt is hosted at http://example.org/robots.txt, Google would be allowed to crawl the following URLs:
http://example.org/siteOne/
http://example.org/siteOne/foo
http://example.org/siteOne/foo/
http://example.org/siteOne/foo.html
Google would not be allowed to crawl the following URLs:
http://example.org/siteone/ (it’s case-sensitive)
http://example.org/siteOne (missing the trailing slash)
http://example.org/foo/siteOne/ (not matching the beginning of the path)
I have a “double” question on the number of pages crawled by Google and it’s maybe relation with possible duplicate content (or not) and impact on SEO.
Facts on my number of pages and pages crawled by Google
I launched a new website two months ago. Today, it has close to 150 pages (it's increasing every day). This is the number of pages in my sitemap anyway.
If I look in "Crawl stats" in Google webmaster, I can see the number of pages crawled by Google everyday is much bigger (see image below).
I'm not sure it's good actually because not only it make my server a bit more busy (5,6 MB of download for 903 pages in a day), but I'm scared it makes some duplicate content as well.
I have checked on Google (site:mysite.com) and it gives me 1290 pages (but only 191 are shown unless I click on "repeat the search with the omitted results included". Let’s suppose the 191 ones are the ones in my sitemap (I think I have a problem of duplicate content of around 40 pages, but I just update the website for that).
Facts on my robots.txt
I use a robots.txt file to disallow all crawling engines to go to pages with parameters (see robots below) and also “Tags”.
User-Agent: *
Disallow: /administrator
Disallow: *?s
Disallow: *?r
Disallow: *?c
Disallow: *?viewmode
Disallow: */tags/*
Disallow: *?page=1
Disallow: */user/*
The most important one is tags. They are in my url as follow:
www.mysite.com/tags/Advertising/writing
It is blocked by the robots.txt (I’ve check with google webmaster) but it is still present in Google search (but you need to click on “repeat the search with the omitted results included.”)
I don’t want those pages to be crawled as it is duplicate content (it’s a kind of search on a keyword) that’s why I put them in robots.txt
Finaly, my questions are:
Why Google is crawling the pages that I blocked in robots.txt?
Why is Google indexing pages that I have blocked? Are those pages considered by Google as duplicate content? If yes I guess it’s bad for SEO.
EDIT: I'm NOT asking how to remove the pages indexed in Google (I know the answer already).
Why google is crawling the pages that I blocked in robots.txt? Why google is indexing pages that I have blocked?
They may have crawled it before you blocked it. You have to wait until they read your updated robots.txt file and then update their index accordingly. There is no set timetable for this but it is typically longer for newer websites.
Are those pages considered as duplicate content?
You tell us. Duplicate content is when two pages have identical or nearly identical content on two or more pages. Is that happening on your site?
Blocking duplicate content is not the way to solve that problem. You should be using canonical URLs. Blocking pages means you're linking to "black holes" in your website which hurts your SEO efforts. Canonical URLs prevents this and gives the canonical URL full credit for its related terms and all links to all duplicated pages as well.