How to detect amazon sitemap - scrapy

I am trying to scrape some products from amazon.com, but it I can't find it in its robots.txt
I tried
amazon.com/sitemap.xml
amazon.com/sitemap.xml.gz
amazon.com/sitemap1.xml.gz
amazon.com/sitemap1.xml
all turn-up nothing
I also tried sitemap detector such like
https://seositecheckup.com/tools/sitemap-test
The result shows Amazon doesn't have a sitemap.
Is that true? or I didn't have the correct approach.

Look at robots.txt, you will see a sitemap link at bottom with access denied.
This ressources may be accessible only to robots (specific user-agent, IP...).

Related

Any way to both NoIndex and Prevent Crawling?

I created a new website and I do not want it to be crawled by search engines as well as not appear in search results.
I already created a robots.txt
User-agent: *
Disallow: /
I have a html page. I wanted to use
<meta name="robots" content="noindex">
but Google page says it should be used when a page is not blocked by robots.txt as robots.txt will not see noindex tag at all.
Is there any way I can use both noindex as well as robots.txt?
There are two solutions, neither of which are elegant.
You are correct that even if you Disallow: / that your URLs might still appear in the search results, just likely without a meta description and a Google generated title.
Assuming you are only doing this temporarily, the recommended approach is to be basic http auth in front of your site. This isn't great since users will have to put in a basic username and password, but this will prevent your site from getting crawled and indexed.
If you can't or don't want to put basic auth in front of your site, the alternative is to still Disallow: / in your Robots.txt file, and use Google Search Console to regularly purge the Google index by requesting the site be removed from the index.
This is inelegant in multiple ways.
You'll have to monitor the search results to see if URLs get indexed
You'll have to manually request the removal in the Google Search Console
Google really didn't intend for the removal feature to be used in this fashion, and who knows if they'll start ignoring your requests over time. But I'd imagine it would actually continue to work even though they'd prefer you didn't use it that way.

How to remove unwanted URL from google cache

We bought a new domain from HugeDomains.com before a month and made it live last week.
Before we move live, the advertisement published by HugeDomains.com got cached in search engines.
Now we need to remove that cached URL from all search engines.
Following is the Pattern of URL got cached, it's just a query string getting passed
http://www.example.com/?fp=ah1QKL6n%2FlECnlCZX2M7prGsvtbv8ddXendjKdEvTBtzHaEkYE%2BEk37MD1iDIPnimmKBVn7jZKj%2BPGqRUxNQzA%3D%3D&prvtof=ytNnOdijWVo6UL0CLJYkUNs043cNT%2BNtJQ5d5VD69Ac%3D&poru=RLg1S8TlJRc59ObVEdjqkbBOZjhk%2FIf%2BH8W1DtjVOk5VRbieT62uHl%2FGfuWk4d%2FnOfDQwYDvqLza3nG76SMxZA%3D%3D&
I have used Disallow in Robots.txt to remove that but its not working, following will be the code
Disallow: /*?fp=
Disallow:
/?fp=ah1QKL6n%2FlECnlCZX2M7prGsvtbv8ddXendjKdEvTBtzHaEkYE%2BEk37MD1iDIPnimmKBVn7jZKj%2BPGqRUxNQzA%3D%3D&prvtof=ytNnOdijWVo6UL0CLJYkUNs043cNT%2BNtJQ5d5VD69Ac%3D&poru=RLg1S8TlJRc59ObVEdjqkbBOZjhk%2FIf%2BH8W1DtjVOk5VRbieT62uHl%2FGfuWk4d%2FnOfDQwYDvqLza3nG76SMxZA%3D%3D&
I even enabled a 302 Redirect for this query string fp= to my home page
Please let me know a way to resolve this.
Thanks in advance.
I wouldn't do this with robots.txt.
Just wait. I think the most search engines will recognize that your website is new so they will crawl it again in near future.
Otherwise you can create a google-webmaster account and send your url to google to crawl it again.
EDIT: You're also able to disallow url-parameter in webmaster tools.
Robots.txt disallow should do it, but another good way is to return a 410 Gone result, then google will stop indexing it since it'll see this page has disappeared.
Edit
Looks like I was wrong about Robots.txt, but right about 410 Gone response:
Reference
You have to do a 301 permanent redirect for Google to drop old indexed page. If you do 302, Google will try to crawl that url once in a while as it is temporary. Ignoring query parameters does not help in clearing the cache, it just sends signal saying the url with query param is same as the one without it. I guess that is not what you want. My suggestion would be to do a 301 permanent redirect if you encounter query param fb.
Right now i doubt google handles 404 and 410 lot differently. So you can do a 410 as well.
Google webmaster can help you in removing outdated/ cache content from Google search results
Copy your domain Cached URL
Browse to https://www.google.com/webmasters/tools/removals
Follow Request instructions.
The cache can be removed in a few numbers of hours. Google search engine crawls to new/current URL contents.

why are some pages in my start directory not showing in my site map?

If i generate a site map at http://www.xml-sitemaps.com/, for aromapersona.com, I get 9 pages, however there are a bunch more pages that should show up. For example, aromapersona.com/candle_holder is in the same "front" directory as the other 9 pages, but doesnt generate in sitemap. Is this because no other pages on my site link to it? Im trying to get these other URLs indexed, and I even edited the site map to include this URL as well as others and submitted to google via webmaster tools, and still nothing. Advice?
I'm not familiar with aromapersona.com but it will only be able to list pages that are linked to from the initial page you give it (or ones they link to) unless you provide the site with FTP access (which I presume you dont).
If you include the URL's in your sitemap for goggle it should eventually list them, but linking to them from other parts of your site is probably the most effective.
I have not checked the website, but do also take the cause is not because of noindex, nofollow, robots.txt, javascript links, mixing http/https etc.
In clear wording: There is no link pointing to the subpage "candle_holder", hence the XML site generator (which works by following links on your site) cannot detect it.
You can add it manually to the XML, but then again, it should be accessible from the site directly.

How to check if googlebot will index a given url?

We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.

Google found my backup web site. What can I do about It?

A few days ago we replaced our web site with an updated version. The original site's content was migrated to http://backup.example.com. Search engines do not know about the old site, and I do not want them to know.
While we were in the process of updating our site, Google crawled the old version.
Now when using Google to search for our web site, we get results for both the new and old sites (e.g., http://www.example.com and http://backup.example.com).
Here are my questions:
Can I update the backup site content with the new content? Then we can get rid all of old content. My concern is that Google will lower our page ranking due to duplicate content.
If I prevent the old site from being accessed, how long will it take for the information to clear out of Google's search results?
Can I use google disallow to block Google from the old web site.
You should probably put a robots.txt file in your backup site and tell robots not to crawl it at all. Google will obey the restrictions though not all crawlers will. You might want to check out the options available to you at Google's WebMaster Central. Ask Google and see if they will remove the errant links for you from their data.
you can always use robot.txt on backup.* site to disallow google to index it.
More info here: link text
Are the URL formats consistent enough between the backup and current site that you could redirect a given page on the backup site to its equivalent on the current one? If so you could do so, having the backup site send 301 Permanent Redirects to each of the equivalent pages on the site you actually want indexed. The redirecting pages should drop out of the index (after how much time, I do not know).
If not, definitely look into robots.txt as Zepplock mentioned. After setting the robots.txt you can expedite removal from Google's index with their Webmaster Tools
Also you can make a rule in your scripts to redirect with header 301 each page to new one
Robots.txt is a good suggestion but...Google doesn't always listen. Yea, that's right, they don't always listen.
So, disallow all spiders but....also put this in your header
<meta name="robots" content="noindex, nofollow, noarchive" />
It's better to be safe than sorry. Meta commands are like yelling at Google "I DONT WANT YOU TO DO THIS TO THIS PAGE". :)
Do both, save yourself some pain. :)
I suggest you to either add no index meta tag in all old page or just disallow by robots.txt. Best way to just blocked the by robots.txt. One thing more add the sitemap in new site and submit it in webmaster that improve your new website indexing.
Password protect your webpages or directories that you don't want web spiders to crawl/index by putting password protecting code in the .htaccess file (if present in your website's root directory on the server or create a new one and upload it).
The web spiders will never know that password and hence won't be able to index the protected directories or web pages.
you can block any particular urls in webmasters check once...even you can block using robots.txt....remove sitemap for your old backup site and put noindex no follow tag for all of your old backup pages...i too handled this situation for one of my client............