Has google changed crawlers in a way that could lead to the 404 growth? - seo

Since yesterday i'm seeing growing number of 404 errors on our website. It is very strange because we don't have those pages which are reported as missing. Also we didn't released any code changes on that day.
Google Webmaster tool is reporting those errors, but when I look into the pages which are linking to the missing urls - there is no a such links. Could this be a Google Crawlers issue?
404 URL:
http://www.justanswer.co.uk/boat/home-improvement/homework/writing
Linked from:
http://www.justanswer.co.uk/boat/home-improvement/homework
http://www.justanswer.co.uk/boat/home-improvement/hvac

It seems that You have CORS issues doing cross-domain javascript.
https://www.facebook.com/connect/ping?client_id=172525162793917&domain=www.justanswer.co.uk&origin=1&redirect_uri=http%3A%2F%2Fstaticxx.facebook.com%2Fconnect%2Fxd_arbiter.php%3Fversion%3D42%23cb%3Df316e5bca883b5%26domain%3Dwww.justanswer.co.uk%26origin%3Dhttp%253A%252F%252Fwww.justanswer.co.uk%252Ff50e0366c05c14%26relation%3Dparent&response_type=token%2Csigned_request%2Ccode&sdk=joey
is saying that
Given URL is not allowed by the Application configuration: One or more of the given URLs is not allowed by the App's settings. It must match the Website URL or Canvas URL, or the domain must be a subdomain of one of the App's domains.

Related

Google 404 soft error on index page that is working fine

A friend of mine has been having trouble getting her site indexed by google and asked me to have a look, but that is not something I really know much about and was hoping for some assistance.
Looking at her search console, google crawl shows an error of soft-404 on the index page. I marked this as fixed a few times, because the site looks fine to me but it keeps coming back.
If I fetch the site as google it seems to be working fine, although it is showing the mobile version instead of the desktop.
It keeps giving another reoccurring 404 of a page http://www.smeyan.com/new-page, which doesn't exist anywhere I can see including server files or sitemaps.
Here is what I know about this site:
It used to be a wix site and was moved to a host gator shared server 2-3 months ago.
It's using JavaScript/jQuery .load to get page content outside the index.html template.
It has 2 sitemaps one for the URLs and one for both URLs and images
http://www.smeyan.com/sitemap_url.xml http://www.smeyan.com/sitemap.xml
It has been about 2 months since it was submitted for indexing and google has not indexed any of the content when you search for site:www.smeyan.com it shows some old stuff from the wix server. Although search console says it has 172 images indexed.
it has www. as a preference set in search console.
Has anyone experienced this and has an direction for a fix?
How long time was set for this site in Cache-Control header? If long, you should use "google removals" for obsolete snippets and cache. I simulated Google visit on your webpage. Correct 404 return code. Correct headers. Thus. Report google removals for "not found" pages. You must request visit of Googlebot and keep calm and wait for reaction.
BTW: For permanently removed content use 410 Gone for Google or... report via Removals.
https://support.google.com/webmasters/answer/1663419?hl=en
The only download error that I saw while using Chrome's Inspect function pertains to a SCRIPT tag with a Facebook url as the source (src) file.
This is the error as reported by Inspect.
This is the SCRIPT tag that caused the error.
I am not sure that this is the cause of the reoccurring 404 error, but it is an issue that needs attention on this website.
I checked your site with Tor Browser which has... DISABLED SCRIPTS. You should provide any content on your site with use of <noscript/> tag. It doesn't have to be beautiful but should be visible for bots. <a href... ></a>, <img/> etc. and... TEXT. Without it the site is NOT OPTIMIZED for search bots. Read about SEO. The sitemap content can be never indexed if the content will be never linked.
Probably your webpage also doesn't meet requirements for screen readers (for blind people).
Note: The image with "SMEYAN" caption is visible on webpage and is indexed.
second image on the webpage (in source): <img class="gallery-full-image" src="./galleries/home_gallery/smeyan_home-1.jpg" /> and indexed
The menu also doesn't work without scripts.
I thought the step is good implemented.
Please use <noscript/> element and implement version for blind people (without scripts, provide alt tag for images) and for noscript browsers. You can test it via disabling script or via NOSCRIPT extension for Firefox.
BTW. You should use HTML, CSS (including animations) and... use the JS ONLY if it is needed. Or... <noscript/> method.
Google bot currently use web rendering service (WRS) that is based on old Chrome 41 (M41), so it may fail where browsers succeed.
To learn how google boot works read this.
Add this code to the page to see the real error.
You can see the error using Url Inspector live, from google search console. It will show at more info tab.
Note: if the bot gets a 301 code or if the page is too little to have significant content it will return a soft 404 error, and won't preview or show any other error.

React Router + AWS Backend, how to SEO

I am using React and React Router in my single page web application. Since I'm doing client side rendering, I'd like to serve all of my static files (HTML, CSS, JS) with a CDN. I'm using Amazon S3 to host the files and Amazon CloudFront as the CDN.
When the user requests /css/styles.css, the file exists so S3 serves it.
When the user requests /foo/bar, this is a dynamic URL so S3 adds a hashbang: /#!/foo/bar. This will serve index.html. On my client side I remove the hashbang so my URLs are pretty.
This all works great for 100% of my users.
All static files are served through a CDN
A dynamic URL will be routed to /#!/{...} which serves index.html (my single page application)
My client side removes the hashbang so the URLs are pretty again
The problem
The problem is that Google won't crawl my website. Here's why:
Google requests /
They see a bunch of links, e.g. to /foo/bar
Google requests /foo/bar
They get redirected to /#!/foo/bar (302 Found)
They remove the hashbang and request /
Why is the hashbang being removed? My app works great for 100% of my users so why do I need to redesign it in such a way just to get Google to crawl it properly? It's 2016, just follow the hashbang...
</rant>
Am I doing something wrong? Is there a better way to get S3 to serve index.html when it doesn't recognize the path?
Setting up a node server to handle these paths isn't the correct solution because that defeats the entire purpose of having a CDN.
In this thread Michael Jackson, top contributor to React Router, says "Thankfully hashbang is no longer in widespread use." How would you change my set up to not use the hashbang?
You can also check out this trick. You need to setup cloudfront distribution and then alter 404 behaviour in "Error Pages" section of your distribution. That way you can again domain.com/foo/bar links :)
I know this has been a few months old, but for anyone that came across the same problem, you can simply specify "index.html" as the error document in S3. Error document property can be found under bucket Properties => static Website Hosting => Enable website hosting.
Please keep in mind that, taking this approach means you will be responsible for handling Http errors like 404 in your own application along with other http errors.
The Hash bang is not recommended when you want to make SEO friendly website, even if its indexed in Google, the page will display only a little and thin content.
The best way to do your website is by using the latest trend and techniques which is "Progressive web enhancement" search for it on Google and you will find many articles about it.
Mainly you should do a separate link for each page, and when the user clicks on any page he will be redirected to this page using any effect you want or even if it single page website.
In this case, Google will have a unique link for each page and the user will have the fancy effect and the great UX.
EX:
Contact Us

How to remove unwanted URL from google cache

We bought a new domain from HugeDomains.com before a month and made it live last week.
Before we move live, the advertisement published by HugeDomains.com got cached in search engines.
Now we need to remove that cached URL from all search engines.
Following is the Pattern of URL got cached, it's just a query string getting passed
http://www.example.com/?fp=ah1QKL6n%2FlECnlCZX2M7prGsvtbv8ddXendjKdEvTBtzHaEkYE%2BEk37MD1iDIPnimmKBVn7jZKj%2BPGqRUxNQzA%3D%3D&prvtof=ytNnOdijWVo6UL0CLJYkUNs043cNT%2BNtJQ5d5VD69Ac%3D&poru=RLg1S8TlJRc59ObVEdjqkbBOZjhk%2FIf%2BH8W1DtjVOk5VRbieT62uHl%2FGfuWk4d%2FnOfDQwYDvqLza3nG76SMxZA%3D%3D&
I have used Disallow in Robots.txt to remove that but its not working, following will be the code
Disallow: /*?fp=
Disallow:
/?fp=ah1QKL6n%2FlECnlCZX2M7prGsvtbv8ddXendjKdEvTBtzHaEkYE%2BEk37MD1iDIPnimmKBVn7jZKj%2BPGqRUxNQzA%3D%3D&prvtof=ytNnOdijWVo6UL0CLJYkUNs043cNT%2BNtJQ5d5VD69Ac%3D&poru=RLg1S8TlJRc59ObVEdjqkbBOZjhk%2FIf%2BH8W1DtjVOk5VRbieT62uHl%2FGfuWk4d%2FnOfDQwYDvqLza3nG76SMxZA%3D%3D&
I even enabled a 302 Redirect for this query string fp= to my home page
Please let me know a way to resolve this.
Thanks in advance.
I wouldn't do this with robots.txt.
Just wait. I think the most search engines will recognize that your website is new so they will crawl it again in near future.
Otherwise you can create a google-webmaster account and send your url to google to crawl it again.
EDIT: You're also able to disallow url-parameter in webmaster tools.
Robots.txt disallow should do it, but another good way is to return a 410 Gone result, then google will stop indexing it since it'll see this page has disappeared.
Edit
Looks like I was wrong about Robots.txt, but right about 410 Gone response:
Reference
You have to do a 301 permanent redirect for Google to drop old indexed page. If you do 302, Google will try to crawl that url once in a while as it is temporary. Ignoring query parameters does not help in clearing the cache, it just sends signal saying the url with query param is same as the one without it. I guess that is not what you want. My suggestion would be to do a 301 permanent redirect if you encounter query param fb.
Right now i doubt google handles 404 and 410 lot differently. So you can do a 410 as well.
Google webmaster can help you in removing outdated/ cache content from Google search results
Copy your domain Cached URL
Browse to https://www.google.com/webmasters/tools/removals
Follow Request instructions.
The cache can be removed in a few numbers of hours. Google search engine crawls to new/current URL contents.

App page tab geting 404 on good url

I created an app (id:155124624522900 - https://developers.facebook.com/apps/155124624522900)
And it is supposed to work as page server for Business Pages (as a tab)
But I get this code when trying to access from my debug app:
App Temporarily Unavailable The URL
The URL http://foodtreedevfb.herokuapp.com/tab returned a 404 Not Found error.
Still, the url is good! I wonder what is happening.
I am serving from Play Framework on Heroku.
EDIT
it may be related to some trailing slash issue on the uri, still I changed and it didn't help.
Is your app redirecting when Facebook sends the POST request? this can happen if your URLs in Facebook's settings are setup without the trailing '/'
Also, check your index file allows POST requests - i'm not sure if heroku have any such restrictions but many other servers won't allow POST requests to some URLs

How to verify if the sitemap generated indexes return 200 code?

I have generated the Sitemap indexes for Google. The only issue which I have is that how to verify that all the indexes(URL's) which got generated work or not. Based on the guide it says something like this:
you write a script to test each URL in the sitemap against your application
server and confirm that each link returns an HTTP 200 (OK) code. Broken links may indicate a mismatch
between the URL formatting configuration of the Sitemap Generator
I just wanted to see if somebody had such experience on how to write such script?
google webmaster tools will report you within "site configuration -> sitemaps" any HTTP errors and redirects (pretty much everything that is not an HTTP 200), additionally in the "Diagnostics -> Crawl Errors -> in Sitemaps" is another view of errors that occured while crawling urls that were listed within the sitemaps.
if that is not what you want, i would just do some logfile grep-ing. (grep for "googlebot" and an identifier of the urls that you listed within your sitemaps)
you could propably write your own crawler to pre-check if your sites return an HTTP 200, but well, if it returns an HTTP 200 for you now, does not mean it will return an HTTP 200 for googlebot next week / month / year. so i recommend to stick with google webmaster tools and logfile analysis (visualized with i.e.: munin, cacti, ...)
How did you create the sitemap? I would think most sitemap tools would only include URLs that responded with "200 OK"
Do note that some websites mess up and always respond with response 200 instead of e.g. 404 for invalid URLs. Such websites have trouble ahead :)