Web crawlers and non-ASCII characters in sitemap.xml

Web crawlers and non-ASCII characters in sitemap.xml - seo

One of our sites has non-ASCII (non-english) characters in URLs:
http://example.com/kb/начало-работы/оплата
I wonder how do web crawlers (particularly Googlebot) handle these situations? Do these URLs have to be encoded or otherwise processed?

I think it is best url-encoded. This is the standard.

Related

Special characters in document.location.href

Is it bad practice to make a page that has a web address like these:
http://example.com/-products-and-services.php
http://example.com/-contact-us.php
http://example.com/--books.php
http://example.com/--translation.php
http://example.com/--illustration.php
http://example.com/-$-special-feature.php
http://example.com/-$-vip-area.php
Will google or apache have problems with these (- $) characters?
I am doing this because I makes it easier for me to view and categorise pages while still letting me add keywords to the file names.
Thanks

You can use alphanumerics, and the special characters "$-_.+!*'()," in url.
But it may not be very helpful with seo, search engine indexing etc.
You can read what google says here,
https://support.google.com/webmasters/answer/76329?hl=en

Sharing a link (share?url=) with spaces?

Trying to add a sharing function to my site, but GPlus seems to have trouble sharing url's with spaces in them.
Even escaped they dont seem to work.
eg;
https://plus.google.com/share?url=http://www.google.com/%23test%20test
It only seems to recognize upto before the %20.
Any ideas? Is this a bug? Am I doing something wrong?
The site is rather ajaxy, and in the history tokens would be a pain to need to use non-standard escaping of characters just for google plus.

I don't think that this is a bug with Google+ but rather its likely intentional because those URLs would need to be double URL encoded because one URL is sharing a second URL, thus your shared URL should be http%3A%2F%2Fwww.google.com%2F%2523test%2Btest
This won't work to create a preview in the share snippet but the URL is correct when it is shared.
All said, you shouldn't use spaces in your URLs because they are considered unsafe, see RFC 1738. You should change your app's URL structure.

Accept case-insensitive URL or redirect to "correct" URL?

Let's say that I have a web app that responds to URLs in the format /entities/{entityKey}. In my access logs, I find people visiting both /entities/KEY1 which is how app URLs are generated, as well as the lower case version of /entities/key1. Currently key1 will throw a 404 not found error due to route requirements.
My question is, would you:
Use URL re-writing to re-write key to uppercase.
Create 302 redirects from lowercase to uppercase?
Have the application convert to uppercase and handle requests in a case-insensitive fashion

Most users these days expect URLs to be case-insensitive. I would have the app silently handle the conversion in the background. I don't see it being worth the extra request time to issue a redirect.
If SEO is a concern, then you can use the rel="canonical" meta tag to let google/other search engines know which URL you want to appear in search results.

Does the user agent string have to be exactly as it appears in my server logs?

When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs?
For example when trying to match GoogleBot, can I just use googlebot?
Also, will a partial-match work? For example just using Google?

At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section:
https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html

(As already answered in another question)
In the original robots.txt specification (from 1994), it says:
User-agent
[…]
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
[…]
But if/which parsers work like that is another question. Your best bet would be to look for the documentation of the bots you want to add. You’ll typically find the agent identifier string in it, e.g.:
Bing:
We want webmasters to know that bingbot will still honor robots.txt directives written for msnbot, so no change is required to your robots.txt file(s).
DuckDuckGo:
DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules […]
Google:
The Google user-agent is (appropriately enough) Googlebot.
Internet Archive:
User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags.
…

robots.txt is case-sensitive, although Google is more conservative than other bots, and may accept its string either way, other bots may not.

Also, will a partial-match work? For example just using Google?
In theory, yes. However, in practise it seems to be specific partial-matches or "substrings" (as mentioned in #unor's answer) that match. These specific "substrings" appear to be referred to as "tokens". And often it must be an exact match for these "tokens".
With regards to the standard Googlebot, this only appears to match Googlebot (case-insensitive). Any lesser partial-match, such as Google, fails to match. Any longer partial-match, such as Googlebot/1.2, fails to match. And using the full user-agent string (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) also fails to match. (Although there is technically more than one user-agent for the Googlebot anyway, so matching on the full user-agent string would not be recommended anyway - even if it did work.)
These tests were performed with Google's robots.txt tester.
Reference:
Google Crawlers - Includes User agent "tokens" (to be used in robots.txt)
Google's robots.txt tester

Yes, the user agent has to be an exact match.
From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"

How can a robots.txt ignore anything with action=history in it?

I have a MediaWiki, and I don't think I want Google indexing the history of any page. How can a robots.txt disallow URLs with action=history in the query string?

The HTML for the history view (and several others, such as the logs, etc.) contains a "noindex,nofollow" meta declaration. Compliant user agents, such as Googlebot, will honour this advice and not bother indexing the page.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Web crawlers and non-ASCII characters in sitemap.xml - seo

One of our sites has non-ASCII (non-english) characters in URLs: http://example.com/kb/начало-работы/оплата I wonder how do web crawlers (particularly Googlebot) handle these situations? Do these URLs have to be encoded or otherwise processed?

I think it is best url-encoded. This is the standard.

Related

Special characters in document.location.href

Sharing a link (share?url=) with spaces?

Accept case-insensitive URL or redirect to "correct" URL?

Does the user agent string have to be exactly as it appears in my server logs?

How can a robots.txt ignore anything with action=history in it?

Categories

Resources