In terms of SEO: what's the best way to encode characters such as ÅÄÖ?
I've used ö, å in titles etc.
But in google webmaster tools they end up as:
"Sök bland inkomna förfrågningar från Stockholm inom Golvvård. Offerta.se"
Doen't google recognize these?
Google does recognise HTML entity references in search results; I'm not sure where in Webmaster Tools you're looking to get the HTML-source version you quote, and whether that's actually indicative of any kind of problem (I suspect not).
But these days there's little good reason to ever use an HTML entity reference (other than the XML built-ins <, &, "). Stick with UTF-8 encoding and just type the characters ö, Å et al directly.
I'd expect if you have a charset declaration in your file (such as <meta http-equiv="Content-type" content="text/html;charset=UTF-8"> in your head section) that the webcrawler would understand the unicode characters for those letters. But really you'd have to test these out and/or ask Google.
I wonder if your inference that Google's webcrawler isn't processing the entities correctly (on the basis of what you're seeing in the webmaster tools) is actually correct; I'm not saying it isn't, but it's an assumption I'd test. I'd be a bit surprised, really, if Google's webcrawler didn't understand the entities (Å for Å, etc.), most browsers do, even in the title.
But using a proper charset is probably your best bet. Make sure the charset you declare is the one your editor and other tools are actually producing!
Google converts everything to Unicode internally, so use UTF-8 everywhere.
Related
For I18n testing, I'm looking for a test string that have a good representation of all commonly used languages (supported by UTF-8) and have all the special chars of these languages that normally have issues in display.
Will use this test string to keep sure that our system process these languages correctly and have the correct font that can display all these languages correctly.
E.g. the sample text should have chars from latin languages, Far East Languages, right to left languages...
There is no clear answer to your question, as it is full of ambiguous terms, for instance "commonly used languages" or "normally have issues in display". This is highly dependent on OS, OS version, the text engine used to display the text, fonts installed. Pretty much the whole tech stack.
Sprinkling "all" in the question (all the special chars, all ... languages) make any answer useless.
You will looking at a string of tens thousands of characters. Then you have a lot of combining marks, and ligatures. Do you want to check all of those combinations too? Those might also have "issues in display"
If all you want to do is check that your application works in (most) languages, try taking some (not all) characters from each Unicode block. Might also want to avoid historical scripts (i.e. cuneiform, Egyptian hieroglyphs, etc.) the are not covered by common fonts.
In general, if you application does not corrupt the string somehow, it will render properly. And if it does not, then it is not your app at fault, it is some limitation in the underlying technology (i.e. the Windows console)
If you explain what you are trying to do, you might get a better answer.
Or you can just search for internationalization testing.
What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.
The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:
Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation
https://dev.twitter.com/docs/counting-characters
Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.
Thanks! :D
One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.
If they say they use UTF-8, that's a pretty good bet. UTF-8 is very common, and UTF-16 in the wild is pretty rare from what I've seen.
There are also some clever libraries you could use if you were so inclined to prove it to yourself by testing whether they support various characters. The best of these is used by Firefox to detect the encoding of webpages as they're loaded: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
At the moment twitter API v2 does not send their data in UTF-8!
I believe it's UTF-16 and because when decoding data in UTF-8 surrogate pairs remain. Surrogate pairs are only featured in UTF-16.
With the API I received for example this string: 🎁Crypto Heroez epic giveaway🎁
However, it didn't come this way but rather: \ud83c\udf81Crypto Heroez epic giveaway\ud83c\udf81
\ud83c\udf81 is a surrogate pair that translates into a gift emoji 🎁
In Hex code UTF-16BE that wrapped present is encoded with: D8 3C DF 81, in UTF-8 this same emoji is encoded with F0 9F 8E 81
Other developers noticed the same: https://twitterdevfeedback.uservoice.com/forums/930250-twitter-api/suggestions/41152342-utf-8-encoding-of-v2-api-responses
This issue was written on the Aug 15, 2020. But as I am writing today the 9th September 2021, they didn't communicated anything publicly available. (That's why I wanted to post this answer here)
I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.
If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).
Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.
i am trying to implement routing on my website.
and just want to know which typr of url is more indexed by goolgle?
examples :
http://www.example.com/contact_us
http://www.example.com/contact-us
also i have one more senario
http://www.example.com/products-services
http://www.example.com/products-and-services
if i use "and" in url will it make any difference in ranking?
Dashes are much better.
Matt Cutts from Google confirmed that in his blog - http://www.mattcutts.com/blog/dashes-vs-underscores/
Regarding "and": check Google Keyword tool. If people search with "and" then use and. If not it's better to skip "and" because URL will be shorter and it's good for SEO.
dash is best as it consider as word separator
here is what Matt Cutts quote time to time
if you have a url like word1_word2, Google will only return that page if the user searches for word1_word2 (which almost never happens). If you have a url like word1-word2, that page can be returned for the searches word1, word2, and even “word1 word2″. That’s why I would always choose dashes instead of underscores.
I didn’t quite say that in the talk. I said that we had someone looking at that now. So I wouldn’t consider it a completely done deal at this point. But note that I also said if you’d already made your site with underscores, it probably wasn’t worth trying to migrate all your urls over to dashes. If you’re starting fresh, I’d still pick dashes.
Matt in February at SMX West, he confirmed that underscores were NOT treated as word separators. According to Matt, this change is still in their queue but unlikely to happen before summer. My interpretation: don’t hold your breath, it’s between summer and never.
according to most seo article dash is more seo friendly
You might also want to read this
SEO difference between dash and hypen
dash (-) and underscore (_) both can be used in the SEO friendly URLs. A dash is seen as a seperation of keywords and can have a good impact on searching while underscore can be seen as a seperation of words in a phrase by some sniffers.
There is much more to it, but this is a short answer.
If i only use <meta name="description" content="lorem impsum." />
I heard search engines does not give importance to Keywords.
<meta name="keywords" content="some, words" />
So is it ok to not to use Keywords?
I have been looking for evidence of Meta Keyword support for years and never found any documentation that they are supported by anyone. Never. Most of the recommendations supporting them are recycled from everyone else.
Some people say that they may be used in the future... well, I'll get to that in a moment. Other people say that Keywords can't hurt so just include them anyway. But they are incorrect.
Meta keywords are great for letting your competitors know your SEO secrets. You wouldn't tell your competitors this information directly so, don't use them. These are the only people that are likely to look at your Meta Keywords.
Since Google set the bench mark of quality software, Search Engines must perform to very high standards to be successful. It's too easy for consumers to switch to Google which is trusted and reliable.
Consider this:
To build quality Search Engine you must, first of all acquire high quality information for indexing. This is the foundation of your product.
You must also protect your search index from being manipulated by third parties for their benefit. Your users will probably not have the same interest as a third party who can modify your Search Engine's behaviour.
Meta keywords are not derived from the content of the web page though any process that can be considered reliable. Meta Keywords are not directly related to web pages in any way and can be manipulated without consequence. This makes meta keywords a low quality source of information. They are what's known to programmers as "Tainted Data", data that is not to be trusted.
If you build your Search Engine to index low quality information, your Search Engine won't return useful search results. I propose that it would be impossible to build a search engine today that uses meta keywords that would work well at all.
It's important to stop using Meta Keywords and try to put the Meta Keywords myth to rest. They just waste everybody's time and are counter productive. Remember, It's not good practice to add features to your website that don't work. The time you spend with something that doesn't work could be better spent with something that does. Or maybe go look out the window and admire the sky. You'll be better off.
I heard search engines does not give
importance to Keywords.
Google doesn’t use the keywords meta tag for the web search (Source).
However, Yahoo (Source), Bing (Source), and other search engines may still be using them with various degrees of importance. They may also be used by internal search engines.
So is it ok to not to use Keywords?
"... I hope this clarifies that the keywords meta tag is not something that you need to worry about, or at least not in Google." - Mutt Cutts (Google doesn’t use the keywords meta tag in web search)
I have heard the same. However search engine algorithms are not static and may change over time. Furthermore not all search engines treat the keywords tag equally. I think you should include it if possible.
Google analyzes your page content and gives higher priority to other parts, but I don't know of any reason not to include meta tag keywords.