How to determine if 2 pages are of the same topic but in different languages? - wikipedia-api

I'm importing wiki pages based on their pageid (or url).
Lest's say I've imported a page and I've stored, among others, its language (i.e EN).
If I'm to import a translated page of the above EN version, by receiving only a new pageid (or url), how can I link between the two so I can store them in my database under the same topic id (but with 2 different languages) ?
Using the MediaWiki API of course.

You can check if they're connected by langlinks. For example, Einstein page result for langlinks:
By title: https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=langlinks
By pageid: https://en.wikipedia.org/w/api.php?action=query&pageids=736&prop=langlinks
But must remember that diferent language pages aren't translations, content can be different on same subject, they're not direct translations of en.wikipedia.

Related

JSON LD + Schema.org - only on homepage or on all pages? [duplicate]

Should the "WebSite" and "Organization" types and their properties be applied to all pages of a website or just the homepage?
I have valid JSON-LD code defining the the necessary items for Google mobile search results, but I am not sure if it should be included on all pages or just the root/home page.
It would make sense to provide it on any page where it’s relevant.
For example, if this is an organization’s website, each page is about/from the organization, so provide metadata about this organization on each of the pages.
A consumer looking for structured data on a certain page is not necessarily also visiting and checking the homepage, so it might never learn that you are providing relevant metadata.
That does not necessarily mean that you should include the full item (with all properties) on each page. It can be sufficient to provide the full item only on one page (e.g., on the site’s homepage), and link to it (for example with the property author) from each other page.

Duplicate content and international sites clarification

Something is not clear, here is my case:
i want to have have the same content for us and uk people,
could i safely avoid duplicate content with thoses url:
www.example.us/info.html (hosted on us server)
www.example.co.uk/info.html (hosted on uk server)
from google :
Websites that provide content for different regions and in different languages sometimes create content that is the same or similar but available on different URLs. This is generally not a problem as long as the content is for different users in different countries. While we strongly recommend that you provide unique content for each different group of users, we understand that this might not always be possible. There is generally no need to "hide" the duplicates by disallowing crawling in a robots.txt file or by using a "noindex" robots meta tag. However, if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately. In addition, you should follow the guidelines on rel-alternate-hreflang to make sure that the correct language or regional URL is served to searchers.
Seems not clear for me, what do you think about my case ?!
flau
Go for hreflang. When implemented properly, you will avoid all duplicate content issues.
if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately. In addition, you should follow the guidelines on rel-alternate-hreflang to make sure that the correct language or regional URL is served to searchers
That covers your scenario:
Choose one as your preferred URL for the US and make it redirect (or use canonical), and
Follow hreflang guidelines: https://support.google.com/webmasters/answer/189077?hl=en

Associated Content & SEO, Sitemaps with External links, using CNAMEs to include External Links as my own in the sitemap

Is there any HTML code or page paramater or metaname that can tell search engines that the content of a page is closely linked to another page on another domain..
I keep the content metatag updated and also the keyword metatag.
I don't want to show these links to my visitors.
1)
I need to know if there is a protocol for communicating related links specifically to crawlers so as to improve my ranking
Is there any way via code I can tell crawlers (crawlers specifically, like how No Follow is addressed to crawlers) that mydomain.com/Porduct.php is closely linked to say
http://ebay.com/sameProduct
http://wikipedia.com/GenericProduct or
http://google.com?q=someKeywords
Should I include external links or CNAME mapped External links(Read Q3) inside the content tag ?? Would that make a difference
2)
Can I include these links in my Sitemap.. Common sense would suggest that links in my sitemap should be hoisted on my domain. Still though I did ask since the sitemap takes in the full URL including the domain name.
3)
If a particular well indexed page has content largely similar to mine can I map a CNAME of my page to that site and include that in the sitemap?? would that amount to cheating ??
First of all, I'm not sure what do you want to achieve there. Search engines in general are already pretty good at recognizing what your page is about. If your content is about product A, write a description about product A, have images about product A, let your users comment about or review product A, or add microdata to your page (i.e. http://schema.org/Product). All these will help search engines recognize that your page is about that product, just like that page on the other site which also have content about the same product.
To answer your questions:
1) I'm not aware of any tag like that which would also be supported by search engines.
2) In your Sitemap you can include only URLs that point to a location on the same hostname the Sitemap is hosted on (there are some exceptions, but those are irrelevant now). See http://www.sitemaps.org/protocol.html for more info about Sitemaps.
3) A CNAME resource record specifies that the domain name is an alias of another domain name, and thus it can't be used the way you described.
Lastly, you're trying to do something for crawlers which is usually a bad idea. Create an awesome website, something useful for the users, something they would love and they'd miss in case you closed the shop. Just focus on the user and all else will come.

SEO: Allowing crawler to index all pages when only few are visible at a time

I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.

How to get sites identical in content but different in language and TLD indexed by major search engines?

Is it possible to get two "editions" of a website both indexed by the major search engines (Google/Yahoo/Bing/Teoma) which differ in content language only and are hosted under different TLDs?
Say English content is available at "http://domain.com/", German content at "http://domain.de/". Now, if e.g. Google.com is used I want it to list the "domain.com" entry and vice versa. Is "Duplicate Content" an issue here?
Depending on website software you use (wordpress, joomla, custom, etc), you might have a plugin or addon for each that supports multiple domains and search-engine pinging/seo. If that's the case, it should be possible.
I'm assuming your website layout is the same but you have a ".com" and ".de" TLD pointing to the same directory/software installation and a (auto?) language selector to choose between English and German.
Edit: (for quick readers)
It shouldn't need separate webspace for each site. What I do for my sites to get them submitted is use Sitemaps. I've never generated one myself, so I can't help in that aspect. However, you could generate sitemaps for each language (e.g. sitemap.en.xml.gz | sitemap.de.xml.gz) and have your application ping search engines with these sitemaps. Essentially, you'll have the same content but in different languages and it'll be in a sitemap which can be submitted to google/bing/yahoo/etc.
I used this method on a wordpress blog I had and every time I submitted/changed content, it would re-generate sitemaps (updating links/etc) and ping the search engines again.