Duplicate content and keyword in URL - seo

I observed that StackOverflow uses two types of links:
Should I list PDFs in my sitemap file?
and
Should I list PDFs in my sitemap file?
for the same question.
The idea is clear: add keywords into URL and have SE pick up the page faster.
But shouldn't Google punish for the duplicate content in this case?
I'm trying to understand what is more helpful since we have a similar situation on our site.

Source code has the answer.
<link rel="canonical" href="http://stackoverflow.com/questions/1072880/sitemap-xml">
<link rel="alternate" type="application/atom+xml" title="Feed for question 'Sitemap xml'" href="/feeds/question/1072880">
rel = alternate/canonical

The idea is clear: add keywords into
URL and have SE pick up the page
faster.
It actually has nothing to do with the speed of indexing, however, it does with ranking.
As pointed above, best approach when dealing with duplicated content inside the same domain is putting a canonical element pointing to the preferred URL.

Related

What's the point of using absolute urls in Pelican?

About RELATIVE_URLS, the Pelican docs say:
…there are currently two supported methods for URL formation: relative and absolute. Relative URLs are useful when testing locally, and absolute URLs are reliable and most useful when publishing.
(http://pelican.readthedocs.org/en/3.4.0/settings.html#url-settings)
But I'm confused why absolute URLs would be better or not. In general, when I write HTML by hand I prefer to use relative URLs because I can change the domain of the website and not worry about it later.
Can somebody explain the thinking behind this setting in more detail?
I don't use the RELATIVE_URLS setting because it's document-relative. I don't want URLs containing ../.. in them, which is often what happens when that setting is used.
Moreover, relative URLs can cause issues in Atom/RSS feeds, since all links in feeds must be absolute as per the respective feed standard specifications.
Contrary to what's implied in the original question, not using the RELATIVE_URLS setting will not cause any 404s if you later decide to change the domain. There's a difference between specifying absolute URLs in your source document (which is what you seem to be talking about) and having absolute URLs generated for you at build time (which is what Pelican does).
When it comes time to link to your own content, you can either use root-relative links, or you can use the intra-site link syntax that Pelican provides.

Basics of i18next

Im new to i18n and when I typed it in the search bar i18next is in the top results.
I already did my research regarding i18n and how to use it. But it still not clear to me. All I know is that to be able to make your web app available to other language, you need to do a json file that contains the keys and value of your app, and you need to add a script for the i18n.
The rest is still confusing for me. This might sound a stupid question for you, but I just cant understand how it works.
1) Im not sure but based on my observation, you only create a json translation for those that have a value or text that will be shown in the page. Correct? Assuming in the html file, I have a text that is not inside a label or innerhtml, example:
<html>
<body>
**How are we going to translate this text? What key am I going to use?**
</body>
</html>
What do I need to do to translate this text?
2) What should we use as the key? id? class? tag? Because I've seen different examples and it uses different any of these. When is the right time to use these?
3) regarding the key-value pair, what if the pair is coming from the server? what's the syntax for this?
4) When do we need a multi line json?
i18n is a big topic, with a lot of solutions depending on what kind of web app you are trying to internationalize / localize. Unfortunately, i18next's documentation is not very good, and it has next to nothing in way of tutorials.
That said, you might be best off taking a look at the sample app on i18next.js's github repository here: https://github.com/jamuhl/i18next/tree/master/sample/static. It does give some examples of how i18next can be used to replace html text with localized versions of the same. To answer some of your questions:
There are a few ways of doing this. The sample script replaces much of the data by using the jQuery .text call -- something like this: $('#MyHTMLID').text($.t('ns.common:MyLocalizedTextForMyHTMLID'));. Any html inside the id "MyHTMLID" is replaced by the localized data for the key "MyLocalizedTextForMyHTMLID' by the i18next .t call.
A lot of these decisions are just convention. Keep it simple, be consistent.
Normally in a web app the json file is on the server, in a locales subdirectory of the directory where your html resides. Take a look at that i18next example for how it's laid out.
When you're first building your web app, use a multi-line json file to be able to troubleshoot. You can compress it later using something like http://jsonformatter.curiousconcept.com/.
Hope this helps get you started!

XSL-FO tags not supported in Java

I'm trying to edit an XSL-FO document to be processed in Java via Apache FOP. Are there any tags that might not be supported or that have become depreciated?
I didn't write the original XSL-FO. It was originally a WordML document that I converted via Word2FO, and I've gotten rid of junk characters and made sure all the tags are closed properly, so the only thing I can think of is that some of these tags might not be supported. Particularly:
Tags with Microsoft-related properites, including fonts like Arial-Word-Unicode, references to Microsoft-Office Smarttags, and other word-related properties
SVG tags
External graphics tags with huge streams of nonsense text to represent an image.
I've been looking online for a list of these unsupported tags, but I can't seem to find anything.
I have found what I am looking for on this page.
http://xmlgraphics.apache.org/fop/compliance.html
Very useful for looking up what is/isn't supported.
I should really stop finding these solutions after asking these questions.
This question is answered, and the above link will probably help anyone who needs it answered too. Sorry to bother the rest of you.

SEO - carve up one dynamic file with params into fixed-name files

So I've got an existing real estate site. All the searches go through one php file, ie: sales_search.php?city=boston&br=4
If I create the following files:
boston-1-br.php
boston.2-br.php
boston-3-br.php
boston-4-br.php
brookline-1-br.php
brookline-2-br.php
brookline-3-br.php
brookline-4-br.php
etc…
I would then use these in place of the sales_search?city=XXX&br=NNN wherever possible and only use sales_search.php for 'advanced' searches. These new files are still dynamic as they pull content from a database.
Would this help the rankings? Hurt them? Waste of time? Thoughts? Suggestions?
I don't think they'll help, or hurt rankings. The content on the page is far more important. What is the similarity of the content between these pages? When pages are named very similar to each other it can trigger a flag to make the spider look for doorway pages. If the content is varied enough you have little to worry about.
Have you considered using a url rewriter and turning them into .htm files? There have been a lot of arguments about it, but I have personally noticed .htm files do better than .php

T-SQL search html with regex?

In my database I have a field wich contains a html document. Now there must be a possibility to search in this document. However, the html tags may not be found. So when I have something like this:
<html>
<head>
<title>Bar</title>
</head>
<body>
<p>
this content my be found
</p>
</body>
</html>
It is possible that the document stored in the database is not xhtml. Can you tell me what the best way is to search in the content? Shall i use regular expressions? And of so, how would it look like? ANd if not, what should I use else?
You could try turning on Full-Text Search or use something like Lucene.Net to index the content for you.
What volume of records are there? I expect you might have to use full-text search and an IFilter to do this efficiently. Html does not lend itself well to regex - it can quickly be very hard to do something very simple.
If the volume isn't huge, can you iterate over the records with an external parsing application, using something like the HTML Agility Pack (for .NET) - or any other DOM of your choice.
But the FTS/IFilter would be my first choice.