Special characters in the URL - asp.net-mvc-4

I am working on a ASP.NET MVC app.
This app displays a detail information regarding a product.
The product name can have any special chars like single quote, the percentage symbol, the Registered symbol the one with a circle and 'R' inside, the Trademark symbol etc.
Currently all these are replaced with a '-'.
If the name is like this:
Super - Men's 100% Polyester Knit Shirts
It appears like this in the URL:
8080/super---men-s-100-polyester-knit-shirts/maverick
- men-s-100-polyester-knit-shirts
This is done in Js like so:
Name.replace(/([~!##$%^&*()_+=`{}\[\]\|\\:;'"<>,.\/? ])+/g, '-').replace(/^(-)+|(-)+$/g, '');
So the question is, should the name be displayed as-is in the URL?
If yes, some pointers please.
If no, please provide some valid reasons like standards as followed today that will help me put the point across the table.
Regards.

The short answer is not to fiddle with it. It's as good as it gets out of the box.
The Url can only contain a small number of alphanumeric letters. which basically means you can only have 0-9 a-z and - _ . ~.
All other characters need to be encoded. Now that you can have arabic url's too it has gotten a little more complicated.
But assuming your website is indo-european this is it. So you will never be able to have full product names in your url.
And renaming them as something more cool like replacing % with "percent" in the url can bring desaster upon your url's as in some cases the "fake" names may not end up unique and therefore end up with unreliable routing.
look at URI characters on wiki

Related

How to request a page title in a foreign language using wikipedia API

I am trying to use a simple GET request
"https://en.wikipedia.org/w/api.php?action=query&titles=音乐&prop=langlinks&lllimit=500"
but with chinese characters in the title wikipedia can't find the page even though it exists https://zh.wikipedia.org/wiki/%E9%9F%B3%E4%B9%90
I have tried URL encoding the chinese word, I tried both simplified and traditional. I tried giving it unicode in ascii like "\u97f3\u4e50"
Does anyone know how to do this?
I solved it.
There are a few things to remember when doing this:
You need to use the wikipedia of your target language. so in this case
zh.wikipedia.org
Chinese wikipedia externally displays the charset of the users region (simplified for mainland, standard for Taiwan). But internally it depends on who wrote the article. The title in your api query must be in the original character set of the person who created it. So for Music, 音樂 will not work and you must use the simplified 音乐. But for Notebook computer the simplified 笔记本 will not work and you must use 筆記本. You have no choice but to try both. .NET includes a set of methods for converting between the two character sets.

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.
The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

Unwanted space in a PDF generated with TCPDF

We use TCPDF to generate PDFs. In one special case I got a strange behaviour, it looks like TCPDF puts a space inbetween two characters.
I use the cid0cs as font, the strange behaviour appears if I place "µg" in the PDF, it looks like "µ g" (with some space inbetween) now.
I edited the cid0cs.php on index 181 (like here: http://bytethinker.com/blog/correct-display-of-imported-fonts-in-tcpdf) with no success.
Any help is really appreciated.
Did you edit the character µ or g? If you select the letters you can see which letter the extra space belongs to. So... for a small "g" (the first letter after which is the space, you must edit the entry "130=>???" of the $cw array.
$cw=array(0=>0,1=>750,2=>750,3=>750,4=>750
Make it half the value. If its 750, make it 400 and try. Or even better: search for a letter that could be the same with as your "g" (an "a" for example).
Cheers,
Guido
(customer service is when you look at all the links that lead to your website :)

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

what is the impact of escaped characters on seo-friendly urls?

I have a site that displays products - in the simplest sense the url of the page for a particular product is:
site.com/products/manufacturer_model - so for example if I was displaying a Dell Latitude D700 laptop my URL would look like:
site.com/products/dell_latitude_d700
I have a number of products that contain characters that I would need to URL escape - so for example a Dell Latitude 12?34. Obviously I cannot include the '?' character in the URL. For the purpose of being SEO-friendly - should I ignore that character? e.g.
site.com/products/dell_latitude_1234
Or should I escape it? e.g.
site.com/products/dell_latitude_12%3F34
Seems like escaping it would be the most logical approach - but do crawlers understand this?
Well, using "_" is not so friendly to users, so I think using "-" is better (check seoMOZ beginners guide).
Also, you would like to check what characters really need escaping on RFC 3986. If you are using PHP, check out urlencode function page at php.net. I wrote a function to make this updated conversion a few months ago ;)
But getting back to your main question, do use escaped (when needed per RFC 3986) for writing your URLs. It is the safe path to not getting stuck or penalized.