What characters are allowed in Shopify product handles? - shopify

Shopify's documentation shows some of the characters that are allowed in product handles (the product identifier that is used in URLs).
Since handles are used for your storefront navigation for products,
collections, blogs and pages, they must use alpha-numeric characters
(a-z, 0 to 9) without accents (such as umlauts, and other
diacriticals), nor characters such as # or # etc., and no spaces.
Spaces will be converted to hyphens, other characters may be stripped
entirely or converted to an equivalent standard ASCII character.
But if I create a product in the web interface with the title 'a b-c_d.e' then the handle generated by Shopify is 'a-b-c_d-e'. It seems like underscores are allowed, but spaces and dots are converted to hyphens.
What is the full set of characters allowed in product handles?

I wrote a script to test if the Shopify API accepts each of the ASCII codes from 0 to 127 in a product handle. It tries to modify the handle of an existing product to xCxC where C is the ASCII character to test and x is literally the letter x. I did it this way to find out how each character is handled when surrounded by text and also when trailing at the end of the handle.
Here are the results:
Allowed:
0-9
a-z
A-Z (will be converted to lowercase)
_ (underscore)
Allowed when surrounded but removed when at the end of the string:
- (hyphen)
Converted to - (hyphen) when surrounded but removed when at the end of the string
space
! # $ % & * + , . / : ; < = > ? # \ ^ ` { | } ~
ASCII control codes 0 to 32
Removed
" ' ( ) [ ]
See Wikipedia for details on each ASCII code: https://en.wikipedia.org/wiki/ASCII

The accepted answer is outdated.
Shopify allows many non-english characters inside URLs.
Example
https://example.myshopify.com/collections/무료/products/이지부
is now a valid Shopify URL.

I can confirm that the Navigation feature used to create menus inside Shopify does not pass quotation symbols or inch symbol, i.e: " specifically when creating a custom URL link in a navigation menu.
The symbol is allowed when entered but it is removed before being passed to the liquid template files.
Unexpectedly, you can use this symbol to create a query URL, i.e:
.../tvs/lg?pf_opt_tv_size=28.5"
This is particularly annoying when creating a Navigation link to a custom query URL created by a search filter app, Shopify will internally remove these characters for you.

Basically all the characters that are not affected by URL decode/encode functions.
Underscores (_) and Hypens (-) escape from this, also does the stop (.); but it's a URL schema parameter and hence gets converted to Shopify handle schema namely -.

Related

Defining the contents of a URL

I have a web site which contains forms for my customers to download. They are constantly telling me that the %20 listed in the url when they are looking at a form means there is a 20% discount on the items listed on the form. The following url is what is displayed in one example. Can you explain to me what the %20 means in this url? http://www.schumachersuniforms.com/form/Atonement%20PreK.pdf
Percent Encoding
A URL cannot contain certain characters. The SPACE character is one of the those forbidden chapters.
Your PDF document is apparently named with a SPACE in the middle, Atonement PreK.pdf.
Percent Encoding, also known as URL Encoding, is a way to replace the offending characters with a sequence of other characters. That sequence begins with a PERCENT SIGN character. A hexadecimal number of the character’s code point follows.
The decimal code point for SPACE is 32, the hex is 20. So the string %20 substitutes for the SPACE.
No way around this:
If you really don't want the %20, then avoid naming your PDF document with space characters. Example: AtonementPreK.pdf.
Or use a more sophisticated web scheme for handling the URL triggering a download other than directly referencing the file name.
Do not confuse URL encoding with HTML (and XML) character entity references.

Special characters in Amazon S3 file name

Users are uploading files with names like "abc #1", "abc #2". I am uploading these files to S3. When I try to download these files I get an error like this
InvalidArgument
Header value contained an open quoted span.
I am creating the link by wrapping the file name using "Uri.EscapeUriString".
Any suggestions?
From AWS documentation:
The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
So the "abc #1" and "abc #2" are valid key names, the problem is then probably in your client code, check the documentation of your Http client.
AWS also warn about using special characters:
You can use any UTF-8 character in an object key name. However, using certain characters in key names may cause problems with some applications and protocols. The following guidelines help you maximize compliance with DNS, web-safe characters, XML parsers, and other APIs.
Alphanumeric characters: 0-9, a-z, A-Z
Special characters: !, -, _, ., *, ', (, )
So either restrict the set of available characters in your app to only allow the recommended ones, or fix the issue at your client level.
You should use Uri.EscapeDataString instead of Uri.EscapeUriString for 3 reasons:
Uri.EscapeUriString has been deprecated as of .NET 6 - https://learn.microsoft.com/en-us/dotnet/api/system.uri.escapeuristring?view=net-6.0
Uri.EscapeUriString can corrupt the Uri string in some cases
Uri.EscapeUriString only escapes the spaces - not the #
Uri.EscapeUriString("abc #1") returns "abc%20#1" whereas Uri.EscapeDataString("abc #1") returns "abc%20%231" which is preferable.

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

How can I write special character in VB code

I have a Sql statament using special character (ex: ('), (/), (&)) and I don't know how to write them in my VB.NET code. Please help me. Thanks.
Find out the Unicode code point for the character (from http://www.unicode.org) and then use ChrW to convert from the code point to the character. (To put this in another string, use concatenation. I'm somewhat surprised that VB doesn't have an escape sequence, but there we go.)
For example, for the Euro sign (U+20AC) you'd write:
Dim euro as Char = ChrW(&H20AC)
The advantage of this over putting the character directly into source code is that your source code stays "just pure ASCII" - which means you won't have any strange issues with any other program trying to read it, diff it, etc. The disadvantage is that it's harder to see the symbol in the code, of course.
The most common way seems to be to append a character of the form Chr(34)... 34 represents a double quote character. The character codes can be found from the windows program "charmap"... just windows/Run... and type charmap
If you are passing strings to be processed as SQL statement try doubling the characters for example.
"SELECT * FROM MyRecords WHERE MyRecords.MyKeyField = ""With a "" Quote"" "
The '' double works with the other special characters as well.
The ' character can be doubled up to allow it into a string e.g
lSQLSTatement = "Select * from temp where name = 'fred''s'"
Will search for all records where name = fred's
Three points:
1) The example characters you've given are not special characters. They're directly available on your keyboard. Just press the corresponding key.
2) To type characters that don't have a corresponding key on the keyboard, use this:
Alt + (the ASCII code number of the special character)
For example, to type ¿, press Alt and key in 168, which is the ASCII code for that special character.
You can use this method to type a special character in practically any program not just a VB.Net text editor.
3) What you probably looking for is what is called 'escaping' characters in a string. In your SQL query string, just place a \ before each of those characters. That should do.
Chr() is probably the most popular.
ChrW() can be used if you want to generate unicode characters
The ControlChars class contains some special and 'invisible' characters, plus the quote - for example, ControlChars.Quote

Are square brackets permitted in URLs?

Are square brackets in URLs allowed?
I noticed that Apache commons HttpClient (3.0.1) throws an IOException, wget and Firefox however accept square brackets.
URL example:
http://example.com/path/to/file[3].html
My HTTP client encounters such URLs but I'm not sure whether to patch the code or to throw an exception (as it actually should be).
RFC 3986 states
A host identified by an Internet
Protocol literal address, version 6
[RFC3513] or later, is distinguished
by enclosing the IP literal within
square brackets ("[" and "]"). This
is the only place where square bracket
characters are allowed in the URI
syntax.
So you should not be seeing such URI's in the wild in theory, as they should arrive encoded.
Square brackets [ and ] in URLs are not often supported.
Replace them by %5B and %5D:
Using a command line, the following example is based on bash and sed:
url='http://example.com?day=[0-3][0-9]'
encoded_url="$( sed 's/\[/%5B/g;s/]/%5D/g' <<< "$url")"
Using Java URLEncoder.encode(String s, String enc)
Using PHP rawurlencode() or urlencode()
<?php
echo '<a href="http://example.com/day/',
rawurlencode('[0-3][0-9]'), '">';
?>
output:
<a href="http://example.com/day/%5B0-3%5D%5B0-9%5D">
or:
<?php
$query_string = 'day=' . urlencode('[0-3][0-9]') .
'&month=' . urlencode('[0-1][0-9]');
echo '<a href="http://example.com?',
htmlentities($query_string), '">';
?>
Using your favorite programming language... Please extend this answer by posting a comment or editing directly this answer to add the function you use from your programming language ;-)
For more details, see the RFC 3986 specifying the URL syntax. The Appendix A is about %-encoding in the query string (brackets as belonging to “gen-delims” to be %-encoded).
I know this question is a bit old, but I just wanted to note that PHP uses brackets to pass arrays in a URL.
http://www.example.com/foo.php?bar[]=1&bar[]=2&bar[]=3
In this case $_GET['bar'] will contain array(1, 2, 3).
Pretty much the only characters not allowed in pathnames are # and ? as they signify the end of the path.
The uri rfc will have the definative answer:
http://www.ietf.org/rfc/rfc1738.txt
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.
The answer is that they should be hex encoded, but knowing postel's law, most things will accept them verbatim.
Any browser or web-enabled software that accepts URLs and is not throwing an exception when special characters are introduced is almost guaranteed to be encoding the special characters behind the scenes. Curly brackets, square brackets, spaces, etc all have special encoded ways of representing them so as not to produce conflicts. As per the previous answers, the safest way to deal with these is to URL-encode them before handing them off to something that will try to resolve the URL.
For using the HttpClient commons class, you want to look into the org.apache.commons.httpclient.util.URIUtil class, specifically the encode() method. Use it to URI-encode the URL before trying to fetch it.
StackOverflow seems to not encode them:
https://stackoverflow.com/search?q=square+brackets+[url]
Best to URL encode those, as they are clearly not supported in all web servers. Sometimes, even when there is a standard, not everyone follows it.
According to the URL specification, the square brackets are not valid URL characters.
Here's the relevant snippets:
The "national" and "punctuation" characters do not appear in any
productions and therefore may not appear in URLs.
national { | } | vline | [ | ] | \ | ^ | ~
punctuation < | >
Square brackets are considered unsafe, but majority of browsers will parse those correctly. Having said that it is better to replace square brackets with some other characters.