What character encoding should I use for a HTTP header? - http-headers

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.
Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.
In IE8 the ✰ is not rendered correctly.
The w3.org HTML validator does not render it correctly (displays "â°" instead).
Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)
Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?
Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?

In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.
HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Previously, RFC 2616 from 1999 defined this:
Words of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
and RFC 2047 is the MIME encoding, so it'd be:
=?UTF-8?Q?=E2=9C=B0?=
but I don't think that many (if any) clients support it.

Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.
You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)
Tip: you can encode anything in JSON.
Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)
I'd like to sum up all the relevant definitions as per the spec linked by Penchant.
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
So, we are after field-value.
LWS = [CRLF] 1*( SP | HT )
CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.
Let's simplify it to this:
field-value = <any field-content or Space or Tab>
Now we are after field-content.
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs,
but including LWS>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is the most general and includes all the rest -so forget about the rest-.
Here is the US-ASCII charset (= ASCII)
As you can see, all printable ASCII chars are allowed.

Related

What characters are allowed in Shopify product handles?

Shopify's documentation shows some of the characters that are allowed in product handles (the product identifier that is used in URLs).
Since handles are used for your storefront navigation for products,
collections, blogs and pages, they must use alpha-numeric characters
(a-z, 0 to 9) without accents (such as umlauts, and other
diacriticals), nor characters such as # or # etc., and no spaces.
Spaces will be converted to hyphens, other characters may be stripped
entirely or converted to an equivalent standard ASCII character.
But if I create a product in the web interface with the title 'a b-c_d.e' then the handle generated by Shopify is 'a-b-c_d-e'. It seems like underscores are allowed, but spaces and dots are converted to hyphens.
What is the full set of characters allowed in product handles?
I wrote a script to test if the Shopify API accepts each of the ASCII codes from 0 to 127 in a product handle. It tries to modify the handle of an existing product to xCxC where C is the ASCII character to test and x is literally the letter x. I did it this way to find out how each character is handled when surrounded by text and also when trailing at the end of the handle.
Here are the results:
Allowed:
0-9
a-z
A-Z (will be converted to lowercase)
_ (underscore)
Allowed when surrounded but removed when at the end of the string:
- (hyphen)
Converted to - (hyphen) when surrounded but removed when at the end of the string
space
! # $ % & * + , . / : ; < = > ? # \ ^ ` { | } ~
ASCII control codes 0 to 32
Removed
" ' ( ) [ ]
See Wikipedia for details on each ASCII code: https://en.wikipedia.org/wiki/ASCII
The accepted answer is outdated.
Shopify allows many non-english characters inside URLs.
Example
https://example.myshopify.com/collections/무료/products/이지부
is now a valid Shopify URL.
I can confirm that the Navigation feature used to create menus inside Shopify does not pass quotation symbols or inch symbol, i.e: " specifically when creating a custom URL link in a navigation menu.
The symbol is allowed when entered but it is removed before being passed to the liquid template files.
Unexpectedly, you can use this symbol to create a query URL, i.e:
.../tvs/lg?pf_opt_tv_size=28.5"
This is particularly annoying when creating a Navigation link to a custom query URL created by a search filter app, Shopify will internally remove these characters for you.
Basically all the characters that are not affected by URL decode/encode functions.
Underscores (_) and Hypens (-) escape from this, also does the stop (.); but it's a URL schema parameter and hence gets converted to Shopify handle schema namely -.

Convert text value in SQL from ISO-8859-1

I have some data in SQL that is falsely formatted:
"Daniel Bødtker" is displayed in ISO 8859-1 format: "=?iso-8859-1?Q?Daniel_B=F8dtker?="
Does anyone have a fix function ready to share?
Thanks!
Daniel
That's not ISO8859-1 format. That's RFC 2047 format, used for transmitting international characters in email headers, which technically only support 7-bit ASCII.
These values have almost certainly been extracted from email headers by a process which does not understand RFC2047.
Format is:
"=?" + character set name + "?" + encoding identifier + "?" + encoded data + "?="
The encoding identifier is either Q or B. Q means "Quoted-Printable" and B means "Base64".
Note that an email header can have multiple such sequences.
Therefore your solution needs to search for these sequences, and handle them on a case-by-case basis.

why ldap search return all results when using %?

When I search one ldap server using the following filter
(cn=%*)
It return all results under the base dn? LDAP treat '%' specially? But I haven't found any description about it.
What is your directory server ?
Are you sure tha '%' is not replace by your command line interpreter or your compiler ?
According to RFC2254 % is not a special character
If a value should contain any of the following characters
Character ASCII value
---------------------------
* 0x2a
( 0x28
) 0x29
\ 0x5c
NUL 0x00
the character must be encoded as the backslash '\' character (ASCII
0x5c) followed by the two hexadecimal digits representing the ASCII
value of the encoded character. The case of the two hexadecimal
digits is not significant.
This simple escaping mechanism eliminates filter-parsing ambiguities
and allows any filter that can be represented in LDAP to be
represented as a NUL-terminated string. Other characters besides the
ones listed above may be escaped using this mechanism, for example,
non-printing characters.
For example, the filter checking whether the "cn" attribute contained
a value with the character "" anywhere in it would be represented as
"(cn=\2a*)".
Note that although both the substring and present productions in the
grammar above can produce the "attr=*" construct, this construct is
used only to denote a presence filter.

Interesting UTF-8 Yahoo File Download Headers

My company runs a webmail service, and we were trying to diagnose a problem with Word downloads not opening automatically - the same *.doc file download from Yahoo Mail would open, but one from ours would not.
In the course of investigating the headers we saw this coming from Yahoo:
content-disposition attachment; filename*="utf-8''word document.doc";
Whereas our headers were like this:
content-disposition attachment; filename="word document.doc";
What exactly is Yahoo doing with the additional asterisk and utf-8'' designation?
I think the correct answer to this is in rfc 2231:
Asterisks ("*") are reused to provide the indicator that language and
character set information is present and encoding is being used. A
single quote ("'") is used to delimit the character set and language
information at the beginning of the parameter value. Percent signs
("%") are used as the encoding flag, which agrees with RFC 2047.
Specifically, an asterisk at the end of a parameter name acts as an
indicator that character set and language information may appear at
the beginning of the parameter value. A single quote is used to
separate the character set, language, and actual value information in
the parameter value string, and an percent sign is used to flag
octets encoded in hexadecimal. For example:
Content-Type: application/x-stuff;
title*=us-ascii'en-us'This%20is%20%2A%2A%2Afun%2A%2A%2A
What Mime-Type are you using?
The asterisk is required as per RFC 2183 (http://www.ietf.org/rfc/rfc2183.txt):
In the extended BNF notation of [RFC 822], the Content-Disposition
header field is defined as follows:
disposition := "Content-Disposition" ":"
disposition-type
*(";" disposition-parm)
disposition-type := "inline"
/ "attachment"
/ extension-token
; values are not case-sensitive
disposition-parm := filename-parm
/ creation-date-parm
/ modification-date-parm
/ read-date-parm
/ size-parm
/ parameter
filename-parm := "filename" "=" value
creation-date-parm := "creation-date" "=" quoted-date-time
modification-date-parm := "modification-date" "=" quoted-date-time
read-date-parm := "read-date" "=" quoted-date-time
size-parm := "size" "=" 1*DIGIT
quoted-date-time := quoted-string
; contents MUST be an RFC 822 `date-time'
; numeric timezones (+HHMM or -HHMM) MUST be used

Are square brackets permitted in URLs?

Are square brackets in URLs allowed?
I noticed that Apache commons HttpClient (3.0.1) throws an IOException, wget and Firefox however accept square brackets.
URL example:
http://example.com/path/to/file[3].html
My HTTP client encounters such URLs but I'm not sure whether to patch the code or to throw an exception (as it actually should be).
RFC 3986 states
A host identified by an Internet
Protocol literal address, version 6
[RFC3513] or later, is distinguished
by enclosing the IP literal within
square brackets ("[" and "]"). This
is the only place where square bracket
characters are allowed in the URI
syntax.
So you should not be seeing such URI's in the wild in theory, as they should arrive encoded.
Square brackets [ and ] in URLs are not often supported.
Replace them by %5B and %5D:
Using a command line, the following example is based on bash and sed:
url='http://example.com?day=[0-3][0-9]'
encoded_url="$( sed 's/\[/%5B/g;s/]/%5D/g' <<< "$url")"
Using Java URLEncoder.encode(String s, String enc)
Using PHP rawurlencode() or urlencode()
<?php
echo '<a href="http://example.com/day/',
rawurlencode('[0-3][0-9]'), '">';
?>
output:
<a href="http://example.com/day/%5B0-3%5D%5B0-9%5D">
or:
<?php
$query_string = 'day=' . urlencode('[0-3][0-9]') .
'&month=' . urlencode('[0-1][0-9]');
echo '<a href="http://example.com?',
htmlentities($query_string), '">';
?>
Using your favorite programming language... Please extend this answer by posting a comment or editing directly this answer to add the function you use from your programming language ;-)
For more details, see the RFC 3986 specifying the URL syntax. The Appendix A is about %-encoding in the query string (brackets as belonging to “gen-delims” to be %-encoded).
I know this question is a bit old, but I just wanted to note that PHP uses brackets to pass arrays in a URL.
http://www.example.com/foo.php?bar[]=1&bar[]=2&bar[]=3
In this case $_GET['bar'] will contain array(1, 2, 3).
Pretty much the only characters not allowed in pathnames are # and ? as they signify the end of the path.
The uri rfc will have the definative answer:
http://www.ietf.org/rfc/rfc1738.txt
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.
The answer is that they should be hex encoded, but knowing postel's law, most things will accept them verbatim.
Any browser or web-enabled software that accepts URLs and is not throwing an exception when special characters are introduced is almost guaranteed to be encoding the special characters behind the scenes. Curly brackets, square brackets, spaces, etc all have special encoded ways of representing them so as not to produce conflicts. As per the previous answers, the safest way to deal with these is to URL-encode them before handing them off to something that will try to resolve the URL.
For using the HttpClient commons class, you want to look into the org.apache.commons.httpclient.util.URIUtil class, specifically the encode() method. Use it to URI-encode the URL before trying to fetch it.
StackOverflow seems to not encode them:
https://stackoverflow.com/search?q=square+brackets+[url]
Best to URL encode those, as they are clearly not supported in all web servers. Sometimes, even when there is a standard, not everyone follows it.
According to the URL specification, the square brackets are not valid URL characters.
Here's the relevant snippets:
The "national" and "punctuation" characters do not appear in any
productions and therefore may not appear in URLs.
national { | } | vline | [ | ] | \ | ^ | ~
punctuation < | >
Square brackets are considered unsafe, but majority of browsers will parse those correctly. Having said that it is better to replace square brackets with some other characters.