What are the allowed ASCII characters via POST in HTTP requests? - httprequest

What ASCII characters are not allowed in HTTP requests (particularly via POST and application/x-www-form-urlencoded)? (one is '+')

If the form is encoded with application/x-www-url-encoded, which is the default for HTML forms, the only characters you can definitely use are:
0-9
a-z
A-Z
$ - _ . ! * ' ( ) , "
"+" means space. Everything else can have a special meaning.
If you are using multipart/form-data, then you can send anything anyhow. If you are using an HTML form, add the enctype property, like so:
<form method="post" enctype="multipart/form-data">

Related

Validating URLs with non traditional characters

I have a URL with special characters, example:
$url = 'https://example.com/c/ファンタシ.jpg';
I can't validate it with:
var_dump( (filter_var($url, FILTER_VALIDATE_URL)) );
Because it isn't a valid URL according to the RFC:
"Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'(),"
[not including the quotes - ed], and reserved characters used for
their reserved purposes may be used unencoded within a URL"
And I can't do:
urlencode($url);
Because that will encode the entire string with encoded slashes, colon etc.
So how do I encode the $url properly so it passed the validation?

AspNet core Url decoding

I am using AspNetCore 2.1.
I encountered an issue to deserialize a portion of URL:
http://localhost:55381/api/Umbrellas/cc1892b0-b790-4698-ae3e-07bee39fd29b/ModeOperationnelWithAppliedEvents?dateDeValeur=2018-09-01T02:00:00.000+02:00
the part "2018-09-01T02:00:00.000+02:00" is expected to be deserialized as DateTimeOffset. But it failed to do it. A default(DateTimeOffset) is returned.
If I encode to this format "2018-09-01T02%3A00%3A00.000%2B02%3A00" => correctly deserialized.
When it is enclosed in URL, that does not work.
In the contrarily, when the same format is enclosed in the body of message, it is correctly deserialized.
{"lastKnownAggregateVersion":4,"validFrom":"2017-09-03T00:00:00.000+02:00","commandId":"0cfa7da0-7895-4917-89ac-24ffa3abb87c","newDateDeValeur":"2017-09-03T00:00:00.000+02:00","eventUniqueIdentifier":{"streamName":"umbrella-54576b92-0234-4ec1-8eee-142375c53325","eventVersion":0},"aggregateId":"54576b92-0234-4ec1-8eee-142375c53325"}
According to RFC3986 both colon ':' and '+' is legal char in a URL. Does anyone have an idea on this?
Ok it turns out URL and URI have different standard
the URL standard is here RFC1738: Uniform Resource Locators (URL). So according to the doc, ':' is reserved for scheme.
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.
and when it goes to +:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

How do I encode a : in a url?

I need to send a get request where the last part of the url is a json value. I have encoded the following {"period":"600s"} to use on multiple different sites, however they all come up with the same result where the : is not decoded.
The encoded url: stickiness=%7B%22period%22%3A%22600s%22%7D.
Its result when I enter it into my browser:
So how do I encode a :?
%3A is the encoding of :. : is reserved in URIs for designating the port number (e.g. google.com:443 manually specifies to use port 443, the default HTTPS port). If you want to include a : in a URI, it must be precent-sign-encoded, which is what the %3A is. It can't be decoded in the URL bar because it would violate the reserved purpose of the : character.
The colon character is not decoded in the browser as it belong to the reserved characters that already have an explicit meaning in URLs elsewhere - in this case separating the protocol from the hostname and the port after the hostname.
The relevant standard is RFC 1738, page 3:
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.
Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

What characters are allowed in Shopify product handles?

Shopify's documentation shows some of the characters that are allowed in product handles (the product identifier that is used in URLs).
Since handles are used for your storefront navigation for products,
collections, blogs and pages, they must use alpha-numeric characters
(a-z, 0 to 9) without accents (such as umlauts, and other
diacriticals), nor characters such as # or # etc., and no spaces.
Spaces will be converted to hyphens, other characters may be stripped
entirely or converted to an equivalent standard ASCII character.
But if I create a product in the web interface with the title 'a b-c_d.e' then the handle generated by Shopify is 'a-b-c_d-e'. It seems like underscores are allowed, but spaces and dots are converted to hyphens.
What is the full set of characters allowed in product handles?
I wrote a script to test if the Shopify API accepts each of the ASCII codes from 0 to 127 in a product handle. It tries to modify the handle of an existing product to xCxC where C is the ASCII character to test and x is literally the letter x. I did it this way to find out how each character is handled when surrounded by text and also when trailing at the end of the handle.
Here are the results:
Allowed:
0-9
a-z
A-Z (will be converted to lowercase)
_ (underscore)
Allowed when surrounded but removed when at the end of the string:
- (hyphen)
Converted to - (hyphen) when surrounded but removed when at the end of the string
space
! # $ % & * + , . / : ; < = > ? # \ ^ ` { | } ~
ASCII control codes 0 to 32
Removed
" ' ( ) [ ]
See Wikipedia for details on each ASCII code: https://en.wikipedia.org/wiki/ASCII
The accepted answer is outdated.
Shopify allows many non-english characters inside URLs.
Example
https://example.myshopify.com/collections/무료/products/이지부
is now a valid Shopify URL.
I can confirm that the Navigation feature used to create menus inside Shopify does not pass quotation symbols or inch symbol, i.e: " specifically when creating a custom URL link in a navigation menu.
The symbol is allowed when entered but it is removed before being passed to the liquid template files.
Unexpectedly, you can use this symbol to create a query URL, i.e:
.../tvs/lg?pf_opt_tv_size=28.5"
This is particularly annoying when creating a Navigation link to a custom query URL created by a search filter app, Shopify will internally remove these characters for you.
Basically all the characters that are not affected by URL decode/encode functions.
Underscores (_) and Hypens (-) escape from this, also does the stop (.); but it's a URL schema parameter and hence gets converted to Shopify handle schema namely -.

When should space be encoded to plus (+) or %20? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?
+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.
http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.
For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.
What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.