Valid characters for URI schemes? - standards-compliance

I was thinking about Registering an Application to a URL Protocol and I'd like to know, what characters are allowed in a scheme?
Some examples:
h323 (has numbers)
h323:[<user>#]<host>[:<port>][;<parameters>]
z39.50r (has a . as well)
z39.50r://<host>[:<port>]/<database>?<docid>[;esn=<elementset>][;rs=<recordsyntax>]
paparazzi:http (has a :)
paparazzi:http:[//<host>[:[<port>][<transport>]]/
So, what characters can I fancy using?
Can we have...
#:TwitterUser
#:HashTag
$:CapitalStock
?:ID-10T
...etc., as desired, or characters in the scheme are restricted by standard?

According to RFC 2396, Appendix A:
scheme = alpha *( alpha | digit | "+" | "-" | "." )
Meaning:
The scheme should start with a letter (upper or lower case), and can contains letters (still upper and lower case), number, "+", "-" and ".".
Note: in the case of
paparazzi:http:[//<host>[:[<port>][<transport>]]/
the scheme is only the "paparazzi" part.

The scheme according to RFC 3986 is defined as:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
So the scheme must begin with an alphabetic character (A–Z, a–z) and may be followed by any number of alphanumeric characters, +, -, or ..

Quoth RFC 2396:
Scheme names consist of a sequence of characters beginning with a
lower case letter and followed by any combination of lower case
letters, digits, plus ("+"), period ("."), or hyphen ("-").

Related

Grammar and unicode characters

Why the below Grammar fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
From the « and » "left and right word boundary" doc:
[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓ isn't a word character. So the word boundary assertion fails.
What is and isn't a "word character"
"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L, which stands for Letter.1
Characters whose Unicode general category is Nd, which stands for Number, decimal.2
_, an underscore.
"alpha 'Nd under"
In a comment below #p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
Footnotes
1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.
I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:
for <y ✓ Ⅲ> {
say $_.uniprops;
say m/<|w>/;
}
The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<ら> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

Regular Expression for alphanumeric and some special characters not adjacent

I would like to have a regular expression to make an Oracle SQL REGEXP_LIKE query that checks
if a string starts with one alphanumeric character
if the string ends with one alphanumeric character
if the "body" of the string contains only alphanumeric character OR these authorized characters (written) : hyphen (dash), dot, apostrophe, space
if the authorised characters are NOT adjacent (to avoid something like "he--'''l..'-lo")
I started with this :
^[a-zA-Z0-9]+(a-zA-Z0-9\-\.'|([^\-\.'])\1)*[a-zA-Z0-9]$
I used backslash to escape assuming that dot and hyphen are metacharacters
I think this is what you want:
^[a-zA-Z0-9]+([-.' ][a-zA-Z0-9]|[a-zA-Z0-9])*\w?$
It looks for
at least 1 alphanumeric (alnum),
followed by
either an authorized character followed by an alphanumeric or just an alphanumeric, repeated any number of times (including 0).
optionally followed by
an alnum
This meets your specification. I'm not sure if starts with one alnum and ends with one alnum means that there must be at least 2 alnums, or if they can be the same. If there must be at least 2 of them, remove the last ? (which make the last alnum optional).
Regards
assuming you meant "authorised characters are NOT adjacent to each other"
try something along these lines
^[a-zA-Z0-9]+([a-zA-Z0-9]+[\-\.' ]?)*[a-zA-Z0-9]$
so that the repeating middle part always has one alphanumeric character followed by zero to one special characters.

Equivalence Partitioning on Email Field

Does anyone know how to derive test cases by using equivalence partitioning on email address field validation?
Test cases
1) Email Length
The format of email addresses is local-part#domain where the local-part may be up to 64 characters long and the domain name may have a maximum of 255 characters – but the maximum 256 characters length of a forward or reverse path restricts the entire email address to be no more than 254 characters
So, divide test cases in two scenarios:
i) email id between 0 to 254 characters
ii) email id greater than 254 characters
2) Characters and Numbers
Email accepts Uppercase and lowercase English letters (a–z, A–Z) and Digits 0 to 9
So, check email address with alphabets lower and upper-case and numbers, Check weather the loginid accepts the user name starting with caps letter or number or spl charaters
eg. niceandsimple#example.com, niceand122simple123#example.com
3) Special Charachters
Characters !#$%&'*+-/=?^_{|}~ are been accepted. So, write two scenarios.
1) email id with Characters !#$%&'*+-/=?^_{|}~ should be accepted
ii) email id containing characters other than Characters !#$%&'*+-/=?^_`{|}~ should not be accepted
eg.
---> !#$%&'*+-/=?^_`{}|~#example.org
---> " "#example.org
4) Special Characters with restrictions
Special characters are allowed with restrictions. They are:
Space and "(),:;<>#[]
The restrictions for special characters are that they must only be used when contained between quotation marks, and that 2 of them (the backslash \ and quotation mark " (ASCII: 92, 34)) must also be preceded by a backslash \ (e.g. "\\"").
Two scenarios
1) characters "(),:;<>#[] within double quotes
ii) charachters "(),:;<>#[] without double quotes
eg.
----> "()<>[]:,;#\\"!#$%&'*+-/=?^_`{}| ~.a"#example.org
5) Email with Dots (.)
i) email id with single dot should be accepted
a.little.lengthy.but.fine#dept.example.com
ii) email with multiple continues dot not accepted
a.little.....fine#dept.example.com
iii) Leading dot in address is not allowed
.abc123#gmail.com
iv) Trailing dot in address is not allowed
abc123.#gmail.com
v) Multiple dot in the domain portion is invalid
abc123#gmail..com
6) domain name
i) same domain name ----> check the mail can be of same domain name i.e gmail#gmail.com
ii) Domain is valid IP address
iii) Square bracket around IP address is considered valid
iv) Dash in domain name is valid
v) Missing # sign and domain
vi) Garbage ( ##%^%#$##$##.com )
vii) Two # sign
viii) Leading dash in front of domain is invalid
ix) .web is not a valid top level domain
x) Invalid IP format
7) Text in email
1) Text followed email is not allowed
email#domain.com (Joe Smith)
2) Text before email allowed
(Joe Smith)email#domain.com
Take each input condition described in the specification and derive at least two equivalence classes for it. One class represents the set of cases which satisfy the condition (the valid class) and one represents cases which do not (the invalid class), example as below:
–Number of email field: 0<21
•Class 1: any value less then 1(invalid input)
•Class 2: 1-20 (valid input)
•Class 3: any value more then 20(invalid input)
•Select at least 1 value from each class as test data for testing on the field “Number of email”
–Value below will be use for testing for “number of email” field validation and verification
–-5, 5, 25

Will Alphanumeric contain _ and space?

If a field is defined as alphanumeric, are spaces and underscores (_) allowed?
I hope they are not.
Can anyone confirm?
Alphanumeric characters by definition only comprise the letters A to Z and the digits 0 to 9. Spaces and underscores are usually considered punctuation characters, so no, they shouldn't be allowed.
If a field specifically says "alphanumeric characters, space and underscore", then they're included. Otherwise in most cases you generally assume they're not.
I came here wondering why \w in regex includes underscore, I had assumed \w meant alphanumeric [A-Za-z0-9] but that is not the case in regex.
In most regex engines \w is a shortform for [A-Za-z0-9_].
However in the case of regex python, besides including underscore, \w also includes letters with diacritics, letters from other scripts, etc. Such as the German letter "ö" in "schön".
So now I've learned to use the longform [A-Za-z0-9] if I wanted to be specifically alphanumeric in regex.
Alphanumeric characters are A to Z, a to z and 0 to 9

why ldap search return all results when using %?

When I search one ldap server using the following filter
(cn=%*)
It return all results under the base dn? LDAP treat '%' specially? But I haven't found any description about it.
What is your directory server ?
Are you sure tha '%' is not replace by your command line interpreter or your compiler ?
According to RFC2254 % is not a special character
If a value should contain any of the following characters
Character ASCII value
---------------------------
* 0x2a
( 0x28
) 0x29
\ 0x5c
NUL 0x00
the character must be encoded as the backslash '\' character (ASCII
0x5c) followed by the two hexadecimal digits representing the ASCII
value of the encoded character. The case of the two hexadecimal
digits is not significant.
This simple escaping mechanism eliminates filter-parsing ambiguities
and allows any filter that can be represented in LDAP to be
represented as a NUL-terminated string. Other characters besides the
ones listed above may be escaped using this mechanism, for example,
non-printing characters.
For example, the filter checking whether the "cn" attribute contained
a value with the character "" anywhere in it would be represented as
"(cn=\2a*)".
Note that although both the substring and present productions in the
grammar above can produce the "attr=*" construct, this construct is
used only to denote a presence filter.