Use of Preferred-Value in Language Tag Records of Type "Variant" (RFC 5646) - iana

In RFC 5646, Tags for Identifying Languages, § 3.1.2 Record and Field Definitions, the following explanation is given for the semantics of the Preferred-Value field when appearing in a record whose Type is "variant":
For fields of type 'script', 'region', or 'variant', 'Preferred-Value' contains the subtag of the same type that is preferred for forming the language tag.
My initial interpretation of this was that if the Type of the record is variant, then the value of a Preferred-Value is also a variant — "a subtag of the same type". In other words, I read "of the same type" as "of the same type as the record itself".
However, there are records in the current version of the language tag registry (2018-04-23 at the time I write this — it doesn’t seem there are versioned links) which do not match this interpretation. For example:
Type: variant
Subtag: arevela
Description: Eastern Armenian
Added: 2006-09-18
Deprecated: 2018-03-24
Preferred-Value: hy
Prefix: hy
The Preferred-Value here is not a variant — a variant must be either 5-8 alphanumeric ASCII characters or 1 digit plus three alphanumeric characters. In this case in particular, it’s clear that it’s referring to the Armenian language (the first segment of a language tag) rather than to a variant.
However, when looking through other entries, most Preferred-Value values do conform to my initial interpretation. For example:
Type: region
Subtag: YD
Description: Democratic Yemen
Added: 2005-10-16
Deprecated: 1990-08-14
Preferred-Value: YE
Here, Preferred-Value does appear to be another region code. The rules for script/region/variant types are given together — the Preferred-Value is the "same type" for all of these. If for a region record a "same type Preferred-Value" means "also a region", how is it that for a variant record Preferred-Value may point at a different type? More importantly, if this is possible, is the only way to determine the type of the Preferred-Value field to test its grammar?

You are right. That arevela entry does not conform to the registry specification. It seems as though they noticed this; the registry as of 2021-02-23 has a new entry for arevela without that Preferred-Value. It instead has the comment "Preferred tag is hy".
Your comments also seems to be correct (and my initial interpretation of the section in the spec was wrong). They've changed those entries too, so all extlangs now have a Preferred-Value that is a primary language subtag identical to the extlang.

Related

Automatically detect security identifier columns using Visions

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known
Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

ABNF rule `zero = ["0"] "0"` matches `00` but not `0`

I have the following ABNF grammar:
zero = ["0"] "0"
I would expect this to match the strings 0 and 00, but it only seems to match 00? Why?
repl-it demo: https://repl.it/#DanStevens/abnf-rule-zero-0-0-matches-00-but-not-0
Good question.
ABNF ("Augmented Backus Naur Form"9 is defined by RFC 5234, which is the current version of a document intended to clarify a notation used (with variations) by many RFCs.
Unfortunately, while RFC 5234 exhaustively describes the syntax of ABNF, it does not provide much in the way of a clear statement of semantics. In particular, it does not specify whether ABNF alternation is unordered (as it is in the formal language definitions of BNF) or ordered (as it is in "PEG" -- Parsing Expression Grammar -- notation). Note that optionality/repetition are just types of alternation, so if you choose one convention for alternation, you'll most likely choose it for optionality and repetition as well.
The difference is important in cases like this. If alternation is ordered, then the parser will not backup to try a different alternative after some alternative succeeds. In terms of optionality, this means that if an optional element is present in the stream, the parser will never reconsider the decision to accept the optional element, even if some subsequent element cannot be matched. If you take that view, then alternation does not distribute over concatenation. ["0"]"0" is precisely ("0"/"")"0", which is different from "00"/"0". The latter expression would match a single 0 because the second alternative would be tried after the first one failed. The former expression, which you use, will not.
I do not believe that the authors of RFC 5234 took this view, although it would have been a lot more helpful had they made that decision explicit in the document. My only real evidence to support my belief is that the ABNF included in RFC 5234 to describe ABNF itself would fail if repetition was considered ordered. In particular, the rule for repetitions:
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
cannot match 7*"0", since the 7 will be matched by the first alternative of repeat, which will be accepted as satisfying the optional [repeat] in repetition, and element will subsequently fail.
In fact, this example (or one similar to it) was reported to the IETF as an erratum in RFC 5234, and the erratum was rejected as unnecessary, because the verifier believed that the correct parse should be produced, thus providing evidence that the official view is that ABNF is not a variant of PEG. Apparently, this view is not shared by the author of the APG parser generator (who also does not appear to document their interpretation.) The suggested erratum chose roughly the same solution as you came up with:
repeat = *DIGIT ["*" *DIGIT]
although that's not strictly speaking the same; the original repeat cannot match the empty string, but the replacement one can. (Since the only use of repeat in the grammar is optional, this doesn't make any practical difference.)
(Disclosure note: I am not a fan of PEG. So it's possible the above answer is not free of bias.)

RRULE (rfc 5545) until and count

I'm having trouble understanding the rfc5545 concerning the the until and count. From what I understand, UNTIL and COUNT cannot be in the same recur rule according to this part of the RFC:
Value Name: RECUR
Purpose: This value type is used to identify properties that
contain a recurrence rule specification.
Formal Definition: The value type is defined by the following
notation:
recur = "FREQ"=freq *(
; either UNTIL or COUNT may appear in a 'recur',
; but UNTIL and COUNT MUST NOT occur in the same 'recur'
...
Further in the rfc, this is stated:
If multiple BYxxx rule parts are specified, then after evaluating the
specified FREQ and INTERVAL rule parts, the BYxxx rule parts are
applied to the current set of evaluated occurrences in the following
order: BYMONTH, BYWEEKNO, BYYEARDAY, BYMONTHDAY, BYDAY, BYHOUR,
BYMINUTE, BYSECOND and BYSETPOS; then COUNT and UNTIL are evaluated.
This last paragraph seems to imply that the COUNT and UNTIL can be in the same RRULE.
When I check libraries that implement rrule generator and parsing, there is no validation that make sure that the the COUNT and UNTIL are not in the same recur.
What is the general implementation that everyone usually do with this ? Should we ignore this validation and simply use the UNTIL parameter when there is both COUNT and UNTIL (or vice versa) ? What does the RFC mean exactly concerning the COUNT and UNTIL parameter ?
I don't think you can derive from the second paragraph that having both is valid.
There is only one definition of RECUR and the cardinality of its various components: the ABNF definition. This is where you should go to check the validity of your property.
The second paragraph simply describes the algorithm to use for doing RRULE expansion.

How does [0] and [3] wơrk in ASN1?

I'm decoding ASN1 (as used in X.509 for HTTPS certificates). I'm doing pretty well, but there is a thing that I just cannot find and understandable documentation for.
In this JS ASN1 parser you see a [0] and a [3] under a SEQUENCE element, the first looking like this in data: A0 03 02 01 02 .... I want to know what this means and how to decode it.
Another example is Anatomy of an X.509 v3 Certificate, there is a [0] right after the first two SEQUENCE elements.
What I don't understand is how A0 fits with the scheme where the first 2 bits of the tag byte are a class, the next a primitive/constructed bit and the remaining 5 are supposed to be the tag type. A0 is 10100000 which means that the tag type value would be zero.
It sounds like you need an introduction to ASN.1 tagging. There are two angles to approach this from. X.690 defines BER/CER/DER encoding rules. As such, it answers the question of how tags are encoded. X.680 defines ASN.1 itself. As such, it defines the syntax and rules for tagging. Both specifications can be found on the ITU-T website. I'll give you a quick overview.
Tags are used in BER/DER/CER to identify types. They are especially useful for distinguishing the components of a SEQUENCE and the alternatives of a CHOICE.
A tag combines a tag class and a tag number. The tag classes are UNIVERSAL, APPLICATION, PRIVATE, and CONTEXT-SPECIFIC. The UNIVERSAL class is basically used for the built-in types. APPLICATION is typically used for user-defined types. CONTEXT-SPECIFIC is typically used for the components inside constructed types (SEQUENCE, CHOICE, SEQUENCE OF). Syntactically, when tags are specified in an ASN.1 module, they are written inside brackets: [ tag_class tag_number ]; for CONTEXT-SPECIFIC, the tag_class is omitted. Thus, [APPLICATION 10] or [0].
While every ASN.1 type has an associated tag, syntactically, there is also the "TaggedType", which is used by an ASN.1 author to specify the tag to encode a type with. Basically, a TaggedType puts a tag prefix ahead of a type. For example:
MyType ::= SEQUENCE {
field_with_tagged_type [0] UTF8String
}
The tag in a TaggedType is either explicit or implicit. If explicit, this means that I want the original tag to be explicitly encoded. If implicit, this means I am happy to have only the tag that I specified be encoded. In the explicit case, the BER encoding results in a nested TLV (tag-length-value): the outer tag ([0] in the example above), the length, and then another TLV as the value. In the example, this inner TLV would have a tag of [UNIVERSAL 12] for the UTF8String.
Whether the tag is explicit or implicit depends upon how you write the tag and the tagging environment. For example:
MyType2 ::= SEQUENCE {
field_with_explicit_tag [0] EXPLICIT UTF8String OPTIONAL,
field_with_implicit_tag [1] IMPLICIT UTF8String OPTIONAL,
field_with_tag [2] UTF8String OPTIONAL
}
If you specify neither IMPLICIT nor EXPLICIT, there are some rules that define whether the tag is explicit or implicit (see X.680 31). These rules take into consideration the tagging environment defined for the ASN.1 module. The ASN.1 module may specify the tagging environment as IMPLICIT TAGS, EXPLICIT TAGS, or AUTOMATIC TAGS. Roughly speaking, if you don't specify IMPLICIT or EXPLICIT for a tag, the tag will be explicit if the tagging environment is EXPLICIT and implicit if the tagging environment is IMPLICIT or AUTOMATIC. An automatic tagging environment is basically the same as an IMPLICIT tagging environment, except that unique tags are automatically assigned for members of SEQUENCE and CHOICE types.
Note that in the above example, the three components of MyType2 are all optional. In BER/CER/DER, a decoder will know what component is present based on the encoded tag (which obviously better be unique).
ASN.1 BER and DER use ASN.1 TAGS to unambiguously identify certain components in an encoded stream. There are 4 classes of ASN.1 tags: UNIVERSAL, APPLICATION, PRIVATE, and context-specific. The [0] is a context-specific tag since there is no tag class keword in front of it. UNIVERSAL is reserved for built-in types in ASN.1. Most often you see context specific tags to eliminate potential ambiguity in a SEQUENCE which contains OPTIONAL elements.
If you know you are receiving two items that are not optional, one after the other, you know which is which even if their tags are the same. However, if the first one is optional, the two must have different tags, or you would not be able to tell which one you had received if only one was present in the encoding.
Most often today, ASN.1 specification use "AUTOMATIC TAGS" so that you don't have to worry about this kind of disambiguation in messages since components of SEQUENCE, SET and CHOICE will automatically get context specific tags starting with [0], [1], [2], etc. for each component.
You can find more information on ASN.1 tags at http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html where two free downloadable books are available.
Another excellent resource is http://asn1-playground.oss.com where you can try variations of ASN.1 specifications with different tags in an online compiler and encoder/decoder. There you can see the effects of tag changes on encodings.
I finally worked through this and thought that I would provide some insight for anyone still trying to understand this. In my example, as in the one above, I was using an X.509 certificate in DER format. I came across the "A0 03 02 01 02" sequence and could not figure out how that translated to a version number of 2. So if you are having the same problem, here is how that works.
The A0 tells you it is a "Context-Specific" field, a "Constructed" tag, and has the type value of 0x00. Immediately, the context-specific tells you not to use the normal type fields for DER/BER. Instead, given this is a X.509 certificate, the type value is labeled in the RFC 5280, p 116. There you will see four fields with markers on them of [0], [1], [2], and [3], standing for "version", "issuerUniqueID", "subjectUniqueID", and "extension", respectively. So in this case, a value of A0 tells you that this is one of the X.509 context-specific fields, specifically the "version" type. That takes care of the "A0" value.
The "03" value is just your length, as you might expect.
Since this was identified as "Constructed", the data should represent a normal DER/BER object. The "02 01 02" is the actual version number you are looking for, expressed as an Integer. "02" is the standard BER encoding of Integer, "01" is your length, and "02" is your value, or in this case, your version number.
So given that X.509 defines 4 context-specific types, you should expect to see "A0", "A1", "A2", and "A3" anywhere in the certificate. Hopefully the information provided above will now make more sense and help you better understand what those marker represent.
[0] is a context-specific tagged type, meaning that to figure out what meaning it gives to the fields (if the "Constructed" flag is set) or data value (if "Constructed" flag is not set) it wraps; you have to know in what context it appears in.
In addition, you also need to know what kind of object the sender and receiver are exchanging in the DER stream, ie. the "ASN.1 module".
Let's say they're exchanging a Certificate Signing Request, and [0] appears as the 4th field inside a SEQUENCE inside the root SEQUENCE:
SEQUENCE
SEQUENCE
INTEGER 0
SEQUENCE { ... }
SEQUENCE { ... }
[0] { ... }
}
}
Then according to RFC2968, which defines the DER contents for Certificate Signing Request, Appendix A, which defines the ASN.1 Module, the meaning of that particular field is sneakily defined as "Attributes" and "Should have the Constructed flag set":
attributes [0] Attributes{{ CRIAttributes }}
You can also go the other way and see that "attributes" must be the 4th field inside the first sequence inside the root sequence and tagges as [0] by looking at the root sequence definition (section 4: "the top-level type CertificationRequest"), finding the CertificationRequestInfo placement inside that, and finding where the "attributes" item is located inside the CertificationRequestInfo, and finally seeing how it is tagged.

RFC 6570 URL Templates : the role of / vs. other prefixes

I recently read some of : https://www.rfc-editor.org/rfc/rfc6570#section-1
And I found the following URL template examples :
GIVEN :
var="value";
x=1024;
path=/foo/bar;
{/var,x}/here /value/1024/here
{#path,x}/here #/foo/bar,1024/here
These seem contradictory.
In the first one, it appears that the / replaces ,
In the 2nd one, it appears that the , is kept .
Thus, I'm wondering wether there are inconsistencies in this particular RFC. I'm new to these RFC's so maybe I don't fully understand the culture behind how these develop.
There's no contradiction in those two examples. They illustrate the point that the rules for expanding an expression whose first character is / are different from the rules for expanding an expression whose first character is #. These alternative expansion rules are pretty much the entire point of having a variety of different magic leading characters -- which are called operators in the RFC.
The expression with the leading / is expanded according to a rule that says "each variable in the expression is replaced by its value, preceded by a / character". (I'm paraphrasing the real rule, which is described in section 3.2.6 of that RFC.) The expression with the leading # is expanded according to a rule that says "each variable in the expression is replaced by its value, with the first variable preceded by a # and subsequent variables preceded by a ,. (Again paraphrased, see section 3.2.4 for the real rule.)