Automatically detect security identifier columns using Visions - pandas

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known

Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

Related

Confused about Tensorflow Algorithm function

Colab notebook
Under the section on Feature Columns, there is this specific line of code
feature_columns = [ ]
for feature_name in CATEGORICAL_COLUMNS:
vocabulary = dftrain[feature_name].unique()
I'm struggling to understand what this is doing. I don't really know what to search up too as I'm still quite new to programming. Why is there a need for this line? I understand that it outputs all unique values of the specified feature_name, but don't get how it's linked to the next line.
When you don't understand a function just google the module name (TensorFlow) and the function name. I found the documentation for tf.feature_column.categorical_column_with_vocabulary_list described here. To quote the documentation:
Use this when your inputs are in string or integer format, and you have an in-memory vocabulary mapping each value to an integer ID. By default, out-of-vocabulary values are ignored.
What this section of code is doing is going through each column and mapping each unique string value to a unique integer (its location in the vocabulary list). Transforming your column using this type of mapping is common for categorical data. The reason that unique is needed is because tf.feature_column.categorical_column_with_vocabulary_list needs a unique list as an argument before it can work its magic.
In the future please put all necessary code in the question. It should not be required to visit another link to answer your question.

Use of math in ALFA

How to get a rule like that working:
rule adminCanViewAllExams {
condition (integerOneAndOnly(my.company.attributes.subject.rights) & 0x00000040) == 0
permit
}
Syntax highlighter complains it doesn't know those items:
& (This is a binary math operation)
0x00000040 (this is the hexadecimal representation of an integer)
EDIT
(adding OP's comment inside the question)
I want to keep as much as possible in my current application. Meaning, I don't want to change a lot in my database model. I just want to implement the PEP and PDP part new. So, currently the rights of the user are stored in a Long. Each bit in the number represents a right. To get the right we do a binary &-operation which masks the other bits in the Long. We might redesign this part, but it's still good to know how far the support for mathematic operations goes
XACML does not support bitwise logic. It can do boolean logic (AND and OR) but that's about it.
To achieve what you are looking for, you could use a Policy Information Point which would take in my.company.attributes.subject.rights and 0x00000040. It would return an attribute called allowed.
Alternatively, you can extend XACML (and ALFA) to add missing datatypes and functions. But I would recommend going for human-readable policies.

How does [0] and [3] wơrk in ASN1?

I'm decoding ASN1 (as used in X.509 for HTTPS certificates). I'm doing pretty well, but there is a thing that I just cannot find and understandable documentation for.
In this JS ASN1 parser you see a [0] and a [3] under a SEQUENCE element, the first looking like this in data: A0 03 02 01 02 .... I want to know what this means and how to decode it.
Another example is Anatomy of an X.509 v3 Certificate, there is a [0] right after the first two SEQUENCE elements.
What I don't understand is how A0 fits with the scheme where the first 2 bits of the tag byte are a class, the next a primitive/constructed bit and the remaining 5 are supposed to be the tag type. A0 is 10100000 which means that the tag type value would be zero.
It sounds like you need an introduction to ASN.1 tagging. There are two angles to approach this from. X.690 defines BER/CER/DER encoding rules. As such, it answers the question of how tags are encoded. X.680 defines ASN.1 itself. As such, it defines the syntax and rules for tagging. Both specifications can be found on the ITU-T website. I'll give you a quick overview.
Tags are used in BER/DER/CER to identify types. They are especially useful for distinguishing the components of a SEQUENCE and the alternatives of a CHOICE.
A tag combines a tag class and a tag number. The tag classes are UNIVERSAL, APPLICATION, PRIVATE, and CONTEXT-SPECIFIC. The UNIVERSAL class is basically used for the built-in types. APPLICATION is typically used for user-defined types. CONTEXT-SPECIFIC is typically used for the components inside constructed types (SEQUENCE, CHOICE, SEQUENCE OF). Syntactically, when tags are specified in an ASN.1 module, they are written inside brackets: [ tag_class tag_number ]; for CONTEXT-SPECIFIC, the tag_class is omitted. Thus, [APPLICATION 10] or [0].
While every ASN.1 type has an associated tag, syntactically, there is also the "TaggedType", which is used by an ASN.1 author to specify the tag to encode a type with. Basically, a TaggedType puts a tag prefix ahead of a type. For example:
MyType ::= SEQUENCE {
field_with_tagged_type [0] UTF8String
}
The tag in a TaggedType is either explicit or implicit. If explicit, this means that I want the original tag to be explicitly encoded. If implicit, this means I am happy to have only the tag that I specified be encoded. In the explicit case, the BER encoding results in a nested TLV (tag-length-value): the outer tag ([0] in the example above), the length, and then another TLV as the value. In the example, this inner TLV would have a tag of [UNIVERSAL 12] for the UTF8String.
Whether the tag is explicit or implicit depends upon how you write the tag and the tagging environment. For example:
MyType2 ::= SEQUENCE {
field_with_explicit_tag [0] EXPLICIT UTF8String OPTIONAL,
field_with_implicit_tag [1] IMPLICIT UTF8String OPTIONAL,
field_with_tag [2] UTF8String OPTIONAL
}
If you specify neither IMPLICIT nor EXPLICIT, there are some rules that define whether the tag is explicit or implicit (see X.680 31). These rules take into consideration the tagging environment defined for the ASN.1 module. The ASN.1 module may specify the tagging environment as IMPLICIT TAGS, EXPLICIT TAGS, or AUTOMATIC TAGS. Roughly speaking, if you don't specify IMPLICIT or EXPLICIT for a tag, the tag will be explicit if the tagging environment is EXPLICIT and implicit if the tagging environment is IMPLICIT or AUTOMATIC. An automatic tagging environment is basically the same as an IMPLICIT tagging environment, except that unique tags are automatically assigned for members of SEQUENCE and CHOICE types.
Note that in the above example, the three components of MyType2 are all optional. In BER/CER/DER, a decoder will know what component is present based on the encoded tag (which obviously better be unique).
ASN.1 BER and DER use ASN.1 TAGS to unambiguously identify certain components in an encoded stream. There are 4 classes of ASN.1 tags: UNIVERSAL, APPLICATION, PRIVATE, and context-specific. The [0] is a context-specific tag since there is no tag class keword in front of it. UNIVERSAL is reserved for built-in types in ASN.1. Most often you see context specific tags to eliminate potential ambiguity in a SEQUENCE which contains OPTIONAL elements.
If you know you are receiving two items that are not optional, one after the other, you know which is which even if their tags are the same. However, if the first one is optional, the two must have different tags, or you would not be able to tell which one you had received if only one was present in the encoding.
Most often today, ASN.1 specification use "AUTOMATIC TAGS" so that you don't have to worry about this kind of disambiguation in messages since components of SEQUENCE, SET and CHOICE will automatically get context specific tags starting with [0], [1], [2], etc. for each component.
You can find more information on ASN.1 tags at http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html where two free downloadable books are available.
Another excellent resource is http://asn1-playground.oss.com where you can try variations of ASN.1 specifications with different tags in an online compiler and encoder/decoder. There you can see the effects of tag changes on encodings.
I finally worked through this and thought that I would provide some insight for anyone still trying to understand this. In my example, as in the one above, I was using an X.509 certificate in DER format. I came across the "A0 03 02 01 02" sequence and could not figure out how that translated to a version number of 2. So if you are having the same problem, here is how that works.
The A0 tells you it is a "Context-Specific" field, a "Constructed" tag, and has the type value of 0x00. Immediately, the context-specific tells you not to use the normal type fields for DER/BER. Instead, given this is a X.509 certificate, the type value is labeled in the RFC 5280, p 116. There you will see four fields with markers on them of [0], [1], [2], and [3], standing for "version", "issuerUniqueID", "subjectUniqueID", and "extension", respectively. So in this case, a value of A0 tells you that this is one of the X.509 context-specific fields, specifically the "version" type. That takes care of the "A0" value.
The "03" value is just your length, as you might expect.
Since this was identified as "Constructed", the data should represent a normal DER/BER object. The "02 01 02" is the actual version number you are looking for, expressed as an Integer. "02" is the standard BER encoding of Integer, "01" is your length, and "02" is your value, or in this case, your version number.
So given that X.509 defines 4 context-specific types, you should expect to see "A0", "A1", "A2", and "A3" anywhere in the certificate. Hopefully the information provided above will now make more sense and help you better understand what those marker represent.
[0] is a context-specific tagged type, meaning that to figure out what meaning it gives to the fields (if the "Constructed" flag is set) or data value (if "Constructed" flag is not set) it wraps; you have to know in what context it appears in.
In addition, you also need to know what kind of object the sender and receiver are exchanging in the DER stream, ie. the "ASN.1 module".
Let's say they're exchanging a Certificate Signing Request, and [0] appears as the 4th field inside a SEQUENCE inside the root SEQUENCE:
SEQUENCE
SEQUENCE
INTEGER 0
SEQUENCE { ... }
SEQUENCE { ... }
[0] { ... }
}
}
Then according to RFC2968, which defines the DER contents for Certificate Signing Request, Appendix A, which defines the ASN.1 Module, the meaning of that particular field is sneakily defined as "Attributes" and "Should have the Constructed flag set":
attributes [0] Attributes{{ CRIAttributes }}
You can also go the other way and see that "attributes" must be the 4th field inside the first sequence inside the root sequence and tagges as [0] by looking at the root sequence definition (section 4: "the top-level type CertificationRequest"), finding the CertificationRequestInfo placement inside that, and finding where the "attributes" item is located inside the CertificationRequestInfo, and finally seeing how it is tagged.

Find a suitable vocabulary database to build a C structure

Let's begin with the question final purpose: my aim is to build a word-based neural network which should take a basic sentence and select for each individual word the meaning it is supposed to yield in the sentence itself. It is then going to learn something about the language (for example the possible correlation between two given words, what is the probability to find both in a single sentence and so on) and at the final stage (after the learning phase) try to build some very simple sentences of its own according to some input.
In order to do this I need some kind of database representing a vocabulary of a given language from which I could extract some information such as word list, definitions, synonyms et cetera. The database should be structured in a way such that I can build C data structures containing the needed information such as
typedef struct _dictEntry DictionaryEntry;
typedef struct _dict Dictionary;
struct _dictEntry {
const char *word; // Word string
const char **definitions; // Array of definition strings
DictionaryEntry **synonyms; // Array of pointers to synonym words
Dictionary *dictionary; // Pointer to parent dictionary
};
struct _dict {
const char *language; // Language identification string
int count; // Number of elements in the dictionary
float **correlations; // Correlation matrix between i-th and j-th entries
DictionaryEntry *entries; // Array of dictionary entries
};
or equivalent Obj-C objects.
I know (from Searching the Mac OSX system dictionaries?) that apple provided dictionaries are licensed so I cannot use them to create my data structures.
Basically what I want to do is the following: given an arbitrary word A I want to fetch all the dictionary entries which have a definition containing A and select such definition only. I will then implement some kind of intersection procedure to select the most appropriate definition and synonyms based on the rest of the sentence and build a correlation matrix.
Let me give a little example: let us suppose I type a sentence containing "play"; I want to fetch all the entries (such as "game", "instrument", "actor", etc.) the word "play" can be correlated to and for each of them select the corresponding definition (I don't want for example to extract the "instrument" definition which corresponds to the "tool" meaning since you cannot "play a tool"). I will then select the most appropriate of these definitions looking at the rest of the sentence: if it contains also the word "actor" then I will assign to "play" the meaning "drama" or another suitable definition.
The most basic way to do this is scanning every definition in the dictionary searching for the word "play" so I will need to access all definitions without restrictions and as I understand this cannot be done using the dictionaries located under /Library/Dictionaries. Sadly this work MUST be done offline.
Is there any available resource I can download which allows me to get my hands on all the definitions and fetch my info? Currently I'm not interested in any particular file format (could be a database or an xml or anything else) but it must be something I can decompose and put in a data structure. I tried to google it but, whatever the keywords I use, if I include the word "vocabulary" or "dictionary" I (pretty obviously) only get pages about the other words definitions on some online dictionary site! I guess this is not the best thing to search for...
I hope the question is clear... If it is not I'll try to explain it in a different way! Anyway, thanks in advance to all of you for any helpful information.
Probably an ontology which is free, like http://www.eat.rl.ac.uk would help you. In the university sector there are severals available.

c-style union with numpy dtypes?

I'm interested in using numpy arrays of somewhat inhomogenous data types. Since numpy specifies that the data must be homogenous, this would be accomplished by defining a super-dtype that acts as a union wrapper over all the sub-dtypes. Accessing the fields of the sub-dtypes then gives a different interpretation of the underlying data.
There's already some facility for this, for example
dtype(('|S2', [('x', '|i1'), ('y', '|i1')]))
refers to an array of two-byte strings, but the first and second bytes can also be interpreted as integers through the 'x' and 'y' field names. I can't figure out how to assign a field label to the two-byte string, though.
Can this be made more general, so that we can overlay any number of different field specifications on the data?
My first try was to specify the field offsets in the dtype, but it failed with a complaint that the offsets must be ordered (i.e. non-overlapping data).
dtype1 = np.dtype(dict(
names=['a','b'],
formats=['|a2','<i2'],
offsets=[0,0]))
Another technique works, but is cumbersome. In this technique I can define several variables as view onto the same underlying data, and change the dtype of the different variables to let me access the data in different formats, i.e.
a=np.zeros(3, dtype='<a2')
b=a[:]
b.dtype='<i2'
This lets me access the data either as strings or integers depending on whether I'm looking at a or b. But it is a cumbersome way of manipulating the data. Ideally, I'd like to be able to specify a variety of different fields with arbitrary offsets. Is there any way to do this?
Union dtypes have been allowed since June 2011: https://github.com/numpy/numpy/pull/94
You'll need to upgrade to NumPy 1.7.x to use this.
However, in previous versions you can use the overlay dtype constructor:
>>> a = np.zeros(3, dtype=np.dtype(('<i2', [('a', '|a2')])))
>>> a[0] = 0x3456
>>> a['a'][0]
'V4'
This is documented at http://docs.scipy.org/doc/numpy-dev/reference/arrays.dtypes.html#specifying-and-constructing-data-types (search for (base_dtype, new_dtype)).