Mapping spaCy int attributes to string (unicode) attributes - spacy

For many token properties, such as part of speech and dependency relations, spaCy stores both integer and string attributes. For example, for POS there is pos_ (string like "PUNCT" and "ADJ") and pos (integer values) attributes. The full list of token attributes is here.
Is there a convenient way to directly convert between the two representations? Concretely, if I have a POS integer value, is there a way to know what is the corresponding string?
I ran into this issue when using the count_by API (see here), which counts attribute frequencies and returns a dictionary of integer attribute and its counting. An example:
>>> doc = nlp("I like natural language processing.")
>>> doc.count_by(spacy.attrs.POS)
{96: 1, 99: 1, 83: 1, 91: 2, 94: 1}
Is it possible to get the corresponding string for each POS key?
Of course, there are other ways to get this counting, using the string attributes. But my question is more general than this example application.

Yes, it's a lookup table at doc.vocab.strings. You can lookup either a string value or its hash with e.g. doc.vocab.strings["VERB"] or doc.vocab.strings[VERB]. If you have a string and want the hash, use the spacy.strings.get_string_id() function. Hashing the string is stateless, so you don't need the StringStore for it.
The built-in symbols can also be dereferenced using the spacy.attrs.IDS and spacy.symbols.IDS global variables.

Related

protocol-buffers: string or byte sequence of the exact length

Looking at https://developers.google.com/protocol-buffers/docs/proto3#scalar it appears that string and bytes types don't limit the length? Does it mean that we're expected to specify the length of transmitted string in a separate field, e.g. :
message Person {
string name = 1;
int32 name_len = 2;
int32 user_id = 3;
...
}
The wire type used for string/byte is Length-delimited. This means that the message includes the strings length. How this is made available to you will depend upon the language you are using - for example the table says that in C++ a string type is used so you can call name.length() to retrieve the length.
So there is no need to specify the length in a separate field.
One of the things that I wished GPB did was allow the schema to be used to set constraints on such things as list/array length, or numerical value ranges. The best you can do is to have a comment in the .proto file and hope that programmers pay attention to it!
Other serialisation technologies do do this, like XSD (though often the tools are poor), ASN.1 and JSON schema. It's very useful. If GPB added these (it doesn't change wire formats), GPB would be pretty well "complete".

Automatically detect security identifier columns using Visions

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known
Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

Call for global variable in JS block of Selenium Webdriver test (Python)

I have a string of numbers set by user. Defined in the beginning of the Webdriver test:
numbers = input("prompt")
Then I need to enter value of this variable by JS code like this:
driver.execute_script("document.getElementsByName('phone')[0].value=***")
Where instead of *** I need the value of "numbers" variable. How should I properly insert it to make it work?
Here is what you want to do.
numbers = input("prompt")
driver.execute_script("document.getElementsByName('phone')[0].value={}".format(numbers))
The documentation link:
https://docs.python.org/3/library/string.html
And a snip-it from the docs:
The field_name itself begins with an arg_name that is either a number or a keyword. If it’s a number, it refers to a positional argument, and if it’s a keyword, it refers to a named keyword argument. If the numerical arg_names in a format string are 0, 1, 2, … in sequence, they can all be omitted (not just some) and the numbers 0, 1, 2, … will be automatically inserted in that order. Because arg_name is not quote-delimited, it is not possible to specify arbitrary dictionary keys (e.g., the strings '10' or ':-]') within a format string. The arg_name can be followed by any number of index or attribute expressions. An expression of the form '.name' selects the named attribute using getattr(), while an expression of the form '[index]' does an index lookup using getitem().
Changed in version 3.1: The positional argument specifiers can be omitted for str.format(), so '{} {}'.format(a, b) is equivalent to '{0} {1}'.format(a, b).
OR
numbers = input("prompt")
driver.execute_script("document.getElementsByName('phone')[0].value=%s" % numbers)
See examples of both here:
https://pyformat.info/
If your python variable's value is simple string without single quotes or special characters, you can simply use:
driver.execute_script("document.getElementsByName('phone')[0].value='" +
python_variable + "'");
If it has quote marks in it, or special characters that need escaping, or if it's not a string at all, you need to obtain JavaScript string representation of your Python variable's value. json.dumps will handle all the necessary formatting and escaping for you, appropriate to the type of your variable:
from json import dumps
driver.execute_script("document.getElementsByName('phone')[0].value=" +
dumps(python_variable))

Using a hash with object keys in Perl 6

I'm trying to make a Hash with non-string keys, in my case arrays or lists.
> my %sum := :{(1, 3, 5) => 9, (2, 4, 6) => 12}
{(1 3 5) => 9, (2 4 6) => 12}
Now, I don't understand the following.
How to retrieve an existing element?
> %sum{(1, 3, 5)}
((Any) (Any) (Any))
> %sum{1, 3, 5}
((Any) (Any) (Any))
How to add a new element?
> %sum{2, 4} = 6
(6 (Any))
Several things are going on here: first of all, if you use (1,2,3) as a key, Rakudo Perl 6 will consider this to be a slice of 3 keys: 1, 2 and 3. Since neither of these exist in the object hash, you get ((Any) (Any) (Any)).
So you need to indicate that you want the list to be seen as single key of which you want the value. You can do this with $(), so %sum{$(1,3,5)}. This however does not give you the intended result. The reason behind that is the following:
> say (1,2,3).WHICH eq (1,2,3).WHICH
False
Object hashes internally key the object to its .WHICH value. At the moment, Lists are not considered value types, so each List has a different .WHICH. Which makes them unfit to be used as keys in object hashes, or in other cases where they are used by default (e.g. .unique and Sets, Bags and Mixes).
I'm actually working on making this the above eq return True before long: this should make it to the 2018.01 compiler release, on which also a Rakudo Star release will be based.
BTW, any time you're using object hashes and integer values, you will probably be better of using Bags. Alas not yet in this case either for the above reason.
You could actually make this work by using augment class List and adding a .WHICH method on that, but I would recommend against that as it will interfere with any future fixes.
Elizabeth's answer is solid, but until that feature is created, I don't see why you can't create a Key class to use as the hash key, which will have an explicit hash function which is based on its values rather than its location in memory. This hash function, used for both placement in the list and equality testing, is .WHICH. This function must return an ObjAt object, which is basically just a string.
class Key does Positional {
has Int #.list handles <elems AT-POS EXISTS-POS ASSIGN-POS BIND-POS push>;
method new(*#list) { self.bless(:#list); }
method WHICH() { ObjAt.new(#!list.join('|')); }
}
my %hsh{Key};
%hsh{Key.new(1, 3)} = 'result';
say %hsh{Key.new(1, 3)}; # output: result
Note that I only allowed the key to contain Int. This is an easy way of being fairly confident no element's string value contains the '|' character, which could make two keys look the same despite having different elements. However, this is not hardened against naughty users--4 but role :: { method Str() { '|' } } is an Int that stringifies to the illegal value. You can make the code stronger if you use .WHICH recursively, but I'll leave that as an exercise.
This Key class is also a little fancier than you strictly need. It would be enough to have a #.list member and define .WHICH. I defined AT-POS and friends so the Key can be indexed, pushed to, and otherwise treated as an Array.

Enumerating Strings as bytes?

I was looking for a way to enumerate String types in (vb).NET, but .NET enums only accept numeric type values.
The first alternative I came across was to create a dictionary of my enum values and the string I want to return. This worked, but was hard to maintain because if you changed the enum you would have to remember to also change the dictionary.
The second alternative was to set field attributes on each enum member, and retrieve it using reflection. Surely enough this worked aswell and also solved the maintenance problem, but it uses reflection and I've always read that using reflection should be a last resort thing.
So I started thinking and I came up with this: every ASCII character can be represented as a hexadecimal value, and you can assign hexadecimal values to enum members.
You could get rid of the attributes, assign the hexadecimal values to the enum members. Then, when you need the text value, convert the value to a byte array and use System.Text.Encodings.ASCII.GetString(enumMemberBytes) to get the string value.
Now speaking out of experience, anything I come up with is usually either flawed or just plain wrong. What do you guys think about this approach? Is there any reason not to do it like that?
Thanks.
EDIT
As pointed out by David W, enum member values are limited in length, depending on the underlying type (integer by default). So yes, I believe my method works but you are limited to characters in the ASCII table, with a maximum length of 4 or 8 characters using integers or longs respectively.
The easiest way I have found to dynamically parse a String representation of an Enumeration into the actual Enumeration type was to do the following:
Private EnumObject
[Undefined]
ValueA
ValueB
End Enum
dim enumVal as EnumObject = DirectCast([Enum].Parse(GetType(EnumObject), "ValueA"), EnumObject)
This removes the need to maintain a dictionary and allows you to just handle strings instead of converting to an Int or a Long. This does use reflection, but I have not come across any issues as long as you catch and handle any exceptions with the String Parse.