Trivial example using spaCy Matcher not working - spacy

I'm trying to get the following simple example using the spaCy Matcher working:
import en_core_web_sm
from spacy.matcher import Matcher
nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)
pattern1 = [{'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}]
pattern2 = [{'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}]
pattern3 = [{'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}]
matcher.add('IP', None, pattern1, pattern2, pattern3)
doc = nlp(u'This is an IP address: 192.168.1.1')
matches = matcher(doc)
However, none of the patterns are matching and this code returns [] for matches. The simple "Hello World" example provided in the spaCy sample code works fine.
What am I doing wrong?

When using the Matcher, keep in mind that each dictionary in the pattern represents one individual token. This also means that the matches it finds depends on how spaCy tokenizes your text. By default, spaCy's English tokenizer will split your example text like this:
>>> doc = nlp("This is an IP address: 192.168.1.1")
>>> [t.text for t in doc]
['This', 'is', 'an', 'IP', 'address', ':', '192.168.1.1']
192.168.1.1 stays one token (which, objectively, is probably quite reasonable – an IP address could be considered a word). So the match pattern that expects parts of it to be individual tokens won't match.
In order to change this behaviour, you could customise the tokenizer with an additional rule that tells spaCy to split periods between numbers. However, this might also produce other, unintended side effects.
So a better approach in your case would be to work with the token shape, available as the token.shape_ attribute. The shape is a string representation of the token that describes the individual characters, and whether they contain digits, uppercase/lowercase characters and punctuation. The IP address shape looks like this:
>>> ip_address = doc[6]
>>> ip_address.shape_
'ddd.ddd.d.d'
You can either just filter your document and check that token.shape_ == 'ddd.ddd.d.d', or use the 'SHAPE' as a key in your match pattern (for a single token) to find sentences or phrases containing tokens of that shape.

Related

Pydantic: how to make model with some mandatory and arbitrary number of other optional fields, which names are unknown and can be any?

I'd like to represent the following json by Pydantic model:
{
"sip" {
"param1": 1
}
"param2": 2
...
}
Means json may contain sip field and some other field, any number any names, so I'd like to have model which have sip:Optional[dict] field and some kind of "rest", which will be correctly parsed from/serialized to json. Is it possible?
Maybe you are looking for the extra model config:
extra
whether to ignore, allow, or forbid extra attributes during model initialization. Accepts the string values of 'ignore', 'allow', or 'forbid', or values of the Extra enum (default: Extra.ignore). 'forbid' will cause validation to fail if extra attributes are included, 'ignore' will silently ignore any extra attributes, and 'allow' will assign the attributes to the model.
Example:
from typing import Any, Dict, Optional
import pydantic
class Foo(pydantic.BaseModel):
sip: Optional[Dict[Any, Any]]
class Config:
extra = pydantic.Extra.allow
foo = Foo.parse_raw(
"""
{
"sip": {
"param1": 1
},
"param2": 2
}
"""
)
print(repr(foo))
print(foo.json())
Output:
Foo(sip={'param1': 1}, param2=2)
{"sip": {"param1": 1}, "param2": 2}

Passing subdomains to dash_leaflet.TileLayer

I have followed and adapted the LayersControl example from https://dash-leaflet.herokuapp.com/. I am trying to include a basemap from this (https://basemap.at/wmts/1.0.0/WMTSCapabilities.xml) source.
Upon running the code I get the error
Invalid argument subdomains passed into TileLayer.
Expected string.
Was supplied type array.
Value provided:
[
"map",
"map1",
"map2",
"map3",
"map4"
]
Looking into the documentation for dash_leaflet.TileLayer it says
- subdomains (string; optional):
Subdomains of the tile service. Can be passed in the form of one
string (where each letter is a subdomain name) or an array of
strings.
I think I understand the error message, but the error seems to disagree with to the docstring of TileLayer. I am not sure, if I have missed a detail here.
MWE:
import dash
from dash import html
import dash_leaflet as dl
url = "https://{s}.wien.gv.at/basemap/geolandbasemap/normal/google3857/{z}/{y}/{x}.png"
subdomains = ["map", "map1", "map2", "map3", "map4"]
name = "Geoland Basemap"
attribution = "basemap.at"
app = dash.Dash(__name__)
app.layout = html.Div(
dl.Map(
[
dl.LayersControl(
[
dl.BaseLayer(dl.TileLayer(), name="default map", checked=True),
dl.BaseLayer(
dl.TileLayer(
url=url, attribution=attribution, subdomains=subdomains
),
name=name,
checked=False,
),
]
)
],
zoom=7,
center=(47.3, 15.0),
),
style={"width": "100%", "height": "50vh", "margin": "auto", "display": "block"},
)
if __name__ == "__main__":
app.run_server(debug=True)
I am running
dash==2.6.1
dash_leaflet==0.1.23

Pymongo: Best way to remove $oid in Response

I have started using Pymongo recently and now I want to find the best way to remove $oid in Response
When I use find:
result = db.nodes.find_one({ "name": "Archer" }
And get the response:
json.loads(dumps(result))
The result would be:
{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"about": "A jazz pianist falls for an aspiring actress in Los Angeles."
}
My expected:
{
"_id": "5e7511c45cb29ef48b8cfcff",
"about": "A jazz pianist falls for an aspiring actress in Los Angeles."
}
As you seen, we can use:
resp = json.loads(dumps(result))
resp['id'] = resp['id']['$oid']
But I think this is not the best way. Hope you guys have better solution.
You can take advantage of aggregation:
result = db.nodes.aggregate([{'$match': {"name": "Archer"}}
{'$addFields': {"Id": '$_id.oid'}},
{'$project': {'_id': 0}}])
data = json.dumps(list(result))
Here, with $addFields I add a new field Id in which I introduce the value of oid. Then I make a projection where I eliminate the _id field of the result. After, as I get a cursor, I turn it into a list.
It may not work as you hope but the general idea is there.
First of all, there's no $oid in the response. What you are seeing is the python driver represent the _id field as an ObjectId instance, and then the dumps() method represent the the ObjectId field as a string format. the $oid bit is just to let you know the field is an ObjectId should you need to use for some purpose later.
The next part of the answer depends on what exactly you are trying to achieve. Almost certainly you can acheive it using the result object without converting it to JSON.
If you just want to get rid of it altogether, you can do :
result = db.nodes.find_one({ "name": "Archer" }, {'_id': 0})
print(result)
which gives:
{"name": "Archer"}
import re
def remove_oid(string):
while True:
pattern = re.compile('{\s*"\$oid":\s*(\"[a-z0-9]{1,}\")\s*}')
match = re.search(pattern, string)
if match:
string = string.replace(match.group(0), match.group(1))
else:
return string
string = json_dumps(mongo_query_result)
string = remove_oid(string)
I am using some form of custom handler. I managed to remove $oid and replace it with just the id string:
# Custom Handler
def my_handler(x):
if isinstance(x, datetime.datetime):
return x.isoformat()
elif isinstance(x, bson.objectid.ObjectId):
return str(x)
else:
raise TypeError(x)
# parsing
def parse_json(data):
return json.loads(json.dumps(data, default=my_handler))
result = db.nodes.aggregate([{'$match': {"name": "Archer"}}
{'$addFields': {"_id": '$_id'}},
{'$project': {'_id': 0}}])
data = parse_json(result)
In the second argument of find_one, you can define which fields to exclude, in the following way:
site_information = mongo.db.sites.find_one({'username': username}, {'_id': False})
This statement will exclude the '_id' field from being selected from the returned documents.

import.io json API: get the list of columns, with subfields

I'm using the import.io API and have noticed that some field types return several columns in the generated json. For instance a field foo of type Money will return three columns: foo, foo/_currency and foo/_source.
Is there a reference somewhere? I found some documentation here http://blog.import.io/post/11-columns-of-importio through an incomplete example:
{
"whole_number_field": 123,
"whole_number_field/_source": "123",
"language_field": "ben",
"language_field/_source": "bn",
"country_field": "CHN",
"country_field/_source": "China",
"boolean_field": false,
"boolean_field/_source": "false",
"currency_field/_currency": "GBP",
"currency_field/_source": "£123.45",
"link_field": "http://chris-alexander.co.uk",
"link_field/_text": "Blog",
"link_field/_title": "linktitle",
"datetime_field": 611368440000,
"datetime_field/_source": "17/05/89 12:34",
"datetime_field/_utc": "Wed May 17 00:34:00 GMT 1989",
"image_field": "http://io.chris-alexander.co.uk/gif2.gif",
"image_field/_alt": "imgalt",
"image_field/_title": "imgtitle",
"image_field/_source": "gif2.gif"
}
The columns are documented in the API docs:
http://api.docs.import.io/
For example, for currency, the columns are:
myvar <== Extracted value
myvar/_currency <== ISO currency code
myvar/_source <== Original value
The ISO currency code is returned as myvar/_currency, the numeric value in myvar
I established this through several tests, I'd like to know if I'm missing something:
{
'DATE': ['_source', '_utc'],
# please tell me if you have an example of an import.io API with a date!
'BOOLEAN': ['_source'],
'LANG': ['_source'],
'COUNTRY': ['_source'],
'HTML':[],
'STRING':[],
'URL': ['_text', '_source', '_title'],
'IMAGE': ['_alt', '_title', '_source'],
'DOUBLE': ['_source'],
'CURRENCY': ['_currency', '_source'],
}

Dojo DGrid RQL Search

I am working with a dgrid where I want to find a search term in my grid on two columns.
For instance, I want to see if the scientific name and commonName columns contain the string "Aca" (I want my search to be case insensitive)
My Grid definition:
var CustomGrid = declare([Grid, Pagination ]);
var gridStore = new Memory({ idProperty: 'tsn', data: null });
gridStore.queryEngine = rql.query;
grid = new CustomGrid({
store: gridStore,
columns:
[
{ field: "tsn", label: "TSN #"},
{ field: "scientificName", label: "Scientific Name"},
{ field: "commonName", label: "Common Name",},
],
autoHeight: 'true',
firstLastArrows: 'true',
pageSizeOptions: [50, 100],
}, id);
With the built in query language (I think simple query language), I was able to find the term in one column or the other, but I couldn't do a complex search that would return results for both columns.
grid.set("query", { scientificName : new RegExp(speciesKeyword, "i") });
grid.refresh()
I started reading and I think RQL can solve this problem, however, I am struggling with the syntax.
I have been looking at these pages:
http://rql-engine.eu01.aws.af.cm/
https://github.com/kriszyp/rql
And I am able to understand basic queries, however the "contains" syntax eludes me.
For instance if I had this simple data set and wanted to find the entries with scientific and common names that contain the string "Aca" I would think my contains query would look like this:
contains(scientificName,string:aca)
However, this results in no matches.
[
{
"tsn": 1,
"scientificName": "Acalypha ostryifolia",
"commonName": "Rough-pod Copperleaf",
},
{
"tsn": 2,
"scientificName": "Aegalius acadicus",
"commonName": "Northern Saw-whet Owl",
},
{
"tsn": 3,
"scientificName": "Portulaca pilosa",
"commonName": "2012-02-01",
},
{
"tsn": 4,
"scientificName": "Accipiter striatus",
"commonName": "Kiss-me-quick",
},
{
"tsn": 5,
"scientificName": "Acorus americanus",
"commonName": "American Sweetflag",
}
]
Can someone guide me in how to formulate the correct syntax? Thank you.
From what I'm briefly reading, it appears that:
contains was replaced by any and all
these are meant for array comparisons, not string comparisons
I'm not sure offhand whether RegExps can just be handed to other operations e.g. eq.
With dojo/store/Memory, you can also pass a query function which will allow you to do whatever you want, so if you wanted to compare for a match in one field or the other you could do something like this:
grid.set('query', function (item) {
var scientificRx = new RegExp(speciesKeyword, 'i');
var commonRx = new RegExp(...);
return scientificRx.test(item.scientificName) || commonRx.test(item.commonName);
});
Of course, if you want to filter only items that match both, you can do that with simple object syntax:
grid.set('query', {
scientificName: scientificRx,
commonName: commonRx
});