neuralcoref in spaCy: Setting a higher value for the parameter 'max_dist' doesn't make a difference - spacy

The text I'm analysing is the 'Business' section of a 10-K annual report of a firm.
My goal is to get all mentions that are linked to the company's name.
In the text, the full name of the company ("ADC Telecommunications, Inc") appears at the beginning and it is frequently referred to by such pronouns as "we", "us", or "our" in the rest.
I tried three different values for 'max_dist' parameter: the default (=50), 100, and 200.
My expectation was, the greater the parameter, the more mentions are in the cluster associated with the company name. This was the case when I increased the parameter value from the default (= 50) to 100, but there was no difference when the parameter was changed from 100 to 200.
Below is how I get the text.
import requests
from bs4 import BeautifulSoup
import re
USER_AGENT = "Jiho Yang (jiho90#gmail.com)"
headers = {"User-agent": USER_AGENT}
url = 'http://www.sec.gov/Archives/edgar/data/61478/000095013707018659/0000950137-07-018659.txt'
file = requests.get(url, headers=headers).content
soup = BeautifulSoup(file, 'html.parser')
txt = ' '.join(soup.find_all(text=True))
txt = ' '.join(txt.split())
section = re.search(r'item 1\. business (.*?)item 1b', txt, re.I).group(1)
From here is the code I used for coreference resolution.
import spacy
import neuralcoref
import en_core_web_lg
nlp = en_core_web_lg.load()
neuralcoref.add_to_pipe(nlp, max_dist = 200) # max_dist set to either 50, 100, or 200.
doc = nlp(section.lower())
Below is what I did to get clusters whose main mention contains the company's name ("ADC").
for cluster in doc._.coref_clusters:
sumlen += len(cluster.mentions)
if re.search(r'adc', str(cluster.main)):
print(cluster)
And the results
adc telecommunications, inc. (“adc,” “we,” “us” or “our”): [adc telecommunications, inc. (“adc,” “we,” “us” or “our”), we, our, our, our, we, our, our, we, our, our, we, our, we, our, our, our, we, we, we, we, our]
broadcast and entertainment: [broadcast and entertainment, broadcast and entertainment]
broadcast and entertainment products: [broadcast and entertainment products, broadcast and entertainment products]
adc’s: [adc’s, adc, adc, rejoining adc, adc, adc, adc’s, adc, adc, adc, adc, adc, adc, adc, adc]
joining adc: [joining adc, joining adc, joining adc, joining adc]
I expect that in the first line of the results, there should be more mentions ("we", "our", "us") as I set a higher value for the 'max_dist' parameter, say from 100 to 200, but it wasn't the case here. Clearly there are many more of such mentions in the text that refer to the company, but the code could only get me 21 of them, however large I set the 'max_dist' parameter.

Related

Where is the list of available built-in colormap names?

Question
Where in the matplotlib documentations lists the name of available built-in colormap names to set as the name argument in matplotlib.cm.get_cmap(name)?
Choosing Colormaps in Matplotlib says:
Matplotlib has a number of built-in colormaps accessible via matplotlib.cm.get_cmap.
matplotlib.cm.get_cmap says:
matplotlib.cm.get_cmap(name=None, lut=None)
Get a colormap instance, defaulting to rc values if name is None.
name: matplotlib.colors.Colormap or str or None, default: None
https://www.kite.com/python/docs/matplotlib.pyplot.colormaps shows multiple names.
autumn sequential linearly-increasing shades of red-orange-yellow
bone sequential increasing black-white color map with a tinge of blue, to emulate X-ray film
cool linearly-decreasing shades of cyan-magenta
copper sequential increasing shades of black-copper
flag repetitive red-white-blue-black pattern (not cyclic at endpoints)
gray sequential linearly-increasing black-to-white grayscale
hot sequential black-red-yellow-white, to emulate blackbody radiation from an object at increasing temperatures
hsv cyclic red-yellow-green-cyan-blue-magenta-red, formed by changing the hue component in the HSV color space
inferno perceptually uniform shades of black-red-yellow
jet a spectral map with dark endpoints, blue-cyan-yellow-red; based on a fluid-jet simulation by NCSA [1]
magma perceptually uniform shades of black-red-white
pink sequential increasing pastel black-pink-white, meant for sepia tone colorization of photographs
plasma perceptually uniform shades of blue-red-yellow
prism repetitive red-yellow-green-blue-purple-...-green pattern (not cyclic at endpoints)
spring linearly-increasing shades of magenta-yellow
summer sequential linearly-increasing shades of green-yellow
viridis perceptually uniform shades of blue-green-yellow
winter linearly-increasing shades of blue-green
However, simply google 'matplotlib colormap names' seems not hitting the right documentation. I suppose there is a page listing the names as a enumeration or constant strings. Please help find it out.
There is some example code in the documentation (thanks to #Patrick Fitzgerald for posting the link in the comments, because it's not half as easy to find as it should be) which demonstrates how to generate a plot with an overview of the installed colormaps.
However, this uses an explicit list of maps, so it's limited to the specific version of matplotlib for which the documentation was written, as maps are added and removed between versions. To see what exactly your environment has, you can use this (somewhat crudely) adapted version of the code:
import numpy as np
import matplotlib.pyplot as plt
gradient = np.linspace(0, 1, 256)
gradient = np.vstack((gradient, gradient))
def plot_color_gradients(cmap_category, cmap_list):
# Create figure and adjust figure height to number of colormaps
nrows = len(cmap_list)
figh = 0.35 + 0.15 + (nrows + (nrows - 1) * 0.1) * 0.22
fig, axs = plt.subplots(nrows=nrows + 1, figsize=(6.4, figh))
fig.subplots_adjust(top=1 - 0.35 / figh, bottom=0.15 / figh,
left=0.2, right=0.99)
axs[0].set_title(cmap_category + ' colormaps', fontsize=14)
for ax, name in zip(axs, cmap_list):
ax.imshow(gradient, aspect='auto', cmap=plt.get_cmap(name))
ax.text(-0.01, 0.5, name, va='center', ha='right', fontsize=10,
transform=ax.transAxes)
# Turn off *all* ticks & spines, not just the ones with colormaps.
for ax in axs:
ax.set_axis_off()
cmaps = [name for name in plt.colormaps() if not name.endswith('_r')]
plot_color_gradients('all', cmaps)
plt.show()
This plots just all of them, without regarding the categories.
Since plt.colormaps() produces a list of all the map names, this version only removes all the names ending in '_r', (because those are the inverted versions of the other ones), and plots them all.
That's still a fairly long list, but you can have a look and then manually update/remove items from cmaps narrow it down to the ones you would consider for a given task.
You can also automatically reduce the list to monochrome/non-monochrome maps, because they provide that properties as an attribute:
cmaps_mono = [name for name in cmaps if plt.get_cmap(name).is_gray()]
cmaps_color = [name for name in cmaps if not plt.get_cmap(name).is_gray()]
That should at least give you a decent starting point.
It'd be nice if there was some way within matplotlib to select just certain types of maps (categorical, perceptually uniform, suitable for colourblind viewers ...), but I haven't found a way to do that automatically.
You can use my CMasher to make simple colormap overviews of a list of colormaps.
In your case, if you want to see what every colormap in MPL looks like, you can use the following:
import cmasher as cmr
import matplotlib.pyplot as plt
cmr.create_cmap_overview(plt.colormaps(), savefig='MPL_cmaps.png')
This will give you an overview with all colormaps that are registered in MPL, which will be all built-in colormaps and all colormaps my CMasher package adds, like shown below:

Filtering on relation

I'm trying to retrieve a network of the major roads in Norway. However, the "int_ref" and "ref" labels are inconsistent and are resulting in gaps in the road. When looking at the road in OpenStreetMap I can see that the 'relation' tag under 'Part of' is exactly what I need. Is there any way to retrieve this using OSMnx? Is there any other way to retrieve a full road? I'm using the following line of code when filtering one specific road based on int_ref:
G1 = ox.graph_from_place(query = "Norway", retain_all = True, custom_filter = '["int_ref"~"E 39"]')
No, OSMnx filters on way tags, not on relations. If you want to get only the major roads in a country, see this answer: https://stackoverflow.com/a/52412274/7321942
Something like this may do what you are looking for:
import osmnx as ox
ox.config(use_cache=True, log_console=True)
# get the geometry of the norwegian mainland
gdf = ox.geocode_to_gdf('Norway')
geom = max(gdf['geometry'].iloc[0], key=lambda x: x.area)
# get all motorway/trunk roads
cf = '["highway"~"motorway|motorway_link|trunk|trunk_link"]'
G = ox.graph_from_polygon(geom, network_type='drive', custom_filter=cf)
# plot it
fig, ax = ox.plot_graph(G)
It takes ~370 Overpass API requests to download all the area of the Norwegian mainland, so it takes a while to make all those requests. You can watch its progress in the log in the console.

How to get a 'dobj' in spacy

In the following Tweet spacy dependency tagger states that disrupt (VB) is a dobj of healthcare market (NN). As these two terms are connected I would like to extract them as one phrase. Is there any way to navigate the parse tree so I can extract the dobj of a word? If I do the folllowing I get market but not 'heathcare market'
from spacy.en import English
from spacy.symbols import nsubj, VERB,dobj
nlp = English()
doc = nlp('Juniper Research: AI start-ups set to disrupt healthcare market, with $800 million to be spent on CAD Systems by 2022')
for possible_subject in doc:
if possible_subject.dep == dobj:
print(possible_subject.text)
You can do this as below using noun chunks
for np in doc.noun_chunks:
if np.root.dep == dobj:
print(np.root.text)
print(np.text)

Plotting Natural Earth features on a custom projection

I am trying to make some plots of sea ice data. The data is delivered in the EASE-North grid, an example file (HDF4) can be downloaded at:
ftp://n4ftl01u.ecs.nasa.gov/SAN/OTHR/NISE.004/2013.09.30/
I created a custom projection class for the EASE-Grid, it seems to be working (the coastlines align well with the data).
When i try to add a Natural Earth feature, it returns an empty Matplotlib figure.
import gdal
import cartopy
# projection class
class EASE_North(cartopy.crs.Projection):
def __init__(self):
# see: http://www.spatialreference.org/ref/epsg/3408/
proj4_params = {'proj': 'laea',
'lat_0': 90.,
'lon_0': 0,
'x_0': 0,
'y_0': 0,
'a': 6371228,
'b': 6371228,
'units': 'm',
'no_defs': ''}
super(EASE_North, self).__init__(proj4_params)
#property
def boundary(self):
coords = ((self.x_limits[0], self.y_limits[0]),(self.x_limits[1], self.y_limits[0]),
(self.x_limits[1], self.y_limits[1]),(self.x_limits[0], self.y_limits[1]),
(self.x_limits[0], self.y_limits[0]))
return cartopy.crs.sgeom.Polygon(coords).exterior
#property
def threshold(self):
return 1e5
#property
def x_limits(self):
return (-9000000, 9000000)
#property
def y_limits(self):
return (-9000000, 9000000)
# read the data
ds = gdal.Open('D:/NISE_SSMISF17_20130930.HDFEOS')
# this loads the layers for both hemispheres
data = np.array([gdal.Open(name, gdal.GA_ReadOnly).ReadAsArray()
for name, descr in ds.GetSubDatasets() if 'Extent' in name])
ds = None
# mask anything other then sea ice
sea_ice_concentration = np.ma.masked_where((data < 1) | (data > 100), data, 0)
# plot
lim = 3000000
fig, ax = plt.subplots(figsize=(8,8),subplot_kw={'projection': EASE_North(), 'xlim': [-lim,lim], 'ylim': [-lim,lim]})
land = cartopy.feature.NaturalEarthFeature(
category='physical',
name='land',
scale='50m',
facecolor='#dddddd',
edgecolor='none')
#ax.add_feature(land)
ax.coastlines()
# from the metadata in the HDF
extent = [-9036842.762500, 9036842.762500, -9036842.762500, 9036842.762500]
ax.imshow(sea_ice_concentration[0,:,:], cmap=plt.cm.Blues, vmin=1,vmax=100,
interpolation='none', origin='upper', extent=extent, transform=EASE_North())
The script above works fine and produces this result:
But when i uncomment the ax.add_feature(land) it fails without any error, only returning the empty figure. Am i missing something obvious?
Here is the IPython Notebook:
http://nbviewer.ipython.org/6779935
My Cartopy build is version 0.9 from Christoph Gohlke's website (thanks!).
edit:
Trying to save the figure does throw an exception:
fig.savefig(r'D:\test.png')
C:\Python27\Lib\site-packages\shapely\speedups\_speedups.pyd in shapely.speedups._speedups.geos_linearring_from_py (shapely/speedups/_speedups.c:2270)()
ValueError: A LinearRing must have at least 3 coordinate tuples
Examining the 'land' cartopy.feature reveals no issues, all polygons pass the .isvalid() and all rings (ext en int) are of 4 or more tuples. So the input shape doesnt seem to be the problem (and works fine in PlateCaree()).
Maybe some rings (like on the southern hemisphere) get 'corrupt' after transforming to EASE_North?
edit2:
When i remove the build-in NE features and load the same shapefile (but with anything below 40N clipped) it works. So it seems like some sort of reprojection issue.
for state in shpreader.Reader(r'D:\ne_50m_land_clipped.shp').geometries():
ax.add_geometries([state], cartopy.crs.PlateCarree(),facecolor='#cccccc', edgecolor='#cccccc')
I'd have said that this was a bug. I'm guessing add_feature updates the matplotlib viewLim and the result is that the picture zooms in to a tiny area (which appears white unless you zoom out a lot).
From the top of my head, I think the underlying behaviour has been improved in matplotlib, but cartopy is not yet making use of the new viewLim calculation. In the meantime I'd suggest setting the extents of your map manually with:
ax.set_extent(extent, transform=EASE_North())
HTH

pandas access axis by user-defined name

I am wondering whether there is any way to access axes of pandas containers (DataFrame, Panel, etc...) by user-defined name instead of integer or "index", "columns", "minor_axis" etc...
For example, with the following data container:
df = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
df.index.name = 'myaxis1'
df.columns.name = 'myaxis2'
I would like to do this:
df.sum(axis='myaxis1')
df.xs('c1', axis='myaxis2') # cross section
Also very useful would be:
df.reshape(['myaxis2','myaxis1'])
(in this case not so relevant, but it could become so if the dimension increases)
The reason is that I work a lot with multi-dimensional arrays of varying dimensions, like "time", "variable", "percentile" etc...and a same piece of code is often applied to objects which can be DataFrame, Panel or even Panel4D or DataFrame with MultiIndex. For now I often make test on the shape of the object, or on the general settings of the script in order to know which axis is the relevant one to compute a sum or mean. But I think it would be much more convenient to forget about how the container is implemented in the detail (DataFrame, Panel etc...), and simply think about the nature of the problem (say I want to average over the time, I do not want to think about whether I work with in "probabilistic" mode with several percentiles, or in "deterministic" mode with a single time series).
Writing this post I have (re)discovered the very useful axes attribute. The above code could be translated into:
nms = [ax.name for ax in df.axes]
axid1 = nms.index('myaxis1')
axid2 = nms.index('myaxis2')
df.sum(axis=axid1)
df.xs('c1', axis=axid2) # cross section
and the "reshape" feature (does not apply to 3-d case though...):
newshape = ['myaxis2','myaxis1']
axid = [nms.index(nm) for nm in newshape]
df.swapaxes(*axid)
Well, I have to admit that I have found these solutions while writing this post (and this is already very convenient), but it could be generalized to account for DataFrame (or other) with MultiIndex axes, do a search on all axes and labels...
In my opinion it would be a major improvement to the user-friendliness of pandas (ok, forgetting about the actual structure could have a performance cost, but the user worried about performance can be careful in how he/she organizes the data).
What do you think?
This is still experimental, but look at this page:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panelnd-experimental
import pandas
import numpy as np
from pandas.core import panelnd
MyPanel4D = panelnd.create_nd_panel_factory(
klass_name = 'MyPanel4D',
axis_orders = ['axis4', 'axis3', 'axis2', 'axis1'],
axis_slices = {'axis3': 'items',
'axis2': 'major_axis',
'axis1': 'minor_axis'},
slicer = 'Panel',
stat_axis=2)
mp4d = MyPanel4D(np.random.rand(5,4,3,2))
print mp4d
Results in this
<class 'pandas.core.panelnd.MyPanel4D'>
Dimensions: 5 (axis4) x 4 (axis3) x 3 (axis2) x 2 (axis1)
Axis4 axis: 0 to 4
Axis3 axis: 0 to 3
Axis2 axis: 0 to 2
Axis1 axis: 0 to 1
Here's the caveat, when you slice it like mp4d[0] you are going to get back a Panel, unless you create a hierarchy of custom objects (unfortunately will need to wait for 0.12-dev for support for 'renaming' Panel/DataFrame, its non-trivial and haven't had any requests)
So for higher dim objects you can impose your own name structure. The axis
aliasing should work like you are suggesting, but I think there are some bugs there