How to visualize the SpaCy word embedding as scatter plot? - matplotlib

Each word in SpaCy is represented by a vector of length 300. How can I plot these words on a scatter plot to get a visual perspective on how close any 2 words are?

There's a new package called whatlies that does exactly this: https://rasahq.github.io/whatlies/
See a short spacy example: https://spacy.io/universe/project/whatlies

When working with small-to-medium-sized texts, ScatterText is a tool which can be used to discover words that have distinguishing features. It also enables users to create interactive scatter plots that contain non-overlapping term labels.
Intall via -https://pypi.org/project/scattertext/
import spacy
import scattertext as st
nlp = spacy.load('en')
corpus = st.CorpusFromPandas(convention_df,
category_col='party',
text_col='text',
nlp=nlp).build()

Related

Where is the list of available built-in colormap names?

Question
Where in the matplotlib documentations lists the name of available built-in colormap names to set as the name argument in matplotlib.cm.get_cmap(name)?
Choosing Colormaps in Matplotlib says:
Matplotlib has a number of built-in colormaps accessible via matplotlib.cm.get_cmap.
matplotlib.cm.get_cmap says:
matplotlib.cm.get_cmap(name=None, lut=None)
Get a colormap instance, defaulting to rc values if name is None.
name: matplotlib.colors.Colormap or str or None, default: None
https://www.kite.com/python/docs/matplotlib.pyplot.colormaps shows multiple names.
autumn sequential linearly-increasing shades of red-orange-yellow
bone sequential increasing black-white color map with a tinge of blue, to emulate X-ray film
cool linearly-decreasing shades of cyan-magenta
copper sequential increasing shades of black-copper
flag repetitive red-white-blue-black pattern (not cyclic at endpoints)
gray sequential linearly-increasing black-to-white grayscale
hot sequential black-red-yellow-white, to emulate blackbody radiation from an object at increasing temperatures
hsv cyclic red-yellow-green-cyan-blue-magenta-red, formed by changing the hue component in the HSV color space
inferno perceptually uniform shades of black-red-yellow
jet a spectral map with dark endpoints, blue-cyan-yellow-red; based on a fluid-jet simulation by NCSA [1]
magma perceptually uniform shades of black-red-white
pink sequential increasing pastel black-pink-white, meant for sepia tone colorization of photographs
plasma perceptually uniform shades of blue-red-yellow
prism repetitive red-yellow-green-blue-purple-...-green pattern (not cyclic at endpoints)
spring linearly-increasing shades of magenta-yellow
summer sequential linearly-increasing shades of green-yellow
viridis perceptually uniform shades of blue-green-yellow
winter linearly-increasing shades of blue-green
However, simply google 'matplotlib colormap names' seems not hitting the right documentation. I suppose there is a page listing the names as a enumeration or constant strings. Please help find it out.
There is some example code in the documentation (thanks to #Patrick Fitzgerald for posting the link in the comments, because it's not half as easy to find as it should be) which demonstrates how to generate a plot with an overview of the installed colormaps.
However, this uses an explicit list of maps, so it's limited to the specific version of matplotlib for which the documentation was written, as maps are added and removed between versions. To see what exactly your environment has, you can use this (somewhat crudely) adapted version of the code:
import numpy as np
import matplotlib.pyplot as plt
gradient = np.linspace(0, 1, 256)
gradient = np.vstack((gradient, gradient))
def plot_color_gradients(cmap_category, cmap_list):
# Create figure and adjust figure height to number of colormaps
nrows = len(cmap_list)
figh = 0.35 + 0.15 + (nrows + (nrows - 1) * 0.1) * 0.22
fig, axs = plt.subplots(nrows=nrows + 1, figsize=(6.4, figh))
fig.subplots_adjust(top=1 - 0.35 / figh, bottom=0.15 / figh,
left=0.2, right=0.99)
axs[0].set_title(cmap_category + ' colormaps', fontsize=14)
for ax, name in zip(axs, cmap_list):
ax.imshow(gradient, aspect='auto', cmap=plt.get_cmap(name))
ax.text(-0.01, 0.5, name, va='center', ha='right', fontsize=10,
transform=ax.transAxes)
# Turn off *all* ticks & spines, not just the ones with colormaps.
for ax in axs:
ax.set_axis_off()
cmaps = [name for name in plt.colormaps() if not name.endswith('_r')]
plot_color_gradients('all', cmaps)
plt.show()
This plots just all of them, without regarding the categories.
Since plt.colormaps() produces a list of all the map names, this version only removes all the names ending in '_r', (because those are the inverted versions of the other ones), and plots them all.
That's still a fairly long list, but you can have a look and then manually update/remove items from cmaps narrow it down to the ones you would consider for a given task.
You can also automatically reduce the list to monochrome/non-monochrome maps, because they provide that properties as an attribute:
cmaps_mono = [name for name in cmaps if plt.get_cmap(name).is_gray()]
cmaps_color = [name for name in cmaps if not plt.get_cmap(name).is_gray()]
That should at least give you a decent starting point.
It'd be nice if there was some way within matplotlib to select just certain types of maps (categorical, perceptually uniform, suitable for colourblind viewers ...), but I haven't found a way to do that automatically.
You can use my CMasher to make simple colormap overviews of a list of colormaps.
In your case, if you want to see what every colormap in MPL looks like, you can use the following:
import cmasher as cmr
import matplotlib.pyplot as plt
cmr.create_cmap_overview(plt.colormaps(), savefig='MPL_cmaps.png')
This will give you an overview with all colormaps that are registered in MPL, which will be all built-in colormaps and all colormaps my CMasher package adds, like shown below:

Using spacy visualizer with custom data

I want to visualize a sentence using Spacy's named entity visualizer. I have a sentence with some user defined labels over the tokens, and I want to visualize them using the NER rendering API.
I don't want to train and produce a predictive model, I have all needed labels from an external source, just need the visualization without messing too much with front-end libraries.
Any ideas how?
Thank you
You can manually modify the list of entities (doc.ents) and add new spans using token offsets. Be aware that entities can't overlap at all.
import spacy
from spacy.tokens import Span
nlp = spacy.load('en', disable=['ner'])
doc = nlp("I see an XYZ.")
doc.ents = list(doc.ents) + [Span(doc, 3, 4, "NEWENTITYTYPE")]
print(doc.ents[0], doc.ents[0].label_)
Output:
XYZ NEWENTITYTYPE

Draw an ordinary plot with the same style as in plt.hist(histtype='step')

The method plt.hist() in pyplot has a way to create a 'step-like' plot style when calling
plt.hist(data, histtype='step')
but the 'ordinary' methods that plot raw data without processing (plt.plot(), plt.scatter(), etc.) apparently do not have style options to obtain the same result. My goal is to plot a given set of points using that style, without making histogram of these points.
Is that achievable with standard library methods for plotting a given 2-D set of points?
I also think that there is at least one hack (generating a fake distribution which would have histogram equal to our data) and a 'low-level' solution to draw each segment manually, but none of these ways seems favorable.
Maybe you are looking for drawstyle="steps".
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
data = np.cumsum(np.random.randn(10))
plt.plot(data, drawstyle="steps")
plt.show()
Note that this is slightly different from histograms, because the lines do not go to zero at the ends.

Difference between matplotlib.countourf and matlab.contourf() - odd sharp edges in matplotlib

I am a recent migrant from Matlab to Python and have recently worked with Numpy and Matplotlib. I recoded one of my scripts from Matlab, which employs Matlab's contourf-function, into Python using matplotlib's corresponding contourf-function. I managed to replicate the output in Python, apart that the contourf-plots are not exacly the same, for a reason that is unknown to me. As I run the contourf-function in matplotlib, I get this otherwise nice figure but it has these sharp edges on the contour-levels on top and bottom, which should not be there (see Figure 1 below, matplotlib-output). Now, when I export the arrays I used in Python to Matlab (i.e. the exactly same data set that was used to generate the matplotlib-contourf-plot) and use Matlab's contourf-function, I get a slightly different output, without those sharp contour-level edges (see Figure 2 below, Matlab-output). I used the same number of levels in both figures. In figure 3 I have made a scatterplot of the same data, which shows that there are no such sharp edges in the data as shown in the contourf-plot (I added contour-lines just for reference). Example dataset can be downloaded through Dropbox-link given below. The data set contains three txt-files: X, Y, Z. Each of them are an 500x500 arrays, which can be directly used with contourf(), i.e. plt.contourf(X,Y,Z,...). The code that used was
plt.contourf(X,Y,Z,10, cmap=plt.cm.jet)
plt.contour(X,Y,Z,10,colors='black', linewidths=0.5)
plt.axis('equal')
plt.axis('off')
Does anyone have an idea why this happens? I would appreciate any insight on this!
Cheers,
Jussi
Below are the details of my setup:
Python 3.7.0
IPython 6.5.0
matplotlib 2.2.3
Matplotlib output
Matlab output
Matplotlib-scatter
Link to data set
The confusing thing about the matlab plot is that its colorbar shows much more levels than there are actually in the plot. Hence you don't see the actual intervals that are contoured.
You would achieve the same result in matplotlib by choosing 12 instead of 11 levels.
import numpy as np
import matplotlib.pyplot as plt
X, Y, Z = [np.loadtxt("data/roundcontourdata/{}.txt".format(i)) for i in list("XYZ")]
levels = np.linspace(Z.min(), Z.max(), 12)
cntr = plt.contourf(X,Y,Z,levels, cmap=plt.cm.jet)
plt.contour(X,Y,Z,levels,colors='black', linewidths=0.5)
plt.colorbar(cntr)
plt.axis('equal')
plt.axis('off')
plt.show()
So in conclusion, both plots are correct and show the same data. Just the levels being automatically chosen are different. This can be circumvented by choosing custom levels depending on the desired visual appearance.

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line
Term_IDF = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]
The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:
for doc in corpus_tfidf:
for ids,tfidf in doc:
IDS.append(ids)
tfidfmtx.append(tfidf)
IDS=[]
now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.
dumydist = 1 - cosine_similarity(tfidfmtx)
When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
Transform into bow:
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]
Transform into tf-idf:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
Now you can transform into sparse/dense matrix:
from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)
Now fit your model using either sparse/dense matrix (after transposing):
model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)
To create document term matrix from gensim, you may use matutils.corpus2csv
Corpus - list of list(Genism Corpus)
from scipy.sparse import csc_matrix
scipy_csc_matrix =genism.matutils.corpus2csc(corpus)
full_matrix=csc_matrix(scipy_csc_matrix).toarray()
you may want to use scipy sparse format if your corpus size is very large.