Feed large text to PyTextRank - spacy

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.

No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.

Related

Is There a Quick, Efficient Way to Add Large Numbers of Labels in Either ArcGIS or QGIS?

In 2007, when I was young and foolish and before I knew about Open Street Map, I started an urban historical map project. I was working in Illustrator, it was going to be an interactive Flash piece, and my process was to draw the maps first, with the thought that I'd label some, but not all, of the street later on.
As we know Flash was began to die about 2010 and I put the project away for a number of years. I picked it up again a couple years ago and continued my earlier practice of just drawing streets and water features, this time with the intention of making it a conventional web map. Now I'm pretty close to finishing the drawing of a five-layer (1871, 1903, 1932, 1952 and 2016) historical map of a medium-sized city, though it still lacks labels.
My problem now is how to add large numbers of labels, many of them duplicates. There could be as many as 10,000 for all five layers, though as a practical matter I may have to settle for a smallish fraction of that number. Based on web searches I gather my workflow is unusual and that mine is therefore an unusual problem.
I've exported my maps and brought them into QGIS and played with the software a little. The process of adding labels to objects doesn't seem terribly efficient or user-friendly, but that's probably due to my unfamiliarity with the program.
So my question is this: Are there any tricks to speed up the painful process of adding large numbers of duplicate labels in either QGIS or ArcGIS? Since so many of the streets exist in all five layers, functionality like the ability to select multiple objects in different layers and edit their attributes simultaneously in the Attribute Table would be a godsend. (Doesn't seem possible.) So would the ability to copy the attributes from one object and paste them onto other objects. Or the ability to do either of these things in Illustrator via a plugin and then export the data along with the shapes to a GIS program.
Thanks for your help!
If I understand the issue correctly I think are several different solutions. When you say that you
Typically for a spatial layer in ArcGIS or QGIS you define how to label all features in a layer once by defining a label scheme to use across all features, 1 or 1 million. This assumes that each feature in the layer has one or more attributes in the associated table for the layer.
How are you converting the Illustrator vectors to a spatial layer? DXF?
You will likely have better/faster responses to this question by posting it to the GIS Stack exchange. https://gis.stackexchange.com/

How many objects these Computer Vision API can detect?

https://learn.microsoft.com/fr-fr/azure/cognitive-services/computer-vision/concept-object-detection
https://cloud.google.com/vision/docs/object-localizer
I would want to know how many and which objects are recognizable using theses APIs and I can't find a mention of that fact.
I found that google API use https://developers.google.com/knowledge-graph/ which is based on schema.org types but I don't really understand well what it's all about.
I'm sorry but as far as I know, there is no fixed list of classes that Azure Computer Vision is able to detect.
By the way, even if there was one, this list is evolving on a regular basis (but no schedule is announced).
In any case, there are limitations (see doc here):
It's important to note the limitations of object detection so you can
avoid or mitigate the effects of false negatives (missed objects) and
limited detail.
Objects are generally not detected if they're small (less than 5% of
the image).
Objects are generally not detected if they're arranged
closely together (a stack of plates, for example).
Objects are not differentiated by brand or product names (different
types of sodas on a store shelf, for example). However, you can get
brand information from an image by using the Brand detection feature.
If you want to detect specific objects, I would highly suggest using Custom Vision (doc / overview here), not Computer Vision, where you can train your model with your own images to match what you are trying to detect

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

Searching for semantic relatedness tool

I need a tool that compute the semantic relatedness between two words.Please, Have you an idea about a tool or a code source which adopts this process. I am trying wordnet similarity (http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi), but there is several missing words, i need one richer in term of concept.
What you described is more like a research project. :-)
If they are just words, not phrases, the most recent technology is word embedding. You can think of it as converting words to high-dimensional vectors (from 200 to 1000 dimensions) by training on millions of documents.
https://code.google.com/archive/p/word2vec
The code has been archived for proprietary issues but you can still download and run for yourself. Good luck.

Scaling up SURF lookups

I am currently trying to recognise DVD covers in generic photos. My initial test involved using 100 DVD covers and 10 test cases of photos that contained them, and with some tweaking of the find_obj.cpp example in OpenCV I was able to get recognition working.
However now I need to do this on a much larger database, and I am aware that the FLANN method will not scale up well to meet this requirement. How do people here recommend I scale up my SURF recognition in an SQL database?
If you really want to scale your system to several orders of magnitude, nearest neighbors search (FLANN) will not be sufficient.
In such a case what you need is to build a visual vocabulary (a.k.a bag of words) by quantizing your descriptors, and create an inverted index.
I recommend you to refer to the Scalable Recognition with a Vocabulary Tree paper that is the reference publication for such a topic.