Text extraction with SpaCy python library Jupyter - spacy

I am using Jupyter Notebook for text mining and data Processing, after extracting my text from a pdf file i used SpaCy library to extract all the proper nouns processed but the problem is that sometimes it gives me the same name twice.
this is the pdf file that i'm using :
https://drive.google.com/file/d/16LicTfEuQQRwVwTyt2bo1QfWS5vapC0c/view?usp=sharing
============Code=====================
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
pattern2 = [{'POS': 'PROPN'}, {'POS': 'PROPN'},{'POS': 'PROPN'}]
matcher.add('FULL_NAME', [pattern1])
matcher.add('FULL_NAME', [pattern2])
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])
===============Output=====================
Sarra Ben
Sarra Ben Moussa
Ben Moussa
Ghofrane Challouf
Malika Ben
Malika Ben Slima
Ben Slima

Related

spacy IS_DIGIT or LIKE_NUM not working as expected for certain chars

I am trying to extract some numbers using IS_DIGIT and LIKE_NUM attributes but it seems to be behaving a bit strange for a beginner like me.
The matcher is only able to detect the numbers when the 5 character string ends in M, G, T . If it is any other character, the IS_DIGIT and LIKE_NUM attributes are not able to detect. What am I missing here?
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}]
matcher.add("DIGIT",[pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
matches = matcher(doc, as_spans=True)
for span in matches:
print(span.text, span.label_)
# prints only 1231, 1232 and 1236
It may be helpful to just check which tokens are true for LIKE_NUM, like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_NUM": True}]
matcher.add("DIGIT", [pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
for tok in doc:
print(tok, tok.like_num)
Here you'll see that sometimes the tokens you have are split in two, and sometimes they aren't. The tokens you match are only the ones that consist just of digits.
Now, why are M, G, and T split off, while H, J, and V aren't? This is because they are units, as for mega, giga, or terabytes.
This behaviour with units may seem inconsistent and weird, but it's been chosen to be consistent with the training data used for the English models. If you need to change it for your application, look at this section in the docs, which covers customizing the exceptions.

Export vectors from fastText to spaCy

I downloaded the fasttext.cc vectors of 1.5gb, I used example code spaCy examples vectors_fast_text. I executed the following command in the terminal:
python config/vectors_fast_text.py vectors_loc data/vectors/wiki.pt.vec
After a few minutes with the processor at 100%, I received the following text:
class colspan 0.32231358
What happens from here? How can I export these vectors elsewhere, such as for example with my AWS S3 training templates?
I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
Follow vectors_fast_text.py:
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
#plac.annotations()
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk("./models/new_nlp/")
if __name__ == '__main__':
plac.call(main)
Type in the terminal:
python vectors_fast_text.py
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python setup.py sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:
https://spacy.io/usage/training

how do I split Chinese string into characters using Tensorflow

I want to use tf.data.TextLineDataset() to read Chinese sentences, then use the map() function to divide into the single word, but tf.split doesn't work for Chinese.
I also hope someone can help us kindly with the issue.
It is my current solution:
read Chinese sentence from the file with Utf-8 coding format.
tokenize the sentences with some tool like jieba.
construct the vocab table.
convert source/target sentence according to vocab table.
convert to the dataset using from_tensor_slices.
get iterator from the dataset.
do other things.
if using TextLineDataset to load chinese sentences directlly, the content of dataset is something strange , displayed with byte flow.
maybe we can consider every byte as one character in english kind of language.
can anyone confirm with this or has any other suggestion, plz?
The above answer is one common option when handling non-English style language like Chinese, Korean, Japanese, etc.
You can also use the code below.
BTW, as you know, TextLineDataSet will read text content as a byte string.
So if we want to handle Chinese, we need to first decode it to unicode.
Unfortunately, there is no such option in tensorflow.
We need to choose other method like py_funct to do this.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
def preprocess_func(x):
ret= "*".join(x.decode('utf-8'))
return ret
str = tf.py_func(
preprocess_func,
[tf.constant(u"我爱,南京")],
tf.string)
with tf.Session() as sess:
value = sess.run(str)
print(value.decode('utf-8'))
output: 我*爱*,*南*京

How to get a 'dobj' in spacy

In the following Tweet spacy dependency tagger states that disrupt (VB) is a dobj of healthcare market (NN). As these two terms are connected I would like to extract them as one phrase. Is there any way to navigate the parse tree so I can extract the dobj of a word? If I do the folllowing I get market but not 'heathcare market'
from spacy.en import English
from spacy.symbols import nsubj, VERB,dobj
nlp = English()
doc = nlp('Juniper Research: AI start-ups set to disrupt healthcare market, with $800 million to be spent on CAD Systems by 2022')
for possible_subject in doc:
if possible_subject.dep == dobj:
print(possible_subject.text)
You can do this as below using noun chunks
for np in doc.noun_chunks:
if np.root.dep == dobj:
print(np.root.text)
print(np.text)

Chronic (Ruby NLP date/time parser) for python?

Does anyone know of a library like chronic but for python?
Thanks!
Have you tried parsedatetime?
You can try Stanford NLP's SUTime. Related Python bindings are here: https://github.com/FraBle/python-sutime
Make sure that all the Java dependencies are installed.
I was talking to Stephen Russett at chronic. I came up with a Python example after he suggested tokenization.
Here is the Python example. You run the output into chronic.
import nltk
import MySQLdb
import time
import string
import re
#tokenize
sentence = 'Available June 9 -- August first week'
tokens = nltk.word_tokenize(sentence)
parts_of_speech = nltk.pos_tag(tokens)
print parts_of_speech
#allow white list
white_list = ['first']
#allow only prepositions
#NNP, CD
approved_prepositions = ['NNP', 'CD']
filtered = []
for word in parts_of_speech:
if any(x in word[1] for x in approved_prepositions):
filtered.append(word[0])
elif any(x in word[0] for x in white_list):
#if word in white list, append it
filtered.append(word[0])
print filtered
#normalize to alphanumeric only
normalized = re.sub(r'\s\W+', ' ', ' '.join(filtered))
print filtered