How to extract only English words from a from big text corpus using nltk? - pandas

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0

after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
SnowballStemmer
is algorithm executing stemming
stemming is just the process of breaking a word down into its root.
is passed the argument 'english'   ↦   porter2 stemming algorithm
more precisely, this 'english' argument   ↦   stem.snowball.EnglishStemmer
(porter2 stemmer considered to be better than the original porter stemmer)
 
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
list comprehension loops over our tokenized input list tokenized
(tokenized can also be any other other iterable input instance)
list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
list comprehension then collects only the set of English stems
i.e., it is a list that should collect only stemmed English word tokens
 
Caveat:   list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words

Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text = f.read()
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
print(new_text)

I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
sentence=''
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(d.check(word)):
if(sentence ==''):
sentence = sentence + word
else:
sentence = sentence +" "+ word
print(sentence)
eng.append(sentence)
return eng
To save it just do this!
sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)
Hope it helps anyone!

Related

Remove whitespace from spacy doc.ents

I am trying to run a spacy model for NER. I have Doc object and doc.ents shows below output
(august 3, 2021, book building offer, bse, nse)
All the tags have space.Due to this i am receiving below error.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data
Can anyone suggest how to remove this whitespace?
Found out a problem in rule based labeling, the entity spans had leading or trailing whitespace.
To solve above problem, you can use below function:
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
If you have documents labeled in new .spacy format then you can use below helper function:
new_docs = []
for doc in skweak.utils.docbin_reader("./ipo_v3.spacy",spacy_model_name='en_core_web_trf'):
new_docs.append(doc)
Once you have all the docs, you can use them to convert into JSON format:
doc_examples = []
for i,j in enumerate(new_docs):
spans_ex = [(ent.start_char,ent.end_char,ent.label_) for ent in new_docs[i].ents]
doc_examples.append([new_docs[i].text,{"entities":spans_ex}])
Then you can simply use :
doc_examples = trim_entity_spans(doc_examples)

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

Query for substrings from freeform STT input

I have a PostgreSQL database with vocabulary in a table.
I want to receive Speech to Text (STT) input and query my vocabulary table for matches.
This is tricky since STT is somewhat free-form.
Let's say the table contains the following vocabulary and phrases:
How are you?
Hi
Nice to meet you
Hill
Nice
And the user is prompted to speak: "Hi, nice to meet you"
I transcribe their input as it comes in as "Hi nice to meet you" and query my database for individual vocabulary matches. I want to return:
[
{
id: 2,
word: "Hi"
},
{
id: 3,
word: "Nice to meet you"
}
]
I could query with wildcards where word ilike '%${term}% but then I'd need to pass in the correct substring so it'd find the match, e.g., where word ilike '%Hi%, but this may incorrectly return Hill. I could also split the spoken input by space, giving me ["Hi", "nice", "to", "meet", you"], and loop through each word looking for a match, but this may return Nice rather than the phrase Nice to meet you.
Q: How can I correctly pass substrings to a query and return accurate results for free-form speech?
Two PostgreSQL functions could help you here:
to_tsvector: creates a text search list of tokens (lexemes: unit of lexical meaning)
to_tsquery for querying the vector for occurrences of certain words or phrases.
See Mastering PostgreSQL Tools: Full-Text Search and Phrase Search
If that's not enough you need to turn to natural language processing (NLP).
Something like PyTextRank could help (something that goes beyond the bag-of-words technique):
import spacy
import pytextrank
text = "Hi, how are you?"
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
# examine the top-ranked phrases in the document
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)

Is there a fast way to get the tokens for each sentence in spaCy?

To split my sentence into tokens I'm doing the following whichi is slow
import spacy nlp = spacy.load("en_core_web_lg")
text = "This is a test. This is another test"
sentence_tokens = []
doc = nlp(text)
for sent in doc.sents:
words = nlp(sent.text)
all = []
for w in words:
all.append(w)
sentence_tokens.append(all)
I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()
The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.
So if you need a list of strings, one for each token, you can do:
sentence_tokens = []
for sent in doc.sents:
sentence_tokens.append([token.text for token in sent])
Or even shorter:
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.
texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
# do something with the tokens
To just do the rule-based tokenization, which is very fast, run:
nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])
There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])
If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).
(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

Generate set of Nouns and verbs from n different descriptions, list out descriptions that match a noun and verb

Im new to NLP, i have with columns app name and its description. Data looks like this
app1, description1 (some information of app1, how it works)
app2, description2
.
.
app(n), description(n)
From these descriptions i need to generate a limited set of nouns and verbs. In the final application, when we pair a noun and verb from this list, output should be of list of apps that satisfy that noun+verb.
I dont have any idea where to start, can you please guide me where to start. Thank you.
The task of finding the morpho-syntactic category of words in a sentence is called part-of-speech (or PoS) tagging.
In your case, you probably need also to tokenize your text first.
To do so, you can use nltk, spacy, or the Stanford NLP tagger (among other tools).
Note that depending on the model you use, there can be several labels for nouns (singular nouns, plural nouns, proper nouns) and verbs (depending on the tense and person).
Example with NLTK:
import nltk
description = "This description describes apps with words."
tokenized_description = nltk.word_tokenize(description)
tagged_description = nltk.pos_tag(tokenized_description)
#tagged_description:
# [('This', 'DT'), ('description', 'NN'), ('describes', 'VBZ'), ('apps', 'RP'), ('with', 'IN'), ('words', 'NNS'), ('.', '.')]
# map the tags to a smaller set of tags
universal_tags_description = [(word, nltk.map_tag("wsj", "universal", tag)) for word, tag in tagged_description]
# universal_tags_description:
# [('This', 'DET'), ('description', 'NOUN'), ('describes', 'VERB'), ('apps', 'PRT'), ('with', 'ADP'), ('words', 'NOUN'), ('.', '.')]
filtered = [(word, tag) for word, tag in universal_tags_description if tag in {'NOUN', 'VERB'}]
# filtered:
# [('description', 'NOUN'), ('describes', 'VERB'), ('words', 'NOUN')]