Tokenizing the text without the use of libraries - python-3.8

I'm trying my best to write a function that tokenizes a text
Input: string;
Output: list (of tokens)
The tokenizer should separate the punctuation from the words, whenever the punctuation is not an
integral part of the word.
For instance:
"The current population of U.S.A. is 332,087,410 as of Friday, 01/22/2021, based on Worldometer
elaboration of the latest United Nations’ data."
should be tokenized as
"The current population of U.S.A. is 332,087,410 as of Friday , 01/22/2021 , based on Worldometer
elaboration of the latest United Nations ’ data ."
tokenization of . (do not tokenize acronyms, abbreviations, numbers)
tokenization of ' (expand when needed, e.g., I’m -> I am; tokenize the possessive,
e.g., Sunday’s -> Sunday ‘s; etc.)
tokenization of dates (keep dates together)
tokenization of - (keep phrases separated by - together)
tokenization of , (do not tokenize numbers)
Been working on this for hours trying to use stuff with re and .split(), but nothing seems to be working. Any assistance would be appreciated!

Would you please try this out? This function will not handle all the possible edge cases. But this should work in most cases :
import re
def tokenize(text):
tokens = text.split()
i = 0
while i < len(tokens):
token = tokens[i]
if re.match(r'\d{1,2}[/-]\d{1,2}[/-]\d{4}', token):
elif re.match(r"\w+'s", token):
token = re.sub(r"(\w+)'s", r"\1 's", token)
elif re.match(r"\w+'\w+", token):
token = token.replace("'", "")
elif re.match(r"\w+-\w+", token):
elif re.match(r"\d+(,\d+)*", token):
token = re.sub(r"([^\w\s]+)", r" \1 ", token)
token = re.sub(r"(\w+)\.", r"\1", token)
token = re.sub(r"(\w+),", r"\1", token)
token = re.sub(r"U\.S\.A\.", r"U.S.A.", token)
tokens[i] = token
i += 1
return tokens


Remove whitespace from spacy doc.ents

I am trying to run a spacy model for NER. I have Doc object and doc.ents shows below output
(august 3, 2021, book building offer, bse, nse)
All the tags have space.Due to this i am receiving below error.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data
Can anyone suggest how to remove this whitespace?
Found out a problem in rule based labeling, the entity spans had leading or trailing whitespace.
To solve above problem, you can use below function:
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
data (list): The data to be cleaned in spaCy JSON format.
list: The cleaned data.
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
If you have documents labeled in new .spacy format then you can use below helper function:
new_docs = []
for doc in skweak.utils.docbin_reader("./ipo_v3.spacy",spacy_model_name='en_core_web_trf'):
Once you have all the docs, you can use them to convert into JSON format:
doc_examples = []
for i,j in enumerate(new_docs):
spans_ex = [(ent.start_char,ent.end_char,ent.label_) for ent in new_docs[i].ents]
Then you can simply use :
doc_examples = trim_entity_spans(doc_examples)

How to pass part-of-speech in WordNetLemmatizer?

I am preprocessing text data. However, I am facing issue with lemmatizing.
Below is the sample text:
'An 18-year-old boy was referred to prosecutors Thursday for allegedly
stealing about ¥15 million ($134,300) worth of cryptocurrency last
year by hacking a digital currency storage website, police said.',
'The case is the first in Japan in which criminal charges have been
pursued against a hacker over cryptocurrency losses, the police
said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi
Prefecture, whose name is being withheld because he is a minor,
allegedly stole the money after hacking Monappy, a website where users
can keep the virtual currency monacoin, between Aug. 14 and Sept. 1
last year.', 'He used software called Tor that makes it difficult to
identify who is accessing the system, but the police identified him by
analyzing communication records left on the website’s server.', 'The
police said the boy has admitted to the allegations, quoting him as
saying, “I felt like I’d found a trick no one knows and did it as if I
were playing a video game.”', 'He took advantage of a weakness in a
feature of the website that enables a user to transfer the currency to
another user, knowing that the system would malfunction if transfers
were repeated over a short period of time.', 'He repeatedly submitted
currency transfer requests to himself, overwhelming the system and
allowing him to register more money in his account.', 'About 7,700
users were affected and the operator will compensate them.', 'The boy
later put the stolen monacoins in an account set up by a different
cryptocurrency operator, received payouts in a different
cryptocurrency and bought items such as a smartphone, the police
said.', 'According to the operator of Monappy, the stolen monacoins
were kept using a system with an always-on internet connection, and
those kept offline were not stolen.'
My code is:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()
stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
def identify_tokens(row):
Articles = row['Articles']
tokens = nltk.word_tokenize(Articles)
token_words = [w for w in tokens if w.isalpha()]
return token_words
df['words'] = df.apply(identify_tokens, axis=1)
def stem_list(row):
my_list = row['words']
stemmed_list = [stemming.stem(word) for word in my_list]
return (stemmed_list)
df['stemmed_words'] = df.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
def remove_stops(row):
my_list = row['lemma_words']
meaningful_words = [w for w in my_list if not w in stops]
return (meaningful_words)
df['stem_meaningful'] = df.apply(remove_stops, axis=1)
def rejoin_words(row):
my_list = row['stem_meaningful']
joined_words = (" ".join(my_list))
return joined_words
df['processed'] = df.apply(rejoin_words, axis=1)
As it is clear from the code that I am using pandas. However here I have given sample text.
My problem area is :
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
Though the code is running without any error lemma function is not working expectedly.
Thanks in Advance.
In your code above you are trying to lemmatize words that have been stemmed. When the lemmatizer runs into a word that it doesn't recognize, it'll simply return that word. For instance stemming offline produces offlin and when you run that through the lemmatizer it just gives back the same word, offlin.
Your code should be modified to lemmatize words, like this...
def lemma_list(row):
my_list = row['words'] # Note: line that is changed
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ', df.ix[0,'words'])
print('Stems: ', df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])
This produces...
Words: ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems: ['and', 'those', 'kept', 'offlin', 'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be', 'not', 'steal']
Which is is correct.

Is there a fast way to get the tokens for each sentence in spaCy?

To split my sentence into tokens I'm doing the following whichi is slow
import spacy nlp = spacy.load("en_core_web_lg")
text = "This is a test. This is another test"
sentence_tokens = []
doc = nlp(text)
for sent in doc.sents:
words = nlp(sent.text)
all = []
for w in words:
I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()
The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.
So if you need a list of strings, one for each token, you can do:
sentence_tokens = []
for sent in doc.sents:
sentence_tokens.append([token.text for token in sent])
Or even shorter:
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.
texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
# do something with the tokens
To just do the rule-based tokenization, which is very fast, run:
nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])
There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])
If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).
(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
is algorithm executing stemming
stemming is just the process of breaking a word down into its root.
is passed the argument 'english'   ↦   porter2 stemming algorithm
more precisely, this 'english' argument   ↦   stem.snowball.EnglishStemmer
(porter2 stemmer considered to be better than the original porter stemmer)
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
list comprehension loops over our tokenized input list tokenized
(tokenized can also be any other other iterable input instance)
list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
list comprehension then collects only the set of English stems
i.e., it is a list that should collect only stemmed English word tokens
Caveat:   list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words
Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text =
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(sentence ==''):
sentence = sentence + word
sentence = sentence +" "+ word
return eng
To save it just do this!
Hope it helps anyone!

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like can be written as btech, b-tech or
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
UIMA has addons, in particular one for Lucene integration.
You can try
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
); can be written as btech, b-tech or
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.