Vocab for LSA Topic Modelling returning letters rather than words - pandas

I am trying to topic model a list of descriptions using LSA. When I tokenize and then create a vocab from the descriptions, the vocab returns letters rather than words.
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•#'
custom_stopwords = ['author', 'book', 'books', 'story', 'stories', 'novel', 'series', 'collection', 'edition', 'volume', 'readers', 'reader', 'reprint', 'writer', 'writing']
final_stopword_list = custom_stopwords + my_stopwords
# cleaning master function
def clean_tokens(tokens):
tokens = (tokens).lower() # lower case
tokens = re.sub('['+my_punctuation + ']+', ' ', tokens) # strip punctuation
tokens = re.sub('([0-9]+)', '', tokens) # remove numbers
token_list = [word for word in tokens.split(' ') if word not in final_stopword_list] # remove stopwords
tokens = ' '.join(token_list)
return tokens
This is my tokenizer
count_vectoriser = CountVectorizer(tokenizer=clean_tokens)
bag_of_words = count_vectoriser.fit_transform(df.Description)
vocab = count_vectoriser.get_feature_names_out()
print(vocab[:10])
And my vocab, which returns
[' ' '#' '\\' 'a' 'b' 'c' 'd' 'e' 'f' 'g']
When I want it to give me words
I am tokenizing from a pandas dataframe so I don't know if that is altering the way I am tokenizing.

Related

How can I pass additional arguments into pandas's apply function?

If my dataFrame's some column string values need to be normalized with the delimiter '|'. For example, the column name's values 'a/b/c' that need to be normalized to 'a|b|c'. And 'sex' columns 'M/F' needs to be normalized to 'M|F'.
columns_to_be_normalized = ['name', 'sex']
delimiters = ['/', ';', ',']
for column in columns_to_be_normalized:
df[column] = df[column].apply(normalize)
def normalize(column_text):
for delimiter in delimiters:
normalized_column_text = re.sub(delimiter, '|', text)
if column_text != normalized_column_text:
return normalized
return column_text
My question is, how do I pass the variable delimiters into the normalize function so that I can use it in the regex? The reason I have to pass it as an argument is because the delimiters could change depending on some conditions.
Define normalize with a named parameter:
def normalize(column_text, delimiters=None):
if delimiters is None:
delimiters = ['/'] # define the default here
for delimiter in delimiters:
normalized_column_text = re.sub(delimiter, '|', text)
if column_text != normalized_column_text:
return normalized # this should be fixed
return column_text
Then use:
df[column] = df[column].apply(normalize, delimiters=['/', ';', ','])
Note that you don't need apply per item though. You can directly use pandas str.replace that takes care of the loop for you:
import re
delimiters = ['/', ';', ',']
regex = '|'.join(map(re.escape, delimiters))
df[columns_to_be_normalized] = (
df[columns_to_be_normalized].apply(lambda s: s.str.replace(regex, '|', regex=True))
)

I have several problems with Logistic Regression using pandas

Can't create X_train and X_test Dataframes ( from 2 differenet csv files) and also can't use them as integer
data=pd.read_csv('action_train.csv', delimiter=';', header=0)
data=data.replace(to_replace ='[act1_]', value = '', regex = True).replace(to_replace ='[act2_]', value = '', regex = True).replace(to_replace ='[type ]', value = '', regex = True)
print(data.shape)
print(list(data.columns))
data1=pd.read_csv('action_test.csv', delimiter=';', header=0)
data1=data1.replace(to_replace ='[act1_]', value='', regex=True).replace(to_replace='[act2_]', value = '', regex = True).replace(to_replace ='[type ]', value = '', regex = True)
print(data1.shape)
print(list(data1.columns))
X_train=data['action_id', 'char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10']
print(X_train)
y_train=data['result']
X_test=data1['action_id', 'char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10']
print(X_test)
y_test=data1['result']
I tried to use them in different way but got tuple instead of array. Also can't convert object type in integer

Leetcode 126: Word Ladder 2 in Python code optimization

I have the solution for the Word Ladder 2 (Leetcode problem 126: Word Ladder 2 ) in Python 3.6, and I notice that one of the very last testcases times out for me on the platform. Funnily, the test passes when run on PyCharm or as an individual test case on the site, but it takes about 5 seconds for it to complete. My solution uses BFS with some optimizations, but can someone tell me if there is a way to make it faster. Thank you! (P.S: Apologies for the additional test cases included in the commented out section!)
import math
import queue
from typing import List
class WordLadder2(object):
#staticmethod
def is_one_hop_away(s1: str, s2: str) -> int:
"""
Uses the distance between strings to return True if string s2 is one character away from s1
:param s1: Base string
:param s2: Comparison string
:return: True if it the difference between the strings is one character
"""
matrix = [[0] * (len(s1) + 1) for i in range(len(s1) + 1)]
for r, row in enumerate(matrix):
for c, entry in enumerate(row):
if not r:
matrix[r][c] = c
elif not c:
matrix[r][c] = r
else:
if s1[r - 1] == s2[c - 1]:
matrix[r][c] = matrix[r - 1][c - 1]
else:
matrix[r][c] = 1 + min(matrix[r - 1][c - 1], matrix[r - 1][c], matrix[r][c - 1])
if matrix[-1][-1] == 1:
return True
else:
return False
def get_next_words(self, s1: str, wordList: List[str]) -> List[str]:
"""
For a given string in the list, return a set of strings that are one hop away
:param s1: String whose neighbors one hop away are needed
:param wordList: Array of words to choose from
:return: List of words that are one character away from given string s1
"""
words = []
for word in wordList:
if self.is_one_hop_away(s1, word):
words.append(word)
return words
def find_ladders(self, beginWord: str, endWord: str, wordList: List[str]) -> List[List[str]]:
"""
Main method to determine shortest paths between a beginning word and an ending word, in a given list of words
:param beginWord: Word to begin the ladder
:param endWord: Word to end the ladder
:param wordList: List of words to choose from
:return: List of list of word ladders, if they are found. Empty list, if endWord not in wordList or path not
found from beginWord to endWord
"""
q = queue.Queue()
paths = list()
current = [beginWord]
q.put((beginWord, current))
# Set to track words we have already processed
visited = set()
# Dictionary to keep track of the shortest path lengths to each word from beginWord
shortest_paths = {beginWord: 1}
min_length = math.inf
# Use BFS to find the shortest path in the graph
while q.qsize():
word, path = q.get()
# If endWord is found, add the current path to the list of paths and compute minimum path
# length found so far
if word == endWord:
paths.append(path)
min_length = min(min_length, len(path))
continue
for hop in self.get_next_words(word, wordList):
# If the hop is already processed or in the queue for processing, skip
if hop in visited or hop in q.queue:
continue
# If the shortest path to the hop has not been determined or the current path length is lesser
# than or equal to the known shortest path to the hop, add it to the queue and update the shortest
# path to the hop.
if (hop not in shortest_paths) or (hop in shortest_paths and len(path + [hop]) <= shortest_paths[hop]):
q.put((hop, path + [hop]))
shortest_paths[hop] = len(path + [hop])
visited.add(word)
return [s for s in paths if len(s) == min_length]
if __name__ == "__main__":
# beginword = 'qa'
# endword = 'sq'
# wordlist = ["si","go","se","cm","so","ph","mt","db","mb","sb","kr","ln","tm","le","av","sm","ar","ci","ca","br","ti","ba","to","ra","fa","yo","ow","sn","ya","cr","po","fe","ho","ma","re","or","rn","au","ur","rh","sr","tc","lt","lo","as","fr","nb","yb","if","pb","ge","th","pm","rb","sh","co","ga","li","ha","hz","no","bi","di","hi","qa","pi","os","uh","wm","an","me","mo","na","la","st","er","sc","ne","mn","mi","am","ex","pt","io","be","fm","ta","tb","ni","mr","pa","he","lr","sq","ye"]
# beginword = 'hit'
# endword = 'cog'
# wordlist = ['hot', 'dot', 'dog', 'lot', 'log', 'cog']
# beginword = 'red'
# endword = 'tax'
# wordlist = ['ted', 'tex', 'red', 'tax', 'tad', 'den', 'rex', 'pee']
beginword = 'cet'
endword = 'ism'
wordlist = ["kid","tag","pup","ail","tun","woo","erg","luz","brr","gay","sip","kay","per","val","mes","ohs","now","boa","cet","pal","bar","die","war","hay","eco","pub","lob","rue","fry","lit","rex","jan","cot","bid","ali","pay","col","gum","ger","row","won","dan","rum","fad","tut","sag","yip","sui","ark","has","zip","fez","own","ump","dis","ads","max","jaw","out","btu","ana","gap","cry","led","abe","box","ore","pig","fie","toy","fat","cal","lie","noh","sew","ono","tam","flu","mgm","ply","awe","pry","tit","tie","yet","too","tax","jim","san","pan","map","ski","ova","wed","non","wac","nut","why","bye","lye","oct","old","fin","feb","chi","sap","owl","log","tod","dot","bow","fob","for","joe","ivy","fan","age","fax","hip","jib","mel","hus","sob","ifs","tab","ara","dab","jag","jar","arm","lot","tom","sax","tex","yum","pei","wen","wry","ire","irk","far","mew","wit","doe","gas","rte","ian","pot","ask","wag","hag","amy","nag","ron","soy","gin","don","tug","fay","vic","boo","nam","ave","buy","sop","but","orb","fen","paw","his","sub","bob","yea","oft","inn","rod","yam","pew","web","hod","hun","gyp","wei","wis","rob","gad","pie","mon","dog","bib","rub","ere","dig","era","cat","fox","bee","mod","day","apr","vie","nev","jam","pam","new","aye","ani","and","ibm","yap","can","pyx","tar","kin","fog","hum","pip","cup","dye","lyx","jog","nun","par","wan","fey","bus","oak","bad","ats","set","qom","vat","eat","pus","rev","axe","ion","six","ila","lao","mom","mas","pro","few","opt","poe","art","ash","oar","cap","lop","may","shy","rid","bat","sum","rim","fee","bmw","sky","maj","hue","thy","ava","rap","den","fla","auk","cox","ibo","hey","saw","vim","sec","ltd","you","its","tat","dew","eva","tog","ram","let","see","zit","maw","nix","ate","gig","rep","owe","ind","hog","eve","sam","zoo","any","dow","cod","bed","vet","ham","sis","hex","via","fir","nod","mao","aug","mum","hoe","bah","hal","keg","hew","zed","tow","gog","ass","dem","who","bet","gos","son","ear","spy","kit","boy","due","sen","oaf","mix","hep","fur","ada","bin","nil","mia","ewe","hit","fix","sad","rib","eye","hop","haw","wax","mid","tad","ken","wad","rye","pap","bog","gut","ito","woe","our","ado","sin","mad","ray","hon","roy","dip","hen","iva","lug","asp","hui","yak","bay","poi","yep","bun","try","lad","elm","nat","wyo","gym","dug","toe","dee","wig","sly","rip","geo","cog","pas","zen","odd","nan","lay","pod","fit","hem","joy","bum","rio","yon","dec","leg","put","sue","dim","pet","yaw","nub","bit","bur","sid","sun","oil","red","doc","moe","caw","eel","dix","cub","end","gem","off","yew","hug","pop","tub","sgt","lid","pun","ton","sol","din","yup","jab","pea","bug","gag","mil","jig","hub","low","did","tin","get","gte","sox","lei","mig","fig","lon","use","ban","flo","nov","jut","bag","mir","sty","lap","two","ins","con","ant","net","tux","ode","stu","mug","cad","nap","gun","fop","tot","sow","sal","sic","ted","wot","del","imp","cob","way","ann","tan","mci","job","wet","ism","err","him","all","pad","hah","hie","aim"]
wl = WordLadder2()
# beginword = 'hot'
# endword = 'dog'
# wordlist = ['hot', 'dog', 'dot']
print(wl.find_ladders(beginword, endword, wordlist))
The part that slows down your solution is is_one_hop_away, which is a costly function. This is called repeatedly during the actual BFS. Instead you should aim to first create a graph structure -- an adjacency list -- so that complexity of calculating which words are neighbors is dealt with before actually peforming the BFS search.
Here is one way to do it:
from collections import defaultdict
class Solution:
def findLadders(self, beginWord: str, endWord: str, wordList: List[str]) -> List[List[str]]:
def createAdjacencyList(wordList):
adj = defaultdict(set)
d = defaultdict(set)
for word in wordList:
for i in range(len(word)):
derived = word[:i] + "*" + word[i+1:]
for neighbor in d[derived]:
adj[word].add(neighbor)
adj[neighbor].add(word)
d[derived].add(word)
return adj
def edgesOnShortestPaths(adj, beginWord, endWord):
frontier = [beginWord]
edges = defaultdict(list)
edges[beginWord] = []
while endWord not in frontier:
nextfrontier = set(neighbor
for word in frontier
for neighbor in adj[word]
if neighbor not in edges
)
if not nextfrontier: # endNode is not reachable
return
for word in frontier:
for neighbor in adj[word]:
if neighbor in nextfrontier:
edges[neighbor].append(word)
frontier = nextfrontier
return edges
def generatePaths(edges, word):
if not edges[word]:
yield [word]
else:
for neighbor in edges[word]:
for path in generatePaths(edges, neighbor):
yield path + [word]
if endWord not in wordList: # shortcut exit
return []
adj = createAdjacencyList([beginWord] + wordList)
edges = edgesOnShortestPaths(adj, beginWord, endWord)
if not edges: # endNode is not reachable
return []
return list(generatePaths(edges, endWord))

spaCy: How to update doc.ents when using doc.retokenize()

I am trying to update a pre-trained model with tokens using the retokenizer. I created a pipeline in order to do this. In this pipeline, I also set "ENT_TYPE" when merging the tokens.
#Language.factory("re_tokenize")
def re_tokenize(nlp, name):
return ReTokenize(nlp.vocab)
class ReTokenize:
pattern = ""
def __init__(self, vocab):
self.pattern = r"[a-zA-Z0-9]+\[{0,1}[a-zA-Z0-9_]+\]{0,1}\[{0,1}[a-zA-Z0-9_]+\]{0,1}\[{0,1}[a-zA-Z0-9_]+\]{0,1}#{0,1}"
def __call__(self, doc):
spans = []
for match in re.finditer(self.pattern, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
if span is not None:
spans.append(span)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span, attrs={"ENT_TYPE": "VAR"})
return doc
Using this pipeline, I can tokenize the words correctly. Also, the data in ent_type_ seems to be updated.
BEFORE:
# Set model
nlp = spacy.load("ja_ginza")
text = "aaa_bbbとaaa_CCCの2バイトマップ"
text = mojimoji.zen_to_han(text).lower()
doc = nlp(text)
print([token.text for token in doc])
print([token.ent_type_ for token in doc])
['aaa', '', 'bbb', 'と', 'aaa', '', 'ccc', 'の', '2', 'バイト', 'マップ']
['Product_Other', 'Product_Other', 'Product_Other', '', 'Product_Other', 'Product_Other', 'Product_Other', '', 'N_Product', 'N_Product', 'N_Product']
AFTER:
nlp.add_pipe("re_tokenize", before="parser")
doc = nlp(text)
print([token.text for token in doc])
print([token.ent_type_ for token in doc])
['aaa_bbb', 'と', 'aaa_ccc', 'の', '2', 'バイト', 'マップ']
['VAR', '', 'VAR', '', 'N_Product', 'N_Product', 'N_Product']
However, it seems that doc.ents is not being updated:
print([ent.label_ for ent in doc.ents])
['N_Product']
How do I also update doc.ents?
To add a single new entity to a doc without modifying any other entity annotation, use doc.set_ents():
span = doc.char_span(start, end, label="VAR")
doc.set_ents(entities=[span], default="unmodified")
More docs: https://spacy.io/api/doc#set_ents

Remove words from pandas series that are not found in list

I have a list of strings and a series with sentences for which all punctuation has been removed:
series = test_data["reviews']
words = [ 'great', 'awesome', 'ok', 'sucky']
I need do remove all words from the series that are not in the list[words] and assign to new series.
I did an online search plus tried but unable to find a solution.
Can someone please assist?
Here is what I have:
new_series= []
for word in words:
if word in significant_words:
new_series.append(word)
print (new_series)
Much appreciated.
If data contains sentences and need new columns filled by lists use:
words = [ 'great', 'awesome', 'ok', 'sucky']
test_data = pd.DataFrame({'reviews':['great it is', 'ok good well awesome']})
words = [ 'great', 'awesome', 'ok', 'sucky']
def func(x):
a, b = [], []
for word in x.split():
if word not in words:
a.append(word)
else:
b.append(word)
return pd.Series([a, b])
test_data[['out','in']] = test_data["reviews"].apply(func)
print (test_data)
reviews out in
0 great it is [it, is] [great]
1 ok good well awesome [good, well] [ok, awesome]