Spacy phrasematcher does not get matcher name - spacy

I am new to phraseMatcher and want to extract some keyword from my emails.
Everything is working well except that I can't get a name of added matcher.
This is my code below:
def main():
patterns_months = 'phraseMatcher/months.txt'
text_loc = 'phraseMatcher/text.txt'
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters ={}
phrases_months = read_gazetter(patterns_months)
txts = read_text(text_loc, n=n)
months = [nlp(text) for text in phrases_months]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('MONTHS', None, *months)
print(nlp.vocab.strings['MONTHS'])
for txt in txts:
doc = nlp(txt)
matches = matcher(doc)
for match_id ,start, end in matches:
span = doc[start: end]
label = nlp.vocab.strings[match_id]
print(label, span.text, start, end)
The result:
12298211501233906429 <--- this is from print(nlp.vocab.strings['MONTHS'])
Traceback (most recent call last):
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 71, in <module>
plac.call(main)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 47, in main
label = nlp.vocab.strings[match_id]
File "strings.pyx", line 117, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744072093410045'."
spaCy version:** 2.0.12
Platform:** Windows-7-6.1.7601-SP1
Python version:** 3.7.0
I can't find what I did wrong. It is simple and I read these already:
Using PhraseMatcher in SpaCy to find multiple match types
Help me, thanks in advance.

Related

using pandas.read_csv, how can one process all errors, receive all non-error data?

Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:
https://opencalaccess.org/misc/NAMES_CD.TSV
I have this:
bad_lines = list()
def bad_line_finder(x):
bad_lines.append(str(x))
return None
for file in os.listdir(dir):
bad_lines = list()
try:
for df in pd.read_csv(f"{dir}/{file}",
sep='\t',
on_bad_lines=bad_line_finder,
engine='python',
chunksize=1000):
print(f"\n{target}")
df.info()
print(f"Bad Lines: {bad_lines}")
bad_lines = list()
except:
print("EXCEPTION:")
traceback.print_exc()
and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:
EXCEPTION:
Traceback (most recent call last):
File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
for df in pd.read_csv(f"{dir}/{file}",
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
content = self._get_lines(rows)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
new_rows.append(next(self.data))
_csv.Error: ' ' expected after '"'
What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?
This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?

Calling `Model.predict` in graph mode is not supported when the `Model` instance was constructed with eager mode enabled

So I just followed someone project and make it to here when I got this error:
[2020-10-12 15:33:21,128] ERROR in app: Exception on /predict/ [POST]
Traceback (most recent call last):
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\_compat.py", line 39, in reraise
raise value
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\flask\app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "D:\Ngoding Python\Skripsi\deploy\app.py", line 70, in predict
out = model.predict(img)
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\tensorflow\python\keras\engine\training.py", line 130, in _method_wrapper
return method(self, *args, **kwargs)
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1562, in predict
version_utils.disallow_legacy_graph('Model', 'predict')
File "c:\users\mr777\anaconda3\envs\gpu\lib\site-packages\tensorflow\python\keras\utils\version_utils.py", line 122, in disallow_legacy_graph
raise ValueError(error_msg)
ValueError: Calling `Model.predict` in graph mode is not supported when the `Model` instance was constructed with eager mode enabled. Please construct your `Model` instance in graph mode or call `Model.predict` with eager mode enabled.
Here's the code I wrote:
with graph.as_default():
# perform the prediction
out = model.predict(img)
print(out)
print(class_names[np.argmax(out)])
# convert the response to a string
response = class_names[np.argmax(out)]
return str(response)
any idea with this? because I found the same question here
The answer is simple, just load your model inside the graph just like this:
with graph.as_default():
json_file = open('models/model.json','r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
#load weights into new model
loaded_model.load_weights("models/model.h5")
print("Loaded Model from disk")
#compile and evaluate loaded model
loaded_model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
# perform the prediction
out = loaded_model.predict(img)
print(out)
print(class_names[np.argmax(out)])
# convert the response to a string
response = class_names[np.argmax(out)]
return str(response)
#Ilham: Try to wrap the call method in a tf.function, right after defining your network. Something like this:
model = Sequential()
model.call = tf.function(model.call)
I had an issue similar to yours. I solved it just by adding that second line of code.
See the following link for more details: https://www.tensorflow.org/guide/intro_to_graphs

TypeError: '<' not supported between instances of 'str' and 'int' Doc2Vec

Any ideas why this error is being thrown
"TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar()
I have a list of .txt documents stored in my data folder and want to compare one doc to another through my flask app on localhost.
Traceback (most recent call last):
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2463, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2449, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2446, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1820,
in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1949,
in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1935,
in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\app.py", line 56, in api_compare_2
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-
packages\gensim\models\keyedvectors.py", line 1715, in most_similar
elif doc in self.doctags or doc < self.count:
TypeError: '<' not supported between instances of 'str' and 'int'\
app.py
#app.route('/api/compare_2', methods=['POST'])
def api_compare_2():
data = request.get_json()
if not 'doc1' in data or not 'doc2' in data:
return 'ERROR'
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
vec2 = d2v_model.docvecs.most_similar(data['doc2'])
vec1 = gensim.matutils.full2sparse(vec1)
vec2 = gensim.matutils.full2sparse(vec2)
print (data)
print (vec2)
print (vec1)
return jsonify(sim=gensim.matutils.cossim(vec1, vec2))
#app.route('/api/compare_all', methods=['POST'])
def api_compare_all():
data = request.get_json()
if not 'doc' in data:
return 'ERROR'
vec = d2v_model.docvecs.most_similar(data['doc'])
res = d2v_model.docvecs.most_similar([vec], topn=5)
return jsonify(list=res)
model.py
def load_model():
try:
return gensim.models.doc2vec.Doc2Vec.load("doc2vec.model2")
except:
print ('Model not found!')
return None
def train_model():
#path to the input corpus files
data="data"
#tagging the text files
class DocIterator(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])
docLabels = [f for f in listdir(data) if f.endswith('.txt')]
print(docLabels)
data = []
for doc in docLabels:
data.append(open(r'C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\data\\' + doc,
encoding='cp437').read())
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))
#This function does all cleaning of data using two objects above
def nlp_clean(data):
new_data = []
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_data.append(dlist)
return new_data
data = nlp_clean(data)
it = DocIterator(data, docLabels)
#train doc2vec model
model = gensim.models.Doc2Vec(size=300, window=15, min_count=4, workers=10,alpha=0.025, min_alpha=0.025, iter=20) # use fixed learning rate
model.build_vocab(it)
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
model.save("doc2vec.model2")
If you try to look-up a string doc-tag that's not in the model, you unfortunately get this confusing error, instead of a clearer error. (See gensim's open-issue: https://github.com/RaRe-Technologies/gensim/issues/1737#issuecomment-346995119 )
Whatever is in data['doc1'] isn't a tag in the model.
You may be able to pre-check, before attempting a most_similar() operation, by looking at whether data['doc1'] in model.docvecs is True.
TypeError: '<' not supported between instances of 'str' and 'int'
[35182] Failed to execute script docker-compose
This error is was as a result of copy and paste code with a wrong quotation mark(). change this to this ''

How to write a pickle file to S3, as a result of a luigi Task?

I want to store a pickle file on S3, as a result of a luigi Task. Below is the class that defines the Task:
class CreateItemVocabulariesTask(luigi.Task):
def __init__(self):
self.client = S3Client(AwsConfig().aws_access_key_id,
AwsConfig().aws_secret_access_key)
super().__init__()
def requires(self):
return [GetItem2VecDataTask()]
def run(self):
filename = 'item2vec_results.tsv'
data = self.client.get('s3://{}/item2vec_results.tsv'.format(AwsConfig().item2vec_path),
filename)
df = pd.read_csv(filename, sep='\t', encoding='latin1')
unique_users = df['CustomerId'].unique()
unique_items = df['ProductNumber'].unique()
item_to_int, int_to_item = utils.create_lookup_tables(unique_items)
user_to_int, int_to_user = utils.create_lookup_tables(unique_users)
with self.output()[0].open('wb') as out_file:
pickle.dump(item_to_int, out_file)
with self.output()[1].open('wb') as out_file:
pickle.dump(int_to_item, out_file)
with self.output()[2].open('wb') as out_file:
pickle.dump(user_to_int, out_file)
with self.output()[3].open('wb') as out_file:
pickle.dump(int_to_user, out_file)
def output(self):
files = [S3Target('s3://{}/item2int.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/int2item.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/user2int.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/int2user.pkl'.format(AwsConfig().item2vec_path), client=self.client),]
return files
When I run this task I get the error ValueError: Unsupported open mode 'wb'. The items I try to dump into a pickle file are just python dictionaries.
Full traceback:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\luigi\worker.py", line 203, in run
new_deps = self._run_get_new_deps()
File "C:\Anaconda3\lib\site-packages\luigi\worker.py", line 140, in _run_get_new_deps
task_gen = self.task.run()
File "C:\Users\user\Documents\python workspace\pipeline.py", line 60, in run
with self.output()[0].open('wb') as out_file:
File "C:\Anaconda3\lib\site-packages\luigi\contrib\s3.py", line 714, in open
raise ValueError("Unsupported open mode '%s'" % mode)
ValueError: Unsupported open mode 'wb'
This is an issue that only happens on python 3.x as explained here. In order to use python 3 and write a binary file or target (ie using 'wb' mode) just set format parameter for S3Target to Nop. Like this:
S3Target('s3://path/to/file', client=self.client, format=luigi.format.Nop)
Notice it's just a trick and not so intuitive nor documented.

pypyodbc execute returns list index out of range error

I have a function that runs 3 queries and returns the result of the last (using the previous ones to create the last) when I get to the 3rd query, it get a list index our of range error. I have ran this exact query as the first query (with manually entered variables) and it worked fine.
This is my code:
import pypyodbc
def sql_conn():
conn = pypyodbc.connect(r'Driver={SQL Server};'
r'Server=HPSQL31\ni1;'
r'Database=tq_hp_prod;'
r'Trusted_Connection=yes;')
cursor = conn.cursor()
return conn, cursor
def get_number_of_jobs(ticket):
# Get Connection
conn, cursor = sql_conn()
# Get asset number
sqlcommand = "select top 1 item from deltickitem where dticket = {} and cat_code = 'Trq sub'".format(ticket)
cursor.execute(sqlcommand)
asset = cursor.fetchone()[0]
print(asset)
# Get last MPI date
sqlcommand = "select last_test from prevent where item = {} and description like '%mpi'".format(asset)
cursor.execute(sqlcommand)
last_recal = cursor.fetchone()[0]
print(last_recal)
# Get number of jobs since last recalibration
sqlcommand = """select count(i.item)
from deltickhdr as d
join deltickitem as i
on d.dticket = i.dticket
where i.start_rent >= '2017-03-03 00:00:00'
and i.meterstart <> i.meterstop
and i.item = '002600395'""" #.format(last_recal, asset)
cursor.execute(sqlcommand)
num_jobs = cursor.fetchone()[0]
print(num_jobs)
cursor.close()
conn.close()
return num_jobs
ticketnumber = 14195 # int(input("Ticket: "))
get_number_of_jobs(ticketnumber)
Below is the error(s) i get when i get to the 3rd cursor.execute(sqlcommand)
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 1596, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2016.3.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 56, in <module>
get_number_of_jobs(ticketnumber)
File "C:/Users/bdrillin/PycharmProjects/Torque_Turn_Data/tt_sub_ui.py", line 45, in get_number_of_jobs
cursor.execute(sqlcommand)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1470, in execute
self._free_stmt(SQL_CLOSE)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1994, in _free_stmt
check_success(self, ret)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 1007, in check_success
ctrl_err(SQL_HANDLE_STMT, ODBC_obj.stmt_h, ret, ODBC_obj.ansi)
File "C:\ProgramData\Anaconda3\lib\site-packages\pypyodbc.py", line 972, in ctrl_err
state = err_list[0][0]
IndexError: list index out of range
Any help would be great
I've had the same error.
Even though I haven't come to the definite conclusion about what this error means I thought my guessing might help anyone else ending up here.
In my case, the problem was a conflict with a datatype length (NVARCHAR(24) and CHAR(10)).
So I guess this IndexError in ctrl_err function just means there is an error in your SQL code that pypyodbc does not know how to handle.
I know this is not much of an answer, but I know it would have saved me a couple of hours had I known this was not some bug in pypyodbc but an inconsistency in the data I was inserting.
Kind regards,
Luka