How to write custom PySpark serializer - serialization

I want to write a custom pyspark serializer. There is very little documentation online apart from details here.
My logic is as follows:
If I receive my special object, then I use custom logic to serialize/deserialize.
Otherwise, use cPickle for the same.
The custom serializer looks like the following:
from pyspark.serializers import FramedSerializer
class CustomSerializer(FramedSerializer):
import cPickle as pickle
import CustomClass
def dumps(self, obj):
if isinstance(obj, CustomClass):
bytes_str = obj.serialize()
bytes_str = '\1' + bytes_str
elif isinstance(obj, CustomClass.Location):
bytes_str = obj.serialize()
bytes_str = '\2' + bytes_str
else:
bytes_str = pickle.dumps(obj)
bytes_str = '\0' + bytes_str
return bytes_str
def loads(self, bytes_str):
c = bytes_str[0]
if c=='\1':
obj = CustomClass()
obj.parse_from_string(bytes_str[1:])
elif c=='\2':
obj = CustomClass.Location()
obj.parse_from_string(bytes_str[1:])
else:
obj = pickle.loads(bytes_str[1:])
return obj
While initiating SparkContext, I make sure I specify the custom serializer:
serializer = CustomSerializer()
sc = SparkContext(appName='MyApp', serializer=serializer)
However, I still get error:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/serializers.py", line 272, in dump_stream
bytes = self.serializer.dumps(vs)
File "<ipython-input-2-536808351108>", line 14, in dumps
PicklingError: Can't pickle <class 'CustomClass.Location'>: attribute lookup Location failed
What am I missing?
Thanks.

Related

pyspark RDDs strip attributes of numpy subclasses

I've been fighting an unexpected behavior when attempting to construct a subclass of numpy ndarray within a map call to a pyspark RDD. Specifically, the attribute that I added within the ndarray subclass appears to be stripped from the resulting RDD.
The following snippets contain the essence of the issue.
import numpy as np
class MyArray(np.ndarray):
def __new__(cls,shape,extra=None,*args):
obj = super().__new__(cls,shape,*args)
obj.extra = extra
return obj
def __array_finalize__(self,obj):
if obj is None:
return
self.extra = getattr(obj,"extra",None)
def shape_to_array(shape):
rval = MyArray(shape,extra=shape)
rval[:] = np.arange(np.product(shape)).reshape(shape)
return rval
If I invoke shape_to_array directly (not under pyspark), it behaves as expected:
x = shape_to_array((2,3,5))
print(x.extra)
outputs:
(2, 3, 5)
But, if I invoke shape_to_array via a map to an RDD of inputs, it goes wonky:
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName("Steps").getOrCreate().sparkContext
rdd = sc.parallelize([(2,3,5),(2,4),(2,5)])
result = rdd.map(shape_to_array).cache()
print(result.map(lambda t:type(t)).collect())
print(result.map(lambda t:t.shape).collect())
print(result.map(lambda t:t.extra).collect())
Outputs:
[<class '__main__.MyArray'>, <class '__main__.MyArray'>, <class '__main__.MyArray'>]
[(2, 3, 5), (2, 4), (2, 5)]
22/10/15 15:48:02 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 23)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
process()
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/util.py", line 81, in wrapper
return f(*args, **kwargs)
File "/var/folders/w7/42_p7mcd1y91_tjd0jzr8zbh0000gp/T/ipykernel_94831/2519313465.py", line 1, in <lambda>
AttributeError: 'MyArray' object has no attribute 'extra'
What happened to the extra attribute of the MyArray instances?
Thanks much for any/all suggestions
EDIT: A bit of additional info. If I add logging inside the shape_to_array function just before the return, I can verify that the extra attribute does exist on the DataArray object that is being returned. But when I attempt to access the DataArray elements in the RDD from the main driver, they're gone.
After a night of sleeping on this, I remembered that I have often had issues with pyspark RDDs where the error message had to do the return type not working with pickle.
I wasn't getting that error message this time because numpy.ndarray does work with pickle. BUT... the __reduce__ and __setstate__ methods of numpy.ndarray known nothing of the added extra attribute on the MyArray subclass. This is where extra was being stripped.
Adding the following two methods to MyArray solved everything.
def __reduce__(self):
mthd,cls,args = super().__reduce__(self)
return mthd, cls, args + (self.extra,)
def __setstate__(self,args):
super().__setstate__(args[:-1])
self.extra = args[-1]
Thank you to anyone who took some time to think about my question.

Unable to use 'read_sql' to call a SQL query class

I am trying to pull results from the database with the following code:
import pandas as pd
import pyodbc
class DataManagement(object):
def __init__(self, database = None, server=None, trusted_connection=True, database_driver=ODBC_SQL2005_2012, uid=None, pwd=None):
self.server = server
self.database = database
self.uid = uid
self.pwd = pwd
# Use default server name none supplied - assumed to be localhost
if self.server is None:
self.server = SERVER
if self.database is None:
self.database = DATABASE
# Use default sql credentials if none provided
if self.uid is None or self.pwd is None:
self.uid = DEFAULT_UID
self.pwd = DEFAULT_PASSWORD
if trusted_connection:
self.connectionstring = "DRIVER={0};SERVER={1};DATABASE={2};Trusted_Connection=yes;".format(database_driver, self.server, self.database)
else:
self.connectionstring = 'DRIVER={0};SERVER={1};DATABASE={2};UID={3};PWD={4};'.format(database_driver, self.server, self.database, uid, pwd)
self.connection = pyodbc.connect(self.connectionstring)
self.cursor = self.connection.cursor()
def __enter__(self):
return self
def __exit__(self, ctx_type, ctx_value, ctx_traceback):
self.connection.commit()
self.connection.close()
qq = DataManagement()
sql =("select * from ***** ")
data_df = pd.read_sql(sql, qq)
I get an error:
Traceback (most recent call last):
File "<ipython-input-94-453876631fe0>", line 3, in <module>
data_df = pd.read_sql(sql, qq)
File "***\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 380, in read_sql
chunksize=chunksize)
File "***\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1468, in read_query
cursor = self.execute(*args)
File "***\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1426, in execute
cur = self.con.cursor()
TypeError: 'pyodbc.Cursor' object is not callable
I saw a similar question at TypeError: 'pyodbc.Cursor' object is not callable (Python 3.6) but unable to get an answer from there.
I got it to work by editing the class from
self.cursor = self.connection.cursor()
into
self.cursor = self.connection.cursor

TypeError: '<' not supported between instances of 'str' and 'int' Doc2Vec

Any ideas why this error is being thrown
"TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar()
I have a list of .txt documents stored in my data folder and want to compare one doc to another through my flask app on localhost.
Traceback (most recent call last):
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2463, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2449, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2446, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1820,
in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1949,
in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1935,
in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\app.py", line 56, in api_compare_2
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-
packages\gensim\models\keyedvectors.py", line 1715, in most_similar
elif doc in self.doctags or doc < self.count:
TypeError: '<' not supported between instances of 'str' and 'int'\
app.py
#app.route('/api/compare_2', methods=['POST'])
def api_compare_2():
data = request.get_json()
if not 'doc1' in data or not 'doc2' in data:
return 'ERROR'
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
vec2 = d2v_model.docvecs.most_similar(data['doc2'])
vec1 = gensim.matutils.full2sparse(vec1)
vec2 = gensim.matutils.full2sparse(vec2)
print (data)
print (vec2)
print (vec1)
return jsonify(sim=gensim.matutils.cossim(vec1, vec2))
#app.route('/api/compare_all', methods=['POST'])
def api_compare_all():
data = request.get_json()
if not 'doc' in data:
return 'ERROR'
vec = d2v_model.docvecs.most_similar(data['doc'])
res = d2v_model.docvecs.most_similar([vec], topn=5)
return jsonify(list=res)
model.py
def load_model():
try:
return gensim.models.doc2vec.Doc2Vec.load("doc2vec.model2")
except:
print ('Model not found!')
return None
def train_model():
#path to the input corpus files
data="data"
#tagging the text files
class DocIterator(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])
docLabels = [f for f in listdir(data) if f.endswith('.txt')]
print(docLabels)
data = []
for doc in docLabels:
data.append(open(r'C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\data\\' + doc,
encoding='cp437').read())
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))
#This function does all cleaning of data using two objects above
def nlp_clean(data):
new_data = []
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_data.append(dlist)
return new_data
data = nlp_clean(data)
it = DocIterator(data, docLabels)
#train doc2vec model
model = gensim.models.Doc2Vec(size=300, window=15, min_count=4, workers=10,alpha=0.025, min_alpha=0.025, iter=20) # use fixed learning rate
model.build_vocab(it)
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
model.save("doc2vec.model2")
If you try to look-up a string doc-tag that's not in the model, you unfortunately get this confusing error, instead of a clearer error. (See gensim's open-issue: https://github.com/RaRe-Technologies/gensim/issues/1737#issuecomment-346995119 )
Whatever is in data['doc1'] isn't a tag in the model.
You may be able to pre-check, before attempting a most_similar() operation, by looking at whether data['doc1'] in model.docvecs is True.
TypeError: '<' not supported between instances of 'str' and 'int'
[35182] Failed to execute script docker-compose
This error is was as a result of copy and paste code with a wrong quotation mark(). change this to this ''

Spacy phrasematcher does not get matcher name

I am new to phraseMatcher and want to extract some keyword from my emails.
Everything is working well except that I can't get a name of added matcher.
This is my code below:
def main():
patterns_months = 'phraseMatcher/months.txt'
text_loc = 'phraseMatcher/text.txt'
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters ={}
phrases_months = read_gazetter(patterns_months)
txts = read_text(text_loc, n=n)
months = [nlp(text) for text in phrases_months]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('MONTHS', None, *months)
print(nlp.vocab.strings['MONTHS'])
for txt in txts:
doc = nlp(txt)
matches = matcher(doc)
for match_id ,start, end in matches:
span = doc[start: end]
label = nlp.vocab.strings[match_id]
print(label, span.text, start, end)
The result:
12298211501233906429 <--- this is from print(nlp.vocab.strings['MONTHS'])
Traceback (most recent call last):
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 71, in <module>
plac.call(main)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 47, in main
label = nlp.vocab.strings[match_id]
File "strings.pyx", line 117, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744072093410045'."
spaCy version:** 2.0.12
Platform:** Windows-7-6.1.7601-SP1
Python version:** 3.7.0
I can't find what I did wrong. It is simple and I read these already:
Using PhraseMatcher in SpaCy to find multiple match types
Help me, thanks in advance.

How to write a pickle file to S3, as a result of a luigi Task?

I want to store a pickle file on S3, as a result of a luigi Task. Below is the class that defines the Task:
class CreateItemVocabulariesTask(luigi.Task):
def __init__(self):
self.client = S3Client(AwsConfig().aws_access_key_id,
AwsConfig().aws_secret_access_key)
super().__init__()
def requires(self):
return [GetItem2VecDataTask()]
def run(self):
filename = 'item2vec_results.tsv'
data = self.client.get('s3://{}/item2vec_results.tsv'.format(AwsConfig().item2vec_path),
filename)
df = pd.read_csv(filename, sep='\t', encoding='latin1')
unique_users = df['CustomerId'].unique()
unique_items = df['ProductNumber'].unique()
item_to_int, int_to_item = utils.create_lookup_tables(unique_items)
user_to_int, int_to_user = utils.create_lookup_tables(unique_users)
with self.output()[0].open('wb') as out_file:
pickle.dump(item_to_int, out_file)
with self.output()[1].open('wb') as out_file:
pickle.dump(int_to_item, out_file)
with self.output()[2].open('wb') as out_file:
pickle.dump(user_to_int, out_file)
with self.output()[3].open('wb') as out_file:
pickle.dump(int_to_user, out_file)
def output(self):
files = [S3Target('s3://{}/item2int.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/int2item.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/user2int.pkl'.format(AwsConfig().item2vec_path), client=self.client),
S3Target('s3://{}/int2user.pkl'.format(AwsConfig().item2vec_path), client=self.client),]
return files
When I run this task I get the error ValueError: Unsupported open mode 'wb'. The items I try to dump into a pickle file are just python dictionaries.
Full traceback:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\luigi\worker.py", line 203, in run
new_deps = self._run_get_new_deps()
File "C:\Anaconda3\lib\site-packages\luigi\worker.py", line 140, in _run_get_new_deps
task_gen = self.task.run()
File "C:\Users\user\Documents\python workspace\pipeline.py", line 60, in run
with self.output()[0].open('wb') as out_file:
File "C:\Anaconda3\lib\site-packages\luigi\contrib\s3.py", line 714, in open
raise ValueError("Unsupported open mode '%s'" % mode)
ValueError: Unsupported open mode 'wb'
This is an issue that only happens on python 3.x as explained here. In order to use python 3 and write a binary file or target (ie using 'wb' mode) just set format parameter for S3Target to Nop. Like this:
S3Target('s3://path/to/file', client=self.client, format=luigi.format.Nop)
Notice it's just a trick and not so intuitive nor documented.