Dask: Access published dataset from within a submitted job - pandas

# Init
import time
import pandas as pd
import numpy as np
from dask.distributed import Client
client = Client()
# Publish data
dataset_name = 'my_dataset'
df_my_dataset = pd.DataFrame(np.ones((2,3)), dtype=np.float32)
client.publish_dataset(df_my_dataset, name=dataset_name)
Its there:
In [13]: client.list_datasets()
Out[13]: ('my_dataset',)
Create submit function for dask. Here I would like to access the published dataset by name:
# submit function
def get_gate1_rows(df_from_submit):
return df_from_submit.mean()
# return df.mean() + my_dataset.mean() #### <<<<<<< How to do this?
And finally the submit:
# Submit code
df_zeros = np.zeros((2,3), dtype=np.float32)
future = client.submit(get_gate1_rows, df_zeros)
time.sleep(2)
result = future.result()
This yields - but should be 0.5:
In [41]: result
Out[41]: 0.0
So how can I access the published dataset from within the dask job?

To access the published datasets within a task, you need get_client:
def get_gate1_rows(df_from_submit):
client = distributed.get_client()
my_dataset = client.get_dataset('my_dataset')
return df_from_submit.mean() + my_dataset.mean()
(the answer is three 1s, since df_zeros.mean()->0, df_my_dataset.mean()->1,1,1)

Related

how to make DataFlow from CSV in googleDrive to tf DataSet - in Colab

according to the instructions in Colab I could get buffer & even take a pd.DataFrame from it (file is just example)...
# ... authentification
file_id = '1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU' # titanic
# loading data
import io
from googleapiclient.http import MediaIoBaseDownload
drive_service = build('drive', 'v3') # , credentials=creds
request = drive_service.files().get_media(fileId=file_id)
buf = io.BytesIO()
downloader = MediaIoBaseDownload(buf, request)
buf.seek(0)
import pandas as pd
df= pd.read_csv(buf);
print(df.head())
But have trouble with correct creation of dataFlow to Dataset - "buf" var is not working in =>
dataset = tf.data.experimental.make_csv_dataset(csv_file_path,
batch_size=100, num_epochs=1)
only "csv_file_path" as 1st arg. Is it possible in Colab to get IO from my GoogleDrive's csv-file into Dataset (used further in training)? And how to do it in a memory-efficient manner?..
P.S.
I understand that I perhaps can make file opened for all (in GoogleDrive) & get url to use the simple way:
#TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TRAIN_DATA_URL = "https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1)
! but I DON'T need to share real file... How to save file confidential & get IO from it (in GoogleDrive) to tf.data.Dataset in Colab ? (preferably the shortest code - there will be much more code in real project tested in Colab)
drive.CreateFile HELPED (link) - as so as I understand that working in Colab - I am working in a separate environment (separate from my PC & I'net env)... So I tried (according link)
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing
link = 'https://drive.google.com/open?id=1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU'
fluff, id = link.split('=')
print (id) # Verify that you have everything after '='
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Filename.csv')
import tensorflow as tf
ds = tf.data.experimental.make_csv_dataset('Filename.csv', batch_size=100, num_epochs=1)
iterator = ds.as_numpy_iterator()
print(next(iterator))
it works for me. Thanks for the interest to the topic (if somebody tried)
even simplier
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')
_types = [float(), float(), float(), float(), str()]
_lines = tf.data.TextLineDataset('/content/drive/My Drive/iris.csv')
ds=_lines.skip(1).map(lambda x: tf.io.decode_csv(x, record_defaults=_types) )
ds0= ds.take(2)
print(*ds0.as_numpy_iterator(), sep='\n') # print list with sep => by rows.
OR from df: (and batched for memory economical usage)
import tensorflow as tf
# Load the Drive helper and mount
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')
df= pd.read_csv('/content/drive/My Drive/iris.csv', dtype = 'float32', converters = {'variety' : str}, nrows=20, decimal='.')
ds = tf.data.Dataset.from_tensor_slices(dict(df)) # if mixed types
ds = ds.shuffle(20, reshuffle_each_iteration=False ) # for train.ds ONLY!
ds = ds.batch(batch_size=4)
ds = ds.prefetch(4)
# labels
label= ds.map(lambda x: x['variety'])
print(list(label.as_numpy_iterator()))
# features
#features = ds.map(lambda x: (x['sepal.length'], x['sepal.width']))
# Or with dynamic keys:
features = ds.map(lambda x: (list(map(x.get, list(np.setdiff1d(list(x.keys()),['variety']))))))
print(list(features.as_numpy_iterator()))
with any Transformations in map...

Cannot store an array using dask

I am using the following code to create an array and and store the the results sequentially in a hdf5 format. I was checking out the dask documentation, and the suggested to use dask.store to store the arrays generated in a function like mine. However I receive an error: dask has no attribute store
My code:
import os
import numpy as np
import time
import concurrent.futures
import multiprocessing
from itertools import product
import h5py
import dask as da
def mean_py(array):
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
values[i][j] = ((np.mean(array[:,i,j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem,60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), int(seconds)))
print(f"{'.'*80}")
return values
def generate_random_array():
a = np.random.randn(120560400).reshape(10980,10980)
return a
def generate_array(nums):
for num in range(nums):
a = generate_random_array()
f = h5py.File('test_db.hdf5')
d = f.require_dataset('/data', shape=a.shape, dtype=a.dtype)
da.store(a, d)
start = time.time()
generate_array(8)
end = time.time()
print(f'\nTime complete: {end-start:.2f}s\n')
Should I use dask for such a a task, or do you recommend to store the results using h5py directly?
Please Ignore the mean_py(array) function. It's for something I want to try out once the data has been produced.
As suggested in the comments, you're currently doing this
import dask as da
When you probably meant to do this
import dask.array as da

small test_set xgb predict

i would like to ask a question about a problem that i have for the last couple days.
First of all i am a beginner in machine learning and this is my first time using the XGBoost algorithm so excuse me for any mistakes I have done.
I trained my model to predict whether a log file is malicious or not. After i save and reload my model on a different session i use the predict function which seems to be working normally ( with a few deviations in probabilities but that is another topic, I know I, have seen it in another topic )
The problem is this: Sometimes when i try to predict a "small" csv file after load it seems to be broken predicting only the Zero label, even for indexes that are categorized correct previously.
For example, i load a dataset containing 20.000 values , the predict() is working. I keep only the first 5 of these values using pandas drop, again its working. If i save the 5 values on a different csv and reload it its not working. The same error happens if i just remove by hand all indexes (19.995) and save file only with 5 remaining.
I would bet it is a size of file problem but when i drop the indexes on the dataframe through pandas it seems to be working
Also the number 5 ( of indexes ) is for example purpose the same happens if I delete a large portion of the dataset.
I first came up with this problem after trying to verify by hand some completely new logs, which seem to be classified correctly if thrown into the big csv file but not in a new file on their own.
Here is my load and predict code
##IMPORTS
import os
import pandas as pd
from pandas.compat import StringIO
from datetime import datetime
from langid.langid import LanguageIdentifier, model
import langid
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.externals import joblib
from ggplot import ggplot, aes, geom_line
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import average_precision_score
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from collections import defaultdict
import pickle
df = pd.read_csv('big_test.csv')
df3 = pd.read_csv('small_test.csv')
#This one is necessary for the loaded_model
class ColumnSelector(BaseEstimator, TransformerMixin):
def init(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
loaded_model = joblib.load('finalized_model.sav')
result = loaded_model.predict(df)
print(result)
df2=df[:5]
result2 = loaded_model.predict(df2)
print(result2)
result3 = loaded_model.predict(df3)
print(result3)
The results i get are these:
[1 0 1 ... 0 0 0]
[1 0 1 0 1]
[0 0 0 0 0]
I can provide any code even from training or my dataset if necessary.
*EDIT: I use a pipeline for my data. I tried to reproduce the error after using xgb to fit the iris data and i could not. Maybe there is something wrong with my pipeline? the code is below :
df = pd.read_csv('big_test.csv')
# df.info()
# Split Dataset
attributes = ['uri','code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ]
x_train, x_test, y_train, y_test = train_test_split(df[attributes], df['Scan'], test_size=0.2,
stratify=df['Scan'], random_state=0)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.2,
stratify=y_train, random_state=0)
# print('Train:', len(y_train), 'Dev:', len(y_dev), 'Test:', len(y_test))
# set up graph function
def plot_precision_recall_curve(y_true, y_pred_scores):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_scores)
return ggplot(aes(x='recall', y='precision'),
data=pd.DataFrame({"precision": precision, "recall": recall})) + geom_line()
# XGBClassifier
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=10)
dict_vectorizer = DictVectorizer()
xgb = XGBClassifier(seed=0)
pipeline = Pipeline([
("feature_union", FeatureUnion([
('text_features', Pipeline([
('selector', ColumnSelector(['uri'])),
('count_vectorizer', count_vectorizer)
])),
('categorical_features', Pipeline([
('selector', ColumnSelector(['code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ])),
('dict_vectorizer', dict_vectorizer)
]))
])),
('xgb', xgb)
])
pipeline.fit(x_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(pipeline, filename)
Thats due to different dtypes in big and small file.
When you do:
df = pd.read_csv('big_test.csv')
The dtypes are these:
print(df.dtypes)
# Output
uri object
code object # <== Observe this
r_size object # <== Observe this
Scan int64
...
...
...
Now when you do:
df3 = pd.read_csv('small_test.csv')
the dtypes are changed:
print(df3.dtypes)
# Output
uri object
code int64 # <== Now this has changed
r_size int64 # <== Now this has changed
Scan int64
...
...
You see, pandas will try to determine the dtypes of the columns by itself. When you load the big_test.csv, there are some values in code and r_size column which are of string types, due to this whole column dtype is changed to string, which is not done in small_test.csv.
Now due to this change, the dictVectorizer encodes the data in a different way than before and the features are changed, and hence the results are also changed.
If you do this:
df3[['code', 'r_size']] = df3[['code', 'r_size']].astype(str)
and then call the predict(), the results are same again.

Tensorflow - Read and Save TFRecords to Dict and Use Multiprocessing

I am trying to speed up the conversion of select tfrecords to a series of python dictionaries. Here's what I have. Initially the CPU utilization spikes, but then goes to almost zero, suggesting my code is not working correctly.
My goal is to have 3 dictionaries saved and pickled. There are 14,000+ tfrecord files (2 gigs appx). At the current rate, it will take about 84 hours to run on a single process.
Are there any problems with my use of manage dicts
import glob
import tensorflow as tf
import cPickle
import numpy as np
from tqdm import tqdm
import collections
from multiprocessing import Process, Manager, Pool
def get_multihot_encoding(example_label):
enc = np.zeros(10)
for label in example_label:
if label in lookup.values():
index = lookup_inverted[label]
enc[index] = 1
return list(enc)
# Set-up MultiProcessing
manager = Manager()
audio_embeddings_dict = manager.dict()
audio_labels_dict = manager.dict()
audio_multihot_dict = manager.dict()
sess = tf.Session()
# The iterable which gets passed to the function
all_tfrecord_filenames = glob.glob('/Users/jeff/features/audioset_v1_embeddings/unbal_train/*.tfrecord')
def process_tfrecord(tfrecord):
for idx, example in enumerate(tf.python_io.tf_record_iterator(tfrecord)):
tf_example = tf.train.Example.FromString(example)
vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8')
example_label = list(np.asarray(tf_example.features.feature['labels'].int64_list.value))
# Non zero intersect of 2 sets is True - only create dict entries if this is true!
if set(example_label) & label_filters:
print(set(example_label) & label_filters, " Is the intersection of the two")
tf_seq_example = tf.train.SequenceExample.FromString(example)
n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)
audio_frame = []
for i in range(n_frames):
audio_frame.append(tf.cast(tf.decode_raw(
tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
,tf.float32).eval(session=sess))
audio_embeddings_dict[vid_id] = audio_frame
audio_labels_dict[vid_id] = example_label
audio_multihot_dict[vid_id] = get_multihot_encoding(example_label)
#print(get_multihot_encoding(example_label), "Is the encoded label")
if idx % 100 == 0:
print ("Saving dictionary at loop: {}".format(idx))
cPickle.dump(audio_embeddings_dict, open('audio_embeddings_dict_unbal_train_multi_{}.pkl'.format(idx), 'wb'))
cPickle.dump(audio_multihot_dict, open('audio_multihot_dict_bal_untrain_multi_{}.pkl'.format(idx), 'wb'))
cPickle.dump(audio_multihot_dict, open('audio_labels_unbal_dict_multi_{}.pkl'.format(idx), 'wb'))
pool = Pool(50)
result = pool.map(process_tfrecord, all_tfrecord_filenames)

How to fix the python multiprocessing matplotlib savefig() issue?

I want to speed up matplotlib.savefig() for many figures by multiprocessing module, and trying to benchmark the performance between parallel and sequence.
Below is the codes:
# -*- coding: utf-8 -*-
"""
Compare the time of matplotlib savefig() in parallel and sequence
"""
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import time
def gen_fig_list(n):
''' generate a list to contain n demo scatter figure object '''
plt.ioff()
fig_list = []
for i in range(n):
plt.figure();
dt = np.random.randn(5, 4);
fig = plt.scatter(dt[:,0], dt[:,1], s=abs(dt[:,2]*1000), c=abs(dt[:,3]*100)).get_figure()
fig.FM_figname = "img"+str(i)
fig_list.append(fig)
plt.ion()
return fig_list
def savefig_worker(fig, img_type, folder):
file_name = folder+"\\"+fig.FM_figname+"."+img_type
fig.savefig(file_name, format=img_type, dpi=fig.dpi)
return file_name
def parallel_savefig(fig_list, folder):
proclist = []
for fig in fig_list:
print fig.FM_figname,
p = multiprocessing.Process(target=savefig_worker, args=(fig, 'png', folder)) # cause error
proclist.append(p)
p.start()
for i in proclist:
i.join()
if __name__ == '__main__':
folder_1, folder_2 = 'Z:\\A1', 'Z:\\A2'
fig_list = gen_fig_list(10)
t1 = time.time()
parallel_savefig(fig_list,folder_1)
t2 = time.time()
print '\nMulprocessing time : %0.3f'%((t2-t1))
t3 = time.time()
for fig in fig_list:
savefig_worker(fig, 'png', folder_2)
t4 = time.time()
print 'Non_Mulprocessing time: %0.3f'%((t4-t3))
And I meet problem "This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information." error caused by p = multiprocessing.Process(target=savefig_worker, args=(fig, 'png', folder)) .
Why ? And how to solve it ?
(Windows XP + Python: 2.6.1 + Numpy: 1.6.2 + Matplotlib: 1.2.0)
EDIT: (add error msg on python 2.7.3)
When run on IDLE of python 2.7.3, it gives below error msg:
>>>
img0
Traceback (most recent call last):
File "C:\Documents and Settings\Administrator\desktop\mulsavefig_pilot.py", line 61, in <module>
proc.start()
File "d:\Python27\lib\multiprocessing\process.py", line 130, in start
File "d:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "d:\Python27\lib\pickle.py", line 748, in save_global
(obj, module, name))
PicklingError: Can't pickle <function notify_axes_change at 0x029F5030>: it's not found as matplotlib.backends.backend_qt4.notify_axes_change
EDIT: (My solution demo)
inspired by Matplotlib: simultaneous plotting in multiple threads
# -*- coding: utf-8 -*-
"""
Compare the time of matplotlib savefig() in parallel and sequence
"""
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import time
def gen_data(fig_qty, bubble_qty):
''' generate data for fig drawing '''
dt = np.random.randn(fig_qty, bubble_qty, 4)
return dt
def parallel_savefig(draw_data, folder):
''' prepare data and pass to worker '''
pool = multiprocessing.Pool()
fig_qty = len(draw_data)
fig_para = zip(range(fig_qty), draw_data, [folder]*fig_qty)
pool.map(fig_draw_save_worker, fig_para)
return None
def fig_draw_save_worker(args):
seq, dt, folder = args
plt.figure()
fig = plt.scatter(dt[:,0], dt[:,1], s=abs(dt[:,2]*1000), c=abs(dt[:,3]*100), alpha=0.7).get_figure()
plt.title('Plot of a scatter of %i' % seq)
fig.savefig(folder+"\\"+'fig_%02i.png' % seq)
plt.close()
return None
if __name__ == '__main__':
folder_1, folder_2 = 'A1', 'A2'
fig_qty, bubble_qty = 500, 100
draw_data = gen_data(fig_qty, bubble_qty)
print 'Mulprocessing ... ',
t1 = time.time()
parallel_savefig(draw_data, folder_1)
t2 = time.time()
print 'Time : %0.3f'%((t2-t1))
print 'Non_Mulprocessing .. ',
t3 = time.time()
for para in zip(range(fig_qty), draw_data, [folder_2]*fig_qty):
fig_draw_save_worker(para)
t4 = time.time()
print 'Time : %0.3f'%((t4-t3))
print 'Speed Up: %0.1fx'%(((t4-t3)/(t2-t1)))
You can try to move all of the matplotlib code(including the import) to a function.
Make sure you don't have a import matplotlib or import matplotlib.pyplot as plt at the top of your code.
create a function that does all the matplotlib including the import.
Example:
import numpy as np
from multiprocessing import pool
def graphing_function(graph_data):
import matplotlib.pyplot as plt
plt.figure()
plt.hist(graph_data.data)
plt.savefig(graph_data.filename)
plt.close()
return
pool = Pool(4)
pool.map(graphing_function, data_list)
It is not really a bug, per-say, more of a limitation.
The explanation is in the last line of your error mesage:
PicklingError: Can't pickle <function notify_axes_change at 0x029F5030>: it's not found as matplotlib.backends.backend_qt4.notify_axes_change
It is telling you that elements of the figure objects can not be pickled, which is how MultiProcess passes data between the processes. The objects are pickled in the main processes, shipped as pickles, and then re-constructed on the other side. Even if you fixed this exact issue (maybe by using a different backend, or stripping off the offending function (which might break things in other ways)) I am pretty sure there are core parts of Figure, Axes, or Canvas objects that can not be pickled.
As #bigbug point to, an example of how to get around this limitation, Matplotlib: simultaneous plotting in multiple threads. The basic idea is that you push your entire plotting routine off to the sub-process so you only push numpy arrays an maybe some configuration information across the process boundry.