Related
For debugging, I would like to print a tensorflow FlatMapDataset.
When trying to use the print method of a tf.data.Dataset, I get the error: AttributeError: 'FlatMapDataset' object has no attribute 'print'.
What I expected was some kind of print-out to assess whether the content of the dataset is what I expected.
Apparently a FlatMapDataset does not have the method:
['_GeneratorState', '__abstractmethods__', '__bool__', '__class__', '__class_getitem__', '__debug_string__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__tf_tracing_type__', '__weakref__', '_abc_impl', '_add_trackable_child', '_add_variable_with_custom_getter', '_apply_debug_options', '_as_serialized_graph', '_checkpoint_dependencies', '_common_args', '_consumers', '_convert_variables_to_tensors', '_deferred_dependencies', '_deserialization_dependencies', '_deserialize_from_proto', '_export_to_saved_model_graph', '_flat_shapes', '_flat_structure', '_flat_types', '_functions', '_gather_saveables_for_checkpoint', '_graph', '_graph_attr', '_handle_deferred_dependencies', '_input_dataset', '_inputs', '_lookup_dependency', '_map_func', '_map_resources', '_maybe_initialize_trackable', '_maybe_track_assets', '_metadata', '_name', '_name_based_attribute_restore', '_name_based_restores', '_no_dependency', '_object_identifier', '_options', '_options_attr', '_options_tensor_to_options', '_preload_simple_restoration', '_restore_from_tensors', '_serialize_to_proto', '_serialize_to_tensors', '_setattr_tracking', '_shape_invariant_to_type_spec', '_structure', '_tf_api_names', '_tf_api_names_v1', '_trace_variant_creation', '_track_trackable', '_trackable_children', '_transformation_name', '_type_spec', '_unconditional_checkpoint_dependencies', '_unconditional_dependency_names', '_update_uid', '_variant_tensor', '_variant_tensor_attr', 'apply', 'as_numpy_iterator', 'batch', 'bucket_by_sequence_length', 'cache', 'cardinality', 'choose_from_datasets', 'concatenate', 'element_spec', 'enumerate', 'filter', 'flat_map', 'from_generator', 'from_tensor_slices', 'from_tensors', 'get_single_element', 'group_by_window', 'interleave', 'list_files', 'load', 'map', 'options', 'padded_batch', 'prefetch', 'random', 'range', 'reduce', 'rejection_resample', 'repeat', 'sample_from_datasets', 'save', 'scan', 'shard', 'shuffle', 'skip', 'snapshot', 'take', 'take_while', 'unbatch', 'unique', 'window', 'with_options', 'zip']
How can I print a FlatMapDataSet in some convenient way to review it's content?
You can print the flatmap dataset by iterating over the datset.
For example
import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices([[[1,2, 3], [3,4,5]]])
dataset = dataset.flat_map(lambda x : tf.data.Dataset.from_tensor_slices(x**2))
for i in dataset:
print(i)
output of above code:
tf.Tensor([1 4 9], shape=(3,), dtype=int32)
tf.Tensor([ 9 16 25], shape=(3,), dtype=int32)
Thank You.
I have to read csv which contains full text data, which can be multiline. I am able to read this csv by pure pandas (tested on version 0.25.3 and 1.0.3) without any problems but when i try to read this csv by dask i receive ParserError: Error tokenizing data. C error: EOF inside string starting at row 28 the row number depends on file which i try to read.
I prepared the artificial dataframe to reproduce this error. Can i something tune in dask parameters, preprocess input file or this is dask implementation issue?
multiplication_factor = 71 # 70 works fine, 71 fail
number_of_columns = 100
import pandas as pd
import dask.dataframe as dd
import textwrap
pandas_default_kwargs = {
'cache_dates': True,
# 'chunksize': None, # not support by dask
'comment': None,
# 'compression': 'infer', # not support by dask
'converters': None,
'date_parser': None,
'dayfirst': False,
'decimal': b'.',
'delim_whitespace': False,
'delimiter': None,
'dialect': None,
'doublequote': True,
'dtype': object,
'encoding': None,
'engine': None,
'error_bad_lines': True,
'escapechar': None,
'false_values': None,
'float_precision': None,
'header': 'infer',
# 'index_col': None, # not support by dask
'infer_datetime_format': False,
# 'iterator': False, # not support by dask
'keep_date_col': False,
'keep_default_na': True,
'lineterminator': None,
'low_memory': True,
'mangle_dupe_cols': True,
'memory_map': False,
'na_filter': True,
'na_values': None,
'names': None,
'nrows': None,
'parse_dates': False,
'prefix': None,
'quotechar': '"',
'quoting': 0,
'sep': ',',
'skip_blank_lines': True,
'skipfooter': 0,
'skipinitialspace': False,
'skiprows': None,
'squeeze': False,
'thousands': None,
'true_values': None,
'usecols': None,
'verbose': False,
'warn_bad_lines': True,
}
artificial_df_1_row = pd.DataFrame(
data=[
(
textwrap.dedent(
f"""
some_data_for
column_number_{i}
"""
)
for i
in range(number_of_columns)
)
],
columns=[f'column_name_number_{i}' for i in range(number_of_columns)]
)
path_to_single_line_csv = './single_line.csv'
path_to_multi_line_csv = './multi_line.csv'
# prepare data to save
single_line_df = artificial_df_1_row
multi_line_df = pd.concat(
[single_line_df] * multiplication_factor,
)
# save data
single_line_df.to_csv(path_to_single_line_csv, index=False)
multi_line_df.to_csv(path_to_multi_line_csv, index=False)
# read 1 row csv by dask - works
dask_single_line_df = dd.read_csv(
path_to_single_line_csv,
blocksize=None, # read as single block
**pandas_default_kwargs
)
dask_single_line_df_count = dask_single_line_df.shape[0].compute()
print('[DASK] single line count', dask_single_line_df_count)
# read multiline csv by pandas - works
pandas_multi_line_df = pd.read_csv(
path_to_multi_line_csv,
**pandas_default_kwargs
)
pandas_multi_line_df_shape_0 = pandas_multi_line_df.shape[0]
print('[PANDAS] multi line count', pandas_multi_line_df_shape_0)
# read multine csv by dask - depends on number of rows fails or not
dask_multi_line_df = dd.read_csv(
path_to_multi_line_csv,
blocksize=None, # read as single block
**pandas_default_kwargs
)
dask_multi_line_df_shape_0 = dask_multi_line_df.shape[0].compute()
print('[DASK] multi line count', dask_multi_line_df_shape_0)
The only way you can read such a file is to ensure that the chunk boundaries are not within a quoted string which, unless you know a lot about the data layout, means not chunking a file at all (but you can still parallelise between files).
This is because, the only way to know whether or not you are in a quoted string is to parse a file from the start, and the way dask achieves parallelism is to have each chunk-reading task completely independent, needing only a file offset. In practice, dask reads from the offset and considers the first newline marker as the point to start parsing from.
I'm stuck on one line of code and have been stalled on a project all weekend as a result.
I am working on a project that uses BERT for sentence classification. I have successfully trained the model, and I can test the results using the example code from run_classifier.py.
I can export the model using this example code (which has been reposted repeatedly, so I believe that it's right for this model):
def export(self):
def serving_input_fn():
label_ids = tf.placeholder(tf.int32, [None], name='label_ids')
input_ids = tf.placeholder(tf.int32, [None, self.max_seq_length], name='input_ids')
input_mask = tf.placeholder(tf.int32, [None, self.max_seq_length], name='input_mask')
segment_ids = tf.placeholder(tf.int32, [None, self.max_seq_length], name='segment_ids')
input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
'label_ids': label_ids, 'input_ids': input_ids,
'input_mask': input_mask, 'segment_ids': segment_ids})()
return input_fn
self.estimator._export_to_tpu = False
self.estimator.export_savedmodel(self.output_dir, serving_input_fn)
I can also load the exported estimator (where the export function saves the exported model into a subdirectory labeled with a timestamp):
predict_fn = predictor.from_saved_model(self.output_dir + timestamp_number)
However, for the life of me, I cannot figure out what to provide to predict_fn as input for inference. Here is my best code at the moment:
def predict(self):
input = 'Test input'
guid = 'predict-0'
text_a = tokenization.convert_to_unicode(input)
label = self.label_list[0]
examples = [InputExample(guid=guid, text_a=text_a, text_b=None, label=label)]
features = convert_examples_to_features(examples, self.label_list,
self.max_seq_length, self.tokenizer)
predict_input_fn = input_fn_builder(features, self.max_seq_length, False)
predict_fn = predictor.from_saved_model(self.output_dir + timestamp_number)
result = predict_fn(predict_input_fn) # this generates an error
print(result)
It doesn't seem to matter what I provide to predict_fn: the examples array, the features array, the predict_input_fn function. Clearly, predict_fn wants a dictionary of some type - but every single thing that I've tried generates an exception due to a tensor mismatch or other errors that generally mean: bad input.
I presumed that the from_saved_model function wants the same sort of input as the model test function - apparently, that's not the case.
It seems that lots of people have asked this very question - "how do I use an exported BERT TensorFlow model for inference?" - and have gotten no answers:
Thread #1
Thread #2
Thread #3
Thread #4
Any help? Thanks in advance.
Thank you for this post. Your serving_input_fn was the piece I was missing! Your predict function needs to be changed to feed the features dict directly, rather than use the predict_input_fn:
def predict(sentences):
labels = [0, 1]
input_examples = [
run_classifier.InputExample(
guid="",
text_a = x,
text_b = None,
label = 0
) for x in sentences] # here, "" is just a dummy label
input_features = run_classifier.convert_examples_to_features(
input_examples, labels, MAX_SEQ_LEN, tokenizer
)
# this is where pred_input_fn is replaced
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in input_features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
pred_dict = {
'input_ids': all_input_ids,
'input_mask': all_input_mask,
'segment_ids': all_segment_ids,
'label_ids': all_label_ids
}
predict_fn = predictor.from_saved_model('../testing/1589418540')
result = predict_fn(pred_dict)
print(result)
pred_sentences = [
"That movie was absolutely awful",
"The acting was a bit lacking",
"The film was creative and surprising",
"Absolutely fantastic!",
]
predict(pred_sentences)
{'probabilities': array([[-0.3579178 , -1.2010787 ],
[-0.36648935, -1.1814401 ],
[-0.30407643, -1.3386648 ],
[-0.45970002, -0.9982413 ],
[-0.36113673, -1.1936386 ],
[-0.36672896, -1.1808994 ]], dtype=float32), 'labels': array([0, 0, 0, 0, 0, 0])}
However, the probabilities returned for sentences in pred_sentences do not match the probabilities I get use estimator.predict(predict_input_fn) where estimator is the fine-tuned model being used within the same (python) session. For example, [-0.27276006, -1.4324446 ] using estimator vs [-0.26713806, -1.4505868 ] using predictor.
Values Tensor: [[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]]
Query Tensor: [[1,2,8], [0,0,6], [11,12,13]]
Reult tensor: [[True, True, False],[False, False, True],[True, True, True]]
If I have a values tensor and query tensor, i want to check whether query tensor existed in values tensor one by one, then return the result tensor.
Could I ask do we have vector-based way to do this (rather than using tf.while_loop)?
Updated: I think as following, tf.sets.set_intersection may be useful.
import tensorflow as tf
a = tf.constant([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
b = tf.constant([[1,2,8], [0,0,6], [11,12,13]])
res = tf.sets.set_intersection(a, b)
res2 = tf.sparse_tensor_to_dense(
res, default_value=-1)
with tf.Session() as sess:
print(sess.run(res2))
[[ 1 2 -1]
[ 6 -1 -1]
[11 12 13]]
You can achieve through subtracting every element of b with every other element of a, and then finding the indices of the zeros:
find_match =tf.reduce_prod(tf.transpose(a)[...,None]- tf.abs(b[None,...]), 0)
find_idx = tf.equal(find_match,tf.zeros_like(find_match))
with tf.Session() as sess:
print(sess.run(find_idx))
#[[ True True False]
# [False False True]
# [ True True True]]
Using the commits from breznak for the encoders (I wasn't able to figure out "git checkout ..." with GitHub, so I just carefully copied over the three files - base.py, multi.py, and multi_test.py).
I ran multi_test.py without any problems.
Then I adjusted my model parameters (MODEL_PARAMS), so that the encoders portion of 'sensorParams' looks like this:
'encoders': {
'frequency': {
'fieldname': u'frequency',
'type': 'SimpleVector',
'length': 5,
'minVal': 0,
'maxVal': 210
}
},
I also adjusted the modelInput portion of my code, so it looked like this:
model = ModelFactory.create(model_params.MODEL_PARAMS)
model.enableInference({'predictedField': 'frequency'})
y = [1,2,3,4,5]
modelInput = {"frequency": y}
result = model.run(modelInput)
But I get the final error, regardless if I instantiate 'y' as a list or a numpy.ndarray
File "nta/eng/lib/python2.7/site-packages/nupic/encoders/base.py", line 183, in _getInputValue
return getattr(obj, fieldname)
AttributeError: 'list' object has no attribute 'idx0'
I also tried initializing a SimpleVector encoder inline with my modelInput, directly encoding my array, then passing it through modelInput. That violated the input parameters of my SimpleVector, because I was now double encoding. So I removed the encoders portion of my model parameters dictionary. That caused a spit up, because some part of my model was looking for that portion of the dictionary.
Any suggestions on what I should do next?
Edit: Here're the files I'm using with the OPF.
sendAnArray.py
import numpy
from nupic.frameworks.opf.modelfactory import ModelFactory
import model_params
class sendAnArray():
def __init__(self):
self.model = ModelFactory.create(model_params.MODEL_PARAMS)
self.model.enableInference({'predictedField': 'frequency'})
for i in range(100):
self.run()
def run(self):
y = [1,2,3,4,5]
modelInput = {"frequency": y}
result = self.model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print y, anomalyScore
sAA = sendAnArray()
model_params.py
MODEL_PARAMS = {
'model': "CLA",
'version': 1,
'predictAheadTime': None,
'modelParams': {
'inferenceType': 'TemporalAnomaly',
'sensorParams': {
'verbosity' : 0,
'encoders': {
'frequency': {
'fieldname': u'frequency',
'type': 'SimpleVector',
'length': 5,
'minVal': 0,
'maxVal': 210
}
},
'sensorAutoReset' : None,
},
'spEnable': True,
'spParams': {
'spVerbosity' : 0,
'globalInhibition': 1,
'columnCount': 2048,
'inputWidth': 5,
'numActivePerInhArea': 60,
'seed': 1956,
'coincInputPoolPct': 0.5,
'synPermConnected': 0.1,
'synPermActiveInc': 0.1,
'synPermInactiveDec': 0.01,
},
'tpEnable' : True,
'tpParams': {
'verbosity': 0,
'columnCount': 2048,
'cellsPerColumn': 32,
'inputWidth': 2048,
'seed': 1960,
'temporalImp': 'cpp',
'newSynapseCount': 20,
'maxSynapsesPerSegment': 32,
'maxSegmentsPerCell': 128,
'initialPerm': 0.21,
'permanenceInc': 0.1,
'permanenceDec' : 0.1,
'globalDecay': 0.0,
'maxAge': 0,
'minThreshold': 12,
'activationThreshold': 16,
'outputType': 'normal',
'pamLength': 1,
},
'clParams': {
'regionName' : 'CLAClassifierRegion',
'clVerbosity' : 0,
'alpha': 0.0001,
'steps': '5',
},
'anomalyParams': {
u'anomalyCacheRecords': None,
u'autoDetectThreshold': None,
u'autoDetectWaitRecords': 2184
},
'trainSPNetOnlyIfRequested': False,
},
}
The problem seems to be that the SimpleVector class is accepting an array instead of a dict as its input, and then reconstructs that internally as {'list': {'idx0': 1, 'idx1': 2, ...}} (ie as if this dict had been the input). This is fine if it is done consistently, but your error shows that it's broken down somewhere. Have a word with #breznak about this.
Working through the OPF was difficult. I wanted to input an array of indices into the temporal pooler, so I opted to interface directly with the algorithms (I relied heavy on hello_tp.py). I ignored SimpleVector all together, and instead worked through the BitmapArray encoder.
Subutai has a useful email on the nupic-discuss listserve, where he breaks down the three main areas of the NuPIC API: algorithms, networks/regions, & the OPF. That helped me understand my options better.