issue with tf.data.experimental.rejection_resample? - tensorflow

I'm working on image classification(multi_labels) model, and dataset is imbalanced, I'm trying to balance the data by using the "tf.data.experimental.rejection_resample" method, but i keep getting the below error:
"ValueError:Shape must be rank 3 but is rank 2 for '{{node Tile}} = Tile[T=DT_FLOAT, Tmultiples=DT_INT32](ExpandDims, Tile/multiples)' with input shapes: [1,9,9], [2]."
please find the blocks of code:
1)Building function to get the label from dataset:
def class_func(features, label):
return label
2)Create the re-sampler (target dist argument is set according to the number of (labels=9):
resampler = tf.data.experimental.rejection_resample(
class_func, target_dist=[0.115, 0.109,0.115, 0.109,0.115, 0.109,0.115, 0.109,0.104])
3)Run the resampler by using .apply function (dataset has been unpatched as per the tensorflow recomendaition(see the reference link)
resample_train_dataset = train_dataset.unbatch().apply(resampler).batch(10)
Note: I tried to change the dims for tensor by using tf.expand_dims but it didn't work, Any idea how to resolve this issue?

In addition to resampling, rejection_resample also transforms your dataset into a new one, where the first element is the class label used for resampling, and the second element is your original item.
So after ds.apply(resampler), if you want your data back in the original format, you might do:
ds.map(lambda extra_label, feat_and_label: feat_and_label)

Related

Testing on some basic example in trying to better understand about .padded_batch in TensorFlow

I have a data a very simple one to test on my understanding about the usage of tf.padded_batch
text file is saved as .txt format:
test = "I use tensorflow for this data\n
I will be testing\n
The current tensorflow data
Please do mark that I am using tensorflow version 2.0 so I do not need to use tf.Session to initialize my variables
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(2)
for x in dataset:
print(x.numpy())
Error that I received:
TypeError: padded_batch() missing 1 required positional argument: 'padded_shapes'
Expected output:
[[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
[b'I' b'will' b'be' b'testing' b'unknown' b'unknown']]
[[b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]
How should I configure my padded_shapes and also padded_values? I wish to make the length of the tensor to be the same by insert "unknown" for each empty element. (This might be a little confused by above shows my expected results.)
Please note that tf.data.Dataset().dataset.padded_batch expects the shape of your inputs, and in your case, since you want the padded value to be "unknown" the padding value that you will use. Below is the code snipped you want to use.
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(3, padded_shapes=[None], padding_values="unknown")
for x in dataset:
print(x.numpy())
# [[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
# [b'I' b'will' b'be' b'testing' b'unknown' b'unknown']
# [b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

How can I reroute the training input pipeline to test pipeline in tensorflow using tf.contrib.graph_editor?

Suppose now I have a training input pipeline which finally generate train_x and train_y using tf.train.shuffle_batch. I export meta graph and re-import the graph in another code file. Now I want to detach the input pipeline, i.e., the train_x and train_y, and connect a new test_x and test_y. How can I make accomplish this using tf.contrib.graph_editor?
EDIT: As suggested by #iga, I change my input directory using input_map
filenames = tf.train.match_filenames_once(FLAGS.data_dir + '*', name='matching_filenames')
if FLAGS.ckpt != '':
latest = FLAGS.log_dir + FLAGS.ckpt
else:
latest = tf.train.latest_checkpoint(FLAGS.log_dir)
if not latest or not os.path.exists(latest+'.meta'):
print("checkpoint " + latest + " does not exist")
sys.exit(1)
saver = tf.train.import_meta_graph(latest+'.meta',
input_map={'matching_filenames:0':filenames},
import_scope='import')
g = tf.get_default_graph()
but I get the following error:
ValueError: graph_def is invalid at node u'matching_filenames/Assign':
Input tensor 'matching_filenames:0' Cannot convert a tensor of type
string to an input of type string_ref.
Are there any elegant way to resolve this?
For this task, you should be able to just use input_map argument to https://www.tensorflow.org/api_docs/python/tf/import_graph_def. If you are using import_meta_graph, you can pass the input_map into its kwargs and it will get passed down to import_graph_def.
RESPONSE TO EDIT: I am assuming that your original graph (the one you are deserializing) had the same matching_filenames variable. Quite confusingly, the tensor name "matching_filenames:0" actually refers to the tensor going from the VariableV2 op to the Assign op. The type of this edge is string_ref and you don't really want to break that edge.
The output from a variable typically goes through an identity op called matching_filenames/read. This is what you want to use as the key in your input_map. For the value, you want the same tensor in your new filenames. So, your call should probably look like:
tf.train.import_meta_graph(latest+'.meta',
input_map={'matching_filenames/read': filenames.read_value()},
import_scope='import')
In general, variables are fairly complicated. If this does not work, you can use some placeholder op and feed the names into it manually.

tf.train.shuffle_batch() ValueError: Cannot infer Tensor's rank: Tensor("PyFunc:0", dtype=uint8)

I am trying to feed my image data from my TFRecord files into tf.train.shuffle_batch(). I have a load_img_file() function that reads the TFRecord files, does preprocessing, and returns the images and one-hot labels in the format [[array of images, np.uint8 format], [array of labels, np.uint8 format]]. I made the op
load_img_file_op = tf.py_func(self.load_img_file, [], [np.uint8, np.uint8])
which converts that function into an op. I have verified that that op works by doing
data = tf.Session().run(load_img_file_op)
for n in range(50): #go through images
print data[1][n] #print one-hot label
self.image_set.display_img(data[0][n]) #display image
which successfully prints the one-hot labels and displays the corresponding images.
However, when I try to do something like
self.batch = tf.train.shuffle_batch(load_img_file_op, batch_size=self.batch_size, capacity=q_capacity, min_after_dequeue=10000)
I get the error
raise ValueError("Cannot infer Tensor's rank: %s" % tl[i])
ValueError: Cannot infer Tensor's rank: Tensor("PyFunc:0", dtype=uint8)"
I have tried many variations to try to match what the guide does:
Instead of self.batch =, I have tried example_batch, label_batch = (trying to get two values instead of one)
setting enqueue_many to True
having my load_image_file() function and load_img_file_op return two separate values: images and labels. And then inputting them like tf.train.shuffle_batch([images, labels],...)
returning/inputting just one image and label at a time into tf.train.shuffle_batch()
using tf.train.shuffle_batch_join()
Nothing seems to work, but I feel like I am following the format of the guide and various other tutorials I have seen. What am I doing wrong? I apologize if my mistake is stupid or trivial (searches for this error do not seem to return anything relevant to me). Thank you for your help and time!
The link in the comments helped a lot; thank you! (The answer is that you have to give the shape when using py_func.) Since I had to figure out a little bit more on top of that I will post the complete solution:
I had to make my function return two separate values so that they would be two different tensors and could be shaped separately:
return images, labels
Then, proceeding as in the question above, but shaping:
load_img_file_op = tf.py_func(self.load_img_file, [], [np.uint8, np.uint8]) # turn the function into an op
images, labels = load_img_file_op
images.set_shape([imgs_per_file, height * width])
labels.set_shape([imgs_per_file, num_classes])
self.batch = tf.train.shuffle_batch([images, labels], batch_size=self.batch_size, capacity=q_capacity, min_after_dequeue=1000, enqueue_many = True)
The enqueue_many is important so that the images will enter the queue individually.

newbie: holoviews curves from pandas follow up: issues with stream

The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?
Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()