I want to generate random data in a py_func op, but when I run this code, I get different random data which is expected as the same in this code.
I think the problem come with the tf.data.Dataset.map api. When I get the values of d5, it will run random function twice which d3 and d4 will get different values in this case.
Thanks.
import tensorflow as tf
import numpy as np
def random(x):
def random_(x):
x = np.float32(np.random.rand())
print("run random_")
return x,x
return tf.compat.v1.py_func(
random_,
[x],
[tf.float32, tf.float32],
name="generate_random")
d1 = tf.data.Dataset.from_tensor_slices(tf.zeros(10))
d2 = d1.map(random)
d3 = d2.map(lambda x1,x2: x1)
d4 = d2.map(lambda x1,x2: x2)
d5 = tf.data.Dataset.zip( (d3,d4) )
it = d5.make_one_shot_iterator()
fetch = it.get_next()
with tf.Session() as sess:
while True:
X = sess.run(fetch)
print(X)
As per the documentation -
Maps map_func across the elements of this dataset.
This transformation applies map_func to each element of this dataset,
and returns a new dataset containing the transformed elements, in the
same order as they appeared in the input. map_func can be used to
change both the values and the structure of a dataset's elements. For
example, adding 1 to each element, or projecting a subset of element
components.
I think it’s best to see the map function as part of the pipeline. Every time you run map, nothing happens with the data yet. Instead, a chain of operations is made in the background. Note that these transformations are executed lazy. Only when you actually request data from the dataset (like with a for loop and batch), the source data from your dataset are fed through the pipeline. At that time, each map is applied on the image or on the batch.
As an alternative, you can fulfill your requirement as below -
Code -
%tensorflow_version 1.x
import tensorflow as tf
import numpy as np
def random(x):
x = np.float32(np.random.rand())
print("run random_")
return x
a = []
for i in range(10):
a.append(random(i))
d1 = tf.data.Dataset.from_tensor_slices(a)
d5 = tf.data.Dataset.zip( (d1,d1) )
it = d5.make_one_shot_iterator()
fetch = it.get_next()
with tf.Session() as sess:
while True:
X = sess.run(fetch)
print(X)
Output -
run random_
run random_
run random_
run random_
run random_
run random_
run random_
run random_
run random_
run random_
WARNING:tensorflow:From <ipython-input-2-ee4ae8ca7ce4>:18: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
(0.6963448, 0.6963448)
(0.92889917, 0.92889917)
(0.054273304, 0.054273304)
(0.3719434, 0.3719434)
(0.33791015, 0.33791015)
(0.873847, 0.873847)
(0.66726416, 0.66726416)
(0.77356577, 0.77356577)
(0.4188753, 0.4188753)
(0.8886825, 0.8886825)
Hope this answers your question. Happy Learning.
Related
I am new to GPflow and I am trying to figure out how to write a custom loss function to optimize the model. For my purpose, I need to manipulate the predicted output of the GP through different data treatments, and thus, it is the output I get after these treatments, that I would like the optimise the GP model according to. For that purpose I would like to use the root mean square error as loss function.
Workflow:
Input -> GP model -> GP_output -> Data treatment -> Predicted_output -> RMSE(Predicted_output, Observations)
I hope this makes sense.
Normally models are optimised doing something like this:
import gpflow as gf
import numpy as np
X = np.linspace(0, 100, num=100)
n = np.random.normal(scale=8, size=X.size)
y_obs = 10 * np.sin(X) + n
model = gf.models.GPR(
data=(X, y_obs),
kernel=gf.kernels.SquaredExponential(),
)
gf.optimizers.Scipy().minimize(
model.training_loss, model.trainable_variables, options=optimizer_config
)
I have figured out how to do a workaround using the scipy minimize function to optimise using RMSE, but I would like to stay within the GPflow framework, where I can just input model.trainable_variables as argument, and have a general function that also works if I have multiple input/output dimensions.
def objective_func(params):
model.kernel.lengthscales.assign(params[0])
model.kernel.variance.assign(params[1])
model.likelihood.variance.assign(params[2])
GP_output = model.predict_y(X)[0]
GP_output = GP_output.numpy()
Predicted_output = data_treatment_func(GP_output)
return np.sqrt(np.square(np.subtract(Predicted_output, y_obs)).mean())
from scipy.optimize import minimize
res = minimize(objective_func,
x0=(1.0, 1.0, 1.0),)
I found the answer myself.
If you write your objective_func using TensorFlow instead of NumPy (e.g. tf.math.sqrt, tf.reduce_mean) you can simply pass that to gf.optimizers.Scipy().minimize(...) instead of model.training_loss:
import tensorflow as tf
def objective_func():
GP_output = model.predict_y(X)[0]
Predicted_output = data_treatment_func(GP_output)
return tf.sqrt(tf.reduce_mean(tf.square(Predicted_output - y_obs)))
gf.optimizers.Scipy().minimize(
objective_func, model.trainable_variables, options=optimizer_config
)
I used the tensorflow tf.keras.metrics.MeanSquaredError() metric to evaluate the mean squared error between two numpy arrays. But each time I call mse() it give a different result.
a = np.random.random(size=(100,2000))
b = np.random.random(size=(100,2000))
for i in range(100):
v = mse(a, b).numpy()
plt.scatter(i,v)
print(v)
where I had previously defined mse = tf.keras.metrics.MeanSquaredError() Here is the Output. Any idea what is going wrong?
np.random.random generates random data every run. So, your code should result in different mse, shouldn't it?
run 1:
[0.87148841 0.50221413 0.49858526 ... 0.22311888 0.71320089 0.36298912]
Run 2:
[0.14941241 0.78560523 0.62436783 ... 0.1865485 0.2730567 0.49300401]
I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb
I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I have written a simple code using pybrain to predict a simple sequential data.
For example a sequence of 0,1,2,3,4 will supposed to get an output of 5 from the network. The dataset specifies the remaining sequence.
Below are my codes implementation
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SequentialDataSet
from pybrain.structure import SigmoidLayer, LinearLayer
from pybrain.structure import LSTMLayer
import itertools
import numpy as np
INPUTS = 5
OUTPUTS = 1
HIDDEN = 40
net = buildNetwork(INPUTS, HIDDEN, OUTPUTS, hiddenclass=LSTMLayer, outclass=LinearLayer, recurrent=True, bias=True)
ds = SequentialDataSet(INPUTS, OUTPUTS)
ds.addSample([0,1,2,3,4],[5])
ds.addSample([5,6,7,8,9],[10])
ds.addSample([10,11,12,13,14],[15])
ds.addSample([16,17,18,19,20],[21])
net.randomize()
trainer = BackpropTrainer(net, ds)
for _ in range(1000):
print trainer.train()
x=net.activate([0,1,2,3,4])
print x
The output on my screen keeps showing [0.99999999 0.99999999 0.9999999 0.99999999] every simple time. What am I missing? Is the training not sufficient? Because trainer.train()
shows output of 86.625..
The pybrain sigmoidLayer is implementing the sigmoid squashing function, which you can see here:
sigmoid squashing function code
The relevant part is this:
def sigmoid(x):
""" Logistic sigmoid function. """
return 1. / (1. + safeExp(-x))
So, no matter what the value of x, it will only ever return values between 0 and 1. For this reason, and for others, it is a good idea to scale your input and output values to between 0 and 1. For example, divide all your inputs by the maximum value (assuming the minimum is no lower than 0), and the same for your outputs. Then do the reverse with the result (e.g. multiply by 25 if you were dividing by 25 at the beginning).
Also, I'm no expert on pybrain, but I wonder if you need OUTPUTS = 4? It looks like you have only one output in your data, so I'm wondering if you could just use OUTPUTS = 1.
You may also try scaling the inputs and outputs to a particular part of the sigmoid curve (e.g. between 0.1 and 0.9) to make the pybrain's job easier, but that makes the scaling before and after a little more complex.