Filling a dataframe with a list to get the max_leaf_nodes with the lowest mean_absolute_error - pandas

I made a simple DecisionTreeRegressor and want to get the best max_leaf_nodes value.
Code:
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.model_selection import train_test_split as TTS
#split the data in 2 parts: training data and validation data
train_X, val_X, train_y, val_y = TTS(X, y, random_state=0)
#Define and fit the modell with the training data
model = DecisionTreeRegressor(random_state=1)
model.fit(train_X, train_y)
#predict
val_prediction = model.predict(val_X)
#check predictions
print(MAE(val_prediction, val_y))
#defining get_mae function
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
#DataFrame and list
df_mae = pd.DataFrame(columns = ["MAE"])
li = []
#collecting mae's depending on max_leaf_nodes values
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
li.append(mae)
How can I add the values of li to the "MAE" column of df_mae?
Is there a better way to find good max_leaf_nodes? (My Laptop was working on that for-loop for 25 minutes)

You could append a row directly in the dataframe, instead of creating a list first.
df_mae = df_mae.append({'MAE': mae}, ignore_index = True)
However, if you prefer to add the list instead of individual values (outside the for loop):
df_mae = df_mae.append(pd.DataFrame({'MAE': li}), ignore_index = True)
Please, be aware that you need to store the max_leaf_nodes as well, otherwise your resulting dataframe won't be meaningful.
df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
li = []
max_leaf_nodes_list = []
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
li.append(mae)
max_leaf_nodes_list.append(max_leaf_nodes)
df_mae = df_mae.append(pd.DataFrame({'MAE': li, 'max_leaf_nodes': max_leaf_nodes_list}), ignore_index = True)
or, appending the values into the dataframe directly:
df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
df_mae = df_mae.append({'MAE': mae, 'max_leaf_nodes': max_leaf_nodes}, ignore_index = True)
To reduce the execution time with this approach, I would increase the step from 2 to a bigger number on the range function. Once you find the interval which produces the best values, you can limit the interval to find an even better metric. In other words, searching the entire hyperparameter grid is not the best approach.
Alternatively, you could use other methods such as Hyperopt or Hyperopt-sklearn.

Related

How can I compile batched training of a gpflow GPR into a tf.function?

I need to train a GPR model in multiple batches per epoch using a custom loss function. I would like to do this using GPflow and I would like to compile my training using tf.function to increase the efficiency. However, gpflow.GPR must be re-instantiated each time you supply new data, so tf.function will have to re-trace each time. This makes the code slower rather than faster.
This is the initial setup:
import numpy as np
from itertools import islice
import tensorflow as tf
import tensorflow_probability as tfp
tfb = tfp.bijectors
from sklearn.model_selection import train_test_split
import gpflow
from gpflow.kernels import SquaredExponential
import time
data_size = 1000
train_fract = 0.8
batch_size = 250
n_epochs = 3
iterations_per_epoch = int(train_fract * data_size/batch_size)
tf.random.set_seed(3)
# Generate dummy data
x = np.arange(data_size)
y = np.arange(data_size) + np.random.rand(data_size)
# Slice into train and validate sets
x_train, x_validate, y_train, y_validate = train_test_split(x, y, random_state = 1, test_size = 1-train_fract )
# Convert data into tensorflow constants
x_train = tf.constant(x_train[:, np.newaxis], dtype=np.float64)
x_validate = tf.constant(x_validate[:, np.newaxis], dtype=np.float64)
y_train = tf.constant(y_train[:, np.newaxis], dtype=np.float64)
y_validate = tf.constant(y_validate[:, np.newaxis], dtype=np.float64)
# Batch data
batched_dataset = (
tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(buffer_size=len(x_train), seed=1)
.repeat(count=None)
.batch(batch_size)
)
# Create kernel
constrain_positive = tfb.Shift(np.finfo(np.float64).tiny)(tfb.Exp())
amplitude = tfp.util.TransformedVariable(initial_value=1, bijector=constrain_positive, dtype=np.float64, name="amplitude")
len_scale = tfp.util.TransformedVariable(initial_value=10, bijector=constrain_positive, dtype=np.float64, name="len_scale")
kernel = SquaredExponential(variance=amplitude, lengthscales=len_scale, name="squared_exponential_kernel")
obs_noise = tfp.util.TransformedVariable(initial_value=1e-3, bijector=constrain_positive, dtype=np.float64, name="observation_noise")
# Define custom loss function
#tf.function(autograph=False, experimental_compile=False)
def my_custom_loss(y_predict, y_true):
return tf.math.reduce_mean(tf.math.squared_difference(y_predict, y_true))
#optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
This is how I train without a tf.function:
gpr_model_j_i = gpflow.models.GPR(data=(x_train, y_train), kernel=kernel, noise_variance=obs_noise)
# Start training loop
for j in range(n_epochs):
for i, (x_train_j_i, y_train_j_i) in enumerate(islice(batched_dataset, iterations_per_epoch)):
with tf.GradientTape() as tape:
gpr_model_j_i = gpflow.models.GPR(data=(x_train_j_i, y_train_j_i), kernel=kernel, noise_variance=gpr_model_j_i.likelihood.variance)
y_predict_j_i = gpr_model_j_i.predict_f(x_validate)[0]
loss_j_i = my_custom_loss(y_predict_j_i, y_validate)
grads_j_i = tape.gradient(loss_j_i, gpr_model_j_i.trainable_variables)
optimizer.apply_gradients(zip(grads_j_i, gpr_model_j_i.trainable_variables))
This is how I train with a tf.function:
#tf.function(autograph=False, experimental_compile=False)
def tf_function_attempt_3(model): #, optimizer):
with tf.GradientTape() as tape:
y_predict_j_i = model.predict_f(x_validate)[0]
loss_j_i = my_custom_loss(y_predict_j_i, y_validate)
grads_j_i = tape.gradient(loss_j_i, model.trainable_variables)
optimizer.apply_gradients(zip(grads_j_i, model.trainable_variables))
print("TRACING...", end="")
for j in range(n_epochs):
for i, (x_train_j_i, y_train_j_i) in enumerate(islice(batched_dataset, iterations_per_epoch)):
gpr_model_j_i = gpflow.models.GPR(data=(x_train_j_i, y_train_j_i), kernel=kernel, noise_variance=gpr_model_j_i.likelihood.variance)
tf_function_attempt_3(gpr_model_j_i)#, optimizer)
The tf.function retraces for each batch and is significantly slower than the normal training.
Is there a way to speed up the batched training of my GPR model with tf.function while using a custom loss function and GPflow? If not, I am open to suggestions for an alternative approach.
You don't have to re-instantiate GPR each time. You can construct tf.Variable holders with unconstrained shape and then .assign to them:
import gpflow
import numpy as np
import tensorflow as tf
input_dim = 1
initial_x, initial_y = np.zeros((0, input_dim)), np.zeros((0, 1)) # or your first batch
x_var = tf.Variable(initial_x, shape=(None, input_dim), dtype=tf.float64)
y_var = tf.Variable(initial_y, shape=(None,1), dtype=tf.float64)
# in principle you could also set shape=(None, None)...
m = gpflow.models.GPR((x_var, y_var), gpflow.kernels.SquaredExponential())
loss = m.training_loss_closure() # compile=True default wraps in tf.function()
N1 = 3
x1, y1 = np.random.randn(N1, input_dim), np.random.randn(N1, 1)
m.data[0].assign(x1)
m.data[1].assign(y1)
loss() # traces the first time
N2 = 7
x2, y2 = np.random.randn(N2, input_dim), np.random.randn(N2, 1)
m.data[0].assign(x2)
m.data[1].assign(y2)
loss() # does not trace again

How to get the labels from tensorflow dataset

ds_test = tf.data.experimental.make_csv_dataset(
file_pattern = "./dfj_test/part-*.csv.gz",
batch_size=batch_size, num_epochs=1,
#column_names=use_cols,
label_name='label_id',
#select_columns= select_cols,
num_parallel_reads=30, compression_type='GZIP',
shuffle_buffer_size=12800)
This is my tesetset during training. After completing the model, I want to zip the columns of predictions and labels for the df_test .
preds = model.predict(df_test)
Getting the predictions is quite simple, and it is of numpy array format. However, I don't know how to get the corresponding labels from the df_test.
I want to zip(preds, labels) for further analysis.
Any hint? Thanks.
(tf version 2.3.1)
You can map each example to return the field you want
# load some exemplary data
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1)
# get field by unbatching
labels_iterator= dataset.unbatch().map(lambda x: x['survived']).as_numpy_iterator()
labels = np.array(list(labels_iterator))
# get field by concatenating batches
labels_iterator= dataset.map(lambda x: x['survived']).as_numpy_iterator()
labels = np.concatenate(list(labels_iterator))

Odd problem with the Multivariate Input Multi-Step LSTM Time Series Forecasting Models

I have developed Multivariate Input Multi-Step LSTM Time Series Forecasting Models for my dataset according to the tutorial (https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/).
Yet, I had a very odd problem, that is, when I run code with smaller samples (50 samples for training, 10 samples for testing), the predictions are correct. but when I run the experiment with full samples (4000 samples for training, 1000 samples for testing), the predictions contain NaN values, which lead to errors.
Then, when I try scaling plus relu activation functions plus regularization as following code, I can get predictions with full samples (4000 samples for training, 1000 samples for testing), but the predictions is still not correct, I want to forecast 96 steps, but all steps I predicted is the same number.
Can you give a useful suggestion to deal with the forecast accuracy issues?
import time
from math import sqrt
from numpy import split
from numpy import array
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import LSTM
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
import csv
import numpy
from sklearn.preprocessing import MinMaxScaler
from numpy import save
from timeit import default_timer as timer
def scale(train, test):
# fit scaler
scaler = MinMaxScaler(feature_range=(-1, 1))
train = train.astype(float)
test = test.astype(float)
scaler = scaler.fit(train)
# transform train
train = train.reshape(train.shape[0], train.shape[1])
train_scaled = scaler.transform(train)
# transform test
test = test.reshape(test.shape[0], test.shape[1])
test_scaled = scaler.transform(test)
return scaler, train_scaled, test_scaled
# split a univariate dataset into train/test sets
def split_dataset(data):
# split into standard weeks
train, test = data[0:387030, 10:26], data[387030:433881, 10:26]
# train, test = data[0:4850, 10:26], data[4850:5820, 10:26]
# train, test = data[0:387030], data[387029:433880]
# restructure into windows of weekly data
# numpy.savetxt("test.csv", data[387030:433881, :], delimiter=",")
# save('test.npy', data[387030:433881, :])
scaler, train_scaled, test_scaled = scale(train, test)
train_scaled = array(split(train_scaled, len(train_scaled) / 97))
test_scaled = array(split(test_scaled, len(test_scaled) / 97))
return scaler, train_scaled, test_scaled
# create a list of configs to try
def model_configs():
# define scope of configs
# n_input = [12]
n_nodes = [100, 200, 300]
n_epochs = [50, 100]
n_batch = [64]
# n_diff = [12]
# create configs
configs = list()
# for i in n_input:
for j in n_nodes:
for k in n_epochs:
for l in n_batch:
cfg = [j, k, l]
configs.append(cfg)
print('Total configs: %d' % len(configs))
return configs
# evaluate one or more weekly forecasts against expected values
def evaluate_forecasts(actual, predicted):
scores = list()
# calculate an RMSE score for each day
for i in range(0, actual.shape[1], 97):
# for i in range():
# calculate mse
mse = mean_squared_error(actual[:, i, :], predicted[:, i, :])
# calculate rmse
rmse = sqrt(mse)
# store
scores.append(rmse)
# calculate overall RMSE
s = 0
for x in range(actual.shape[0]):
for y in range(actual.shape[1]):
for z in range(actual.shape[2]):
s += (actual[x, y, z] - predicted[x, y, z])**2
score = sqrt(s / (actual.shape[0] * actual.shape[1] * actual.shape[2]))
return score, scores
# convert history into inputs and outputs
def to_supervised(train, n_steps_in, n_steps_out=97, overlop=97):
# flatten data
sequences = train.reshape(
(train.shape[0] * train.shape[1], train.shape[2]))
X, y = list(), list()
for i in range(0, len(sequences), overlop):
end_ix = i + n_steps_in
out_end_ix = end_ix + n_steps_out
# check if we are beyond the dataset
if out_end_ix > len(sequences):
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix:out_end_ix, :]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# train the model
def build_model(train, n_input, config):
# unpack config
n_nodes, n_epochs, n_batch = config
# prepare data
train_x, train_y = to_supervised(train, n_input)
# define parameters
verbose, epochs, batch_size = 0, n_epochs, n_batch
n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]
# reshape output into [samples, timesteps, features]
train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], n_features))
# define model
model = Sequential()
model.add(
LSTM(
n_nodes,
activation='relu',
input_shape=(
n_timesteps,
n_features), recurrent_dropout=0.6))
model.add(RepeatVector(n_outputs))
model.add(LSTM(n_nodes, activation='relu', return_sequences=True, recurrent_dropout=0.6))
model.add(TimeDistributed(Dense(n_nodes, activation='relu')))
model.add(TimeDistributed(Dense(n_features)))
model.compile(loss='mse', optimizer='adam')
# fit network
model.fit(
train_x,
train_y,
epochs=epochs,
batch_size=batch_size,
verbose=verbose)
return model
# make a forecast
def forecast(model, history, n_input):
# flatten data
data = array(history)
data = data.reshape((data.shape[0] * data.shape[1], data.shape[2]))
# retrieve last observations for input data
input_x = data[-n_input:, :]
# reshape into [1, n_input, n]
input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1]))
# forecast the next week
yhat = model.predict(input_x, verbose=0)
# we only want the vector forecast
yhat = yhat[0]
return yhat
# evaluate a single model
def evaluate_model(train, test, n_input, cfg):
start = timer()
# fit model
model = build_model(train, n_input, cfg)
# print("--- %s seconds ---" % (time.time() - start_time))
# history is a list of weekly data
history = [x for x in train]
# walk-forward validation over each week
predictions = list()
for i in range(len(test)):
# predict the week
yhat_sequence = forecast(model, history, n_input)
# store the predictions
predictions.append(yhat_sequence)
# get real observation and add to history for predicting the next week
history.append(test[i, :])
# evaluate predictions days for each week
predictions = array(predictions)
# invert scaling
predictions = predictions.reshape(
(predictions.shape[0] *
predictions.shape[1],
predictions.shape[2]))
predictions = scaler.inverse_transform(predictions)
test = test.reshape((test.shape[0] * test.shape[1], test.shape[2]))
test = scaler.inverse_transform(test)
predictions = array(split(predictions, len(predictions) / 97))
test = array(split(test, len(test) / 97))
score, scores = evaluate_forecasts(test, predictions)
run_time = timer() - start
return cfg[0], cfg[1], cfg[2], score, scores, run_time
# load the new file
dataset = read_csv(
'data_preproccess_5.csv',
header=0,
index_col=0)
# split into train and test
scaler, train_scaled, test_scaled = split_dataset(dataset.values)
# evaluate model and get scores
n_input = 7 * 97
# model configs
cfg_list = model_configs()
scores = [
evaluate_model(
train_scaled,
test_scaled,
n_input,
cfg) for cfg in cfg_list]
provide some sample data
sample data
If you have multistep output, you can easily reshape your predictions and calculate it.
My splitted datasets
`trainX, trainY, testX, testY`
Get Prediction Results
`trainPredict = model.predict(trainX)`
`testPredict = model.predict(testX)`
Reshape the Predictions and Real Values
`trainY = trainY.reshape(-1, )`
`trainPredict = trainPredict.reshape(-1, )`
`testY = testY.reshape(-1, )`
`testPredict = testPredict.reshape(-1, )`
Calculate root mean squared error
`print('Train Root mean squared error: {}'.format(math.sqrt(mean_squared_error(trainY, trainPredict))))`
`print('Test Root mean squared error: {}'.format(math.sqrt(mean_squared_error(testY, testPredict))))`

Making prediction on Iris dataset

I have a basic classification code for Irish dataset.
import tensorflow as tf
import pandas as pd
COLUMN_NAMES = [
'SepalLength',
'SepalWidth',
'PetalLength',
'PetalWidth',
'Species'
]
# Import training dataset
training_dataset = pd.read_csv('iris_training.csv', names=COLUMN_NAMES, header=0)
train_x = training_dataset.iloc[:, 0:4]
train_y = training_dataset.iloc[:, 4]
# Import testing dataset
test_dataset = pd.read_csv('iris_test.csv', names=COLUMN_NAMES, header=0)
test_x = test_dataset.iloc[:, 0:4]
test_y = test_dataset.iloc[:, 4]
columns_feat = [
tf.feature_column.numeric_column(key='SepalLength'),
tf.feature_column.numeric_column(key='SepalWidth'),
tf.feature_column.numeric_column(key='PetalLength'),
tf.feature_column.numeric_column(key='PetalWidth')
]
classifier = tf.estimator.DNNClassifier(
feature_columns=columns_feat,
# Two hidden layers of 10 nodes each.
hidden_units=[10, 10],
# The model is classifying 3 classes
n_classes=3)
def train_function(inputs, outputs, batch_size):
dataset = tf.data.Dataset.from_tensor_slices((dict(inputs), outputs))
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
# Train the Model.
classifier.train(
input_fn=lambda:train_function(train_x, train_y, 100),
steps=1000)
def evaluation_function(attributes, classes, batch_size):
attributes=dict(attributes)
if classes is None:
inputs = attributes
else:
inputs = (attributes, classes)
dataset = tf.data.Dataset.from_tensor_slices(inputs)
assert batch_size is not None, "batch_size must not be None"
dataset = dataset.batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:evaluation_function(test_x, test_y, 100))
I evaluate the result but how can i make a prediction on my data because now i get only console info of loss and epochs, accuracy. For example if i have everything except species. I want to give my own sepal length and etc so i can get prediction of the species and it will be another variable. Do i have to create variables like pred_x or pred_y(pandas dataframe) and then put them into eval_result?
Is that what you mean? for example:new_samples = np.array([[6.4, 3.2, 4.5, 1.5], [5.8, 3.1, 5.0, 1.7]], dtype=np.float32) If you want new data like this to make predictions, then you can refer to this code.TensorFlow-Iris-Classification
Like all estimator classes, the DNNClassifier class has a predict method that makes real-world predictions. The documentation is here.

Tensorflow shuffle_batch speed

I noticed a big difference in speed if I load my training data into memory and feed it into the graph as a numpy array vs using a shuffle batch of the same size, my data has ~1000 instances.
Using memory 1000 iterations takes less than a few seconds but using a shuffle batch it takes almost 10 minutes. I get the shuffle batch should be a bit slower but this seems way too slow. Why is this?
Added a bounty. Any suggestions on how to make shuffled mini-batches faster?
Here is the training data: Link to bounty_training.csv (pastebin)
Here is my code:
shuffle_batch
import numpy as np
import tensorflow as tf
data = np.loadtxt('bounty_training.csv',
delimiter=',',skiprows=1,usecols = (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14))
filename = "test.tfrecords"
with tf.python_io.TFRecordWriter(filename) as writer:
for row in data:
features, label = row[:-1], row[-1]
example = tf.train.Example()
example.features.feature['features'].float_list.value.extend(features)
example.features.feature['label'].float_list.value.append(label)
writer.write(example.SerializeToString())
def read_and_decode_single_example(filename):
filename_queue = tf.train.string_input_producer([filename],
num_epochs=None)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
serialized_example,
features={
'label': tf.FixedLenFeature([], np.float32),
'features': tf.FixedLenFeature([14], np.float32)})
pdiff = features['label']
avgs = features['features']
return avgs, pdiff
avgs, pdiff = read_and_decode_single_example(filename)
n_features = 14
batch_size = 1000
hidden_units = 7
lr = .001
avgs_batch, pdiff_batch = tf.train.shuffle_batch(
[avgs, pdiff], batch_size=batch_size,
capacity=5000,
min_after_dequeue=2000)
X = tf.placeholder(tf.float32,[None,n_features])
Y = tf.placeholder(tf.float32,[None,1])
W = tf.Variable(tf.truncated_normal([n_features,hidden_units]))
b = tf.Variable(tf.zeros([hidden_units]))
Wout = tf.Variable(tf.truncated_normal([hidden_units,1]))
bout = tf.Variable(tf.zeros([1]))
hidden1 = tf.matmul(X,W) + b
pred = tf.matmul(hidden1,Wout) + bout
loss = tf.reduce_mean(tf.squared_difference(pred,Y))
optimizer = tf.train.AdamOptimizer(lr).minimize(loss)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for step in range(1000):
x_, y_ = sess.run([avgs_batch,pdiff_batch])
_, loss_val = sess.run([optimizer,loss],
feed_dict={X: x_, Y: y_.reshape(batch_size,1)} )
if step % 100 == 0:
print(loss_val)
coord.request_stop()
coord.join(threads)
Full batch via numpy array
"""
avgs and pdiff loaded into numpy arrays first...
Same model as above
"""
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for step in range(1000):
_, loss_value = sess.run([optimizer,loss],
feed_dict={X: avgs,Y: pdiff.reshape(n_instances,1)} )
In this case, you're running a session 3 times per step - once in avgs_batch.eval, once for pdiff_batch.eval, and once for the actual sess.run call. That doesn't explain the magnitude of the slow down, but it's definitely something you should keep in mind. At the very least the first two eval calls should be combined to one sess.run call.
I suspect most of the slow-down is coming from use of TFRecordReader. I don't pretend to understand the inner workings of tensorflow, but you might find my answer here helpful.
Summary
create minimal data associated with each example, i.e. image filenames, ids rather than entire images;
convert to tensorflow ops with tensorflow.python.framework.ops.convert_to_tensor;
use tf.train.slice_input_producer to get a tensor for a single example;
do some preprocessing on individual examples - e.g. load images from filenames;
batch them together using tf.train.batch to group them up.
The trick is instead of feeding single examples into shuffle_batch you feed an n+1 dimensional tensor of examples to it with enqueue_many=True. I found this thread that was very helpful:
TFRecordReader seems extremely slow , and multi-threads reading not working
def get_batch(batch_size):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
batch_list = []
for i in range(batch_size):
batch_list.append(serialized_example)
return [batch_list]
batch_serialized_example = tf.train.shuffle_batch(
get_batch(batch_size), batch_size=batch_size,
capacity=100*batch_size,
min_after_dequeue=batch_size*10,
num_threads=1,
enqueue_many=True)
features = tf.parse_example(
batch_serialized_example,
features={
'label': tf.FixedLenFeature([], np.float32),
'features': tf.FixedLenFeature([14], np.float32)})
batch_pdiff = features['label']
batch_avgs = features['features']
...
When using queues to get the data, you shouldn't use feed_dict. Instead, make your graph depend directly on the input data, that is:
remove the X and Y PlaceHolders
use your feature batch directly
hidden1 = tf.matmul(avgs_batch,W) + b
similarly, use the label batch (pdiff_batch) instead of Y when computing the loss
finally, just keep the second session.run to compute the loss directly, and without using feed_dict
# x_, y_ = sess.run([avgs_batch,pdiff_batch])
# _, loss_val = sess.run([optimizer,loss],
feed_dict={X: x_, Y: y_.reshape(batch_size,1)} )
_, loss_val = sess.run([optimizer,loss])