I have split my data into training and test, followed by the split of training into another train set and a validation set. To this new train set and validation set I have applied the below transformation. I am implementing a Random forest regression, so at the next step I apply the transformations to these set and try to combine it into one. The issue is np.vstack isn't returning me the correct shape:
Output:
(2, 1) <- should have been (25455, 2394)
(21636, 2394)
(3819, 2394)
Could someone one tell me what am I doing wrong?
Xtrain_rf, Xtest_rf = train_test_split(insurance_data_prep, test_size=0.15, random_state=42)
Xtrain2_rf, Xval_rf = train_test_split(Xtrain_rf, test_size=0.15, random_state=42)
full_transform_rf = ColumnTransformer([
("num", StandardScaler(), attributes_num),
("cat", OneHotEncoder(handle_unknown='ignore'), attributes_cat),
])
## fit transform in the train set
Xtrain2_rf_att_prepared = full_transform_rf.fit_transform(Xtrain2_rf_att)
## transform in the validation set
Xval_rf_att_prepared = full_transform_rf.transform(Xval_rf_att)
whole_train_set_attributes_rf = np.vstack((Xtrain2_rf_att_prepared, Xval_rf_att_prepared))
print(whole_train_set_attributes_rf.shape)
print(Xtrain2_rf_att_prepared.shape)
print(Xval_rf_att_prepared.shape)
I tried to replicate your code with a dataset of my own, and one problem I have ran into is that Xtrain2_rf_att and Xval_rf_att variables have not been defined in your code snippet. Those are the variables that you passed to your ColumnTransformer.
I guess that in your case the variables are defined but what you really want to pass to the ColumnTransformer are Xtrain2_rf and Xval_rf.
If this does not solve your problem, could you edit your question and add the following information?:
insurance_data_prep shape and columns (or some of them)
attributes_num and attributes_dat
clearly provide the outputs from the three prints statements at the end of your code and not just one of them
You might need to use a MRE to replace your dataframe, we will see
Related
I used the tensorflow tf.keras.metrics.MeanSquaredError() metric to evaluate the mean squared error between two numpy arrays. But each time I call mse() it give a different result.
a = np.random.random(size=(100,2000))
b = np.random.random(size=(100,2000))
for i in range(100):
v = mse(a, b).numpy()
plt.scatter(i,v)
print(v)
where I had previously defined mse = tf.keras.metrics.MeanSquaredError() Here is the Output. Any idea what is going wrong?
np.random.random generates random data every run. So, your code should result in different mse, shouldn't it?
run 1:
[0.87148841 0.50221413 0.49858526 ... 0.22311888 0.71320089 0.36298912]
Run 2:
[0.14941241 0.78560523 0.62436783 ... 0.1865485 0.2730567 0.49300401]
I have a data a very simple one to test on my understanding about the usage of tf.padded_batch
text file is saved as .txt format:
test = "I use tensorflow for this data\n
I will be testing\n
The current tensorflow data
Please do mark that I am using tensorflow version 2.0 so I do not need to use tf.Session to initialize my variables
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(2)
for x in dataset:
print(x.numpy())
Error that I received:
TypeError: padded_batch() missing 1 required positional argument: 'padded_shapes'
Expected output:
[[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
[b'I' b'will' b'be' b'testing' b'unknown' b'unknown']]
[[b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]
How should I configure my padded_shapes and also padded_values? I wish to make the length of the tensor to be the same by insert "unknown" for each empty element. (This might be a little confused by above shows my expected results.)
Please note that tf.data.Dataset().dataset.padded_batch expects the shape of your inputs, and in your case, since you want the padded value to be "unknown" the padding value that you will use. Below is the code snipped you want to use.
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(3, padded_shapes=[None], padding_values="unknown")
for x in dataset:
print(x.numpy())
# [[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
# [b'I' b'will' b'be' b'testing' b'unknown' b'unknown']
# [b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]
I have the following code to average embeddings for list of item-ids.
(Embedding is trained on review_meta_id_input, and used as look up for pirors_input and for getting average embedding)
review_meta_id_input = tf.keras.layers.Input(shape=(1,), dtype='int32', name='review_meta_id')
priors_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='priors') # array of ids
item_embedding_layer = tf.keras.layers.Embedding(
input_dim=100, # max number
output_dim=self.item_embedding_size,
name='item')
review_meta_id_embedding = item_embedding_layer(review_meta_id_input)
selected = tf.nn.embedding_lookup(review_meta_id_embedding, priors_input)
non_zero_count = tf.cast(tf.math.count_nonzero(priors_input, axis=1), tf.float32)
embedding_sum = tf.reduce_sum(selected, axis=1)
item_average = tf.math.divide(embedding_sum, non_zero_count)
I also have some feature columns such as..
(I just thought feature_column looked cool, but not many documents to look for..)
kid_youngest_month = feature_column.numeric_column("kid_youngest_month")
kid_age_youngest_buckets = feature_column.bucketized_column(kid_youngest_month, boundaries=[12, 24, 36, 72, 96])
I'd like to define [review_meta_id_iput, priors_input, (tensors from feature_columns)] as an input to keras Model.
something like:
inputs = [review_meta_id_input, priors_input] + feature_layer
model = tf.keras.models.Model(inputs=inputs, outputs=o)
In order to get tensors from feature columns, the closest lead I have now is
fc_to_tensor = {fc: input_layer(features, [fc]) for fc in feature_columns}
from https://github.com/tensorflow/tensorflow/issues/17170
However I'm not sure what the features are in the code.
There's no clear example on https://www.tensorflow.org/api_docs/python/tf/feature_column/input_layer either.
How should I construct the features variable for fc_to_tensor ?
Or is there a way to use keras.layers.Input and feature_column at the same time?
Or is there an alternative than tf.feature_column to do the bucketing as above? then I'll just drop the feature_column for now;
The behavior you desire could be achieved through following steps.
This works in TF 2.0.0-beta1, but may being changed or even simplified in further reseases.
Please check out issue in TensorFlow github repository Unable to use FeatureColumn with Keras Functional API #27416. There you will find the more general example and useful comments about tf.feature_column and Keras Functional API.
Meanwhile, based on the code in your question the input tensor for feature_column could be get like this:
# This you have defined feauture column
kid_youngest_month = feature_column.numeric_column("kid_youngest_month")
kid_age_youngest_buckets = feature_column.bucketized_column(kid_youngest_month, boundaries=[12, 24, 36, 72, 96])
# Then define layer
feature_layer = tf.keras.layers.DenseFeatures(kid_age_youngest_buckets)
# The inputs for DenseFeature layer should be define for each original feature column as dictionary, where
# keys - names of feature columns
# values - tf.keras.Input with shape =(1,), name='name_of_feature_column', dtype - actual type of original column
feature_layer_inputs = {}
feature_layer_inputs['kid_youngest_month'] = tf.keras.Input(shape=(1,), name='kid_youngest_month', dtype=tf.int8)
# Then you can collect inputs of other layers and feature_layer_inputs into one list
inputs=[review_meta_id_input, priors_input, [v for v in feature_layer_inputs.values()]]
# Then define outputs of this DenseFeature layer
feature_layer_outputs = feature_layer(feature_layer_inputs)
# And pass them into other layer like any other
x = tf.keras.layers.Dense(256, activation='relu')(feature_layer_outputs)
# Or maybe concatenate them with outputs from your others layers
combined = tf.keras.layers.concatenate([x, feature_layer_outputs])
#And probably you will finish with last output layer, maybe like this for calssification
o=tf.keras.layers.Dense(classes_number, activation='softmax', name='sequential_output')(combined)
#So you pass to the model:
model_combined = tf.keras.models.Model(inputs=[s_inputs, [v for v in feature_layer_inputs.values()]], outputs=o)
Also note. In model fit() method you should pass info which data sould be used for each input.
One way, if you use tf.data.Dataset, take care that you have used the same names for features in Dataset and for keys in feature_layer_inputs dictionary
Other way use explicite notation like:
model.fit({'review_meta_id_input': review_meta_id_data, 'priors_input': priors_data, 'kid_youngest_month': kid_youngest_month_data},
{'outputs': o},
...
)
I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.
I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.
I'm trying to visualize a tensor summary in tensorboard. However I can't see the tensor summary at all in the board. Here is my code:
out = tf.strided_slice(logits, begin=[self.args.uttWindowSize-1, 0], end=[-self.args.uttWindowSize+1, self.args.numClasses],
strides=[1, 1], name='softmax_truncated')
tf.summary.tensor_summary('softmax_input', out)
where out is a multi-dimensional tensor. I guess there must be something wrong with my code. Probably I used the tensor_summary function incorrectly.
What you do is you create a summary op, but you don't invoke it and don't write the summary (see documentation).
To actually create a summary you need to do the following:
# Create a summary operation
summary_op = tf.summary.tensor_summary('softmax_input', out)
# Create the summary
summary_str = sess.run(summary_op)
# Create a summary writer
writer = tf.train.SummaryWriter(...)
# Write the summary
writer.add_summary(summary_str)
Explicitly writing a summary (last two lines) is only necessary if you don't have a higher level helper like a Supervisor. Otherwise you invoke
sv.summary_computed(sess, summary_str)
and the Supervisor will handle it.
More info, also see:
How to manually create a tf.Summary()
Hopefully a workaround which achieves what you want. ..
If you wish to view the tensor values, you can convert them using as_string, then use summary.text. The values will appear in the tensorboard text tab.
Not tried with 3D tensors, but feel free to slice according to needs.
code snippet, which includes use of inserting a print statement to get console output as well.
predictions = tf.argmax(reshaped_logits, 1)
txtPredictions = tf.Print(tf.as_string(predictions),[tf.as_string(predictions)], message='predictions', name='txtPredictions')
txtPredictions_op = tf.summary.text('predictions', txtPredictions)
Not sure whether this is kinda obvious, but you could use something like
def make_tensor_summary(tensor, name='defaultTensorName'):
for i in range(tensor.get_shape()[0]:
for j in range(tensor.get_shape()[1]:
tf.summary.scalar(Name + str(i) + '_' + str(j), tensor[i, j])
in case you know it is a 'matrix-shaped' Tensor in advance.