Keras Sequential with multiple inputs - tensorflow

Given 3 array as input to the network, it should learn what links data in 1st array, 2nd array, and 3rd array.
In particular:
1st array contains integer numbers (eg.: 2, 3, 5, 6, 7)
2nd array contains integer numbers (eg.: 3, 2, 4, 6, 2)
3rd array contains integer numbers that are the results of an operation done between data in 1st and 2nd array (eg.: 6, 6, 20, 36, 14).
As you can see from the example data here above, the operation done is a multiplication so the network should learn this, giving:
model.predict(11,2) = 22.
Here's the code I've used:
import logging
import numpy as np
import tensorflow as tf
primo = np.array([2, 3, 5, 6, 7])
secondo = np.array([3, 2, 4, 6, 2])
risu = np.array([6, 6, 20, 36, 14])
l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
model = tf.keras.Sequential([l0])
input1 = tf.keras.layers.Input(shape=(1, ), name="Pri")
input2 = tf.keras.layers.Input(shape=(1, ), name="Sec")
merged = tf.keras.layers.Concatenate(axis=1)([input1, input2])
dense1 = tf.keras.layers.Dense(
2,
input_dim=2,
activation=tf.keras.activations.sigmoid,
use_bias=True)(merged)
output = tf.keras.layers.Dense(
1,
activation=tf.keras.activations.relu,
use_bias=True)(dense1)
model = tf.keras.models.Model([input1, input2], output)
model.compile(
loss="mean_squared_error",
optimizer=tf.keras.optimizers.Adam(0.1))
model.fit([primo, secondo], risu, epochs=500, verbose = False, batch_size=16)
print(model.predict(11, 2))
My questions are:
is it correct to concatenate the 2 input as I did? I don't understand if concatenating in such a way the network understand that input1 and input2 are 2 different data
I'm not able to make the model.predict() working, every attempt result in an error

Your model has two inputs, each with shape (None,1), so you need to use np.expand_dims:
print(model.predict([np.expand_dims(np.array(11), 0), np.expand_dims(np.array(2), 0)]))
Output:
[[20.316557]]

Related

Local maximums of sub-tensors by index tensor

I have a tensor x of shape (1,n), and another index tensor d of shape (1,k). I’m trying to find the maximums of k sub-tensors
x[0:d[0]], x[d[0]:d[1]], x[d[1]:d[2]], ..., x[d[-2]: d[-1]]
So the output is a tensor of shape (1,k) with k local maximums. I can implement a for loop, but that’s too slow. Can I do it in parallel in PyTorch (or Numpy)?
I found the answer thanks to user7138814. There is a SegmentCSR function in torch_scatter that does the job:
from torch_scatter import segment_csr
src = torch.randn(10, 6, 64)
indptr = torch.tensor([0, 2, 5, 6])
indptr = indptr.view(1, -1) # Broadcasting in the first and last dim.
out = segment_csr(src, indptr, reduce="sum")
print(out.size())
torch.Size([10, 3, 64])
output: torch.Size([10, 3, 64])

What is the actual use of num_words parameter in keras Tokenizer? How much overall does it affect the accuracy of my model

In the given line of code tokenizer=Tokenizer(num_words=, oov_token= '<OOV>'), what does the num_words parameter actually do and what to take into consideration before determining the value to assign to it. What will be the effect of assigning a very high value to it and a very low one.
It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.
Without num_words:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 4, 2, 1, 6, 7, 2, 8]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Here tokenizer.fit_on_texts(fit_text) will create the word_index of the words mentioned present in fit_text in the order starting from oov_token which will be 1 and followed by most frequent words from the word_counts.
If you don't mention num_words then all the unique words of fit_text will be considered for word_index and will be used to represent the sequences.
If the num_words is present then it will restrict the sequences to num_words -1 words from word_index will only be considered to form the sequence while using tokenizer.texts_to_sequences() if any word is present beyond num_words -1 it will be considered as oov_token.
Below is the example of it.
With use of num_words:
tokenizer = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 1, 2, 1, 1, 1, 2, 1]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming to remove all the unnecessary words and then followed by Tokenizer with the preprocessed data to choose the num_words parameter better.

Given a dataframe with N elements, how can make m smaller dataframes such that the size of each m is some fraction of N?

I have a dataset (call it Data) with ~25000 instances that I want to split into a train set, development set, and test set. I want it to be such that,
train set = 0.7*Data
development set = 0.1*Data
test set = 0.2*Data
When making the split, I want the instances to be randomly sampled and NOT REPEATED between the 3 sets. This is why I can't use something like,
train_set = Data.sample(frac=0.7)
dev_set = Data.sample(frac=0.1)
train_set = Data.sample(frac=0.2)
where instances from Data may be repeated in the sets. Is there a build in function that I am missing or could you help me write a function for doing this?
I will use an array to demonstrate an example of what I am looking for.
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
splits = [0.7, 0.1, 0.2]
def splitFunction(data, array_of_splits):
// I need your help here
splits = splitFunction(A, splits)
#output
[[1, 3, 8, 9, 6, 7, 2], [4], [5, 0]]
Thank you in advance!
from random import shuffle
def splitFunction(data, array_of_splits):
data_copy = data[:] # copy data if don't want to change original array
shuffle(data_copy) # randomizes data
splits = []
startIndex = 0
for val in array_of_splits:
split = data_copy[startIndex:startIndex + val*len(data)]
startIndex = startIndex + val*len(data)
splits.append(split)
return splits

Tensorflow LinearRegressor evaluate method hangs

Consider the following toy TensorFlow code. The fit method of LinearRegressor works properly and finds the right coefficients (i.e. y = x1 + x2), but evaluate (see the last print statement) hangs. Any idea what's wrong?
import tensorflow as tf
x1 = [1, 3, 4, 5, 1, 6, -1, -3]
x2 = [5, 2, 1, 5, 0, 2, 4, 2]
y = [6, 5,5, 10, 1, 8, 3, -1]
def train_fn():
return {'x1': tf.constant(x1), 'x2':tf.constant(x2)}, tf.constant(y)
features = [tf.contrib.layers.real_valued_column('x1', dimension=1),
tf.contrib.layers.real_valued_column('x2', dimension=1)]
estimator = tf.contrib.learn.LinearRegressor(feature_columns=features)
estimator.fit(input_fn=train_fn, steps=10000)
for vn in estimator.get_variable_names():
print('variable name', vn, estimator.get_variable_value(vn))
print(estimator.evaluate(input_fn=train_fn))
estimator.evaluate() takes a parameter steps, which defaults to None, which is interpreted as "infinity". It therefore never ends.
To make it end, pass steps=1 explicitly:
estimator.evaluate(input_fn=your_input_fn, steps=1)

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.
Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.