TensorFlow dataset with multi-dimensional Tensors from a CSV file - tensorflow

Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?
For example, my CSV input looks like the following:
f1, f2, f3, label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1
I'd like load a dataset from such file, e.g.
import tensorflow as tf
frames_csv_ds = tf.data.experimental.make_csv_dataset(
'input.csv',
header=False,
column_names=['f1','f2','f3','label'],
batch_size=5,
label_name='label',
num_epochs=1,
ignore_errors=True,)
for batch, label in frames_csv_ds.take(1):
for key, value in batch.items():
print(f"{key:20s}: {value}")
print()
print(f"{'label':20s}: {label}")
To get the batch as:
f1 : [0.1 0.2 0.3 ]
f2 : [0.2 0.3 0.4 ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]
The snippet above is incomplete and doesn't work. Is there away to get the dataset in the illustrated form? If yes, can this be done for arrays of dimensions varying across the dataset?

Well, you can do this by customizing some Tensorflow Functions
import tensorflow as tf
file_path = "data.csv"
dataset = tf.data.TextLineDataset(file_path).skip(1)
def parse_csv_line(line):
# Split the line into a list of strings
fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
f1 = tf.strings.to_number(fields[0], tf.float32)
f2 = tf.strings.to_number(fields[1], tf.float32)
f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
label = tf.strings.to_number(fields[3], tf.int32)
return {"f1": f1, "f2": f2, "f3": f3, "label": label}
dataset = dataset.map(parse_csv_line).batch(5)
next(iter(dataset.take(1)))
{'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
[0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
[0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}

Related

Tensorflow: Reshape a tensor according to a boolean mask

I have a 1D tensor of values:
a = tf.constant([0.1, 0.2, 0.3, 0.4])
and a nD boolean mask:
b = tf.constant([[1, 1, 0], [0, 1, 1]])
The total number of 1's in b matches the length of a.
How can I get [[0.1, 0.2, 0.0], [0.0, 0.3, 0.4]] from a and b?
import tensorflow as tf
a = tf.constant([0.1, 0.2, 0.3, 0.4])
b = tf.constant([[1, 1, 0], [0, 1, 1]])
# reshape b to a 1D vector
b_res = tf.reshape(b, [-1])
# Get the indices to gather using cumsum
b_cum = tf.cumsum(b_res) - 1
# Gather the elements, multiply by b_res to zero out the unwanted values and reshape back
c = tf.reshape(tf.gather(a, b_cum) * tf.cast(b_res, 'float32'), [-1, 3])
print(c)

tensorflow how do one get the output the same size as input tensor after segment sum

I'm using the tf.unsorted_segment_sum method of TensorFlow and it works.
For example:
tf.unsorted_segment_sum(tf.constant([0.2, 0.1, 0.5, 0.7, 0.8]),
tf.constant([0, 0, 1, 2, 2]), 3)
Gives the right result:
array([ 0.3, 0.5 , 1.5 ], dtype=float32)
I want to get:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)
I've solved it.
data = tf.constant([0.2, 0.1, 0.5, 0.7, 0.8])
gr_idx = tf.constant([0, 0, 1, 2, 2])
y, idx, count = tf.unique_with_count(gr_idx)
group_sum = tf.segment_sum(data, gr_idx)
group_sup = tf.gather(group_sum, idx)
answer:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)

Add an extra column to ndarray in python

I have a ndarray as follows.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
I have a position ndarray as follows.
position = [10, 20, 30]
Now I want to add the position value at the beginning of the feature_matrix as follows.
[[10, 0.1, 0.3], [20, 0.7, 0.8], [30, 0.8, 0.8]]
I tried the answers in this: How to add an extra column to an numpy array
E.g.,
feature_matrix = np.concatenate((feature_matrix, position), axis=1)
However, I get the error saying that;
ValueError: all the input arrays must have same number of dimensions
Please help me to resolve this prblem.
This solved my problem. I used np.column_stack.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
position = [10, 20, 30]
feature_matrix = np.column_stack((position, feature_matrix))
It is the shape of the position array which is incorrect regarding the shape of the feature_matrix.
>>> feature_matrix
array([[ 0.1, 0.3],
[ 0.7, 0.8],
[ 0.8, 0.8]])
>>> position
array([10, 20, 30])
>>> position.reshape((3,1))
array([[10],
[20],
[30]])
The solution is (with np.concatenate):
>>> np.concatenate((position.reshape((3,1)), feature_matrix), axis=1)
array([[ 10. , 0.1, 0.3],
[ 20. , 0.7, 0.8],
[ 30. , 0.8, 0.8]])
But np.column_stack is clearly great in your case !

Argmax on a tensor and ceiling in Tensorflow

Suppose I have a tensor in Tensorflow that its values are like:
A = [[0.7, 0.2, 0.1],[0.1, 0.4, 0.5]]
How can I change this tensor into the following:
B = [[1, 0, 0],[0, 0, 1]]
In other words I want to just keep the maximum and replace it with 1.
Any help would be appreciated.
I think that you can solve it with a one-liner:
import tensorflow as tf
import numpy as np
x_data = [[0.7, 0.2, 0.1],[0.1, 0.4, 0.5]]
# I am using hard-coded dimensions for simplicity
x = tf.placeholder(dtype=tf.float32, name="x", shape=(2,3))
session = tf.InteractiveSession()
session.run(tf.one_hot(tf.argmax(x, 1), 3), {x: x_data})
The result is the one that you expect:
Out[6]:
array([[ 1., 0., 0.],
[ 0., 0., 1.]], dtype=float32)

numpy - multiply each element in array by a scaling factor

I have a numpy array of values, and a list of scaling factors which I want to scale each value in the array by, down each column
values = [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]]
ls_alloc = [ 0.1, 0.4, 0.3, 0.2]
# convert values into numpy array
import numpy as np
na_values = np.array(values, dtype=float)
Edit: To clarify:
na_values can is a 2-dimensional array of stock cumulative returns (ie: normalised to day 1), where each row represents a date, and each column a stock. The data is returned as an array for each date.
I want to now scale each stock's cumulative return by its allocation in the portfolio. So for each date (ie: each row of ndarray values, apply the respective element from ls_alloc to the array element-wise)
# scale each value by its allocation
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
This does what I want, but I can't help but feel there must be a way to have numpy do this for me automatically?
That is, I feel:
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
# display na_components
na_components
[array([ 0. , 0.1, 0.2, 0.3]), \
array([ 0.4, 0.4, 0.4, 0.4]), \
array([ 0.6, 1.2, 1.8, 2.4]), \
array([ 0.6, 0.6, 0.6, 0.6])]
should be able to be expressed as something like:
tmp = np.multiply(na_values, ls_alloc)
# display tmp
tmp
array([[ 0. , 0.4, 0.6, 0.6],
[ 0.1, 0.4, 1.2, 0.6],
[ 0.2, 0.4, 1.8, 0.6],
[ 0.3, 0.4, 2.4, 0.6]])
Is there a numpy function which will achieve what I want elegantly and succinctly?
Edit:
I see that my first solution has transposed my data, such that I am returned a list of ndarrays. na_components[0] now gives an ndarray of the stock values for the first stock, 1 element per date.
The next step that I perform with na_components is to calculate the total cumulative return for the portfolio by summing each individual component
na_pfo_cum_ret = np.sum(na_components, axis=0)
This works with the list of individual stock return ndarrays.
That order seems a little odd to me, but IIUC, all you need to do is to transpose the result of multiplying na_values by array(ls_alloc):
>>> v
array([[ 0., 1., 2., 3.],
[ 1., 1., 4., 3.],
[ 2., 1., 6., 3.],
[ 3., 1., 8., 3.]])
>>> a
array([ 0.1, 0.4, 0.3, 0.2])
>>> (v*a).T
array([[ 0. , 0.1, 0.2, 0.3],
[ 0.4, 0.4, 0.4, 0.4],
[ 0.6, 1.2, 1.8, 2.4],
[ 0.6, 0.6, 0.6, 0.6]])
It's not completely clear to me what you want to do, but the answer is probably in Broadcasting rules. I think you want:
values = np.array( [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]] )
ls_alloc = np.array([ 0.1, 0.4, 0.3, 0.2])
and either:
na_components = values * ls_alloc
or:
na_components = values * ls_alloc[:,np.newaxis]