Tensorflow math operation reduce_sum - tensorflow

import tensorflow as tf
a = tf.constant([[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]])
b = tf.constant([[5,4,3,2,1],[1,2,3,4,5],[1,2,3,4,5]])
product =tf.mul(a,b)
product_sum =tf.reduce_sum(tf.mul(a,b))
with tf.Session() as sess:
print product.eval()
print product_sum.eval()
The result is :
[[ 5 8 9 8 5]
[ 1 4 9 16 25]
[ 1 4 9 16 25]]
145
But it is not the answer what i want.
I want to get the answer
[5+8+9+8+5,1+4+9+16+25,1+4+9+16+25]
=[35,55,55]

As xxi mentioned in their comment, the correct solution here is to use the optional axis argument when calling tf.reduce_sum(). In your case, you want to reduce along the column axis, so the following code will work:
product = tf.multiply(a, b)
product_sum = tf.reduce_sum(product, axis=1)
(Note also that in TensorFlow 1.0, tf.mul() is now tf.multiply().)

Related

Tensorflow dataset of sliding windows keeping track of index

I have a dataframe which contains time series data: for the sake of simplicity, let's say that the index is my "datetime" or just the element that establishes the order of the data. Columns a and b instead are real numbers and I set them equal to the index just to explain my problem.
import pandas as pd
import numpy as np
import tensorflow as tf
data = pd.DataFrame({'a': np.arange(100), 'b': np.arange(100)})
print(data)
Which outputs:
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. .. ..
95 95 95
96 96 96
97 97 97
98 98 98
99 99 99
Then, I proceed to create a dataset of sliding windows over the time series dataframe:
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(data=data, targets=None,
sequence_length=6,
sequence_stride=6,
sampling_rate=1,
shuffle=True,
batch_size=None,
seed=1)
for i in ds.take(3):
print(i)
Which outputs:
tf.Tensor( [[84. 84.]
[85. 85.]
[86. 86.]
[87. 87.]
[88. 88.]
[89. 89.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[30. 30.]
[31. 31.]
[32. 32.]
[33. 33.]
[34. 34.]
[35. 35.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[54. 54.]
[55. 55.]
[56. 56.]
[57. 57.]
[58. 58.]
[59. 59.]], shape=(6, 2), dtype=float32)
As you can see, each matrix is "datetime" ordered (sequence_length=6) and matrixes do not overlap (sequence_stride=6). I would like to keep track of the initial index. In other words, I want to be able to say extract the matrix with shape=(6, 2) that corresponds to the index values K:K+6. I know I could do this directly from the initial dataframe, but this is just a simplified version of a bigger problem: I am trying to replicate the section Data windowing of this Tensorflow tutorial such that I can plot exactly the date that I want, rather than random dates.

Stratified Sampling with different sizes

I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?

How to repeat tensor elements variable number of times in tensorflow

I have an input tensor which represents an alternation between item and item quantity:
[item0, qty0, item1, qty1, ...]
I would like to unfold this tensor to
[[item0*qty0], [item1*qty1], ...]
Example:
[1000, 2, 3000, 5, ...]
[[100,1000], [3000,3000,3000,3000,3000], ...]
This is in tensorflow 1.x btw.
In tf1.x version.
import tensorflow as tf
inputs = tf.constant([10,2,20,3,30,4])
x_unpacked = tf.unstack(tf.reshape(inputs,(-1,2)))
tmp = []
for t in x_unpacked:
tmp.append(tf.tile([t[0]], [t[1]]))
ans = tf.concat(tmp, axis=0)
with tf.Session() as sess:
print(sess.run(ans))
# [10 10 20 20 20 30 30 30 30]
In tf2.x, it can be one line,
tf.concat([tf.tile([x[0]],[x[1]]) for x in tf.reshape(inputs, (-1,2))], axis=0)
# [10 10 20 20 20 30 30 30 30]
I started with the answer given by zihaozhihao, but since input has dimensions (?) the length is not known and hence tf.unstack won't work.
However, one of the error messages recommended using map_fn and that seems to work:
x_unpacked = tf.reshape(input, (-1, 2))
tiled = tf.map_fn(lambda x: tf.tile([x[0]], [x[1]]), x_unpacked)

Rolling Second highest in a pandas dataframe

I am trying to find the top and second highest value
I can get the highest using
df['B'] = df['a'].rolling(window=3).max()
But how do I get the second highest please?
Such that df['C'] will display as per below
A B C
1
6
5 6 5
4 6 5
12 12 5
Generic n-highest values in rolling/sliding windows
Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows -
# https://stackoverflow.com/a/40085052/ #Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# Return N highest nums in rolling windows of length W off array ar
def N_highest(ar, W, N=1):
# ar : Input array
# W : Window length
# N : Get us the N-highest in sliding windows
A2D = strided_app(ar,W,1)
idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1)
return A2D[np.arange(len(idx)), idx]
Sample runs -
In [634]: a = np.array([1,6,5,4,12]) # input array
In [635]: N_highest(a, W=3, N=1) # highest in W=3
Out[635]: array([ 6, 6, 12])
In [636]: N_highest(a, W=3, N=2) # second highest
Out[636]: array([5, 5, 5])
In [637]: N_highest(a, W=3, N=3) # third highest
Out[637]: array([1, 4, 4])
Another shorter way based on strides, would be with direct sorting, like so -
np.sort(strided_app(ar,W,1), axis=1)[:,-N]]
Solving our case
Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so -
W = 3
df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
Based on direct sorting, we would have -
df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]]
Sample run -
In [578]: df
Out[578]:
A
0 1
1 6
2 5
3 4
4 3 # <== Different from given sample, for variety
In [619]: W = 3
In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
In [621]: df
Out[621]:
A C
0 1 NaN
1 6 NaN
2 5 5.0
3 4 5.0
4 3 4.0 # <== Second highest from the last group off : [5,4,3]

Keras summation Layer acting weird, summing over training set

I am having trouble understanding the basic way Keras works. I am experimenting with a single summation layer, implemented as a Lambda layer using tensorflow as a backend:
from keras import backend as K
test_model = Sequential()
test_model.add( Lambda( lambda x: K.sum(x, axis=0), input_shape=(2,3)) )
x = np.reshape(np.arange(12), (2,2,3))
test_model.predict(x)
This returns:
array([[ 6., 8., 10.],
[ 12., 14., 16.]], dtype=float32)
Which is very weird, as it sums over the first index, which to my understanding corresponds to the index of the training data. Also, if I change the axis to axis=1 then the sum is taken over the second coordinate, which is what I would expect to get for axis=0.
What is going on? Why does it seem like the axis chosen effects how the data is passed to the lambda layer?
The input_shape is the shape of one sample of the batch.
It doesn't matter if you have 200 or 10000 samples in a batch, all the samples should be (2,3).
But the batch itself is what is passed along from one layer to another.
A batch contains "n" samples, each sample with the input_shape:
Batch shape then is: (n, 2, 3) -- n samples, each sample with input_shape = (2,3)
You don't define "n" when input_shape is required, because "n" will be defined when you use fit or another training command, with the batch_size. (In your example, n = 2)
This is the original array:
[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
Sample 1 = [ 0 1 2], [ 3 4 5]
Sample 2 = [ 6 7 8], [ 9 10 11]
Summing on index 0 (the batch size dimension) will sum sample 1 with sample 2:
[ 6 8 10], [12 14 16]
Summing on index 1 will sum the first dimension of one sample's input shape:
[ 3, 5, 7 ], [15, 17, 19]