How to create a boolean mask in tensorflow with middle range only set to True using indexing specified in another tensor - tensorflow

I have a tensor of shape [None, 1] consisting of a value for each batch. Using this tensor, I have to create a boolean mask with values set to true starting from the index value (from the tensor) of the corresponding batch to a fixed length and rest all set to false.
Example, consider the index tensor to be
[[1],[2],[2]]
Suppose the desired length of timesteps for each batch is 5 and the fixed length is 2, then for the first batch, in the indexes starting from 1 and ending at 2 (as fixed length=2), the values are to be set True. Likewise for other batches. i.e, I want my boolean mask created to be
[[False,True,True,False,False],
[False,False,True,True,False],
[False,False,True,True,False]]
How to achieve the above without having to do individually for each batch? And preferably without using ragged feature in tensorflow?
index < tf.range(number_of_timesteps)
The above can be used for setting True in the extremes, but I could not find a way to set True in the middle.

Solving your example can be done with a combination of tf.one_hot and tf.roll.
input = [[1],[2],[2]]
intermediate_output = tf.one_hot(input, 5)
output = intermediate_output + tf.roll(intermediate_output, shift=1,axis=2)
If you need to convert to boolean from zeros and ones
output = tf.where(tf.equal(output, 1), True, False)
Explanation:
tf.one_hot is used to convert your indices to an intermediate one hot representation. Next tf.roll shifts the intermediate representation by 1. Combining this with the intermediate representation and converting to boolean returns your desired output.
EDIT:
I don't see a good way other than a for loop to extend this to multiple timesteps. The below code will generate your desired output
input = [[1],[2],[2]]
intermediate_output = tf.one_hot(input, 5)
outputs = []
timesteps = 3
for i in range(timesteps):
outputs.append(tf.roll(intermediate_output, shift=i,axis=2))
output = tf.reduce_sum(tf.unstack(outputs, timesteps), 2)
output = tf.where(tf.equal(output, 1), True, False)

Related

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

vectorize pytorch tensor indexing

I have a batch of images img_batch, size [8,3,32,32], and I want to manipulate each image by setting randomly selected pixels to zero. I can do this using a for loop over each image but I'm not sure how to vectorize it so I'm not processing only one image at a time. This is my code using loops.
batch_size = 8
prct0 = 0.1
noise = torch.tensor([9, 14, 5, 7, 6, 14, 1, 3])
comb_img = []
for ind in range(batch_size):
img = img_batch[ind]
c, h, w = img.shape
prct = 1 - (1 - prct0)**noise[ind].item()
idx = random.sample(range(h*w), int(prct*h*w) )
img_noised = img.clone()
img_noised.view(c,1,-1)[:,0,idx] = 0
comb_img.append(img_noised)
comb_img = torch.stack(comb_img) # output is comb_img [8,3,32,32]
I'm new to pytorch and if you see any other improvements, please share.
First note: Do you need to use noise? It will be a lot easier if you treat all images the same and don't have a different set number of pixels to set to 0.
However, you can do it this way, but you still need a small for loop (in the list comprehension).
#don't want RGB masking, want the whole pixel
rng = torch.rand(*img_batch[:,0:1].shape)
#create binary mask
mask = torch.stack([rng[i] <= 1-(1-prct0)**noise[i] for i in range(batch_size)])
img_batch_masked = img_batch.clone()
#broadcast mask to 3 RGB channels
img_batch_masked[mask.tile([1,3,1,1])] = 0
You can check that the mask is set correctly by summing mask across the last 3 dims, and seeing if it matches your target percentage:
In [5]: print(mask.sum([1,2,3])/(mask.shape[2] * mask.shape[3]))
tensor([0.6058, 0.7716, 0.4195, 0.5162, 0.4739, 0.7702, 0.1012, 0.2684])
In [6]: print(1-(1-prct0)**noise)
tensor([0.6126, 0.7712, 0.4095, 0.5217, 0.4686, 0.7712, 0.1000, 0.2710])
You can easily do this without a loop in a fully vectorized manner:
Create noise tensor
Select a threshold and round the noise tensor to 0 or 1 based on above or below that threshold (prct0)
Element-wise multiply image tensor by noise tensor
I think calling the vector of power mutlipliers noise is a bit confusing, so I've renamed that vector power_vec in this example:
power_vec = noise
# create random noise - note one channel rather than 3 color channels
rand_noise = torch.rand(8,1,32,32)
noise = torch.pow(rand_noise,power_vec) # these tensors are broadcastable
# "round" noise based on threshold
z = torch.zeros(noise.shape)
o = torch.ones(noise.shape)
noise_rounded = torch.where(noise>prct0,o,z)
# apply noise mask to each color channel
output = img_batch * noise_rounded.expand(8,3,32,32)
For simplicity this solution uses your original batch size and image size but could be trivially extended to work on inputs of any image and batch size.

Tensorflow : Choosing a range of columns in each row from a Tensor

I would like to choose only particular columns in each row of a tensor, using it for an RNN
seq_len=[11,12,20,30] #This is the sequence length, assume 4 sequences
array=tf.ones([4,30]) #Assuming this is the array I want to index from
function(array,seq_len) #apply required function
Output=(first 11 elements from row 0, first 12 from row 2, first 20 from row 3 etc), perhaps obtained as a flat tensor
You can use tf.sequence_mask and tf.boolean_mask to get them flattened:
mask = tf.sequence_mask(seq_len, MAX_LENGTH) # Replace MAX_LENGTH with the size of array on the right dimension, 30 in your case
output= tf.boolean_mask(array, mask=mask)
A tensor in tensorflow can be sliced just like a numpy array and then concatenated into one tensor. Assuming you measure the sequence length from the first element.
Use [row_idx,column_idx] to slice the tensor. slice = array[0,:] would assign the first row to slice.
flat_slices = tf.concat([slice,slice]) will flatten them into one tensor.
import tensorflow as tf
seq_len = [11,12,20,30]
array = tf.ones([4,30])
init = tf.global_variables_initializer()
with tf.Session() as sess:
init.run()
flatten = array[0,:seq_len[0]]
for i in range(1,len(seq_len)):
row = array[i,:seq_len[i]]
flatten = tf.concat([flatten, row])
print(sess.run(flatten))

Get the original train-test split based on an integer column value after using pd.get_dummies

I combined my train and test dataset and used get_dummies function from pandas to one hot encode categorical data. The reason behind concatenating was the number of levels in categorical data for train and test set were different.
If I use the get_dummies function on separate train and test set then I would get a dataframe of different dimension as out so I thought of combining them.
I want to now again split it into train and test set. Is it possible?
Assume that the output we get after using pd.get_dummies is named as 'dataset'. If the value of 'C10' column in dataset is 30(integer) then the data beleongs to test set otherwise it belongs to trainset.
If I try to select values like we do in a normal dataframe I get the following error:
dataset = pd.concat([train, test])
dataset_dummy = pd.get_dummies(dataset, prefix_sep='_', columns = cat_columns, sparse = True, drop_first = True)
test_dummy = dataset_dummy.iloc[dataset_dummy['day'] == 30]
AttributeError: 'BlockManager' object has no attribute 'T'

reshape list to (-1,1) and return float as datatype

I am trying to build Logistic Regression model, data.Exam1 is the first column
reg = linear_model.LogisticRegression()
X = list(data.Exam1.values.reshape(-1,1)).........(1)
I have performed this operation
type(X[0]) returns numpy.ndarray
reg.fit accepts parameters which contains all float items in the list, so I did this because of this exception ValueError: Unknown label type: 'continuous'
newX = []
for item in X:
type(float(item))
newX.append(float(item))
so when I tried to do
reg.fit(newX,newY,A)
It throws me this exception
Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
which I already did in (1), and when I try to reshape again it returns ndarray again, how can I have reshape and convert items to float simultaneously??
Adapting our solution from chat
You are trying to understand Admission (type: bool) as a function of Exam scores (Exam1: float, Exam2: float). The crux of your issue is that sklearn.linear_model.LogisticRegression expects two inputs:
X: a vector/matrix of training data with the shape (number of observations, number of predictors) with type float
Y: a vector of categorical outcomes (in this case binary) with the shape (number of observations, 1) with type bool or int
They way you are calling it is trying to fit Exam2 (float) as a function of Exam1 (float). This is the fundamental issue. Further complicating matters is the way you are recasting your reshaped numpy array as a list. Assuming data is a pandas.DataFrame, you want something like:
X = np.vstack((data.Exam1, data.Exam2)).T
print X.shape # should be (100, 2)
reg.fit(X, data.Admitted)
Here, both data.Exam1 and data.Exam2 are vectors of length 100. Using np.vstack combines them into the shape (2, 100), so we take the transpose so that we have it oriented properly with observations along the first dimension (100, 2). No need to recast as list or even take data.Exam1.values as the pd.Series gets recast as np.array during np.vstack. Similarly, data.Admitted (with shape (100,)) plays nicely with reg.fit.