Tensorflow tokeniser: the maximum number of words to keep - tensorflow

Trying to tokenize the IMDB movie reviews by applying Tensorflow tokenizer. I want to have a maximum 10000-word vocabulary. For unseen words, I use a default token.
type(X), X.shape, X[:3]
(pandas.core.series.Series,(25000,),
0 first think another disney movie might good it...
1 put aside dr house repeat missed desperate hou...
2 big fan stephen king s work film made even gre...
Name: SentimentText, dtype: object)
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer(num_words=10000,oov_token='xxxxxxx')
# fit on the input data
tokenizer.fit_on_texts(X)
When I check the number of words in tokenizer dictionary I get:
X_dict=tokenizer.word_index
list(enumerate(X_dict.items()))[:10]
[(0, ('xxxxxxx', 1)),
(1, ('s', 2)),
(2, ('movie', 3)),
(3, ('film', 4)),
(4, ('not', 5)),
(5, ('it', 6)),
(6, ('one', 7)),
(7, ('like', 8)),
(8, ('i', 9)),
(9, ('good', 10))]
print(len(X_dict))
Out: 74120
Why do I get 74120 words instead of 10000 words?

Because the dictionary of words is always saved. When you have a look at the source code you see that in the function fit_on_texts() the parameter num_words is ignored. However, when you convert your text to sequences with texts_to_sequences() you can see the call to the texts_to_sequences_generator() which then has the following piece of code:
for w in seq:
i = self.word_index.get(w)
if i is not None:
if num_words and i >= num_words:
if oov_token_index is not None:
vect.append(oov_token_index)
else:
vect.append(i)
elif self.oov_token is not None:
vect.append(oov_token_index)
yield vect
There you can see, that num_words is noticed and used for further generating the sequences. This is useful as you can change the number of words easily without fitting on the whole text again, so experiment whether it suits your needs well or you need more words for successfully work on your task, as nicolewhite states in her github answer.
So basically, what you observe is just as expected, when you run np.unique() on all of your sequences, you will not have more than 10000 values.

Related

How to structure my video dataset based on extracted features for building a CNN-LSTM classification model?

For my project which deals with the recognition of emotions, I have a dataset consisting of multiple videos, which range from 0.5s-10s. I have an application which goes through each video and creates a .csv file containing the features it has extracted from each frame in the video, i.e., each row represents each frame from the video (with no. of rows being variable) and the columns represent the different features the application has extracted from the frame (with no. of columns being fixed). Each .csv filename also contains a code representing the emotion being expressed in the video.
Initially, my plan was to extract each frame from the video and pass each frame as input to the following CNN-LSTM (CNN for the spatial features and LSTM for the temporal features) model I was planning on using.
model = Sequential()
model.add(Input(input_shape))
model.add(Conv3D(6, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-1'))
model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-1'))
model.add(Conv3D(16, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-2'))
model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-2'))
model.add(Conv3D(32, (1, 5, 5), (1, 1, 1), activation='relu', name='conv-3'))
model.add(AveragePooling3D((1, 2, 2), strides=(1, 2, 2), name='avgpool-3'))
model.add(Conv3D(64, (1, 4, 4), (1, 1, 1), activation='relu', name='conv-4'))
model.add(Reshape((30, 64), name='reshape'))
model.add(CuDNNLSTM(64, return_sequences=True, name='lstm-1'))
model.add(CuDNNLSTM(64, name='lstm-2'))
model.add(Dense(6, activation=tf.nn.softmax, name='result'))
I still plan on using a CNN-LSTM model but I don't know how to structure my dataset now. I thought of labelling each frame in each .csv file with the corresponding emotion label and then combining all the .csv files into a single .csv file. This combined .csv file would then be passed to the above model, after changing the input shape and other necessary parameters, but I don't know if the model would be able to differentiate between the videos if done in that way.
So to conclude, I need help structuring my dataset and how this dataset should be passed to a CNN-LSTM model.
By looking at your problem statement I don't think there is a need to differentiate between the videos.
You can go ahead with your approach of labeling each frame in the video and combining it to a single CSV file.
For can use the below code to convert to NumPy arrays from CSV file to prepare your model to train by following the below method.
data = pd.read_csv('input.csv')
width, height = 48, 48
datapoints = data['pixels'].tolist()
#getting features for training
X = []
for xseq in datapoints:
xx = [int(xp) for xp in xseq.split(' ')]
xx = np.asarray(xx).reshape(width, height)
X.append(xx.astype('float32'))
X = np.asarray(X)
X = np.expand_dims(X, -1)
#getting labels for training
y = pd.get_dummies(data['emotion']).as_matrix()
#storing them using numpy
np.save('fdataX', X)
np.save('flabels', y)

How to conveniently use operations on numpy fortran contiguos arrays?

Some numpy functions like np.matmul(a, b) have convenient behavior for stacks of matrices.
The manual states:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, for a.shape = (10 , 2, 4) and b.shape(10, 4, 2) the statementa # b is meaningful and will have shape (10, 2, 2)
However, I'm coming from the linear algebra world, where I'm used to a Fortran contiguous array layout.
The same a represented as a Fortran contiguous array would have shape (4, 2, 10) and similarly b.shape = (2, 4, 10).
To do a # b as before I would have to invoke
(a.T # b.T).T .
Even worse, assume you naively created the same Fortran-contiguous array a with the behavior of matmul in mind, such that it has shape (10, 4, 2).
Then a.strides = (8, 80, 320) with the smallest stride in the 'stack' index, which actually should have highest stride.
Is this really the way to go or am I missing something?
While numpy can handle all sorts of layouts, many details are designed with the "C" layout in mind. Good examples are how nested lists translate into arrays, and the way numpy operations batch excess dimensions as in the matmul case.
It is correct that results in numpy as a rule of thumb do not depend on array layout (FORTRAN,C,non-contiguous); speed, however, certainly does and heavily so:
rng = np.random.default_rng()
a = rng.random((100,111,200))
b = rng.random((111,77,200))
af = np.array(a,order="F")
bf = np.array(b,order="F")
np.allclose((b.T#a.T).T,(bf.T#af.T).T)
# True
timeit(lambda:(b.T#a.T).T,number=10)
# 5.972857117187232
timeit(lambda:(bf.T#af.T).T,number=10)
# 0.1994628761895001
In fact, sometimes it is totally worth it to non-lazily transpose, i.e. copy your data into the best layout:
timeit(lambda:(np.array(b.T,order="C")#np.array(a.T,order="C")).T,number=10)
# 0.3931349152699113
My advice: If you want speed and convenience it is probably best to go with the "C" layout, it doesn't take all that long to get used to and saves you a lot of potential headaches.
numpy's matrix multiplication works regardless of the internal layout of the array. For example, here are two C-ordered arrays:
>>> import numpy as np
>>> a = np.random.rand(10, 2, 4)
>>> b = np.random.rand(10, 4, 2)
>>> print('a', a.shape, a.strides)
>>> print('b', b.shape, b.strides)
a (10, 2, 4) (64, 32, 8)
b (10, 4, 2) (64, 16, 8)
Here are the equivalent arrays in Fortran order:
>>> af = np.asfortranarray(a)
>>> bf = np.asfortranarray(b)
>>> print('af', af.shape, af.strides)
>>> print('bf', bf.shape, bf.strides)
af (10, 2, 4) (8, 80, 160)
bf (10, 4, 2) (8, 80, 320)
Numpy treats equivalent arrays as equivalent, regardless of their internal layout:
>>> np.allclose(a, af) and np.allclose(b, bf)
True
The results of a matrix multiplication do not depend on the internal layout:
>>> np.allclose(a # b, af # bf)
True
and you can even mix layouts if you wish:
>>> np.allclose(a # bf, af # b)
True
In short, the most convenient way to use Fortran-ordered arrays in numpy is to not worry about internal array layout: the shape is all that matters.
If your array shapes differ from what is expected by the numpy matmul API, your best bet is to reshape the arrays, for example using a.transpose(2, 0, 1) # b.transpose(2, 0, 1) or similar, depending on what is appropriate for your use-case, but don't worry: for C or Fortran contiguous arrays, this operation only adjusts the metadata around the array view, it does not cause the underlying data buffer to be copied or re-ordered.

TF-IDF using in pandas data frame

i am trying to use TF-IDF in pandas with data set content two columns first column it content text data and the another one it content categorical data looks like blow
summary type of attack
unknown african american assailants fired seve... Armed Assault
unknown perpetrators detonated explosives paci... Bombing
karl armstrong member years gang threw firebom... Infrastructure
karl armstrong member years gang broke into un... Infrastructure
unknown perpetrators threw molotov cocktail in... Infrastructure
i want to use tf-idf to convert the first column and then use it to build the mode for prediction of the second columns that content the attack type
I helped you to process your df into X and y to be trained with a short example.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
data = {'summary':['unknown african american assailants fired',
'Armed Assault unknown perpetrators detonated explosives','Bombing karl armstrong member years gang threw'],'type of attack':['bullet','explosion','gang']}
#tfidf
df = pd.DataFrame(data)
tf = TfidfVectorizer()
X = tf.fit_transform(df['summary'])
#label encoding
le = LabelEncoder()
y = le.fit_transform(df['type of attack'])
#your X and y ready to be trained
print('X----')
print(X)
print('y----')
print(y)
Output
X----
(0, 9) 0.4673509818107163
(0, 4) 0.4673509818107163
(0, 1) 0.4673509818107163
(0, 0) 0.4673509818107163
(0, 15) 0.35543246785041743
(1, 8) 0.4233944834119594
(1, 7) 0.4233944834119594
(1, 13) 0.4233944834119594
(1, 5) 0.4233944834119594
(1, 2) 0.4233944834119594
(1, 15) 0.3220024178194947
(2, 14) 0.37796447300922725
(2, 10) 0.37796447300922725
(2, 16) 0.37796447300922725
(2, 12) 0.37796447300922725
(2, 3) 0.37796447300922725
(2, 11) 0.37796447300922725
(2, 6) 0.37796447300922725
y----
[0 1 2]

Squeeze all but one axis in TensorFlow

How can I squeeze all but specific axis(es) of a Tensor?
The input size (and how many 1 dimension there is) is unknown in advance but I know that I wanna keep the first and second axis only (i.e. the rest can be only 1).
e.g. If I have a Tensor of shape (1, 4, 1, 1, 1) and I want it to be squeezed into (1, 4)

Baffled by numpy's transpose

Let's take a very simple case: an array with shape (2,3,4), ignoring the values.
>>> a.shape
(2, 3, 4)
When we transpose it and print the dimensions:
>>> a.transpose([1,2,0]).shape
(3, 4, 2)
So I'm saying: take axis index 2 and make it the first, then take axis index 0 and make it the second and finally take axis index 1 and make it the third. I should get (4,2,3), right?
Well, I thought perhaps I don't understand the logic fully. So I read the documentation and its says:
Use transpose(a, argsort(axes)) to invert the transposition of tensors
when using the axes keyword argument.
So I tried
>>> c = np.transpose(a, [1,2,0])
>>> c.shape
(3, 4, 2)
>>> np.transpose(a, np.argsort([1,2,0])).shape
(4, 2, 3)
and got yet a completely different shape!
Could someone please explain this? Thanks.
In [259]: a = np.zeros((2,3,4))
In [260]: idx = [1,2,0]
In [261]: a.transpose(idx).shape
Out[261]: (3, 4, 2)
What this has done is take a.shape[1] dimension and put it first. a.shape[2] is 2nd, and a.shape[0] third:
In [262]: np.array(a.shape)[idx]
Out[262]: array([3, 4, 2])
transpose without parameter is a complete reversal of the axis order. It's an extension of the familiar 2d transpose (rows become columns, columns become rows):
In [263]: a.transpose().shape
Out[263]: (4, 3, 2)
In [264]: a.transpose(2,1,0).shape
Out[264]: (4, 3, 2)
And the do-nothing transpose:
In [265]: a.transpose(0,1,2).shape
Out[265]: (2, 3, 4)
You have an initial axes order and final one; describing swap can be hard to visualize if you don't regularly work with lists of size 3 or larger.
Some people find it easier to use swapaxes, which changes the order of just axes. rollaxis is yet another way.
I prefer to use transpose since it can do anything the others can; so I just have to develop an intuitive for one tool.
The argsort comment operates this way:
In [278]: a.transpose(idx).transpose(np.argsort(idx)).shape
Out[278]: (2, 3, 4)
That is, apply it to the result of one transpose to get back the original order.
np.argsort([1,2,0]) returns an array like [2,0,1]
So
np.transpose(a, np.argsort([1,2,0])).shape
act like
np.transpose(a, [2,0,1]).shape
not
np.transpose(a, [1,2,0]).shape