What is the sequence for preprocessing text df with tensorflow? - tensorflow

I have a pandas data frame, containing two columns: sentences and annotations:
Col 0
Sentence
Annotation
1
[This, is, sentence]
[l1, l2, l3]
2
[This, is, sentence, too]
[l1, l2, l3, l4]
There are several things I need to do:
split to features and labels
split into train-val-test data
vectorize train data using:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=maxlen,
standardize='lower',
split='whitespace',
ngrams=(1,3),
output_mode='tf-idf',
pad_to_max_tokens=True,)
I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?

You can split your dataset before creating a model. After splitting you need to tokenize your sentences using
tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
After tokenizing you need to add padding to the sentence using
training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
Then you can train your model with the data. For more details please refer to this working code example. Thank You.

Related

How to import a CSV file, split it 70/30 and then use first column as my 'y' value?

I am having an issue at the moment, I think im making it far more complicated than it needs to be. my csv file is 31 rows by 500. I need to import this, split it in a 70/30 ratio and then be able to use the first column as my 'y' value for a neural network, and the remaining 30 columns need to be my 'x' value.
ive implemented the below code to do this, but when I run it through my basic sigmoid and testing functions, it provides results in a weird format i.e. [6.54694655e-06].
I believe this is due to my splitting/importing of the data, which I think I have done wrong. I need to import the data into arrays that are readable by my functions, and be able to separate my first column specifically to a 'y' value. how do I go about this?
df = pd.read_csv(r'data.csv', header=None)
df.to_numpy()
#splitting data 70/30
trainingdata= df[:329]
testingdata= df[:141]
#converting data to seperate arrays for training and testing
training_features= trainingdata.loc[:, trainingdata.columns != 0].values.reshape(329,30)
training_labels = trainingdata[0]
training_labels = training_labels.values.reshape(329,1)
testing_features = testingdata[0]
testing_labels = testingdata.loc[:, testingdata.columns != 0]
Usually for splitting the dataframe on test and train data I use sklearn.model_selection.train_test_split. Documentation here.
Some other methods are described here Hope this will help you!
Make you train/test split easy by using sklearn.model_selection.train_test_split.
If you don't have sklearn installed, first install it by running pip install -U scikit-learn.
Then
from sklearn.model_selection import train_test_split
df = pd.read_csv(r'data.csv', header=None)
# X is your features, y is your target column
X = df.loc[:,1:]
y = df.loc[:,0]
# Use train_test_split function with test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
df = pd.read_csv(r'data.csv')
df.to_numpy()
print(df)

Building tf.keras.preprocessing.image.ImageDataGenerator on existing data split

Let's assume we have have two txt-files (train split and validation split) with filenames.
Now we want to use a tf.keras.preprocessing.image.ImageDataGenerator. In case of random split based on folder structure (each per class) the generator would be build like that.
TRAINING_DATA_DIR = str(data_root)
datagen_kwargs = dict(rescale=1. / 255, validation_split=0.2) # rescale and split
valid_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**datagen_kwargs)
valid_generator = valid_datagen.flow_from_directory(
TRAINING_DATA_DIR,
subset="validation",
shuffle=True,
target_size=IMAGE_SHAPE)
But what about building the generation based on the two txt-files containing image names os the splits.
I found out that besides the flow_from_directory function there is a from_dataframe function but it still doesn't give an answer on the problem.
Any ideas are highly appreciated
Thanks

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line
Term_IDF = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]
The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:
for doc in corpus_tfidf:
for ids,tfidf in doc:
IDS.append(ids)
tfidfmtx.append(tfidf)
IDS=[]
now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.
dumydist = 1 - cosine_similarity(tfidfmtx)
When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
Transform into bow:
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]
Transform into tf-idf:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
Now you can transform into sparse/dense matrix:
from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)
Now fit your model using either sparse/dense matrix (after transposing):
model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)
To create document term matrix from gensim, you may use matutils.corpus2csv
Corpus - list of list(Genism Corpus)
from scipy.sparse import csc_matrix
scipy_csc_matrix =genism.matutils.corpus2csc(corpus)
full_matrix=csc_matrix(scipy_csc_matrix).toarray()
you may want to use scipy sparse format if your corpus size is very large.

Stacking list of lists vertically using np.vstack is throwing an error

I am following this piece of code http://queirozf.com/entries/scikit-learn-pipeline-examples in order to develop a Multilabel OnevsRest classifier for text. I would like to compute the hamming_score and thus would need to binarize my test labels as well. I thus have:
X_train, X_test, labels_train, labels_test = train_test_split(meetings, labels, test_size=0.4)
Here, labels_train and labels_test are list of lists
[['dog', 'cat'], ['cat'], ['people'], ['nice', 'people']]
Now I need to binarize all my labels, I am therefore doing this...
all_labels = np.vstack([labels_train, labels_test])
mlb = MultiLabelBinarizer().fit(all_labels)
As directed by in the link. But that throws
ValueError: all the input array dimensions except for the concatenation axis must match exactly
I used np.column_stack as directed here
numpy array concatenate: "ValueError: all the input arrays must have same number of dimensions"
but that throws the same error.
How can the dimensions be the same if I am splitting on train and test, I am bound to get different shapes right? Please help, thank you.
MultilabelBinarizer works on list of lists directly, so you dont need to stack them using numpy. Directly send the list without stacking.
all_labels = labels_train + labels_test
mlb = MultiLabelBinarizer().fit(all_labels)

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)