Remove header from .MAT files loaded by loadmat() from scipy.io library - numpy

I am new to both python and tensorflow.
I am trying to make a input pipeline for a generative adversarial network with input complex number data in .mat format and loaded it with loadmat() from scipy.io library. Now I am trying to prepare my data for giving input to my network and i tried from_tensor_slices(). But it can not be converted into tensor because of the headers in it. I looked up how to remove header from files by python and found some techniques that can be applied to .csv file but nothing on .mat files. How can I remove the header from .mat files? Also, the loadmat() function returns a list of dictionary I think. How can I extract the data from the file under such condition? Thank you.

Related

How to convert the .wav file to tfrecord file?

I am new in the tensorflow part and hope someone can help me.
I've seen this document https://www.tensorflow.org/api_docs/python/tf/audio/decode_wav and executed tf.audio.decode_wav( contents, desired_channels=-1, desired_samples=-1, name=None )
The only thing I've changed is the contents, changing to my path name.
But still get an error!
Any methods to convert it? And I want to output something like this. .tfrecord-00000-of-00008
Read the .wav file into a string of bytes and then decode it:
import tensorflow as tf
wav_contents = tf.io.read_file("file.wav")
audio, sample_rate = tf.audio.decode_wav(contents=wav_contents)
audio.shape
This example was borrowed from the TensorFlow tutorial on reading audio files.

COCO json annotation to YOLO txt format

how to convert a single COCO JSON annotation file into a YOLO darknet format?? like below
each individual image has separate filename.txt file
My classmates and I have created a python package called PyLabel to help others with this task and other labelling tasks.
Our package does this conversion! You can see an example in this notebook https://github.com/pylabel-project/samples/blob/main/coco2yolov5.ipynb.
You're answer should be in there! But you should be able to do this conversion by doing something like:
!pip install pylabel
from pylabel import importer
dataset = importer.ImportCoco(path=path_to_annotations, path_to_images=path_to_images)
dataset.export.ExportToYoloV5(dataset)
You can find the source code that is used behind the scenes here https://github.com/pylabel-project/
I built a tool
https://github.com/tw-yshuang/coco2yolo
Download this repo and use the following command:
python3 coco2yolo.py [OPTIONS]
coc2yolo
Usage: coco2yolo.py [OPTIONS] [CAT_INFOS]...
Options:
-ann-path, --annotations-path TEXT
JSON file. Path for label. [required]
-img-dir, --image-download-dir TEXT
The directory of the image data place.
-task-dir, --task-categories-dir TEXT
Build a directory that follows the task-required categories.
-cat-t, --category-type TEXT Category input type. (interactive | file) [default: interactive]
-set, --set-computing-type TEXT
Set Computing for the data. (union | intersection) [default: union]
--help Show this message and exit.
There is an open-source tool called makesense.ai for annotating your images. You can download YOLO txt format once you annotate your images. But you won't be able to download the annotated images.
There is three ways.
use roboflow https://roboflow.com/formats (You can find another solution also)
You can find some usage guide for roboflow. e.g.
https://medium.com/red-buffer/roboflow-d4e8c4b52515
search 'convert coco format to yolo format' -> you will find some open-source codes to convert annotations to yolo format.
write your own code to convert coco format to yolo format

How can I open a large parquet file with Keras?

I've tried looking for this and haven't had any meaningful results.
I have a keras model that has multi input and my data was getting too large for my pandas approach so I preprocessed it and saved it parquet file. I'm not sure how to open it with keras.
I looked up tf.datasets but I still cannot figure out how to read a parquet file that I can pass to my model.
Does anyone know how to use open parquet files? I can't seem to figure out how to do this in tensorflow and can't find anything related to it in keras.
You can probably keep your pandas approach, but you would have to breakdown your data into chunks.
If you have already broken it down to create your parquet file, you should be able to use the same method to have only a subset of your data opened in pandas at a time.
If you need to extract the data from your parquet file here's a link on how to create chunks of data for a pandas dataframe:
How to read a CSV file subset by subset with Pandas?
Once you have a chunk of data you can call model.fit on that chunk of data and then go on to the next chunk and call model.fit
You can look into TensorFlow I/O which is a collection of file systems and file formats that are not available in TensorFlow's built-in support. Here you can find functionalities such tfio.IODataset.from_parquet, and also tfio.IOTensor.from_parquet to work with the parquet file formats.
!pip install tensorflow_io -U -q
import tensorflow_io as tfio
df = pd.DataFrame({"data": tf.random.normal([20], 0, 1, tf.float32),
"label": np.random.randint(2, size=(20))})
df.to_parquet("df.parquet")
pd.read_parquet('/content/df.parquet')[:2]
data label
0 0.721347 1
1 -1.215225 1
ds = tfio.IODataset.from_parquet('/content/df.parquet')
ds
FYI, I think you should also consider using the feather format rather than the parquet file format, AFAIK, the parquet file can be really heavy to load and can slow down your training pipelines, whereas feather is comparatively fast (very fast).

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()