Creating 2D numpy array from binary file? - numpy

I am using the following data set which is a binary file from: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Data looks like this:
1350423,5,10,10,8,5,5,7,10,1,4
1352848,3,10,7,8,5,8,7,4,1,4
1353092,3,2,1,2,2,1,3,1,1,2
1354840,2,1,1,1,2,1,3,1,1,2
1354840,5,3,2,1,3,1,1,1,1,2
1355260,1,1,1,1,2,1,2,1,1,2
1365075,4,1,4,1,2,1,1,1,1,2
1365328,1,1,2,1,2,1,2,1,1,2
1368267,5,1,1,1,2,1,1,1,1,2
1368273,1,1,1,1,2,1,1,1,1,2
1368882,2,1,1,1,2,1,1,1,1,2
Binary file has 699 such lines.
I am then using the below code to get the data from binary file saved as 'sample.data' and save it in a 2D numpy array:
import numpy as np
def main():
dtype = np.dtype('i8')
b = np.fromfile('sample.data', dtype=dtype)
ndata = np.array(b)
print(ndata.shape)
main()
I am looking to get a (699,11) array i.e each row from binary file as numpy row and each element separated by comma as a numpy row's element.
what can I do here to achieve the same ?

I downloaded the file http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/unformatted-data and then did this:
ndata = np.genfromtxt('unformatted-data', skip_header=16, delimiter=',')
The array ndata has shape (699, 11).
Be careful, because there are various groups of data (there are comments in the file that numpy is ignoring) in there and it might not make sense to lump them all together.

Related

read numpy array from txt file by chunks

I have a 2D array saved in txt file(10 millions rows). Because it's too large, I need to load it by chunk. Let's say read every 1000 lines each time (as a batch size of training data in Neural Network). Now I followed this :
Read specific lines from text file as numpy array
It works well. But it is way too slow. Is there any other way to do this please?
from itertools import islice
import numpy as np
data=np.ones((10000000,100))
# This is sample data
#I saved data using
outfile= ('data.txt','wb')
np.savetxt(outfile, data)
#Now load data
file = open('data.txt','rb')
array = np.genfromtxt(islice(file, 1000000,1000005))
Or is there any other way to save and load data by chunks faster?

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

Tensorflow: Load unknown TFRecord dataset

I got a TFRecord data file filename = train-00000-of-00001 which contains images of unknown size and maybe other information as well. I know that I can use dataset = tf.data.TFRecordDataset(filename) to open the dataset.
How can I extract the images from this file to save it as a numpy-array?
I also don't know if there is any other information saved in the TFRecord file such as labels or resolution. How can I get these information? How can I save them as a numpy-array?
I normally only use numpy-arrays and am not familiar with TFRecord data files.
1.) How can I extract the images from this file to save it as a numpy-array?
What you are looking for is this:
record_iterator = tf.python_io.tf_record_iterator(path=filename)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Exit after 1 iteration as this is purely demonstrative.
break
2.) How can I get these information?
Here is the official documentation. I strongly suggest that you read the documentation because it goes step by step in how to extract the values that you are looking for.
Essentially, you have to convert example to a dictionary. So if I wanted to find out what kind of information is in a tfrecord file, I would do something like this (in context with the code stated in the first question): dict(example.features.feature).keys()
3.) How can I save them as a numpy-array?
I would build upon the for loop mentioned above. So for every loop, it extracts the values that you are interested in and appends them to numpy arrays. If you want, you could create a pandas dataframe from those arrays and save it as a csv file.
But...
You seem to have multiple tfrecord files...tf.data.TFRecordDataset(filename) returns a dataset that is used to train models.
So in the event for multiple tfrecords, you would need a double for loop. The outer loop will go through each file. For that particular file, the inner loop will go through all of the tf.examples.
EDIT:
Converting to np.array()
import tensorflow as tf
from PIL import Image
import io
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Get the values in a dictionary
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
image_array = np.array(Image.open(io.BytesIO(example_bytes)))
print(image_array)
break
Sources for the code above:
Base code
Converting bytes to PIL.JpegImagePlugin.JpegImageFile
Converting from PIL.JpegImagePlugin.JpegImageFile to np.array
Official Documentation for PIL
EDIT 2:
import tensorflow as tf
from PIL import Image
import io
import numpy as np
# Load image
cat_in_snow = tf.keras.utils.get_file(path, 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')
#------------------------------------------------------Convert to tfrecords
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def image_example(image_string):
feature = {
'image_raw': _bytes_feature(image_string),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
with tf.python_io.TFRecordWriter('images.tfrecords') as writer:
image_string = open(cat_in_snow, 'rb').read()
tf_example = image_example(image_string)
writer.write(tf_example.SerializeToString())
#------------------------------------------------------
#------------------------------------------------------Begin Operation
record_iterator = tf.python_io.tf_record_iterator(path to tfrecord file)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# OPTION 1: convert bytes to arrays using PIL and IO
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
PIL_array = np.array(Image.open(io.BytesIO(example_bytes)))
# OPTION 2: convert bytes to arrays using Tensorflow
with tf.Session() as sess:
TF_array = sess.run(tf.image.decode_jpeg(example_bytes, channels=3))
break
#------------------------------------------------------
#------------------------------------------------------Compare results
(PIL_array.flatten() != TF_array.flatten()).sum()
PIL_array == TF_array
PIL_img = Image.fromarray(PIL_array, 'RGB')
PIL_img.save('PIL_IMAGE.jpg')
TF_img = Image.fromarray(TF_array, 'RGB')
TF_img.save('TF_IMAGE.jpg')
#------------------------------------------------------
Remember that tfrecords is just simply a way of storing information for tensorflow models to read in an efficient manner.
I use PIL and IO to essentially convert the bytes to an image. IO takes the bytes and converts them to a file like object that PIL.Image can then read
Yes, there is a pure tensorflow way to do it: tf.image.decode_jpeg
Yes, there is a difference between the two approaches when you compare the two arrays
Which one should you pick? Tensorflow is not the way to go if you are worried about accuracy as stated in Tensorflow's github : "The TensorFlow-chosen default for jpeg decoding is IFAST, sacrificing image quality for speed". Credit for this information belongs to this post

Change data type in Numpy and Nibabel

I'm trying to convert numpy arrays into Nifti file format using Nibabel. Some of my Numpy arrays have dtype('<i8') when it should be dtype('uint8') when Nibabel calls for the data type.
arr.get_data_dtype()
Does anyone know how to convert and save Numpy arrays' data type?
The question of the title is slightly different than the question in the text. So...
If you want to change the data-type of a numpy array arr to np.int8, you are looking for arr.astype(np.int8).
Mind that you may lose precision due to data casting (see astype documentation)
To save it afterwards you may want to see ?np.save and ?np.savetxt (or to check the library pickle, to save more general objects than numpy array).
If you want to change the data-type of a nifti image saved in my_image.nii.gz
you have to go for:
import nibabel as nib
import numpy as np
image = nib.load('my_image.nii.gz')
# to be extra sure of not overwriting data:
new_data = np.copy(image.get_data())
hd = image.header
# in case you want to remove nan:
new_data = np.nan_to_num(new_data)
# update data type:
new_dtype = np.int8 # for example to cast to int8.
new_data = new_data.astype(new_dtype)
image.set_data_dtype(new_dtype)
# if nifty1
if hd['sizeof_hdr'] == 348:
new_image = nib.Nifti1Image(new_data, image.affine, header=hd)
# if nifty2
elif hd['sizeof_hdr'] == 540:
new_image = nib.Nifti2Image(new_data, image.affine, header=hd)
else:
raise IOError('Input image header problem')
nib.save(new_image, 'my_image_new_datatype.nii.gz')
Finally if you have a numpy array my_arr and you want to save it into a nifti image with a given data-type np.my_dtype, you can do:
import nibabel as nib
import numpy as np
new_image = nib.Nifti1Image(my_arr, np.eye(4))
new_image.set_data_dtype(np.my_dtype)
nib.save(new_image, 'my_arr.nii.gz')
Hope it helps!
NOTE: If you are using ITKsnap you may want to use np.float32, np.float64, np.uint16, np.uint8, np.int16, np.int8. Other choices may not produce images that can be open with this software.
Seems like you could also do
import nibabel
img = nibabel.load(filename)
img.set_data_dtype(dtype)
img.to_filename(new_filename)
You can use nilearn for a tidy solution. Here is an example if you want to change the data type of nifti image to int16:
from nilearn import image
import numpy as np
vol = image.load_img(input_file)
vol = image.new_img_like(vol, np.int16(vol.get_fdata()))
vol.to_filename(output_file)
Datatypes for .nii files can also be specified in the .to_filename() function:
import nibabel as nib
new_image = nib.Nifti2Image(my_arr, affine)
new_image.to_filename(fn, dtype=np.uint8)