how to compress lists/nested lists in hdf5 - arraylist

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?

This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))

You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

Related

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

Restricting the term-document matrix to most frequent unigrams

The code below as an example for analyzing massive corpus. I want to restrict the term-document matrix to 1000 most frequent unigrams, but changing the max-features parameter to n only return the first n unigrams. Any suggestion?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = ['Hi my name is Joe.', 'Hi my name is Donald.']
vectorizer = TfidfVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
df.to_csv('test.csv')
I am assuming that this is a problem in your example, but according to the sklearn documentation for TfidfVectorizer ssays the following for max_features:
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
It might be that the first n terms are considered when there are words of equal frequency, but otherwise it should return the correct result. If it still does not work, I strongly suggest to open a bug report in the sklearn repository. However, you can also manually construct a vocabulary yourself (with your own interpretation of "frequency", by setting the vocabulary option:
vocabulary: Mapping or iterable, default=None
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Creating 2D numpy array from binary file?

I am using the following data set which is a binary file from: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Data looks like this:
1350423,5,10,10,8,5,5,7,10,1,4
1352848,3,10,7,8,5,8,7,4,1,4
1353092,3,2,1,2,2,1,3,1,1,2
1354840,2,1,1,1,2,1,3,1,1,2
1354840,5,3,2,1,3,1,1,1,1,2
1355260,1,1,1,1,2,1,2,1,1,2
1365075,4,1,4,1,2,1,1,1,1,2
1365328,1,1,2,1,2,1,2,1,1,2
1368267,5,1,1,1,2,1,1,1,1,2
1368273,1,1,1,1,2,1,1,1,1,2
1368882,2,1,1,1,2,1,1,1,1,2
Binary file has 699 such lines.
I am then using the below code to get the data from binary file saved as 'sample.data' and save it in a 2D numpy array:
import numpy as np
def main():
dtype = np.dtype('i8')
b = np.fromfile('sample.data', dtype=dtype)
ndata = np.array(b)
print(ndata.shape)
main()
I am looking to get a (699,11) array i.e each row from binary file as numpy row and each element separated by comma as a numpy row's element.
what can I do here to achieve the same ?
I downloaded the file http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/unformatted-data and then did this:
ndata = np.genfromtxt('unformatted-data', skip_header=16, delimiter=',')
The array ndata has shape (699, 11).
Be careful, because there are various groups of data (there are comments in the file that numpy is ignoring) in there and it might not make sense to lump them all together.

Object dtype dtype('O') has no native HDF5 equivalent

Well, it seems like a couple of similar questions were asked here in stack overflow, but none of them seem like answered correctly or properly, nor they described the exact examples.
I have a problem with saving array or list into hdf5 ...
I have a several files contains list of (n, 35) dimensions, where n may be different in each file. Each of them can be saved in hdf5 with code below.
hdf = hf.create_dataset(fname, data=d)
However, if I want to merge them to make in 3d the error occurs as below.
Object dtype dtype('O') has no native HDF5 equivalent
I have no idea why it turns to dtype object, since what I have done is only this
all_data = list()
for fname in file_list:
d = np.load(fname)
all_data.append(d)
hdf = hf.create_dataset('all_data', data=all_data)
How can I save such data?
I tried a couple of tests, and it seems like all_data turns to dtype with 'object' when I change them with
all_data = np.array(all_data)
Which looks it has the similar problem with saving hdf5.
Again, how can I save such data in hdf5?
I was running into a similar issue with h5py, and changing the type of the NumPy array using array.astype worked for me (I believe this changes the type from dtype('O') to the data type you specify). Please see the code snippet below:
import numpy as np
print(X.dtype)
--> dtype('O')
print(X.astype(np.float64).dtype)
--> dtype('float64')
When I ran h5.create_dataset with this data type conversion, I was able to successfully create a h5 dataset. Hope this helps!
ONE ADDITIONAL UPDATE: I believe the NumPy object type 'O' is created when the NumPy array itself has mixed element types (e.g. np.int8 and np.float32).
dtype('O') stands for object. In my case I had a list of lists where the lengths were different and got the same error. If you convert it to a numpy array numpy warns Creating an ndarray from ragged nested sequences. h5 files can't handle this type of data for more info see this post
This error comes when I use:
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X)
pfile['peakX'] = np.array(Y)
However when I used dtype when saving the arrays... the problem went away... I guess h5py is not able to create datasets from undefined data types.
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X, dtype=np.float32)
pfile['peakX'] = np.array(Y, dtype=np.float32)