Dask array from_npy_stack misses info file - numpy

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'

The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Related

How to decode a .csv .gzip file containing tweets?

I'm trying to do a twitter sentiment analysis and my dataset is a couple of .csv.gzip files.
This is what I did to convert them to all to one dataframe.
(I'm using google colab, if that has anything to do with the error, filename or something)
apr_files = [file[9:] for file in csv_collection if re.search(r"04+", file)]
apr_files
Output:
['0428_UkraineCombinedTweetsDeduped.csv.gzip',
'0430_UkraineCombinedTweetsDeduped.csv.gzip',
'0401_UkraineCombinedTweetsDeduped.csv.gzip']
temp_list = []
for file in apr_files:
print(f"Reading in {file}")
# unzip and read in the csv file as a dataframe
temp = pd.read_csv(file, compression="gzip", header=0, index_col=0)
# append dataframe to temp list
temp_list.append(temp)
Error:
Reading in 0428_UkraineCombinedTweetsDeduped.csv.gzip
Reading in 0430_UkraineCombinedTweetsDeduped.csv.gzip
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Reading in 0401_UkraineCombinedTweetsDeduped.csv.gzip
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-5cba3ca01b1e> in <module>()
3 print(f"Reading in {file}")
4 # unzip and read in the csv file as a dataframe
----> 5 tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
6 # append dataframe to temp list
7 tmp_df_list.append(tmp_df)
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 8048: invalid start byte
I assumed that this error might be because the tweets contain multiple characters (like emoji, non-english characters, etc.).
I just switched to Jupyter Notebook, and It worked fine there.
As of now, I don't know what was the issue with Google Colab though.

Writing pandas dataframe to excel in dbfs azure databricks: OSError: [Errno 95] Operation not supported

I am trying to write a pandas dataframe to the local file system in azure databricks:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
data.to_excel(writer)
Then I get the following error message:
OSError: [Errno 95] Operation not supported
--------------------------------------------------------------------------- OSError Traceback (most recent call
last) in
3 data = pd.read_csv(url)
4 with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
----> 5 data.to_excel(writer)
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in exit(self, exc_type, exc_value, traceback)
892
893 def exit(self, exc_type, exc_value, traceback):
--> 894 self.close()
895
896 def close(self):
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in close(self)
896 def close(self):
897 """synonym for save, to make it more file-like"""
--> 898 content = self.save()
899 self.handles.close()
900 return content
I read in this post some limitations for mounted file systems: Pandas: Write to Excel not working in Databricks
But if I got it right, the solution is to write to the local workspace file system, which is exactly what is not working for me.
My user is workspace admin and I am using a standard cluster with 10.4 Runtime.
I also verified I can write csv file to the same location using pd.to_csv
What could be missing.
Databricks has a drawback that does not allow random write operations into DBFS which is indicated in the SO thread you are referring to.
So, a workaround for this would be to write the file to local file system (file:/) and then move to the required location inside DBFS. You can use the following code:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'export.xlsx', engine="openpyxl") as writer:
#file will be written to /databricks/driver/ i.e., local file system
data.to_excel(writer)
dbutils.fs.ls("/databricks/driver/") indicates that the path you want to use to list the files is dbfs:/databricks/driver/ (absolute path) which does not exist.
/databricks/driver/ belongs to the local file system (DBFS is a part of this). The absolute path of /databricks/driver/ is file:/databricks/driver/. You can list the contents of this path by using either of the following:
import os
print(os.listdir("/databricks/driver/")
#OR
dbutils.fs.ls("file:/databricks/driver/")
So, use the file located in this path and move (or copy) it to your destination using shutil library as the following:
from shutil import move
move('/databricks/driver/export.xlsx','/dbfs/tmp/export.xlsx')

using tfrecord but getting file too large

I am trying to create a tfrecord from a folder of numpy arrays, the folder contains about 2000 numpy files of 50mb each.
def convert(image_paths,out_path):
# Args:
# image_paths List of file-paths for the images.
# labels Class-labels for the images.
# out_path File-path for the TFRecords output file.
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_images = len(image_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path) in enumerate(image_paths):
# Print the percentage-progress.
print_progress(count=i, total=num_images-1)
# Load the image-file using matplotlib's imread function.
img = np.load(path)
# Convert the image to raw bytes.
img_bytes = img.tostring()
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'image': wrap_bytes(img_bytes)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
# Write the serialized data to the TFRecords file.
writer.write(serialized)
i think it converts about 200 files and then i get this
Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
File "tf_record.py", line 54, in convert
writer.write(serialized)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large
Any suggestions to fix this would be helpful, Thanks in advance.
I'm not sure what the limits are to tfrecords but the more common approach assuming you have enough disk space is to store your dataset over several tfrecords file e.g. store every 20 numpy files in a different tfrecords file.

Convert Pandas DataFrame to bytes-like object

Hi I am trying to convert my df to binary and store it in a variable.
my_df:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
my code:
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0) # reset pointer
I am getting AttributeError: '_io.BytesIO' object has no attribute 'write_cells'
Full Traceback:
AttributeError Traceback (most recent call last)
<ipython-input-25-be6ee9d9ede6> in <module>()
1 towrite = io.BytesIO()
----> 2 df.to_excel(towrite) # write to BytesIO buffer
3 towrite.seek(0) # reset pointer
4 encoded = base64.b64encode(towrite.read()) #
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
1422 formatter.write(excel_writer, sheet_name=sheet_name, startrow=startrow,
1423 startcol=startcol, freeze_panes=freeze_panes,
-> 1424 engine=engine)
1425
1426 def to_stata(self, fname, convert_dates=None, write_index=True,
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine)
624
625 formatted_cells = self.get_formatted_cells()
--> 626 writer.write_cells(formatted_cells, sheet_name,
627 startrow=startrow, startcol=startcol,
628 freeze_panes=freeze_panes)
AttributeError: '_io.BytesIO' object has no attribute 'write_cells'
I solved the issue by upgrading pandas to newer version.
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0)
print(towrite)
b''
print(type(towrite))
_io.BytesIO
if you want to see the bytes-like object use getvalue,
print(towrite.getvalue())
b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00!\x00<\xb
Pickle
Pickle is a reproducible format for a Pandas dataframe, but it's only for internal use among trusted users. It's not for sharing with untrusted users due to security reasons.
import pickle
# Export:
my_bytes = pickle.dumps(df, protocol=4)
# Import:
df_restored = pickle.loads(my_bytes)
This was tested with Pandas 1.1.2. Unfortunately this failed for a very large dataframe, but then what worked is pickling and parallel-compressing each column individually, followed by pickling this list. Alternatively you can pickle chunks of the large dataframe.
CSV
If you must use a CSV representation:
df.to_csv(index=False).encode()
Note that various datatypes are lost when using CSV.
Parquet
See this answer. Note that various datatypes are converted when using parquet.
Excel
Avoid its use for the most part because it limits the max number of rows and columns.
I required to upload the file object to S3 via boto3 which didn't accept the pandas bytes object. So building on the answer from Asclepius I cast the object to a BytesIO, eg:
from io import BytesIO
data = BytesIO(df.to_csv(index=False).encode('utf-8'))

Concurrently read an HDF5 file in Pandas

I have a data.h5 file organised in multiple chunks, the entire file being several hundred gigabytes. I need to work with a filtered subset of the file in memory, in the form of a Pandas DataFrame.
The goal of the following routine is to distribute the filtering work across several processes, then concatenate the filtered results into the final DataFrame.
Since reading from the file takes a significant amount of time, I'm trying to make each process read its own chunk in a concurrent manner as well.
import multiprocessing as mp, pandas as pd
store = pd.HDFStore('data.h5')
min_dset, max_dset = 0, len(store.keys()) - 1
dset_list = list(range(min_dset, max_dset))
frames = []
def read_and_return_subset(dset):
# each process is intended to read its own chunk in a concurrent manner
chunk = store.select('batch_{:03}'.format(dset))
# and then process the chunk, do the filtering, and return the result
output = chunk[chunk.some_condition == True]
return output
with mp.Pool(processes=32) as pool:
for frame in pool.map(read_and_return_subset, dset_list):
frames.append(frame)
df = pd.concat(frames)
However, the above code triggers this error:
HDF5ExtError Traceback (most recent call last)
<ipython-input-174-867671c5a58f> in <module>()
53
54 with mp.Pool(processes=32) as pool:
---> 55 for frame in pool.map(read_and_return_subset, dset_list):
56 frames.append(frame)
57
/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 554, in H5D__read
can't read data
File "H5Dchunk.c", line 1856, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 340, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1262, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3574, in H5C_protect
can't load entry
File "H5C.c", line 7954, in H5C_load_entry
unable to load entry
File "H5Bcache.c", line 143, in H5B__load
wrong B-tree signature
End of HDF5 error back trace
Problems reading the array data.
It seems that Pandas/pyTables have troubles when trying to access the same file in a concurrent manner, even if it's only for reading.
Is there a way to be able to make each process read its own chunk concurrently?
IIUC you can index those columns that are used for filtering data (chunk.some_condition == True - in your sample code) and then read up only that subset of data that satisfies needed conditions.
In order to be able to do that you need to:
save HDF5 file in table format - use parameter: format='table'
index columns, that will be used for filtering - use parameter: data_columns=['col_name1', 'col_name2', etc.]
After that you should be able to filter your data just by reading:
store = pd.HDFStore(filename)
df = store.select('key_name', where="col1 in [11,13] & col2 == 'AAA'")