Does Raku have a similar function to Python's Struct? - raku

Unpacking binary data in Python:
import struct
bytesarray = "01234567".encode('utf-8')
# Return a new Struct object which writes and reads binary data according to the format string.
s = struct.Struct('=BI3s')
s = s.unpack(bytesarray) # Output: (48, 875770417, b'567')
Does Raku have a similar function to Python's Struct? How can I unpack binary data according to a format string in Raku?

There's the experimental unpack
use experimental :pack;
my $bytearray = "01234567".encode('utf-8');
say $bytearray.unpack("A1 L H");
It's not exactly the same, though; this outputs "(0 875770417 35)". You can tweak your way through it a bit, maybe.
There's also an implementation of Perl's pack / unpack in P5pack

Related

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

Object arrays not supported on numpy with mkl?

I recently switched from numpy compiled with open blas to numpy compiled with mkl. In pure numeric operations there was a clear speed up for matrix multiplication. However when I ran some code I have been using which multiplies matrices containing sympy variables, I now get the error
'Object arrays are not currently supported'
Does anyone have information on why this is the case for mkl and not for open blas?
Release notes for 1.17.0
Support of object arrays in matmul
It is now possible to use matmul (or the # operator) with object arrays. For instance, it is now possible to do:
from fractions import Fraction
a = np.array([[Fraction(1, 2), Fraction(1, 3)], [Fraction(1, 3), Fraction(1, 2)]])
b = a # a
Are you using # (matmul or dot)? A numpy array containing sympy objects will be object dtype. Math on object arrays depends on delegating the action to the object's own methods. It cannot be performed by the fast compiled libraries, which only work with c types such as float and double.
As a general rule you should not be trying to mix numpy and sympy. Math is hit-or-miss, and never fast. Use sympy's own Matrix module, or lambdify the sympy expressions for numeric work.
What's the mkl version? You may have to explore this with creator of that compilation.

Arrow ListArray from pandas has very different structure from arrow array generated by awkward?

I encountered the following issue making some tests to demonstrate the usefulness of a pure pyarrow UDF in pyspark as compared to always going through pandas.
import awkward
import numpy
import pandas
import pyarrow
counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)
def awk_arrow(col):
jagged = awkward.fromarrow(col)
jagged2 = jagged**2
return awkward.toarrow(jagged2)
def pds_arrow(col):
pds = col.to_pandas()
pds2 = pds**2
return pyarrow.Array.from_pandas(pds2)
out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)
out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)
type(out3)
type(out4)
yields
<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>
and
out3 == out4
yields (at the end of the stack trace):
AttributeError: no column named 'reshape'
looking at the arrays:
print(out3);print();print(out4);
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
You can see the contents and shape of the arrays are the same, but they're not comparable to each other at face value, which is very counter intuitive. Is there a good reason for dense jagged structures with no Nulls to be represented as a BitMaskedArray?
All data in Arrow are nullable (at every level), and they use bit masks (as opposed to byte masks) to specify which elements are valid. The specification allows columns of entirely valid data to not write the bitmask, but not every writer takes advantage of that freedom. Quite often, you see unnecessary bitmasks.
When it encounters a bitmask, such as here, awkward inserts a BitMaskedArray.
It could be changed to check to see if the mask is unnecessary and skip that step, though that adds an operation that scales with the size of the dataset (though likely insignificant in most cases—bitmasks are 8 times faster to check than bytemasks). It's also a little complicated: the last byte may be incomplete if the length of the dataset is not a multiple of 8. One would need to check these bits individually, but the rest of the mask could be checked in bulk. (Maybe even cast as int64 to check 64 flags at a time.)

Creating a numpy array from a pointer in cython

After having read a lot of documentation on numpy / cython I am still unable to create a numpy array from a pointer in cython. The situation is as follows. I have a cython (*.pyx) file containing a callback function:
cimport numpy
cdef void func_eval(double* values,
int values_len,
void* func_data):
func = (<object> func_data)
# values.: contiguous array of length=values_len
array = ???
# array should be a (modifiable) numpy array containing the
# values as its data. No copying, no freeing the data by numpy.
func.eval(array)
Most tutorials and guides consider the problem of turning an array to a pointer, but I am interested in the opposite.
I have seen one solution here based on pure python using the ctypes library (not what I am interested in). Cython itself talks about typed memoryviews a great deal. This is also not what I am looking for exactly, since I want all the numpy goodness to work on the array.
Edit: A (slightly) modified standalone MWE (save as test.cyx, compile via cython test.cyx):
cimport numpy
cdef extern from *:
"""
/* This is C code which will be put
* in the .c file output by Cython */
typedef void (*callback)(double* values, int values_length);
void execute(callback cb)
{
double values[] = {0., 1.};
cb(values, 2);
}
"""
ctypedef void (*callback)(double* values, int values_length);
void execute(callback cb);
def my_python_callback(array):
print(array.shape)
print(array)
cdef void my_callback(double* values, int values_length):
# turn values / values_length into numpy array
# and call my_pytohn_callback
pass
cpdef my_execute():
execute(my_callback)
2nd Edit:
Regarding the possible duplicate: While the questions are related, the first answer given is, as was pointed out rather fragile, relying on memory data flags, which are arguably an implementation detail. What is more, the question and answers are rather outdated and the cython API has been expanded since 2014. Fortunately however, I was able to solve the problem myself.
Firstly, you can cast a raw pointer to a typed MemoryView operating on the same underlying memory without taking ownership of it via
cdef double[:] values_view = <double[:values_length]> values
This is not quire enough however, as I stated I want a numpy array. But it is possible to convert a MemoryView to a numpy array provided that it has a numpy data type. Thus, the goal can be achieved in one line via
array = np.asarray(<np.float64_t[:values_length]> values)
It can be easily checked that the array operates on the correct memory segment without owning it.

Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big?

I am using the netCDF4 library in python and just came across the issue stated in the title. At first I was blaming groups for this, but it turns out that it is a difference between the NETCDF4 and NETCDF3_CLASSIC formats (edit: and it appears related to our Linux installation of the netcdf libraries).
In the program below, I am creating a simple time series netcdf file of the same data in 2 different ways: 1) as NETCDF3_CLASSIC file, 2) as NETCDF4 flat file (creating groups in the netcdf4 file doesn't make much of a difference). What I find with a simple timing and the ls command is:
1) NETCDF3 1.3483 seconds 1922704 bytes
2) NETCDF4 flat 8.5920 seconds 15178689 bytes
It's exactly the same routine which creates 1) and 2), the only difference is the format argument in the netCDF4.Dataset method. Is this a bug or a feature?
Thanks, Martin
Edit: I have now found that this must have something to do with our local installation of the netcdf library on a Linux computer. When I use the program version below (trimmed down to the essentials) on my Windows laptop, I get similar file sizes, and netcdf4 is actually almost 2-times as fast as netcdf3! When I run the same program on our linux system, I can reproduce the old results. Thus, this question is apparently not related to python.
Sorry for the confusion.
New code:
import datetime as dt
import numpy as np
import netCDF4 as nc
def write_to_netcdf_single(filename, data, series_info, format='NETCDF4'):
vname = 'testvar'
t0 = dt.datetime.now()
with nc.Dataset(filename, "w", format=format) as f:
# define dimensions and variables
dim = f.createDimension('time', None)
time = f.createVariable('time', 'f8', ('time',))
time.units = "days since 1900-01-01 00:00:00"
time.calendar = "gregorian"
param = f.createVariable(vname, 'f4', ('time',))
param.units = "kg"
# define global attributes
for k, v in sorted(series_info.items()):
setattr(f, k, v)
# store data values
time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)
param[:] = data.value
t1 = dt.datetime.now()
print "Writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())
if __name__ == "__main__":
# create an array with 1 mio values and datetime instances
time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) for v in range(1000000)])
values = np.arange(0., 1000000.)
data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])
data = data.view(np.recarray)
series_info = {'attr1':'dummy', 'attr2':'dummy2'}
filename = "testnc4.nc"
write_to_netcdf_single(filename, data, series_info)
filename = "testnc3.nc"
write_to_netcdf_single(filename, data, series_info, format='NETCDF3_CLASSIC')
[old code deleted because it had too much unnecessary stuff]
The two file formats do have different characteristics. the classic file format was dead simple (well, more simple than the new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Classic-Format-Spec.html#Classic-Format-Spec ): A small header described all the data, and then (since you have 3 record variables) the 3 record variables get interleaved.
nice and simple, but you only get one UNLIMITED dimension, there's no facility for parallel I/O, and no way to manage data into groups.
Enter the new HDF5-based back-end, introduced in NetCDF-4.
In exchange for new features, more flexibility, and fewer restrictions on file and variable size, you have to pay a bit of a price. For large datasets, the costs are amortized, but your variables are (relatively speaking) kind of small.
I think the file size discrepancy is exacerbated by your use of record variables. in order to support arrays grow-able in N dimensions, there is more metadata associated with each record entry in the Netcdf-4 format.
HDF5 uses the "reader makes right" convention, too. classic NetCDF says "all data will be big-endian", but HDF5 encodes a bit of information about how the data was stored. If the reader process is the same architecture as the writer process (which is common, as it would be on your laptop or if restarting from a simulation checkpoint), then no conversion need be conducted.
This question is unlikely to help others as it appears to be a site-specific problem related to the interplay between netcdf libraries and the python netCDF4 module.