Read a file consisting two columns of pure(?) double into a complex NumPy array

However, the above involved slightly different input format, e.g. parentheses, than the file content herein.
Consider a file named example containing two columns of pure(?) double:
0.8355544313622164 0
1.199174279986189 0
1.417275292218002 0
I am able to generate a numpy array of np.complex64 by doing the following:
data = np.loadtxt("./example", dtype=np.float64, delimiter='\t')
complexData = data.T[0] + 1j*data.T[1]
Printing complexData now gives:
[ 0.83555443+0.j 1.19917428+0.j 1.41727529+0.j ... ]
Is it possible to reduce the above approach into a neater one?
For example, changing data type to np.complex64 raises TypeError:
data = np.loadtxt("./example", dtype=np.complex64, delimiter='\t')

Instead of converting the real array to complex with
complexData = data.T[0] + 1j*data.T[1]
you can create a complex view of the data:
complexData = data.view(np.complex128)
Then data and complexData share the underlying array of floating point numbers, but complexData interprets those values as complex numbers.
complexData will be an array with shape (n, 1). To get rid of the extraneous second dimension, you can use
complexData = data.view(np.complex128)[:, 0]
You could do the conversion immediately upon reading the data. For example, my sample file called "real.txt" is
0.8355544313622164 0
1.199174279986189 0
1.417275292218002 0
3.141592653589793 -1
and it is not tab-delimited, so I'll use the default delimiter. To read the data as complex:
In [18]: z = np.loadtxt('real.txt').view(np.complex128)[:, 0]
In [19]: z
Out[19]: array([0.83555443+0.j, 1.19917428+0.j, 1.41727529+0.j, 3.14159265-1.j])


How to compare numpy arrays of tuples?

Here's an MWE that illustrates the issue I have:
import numpy as np
arr = np.full((3, 3), -1, dtype="i,i")
doesnt_work = arr == (-1, -1)
n_arr = np.full((3, 3), -1, dtype=int)
works = n_arr == 10
arr is supposed to be an array of tuples, but it doesn't behave as expected.
works is an array of booleans, as expected, but doesnt_work is False. Is there a way to get numpy to do elementwise comparisons on more complex types, or do I have to resort to list comprehension, flatten and reshape?
There's a second problem:
f = arr[(0, 0)] == (-1, -1)
f is False, because arr[(0,0)] is of type numpy.void rather than a tuple. So even if the componentwise comparison worked, it would give the wrong result. Is there a clever numpy way to do this or should I just resort to list comprehension?
Both problems are actually the same problem! And are both related to the custom data type you created when you specified dtype="i,i".
If you run arr.dtype you will get dtype([('f0', '<i4'), ('f1', '<i4')]). That is a 2 signed integers that are placed in one continuous block of memory. This is not a python tuple. Thus it is clear why the naive comparison fails, since (-1,-1) is a python tuple and is not represented in memory the same way that the numpy data type is.
However if you compare with a_comp = np.array((-1,-1), dtype="i,i") you get the exact behavior you are expecting!
You can read more about how the custom dtype stuff works on the numpy docs:
Oh and to address what np.void is: it comes from the idea that it is a void c pointer which essentially means that it is an address to a continuous block of memory of unspecified type. But, provided you (the programer) knows what is going to be stored in that memory (in this case two back to back integers) it's fine provided you are careful (compare with the same custom data type).

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Arrow ListArray from pandas has very different structure from arrow array generated by awkward?

I encountered the following issue making some tests to demonstrate the usefulness of a pure pyarrow UDF in pyspark as compared to always going through pandas.
import awkward
import numpy
import pandas
import pyarrow
counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)
def awk_arrow(col):
jagged = awkward.fromarrow(col)
jagged2 = jagged**2
return awkward.toarrow(jagged2)
def pds_arrow(col):
pds = col.to_pandas()
pds2 = pds**2
return pyarrow.Array.from_pandas(pds2)
out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)
out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)
<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>
out3 == out4
yields (at the end of the stack trace):
AttributeError: no column named 'reshape'
looking at the arrays:
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
You can see the contents and shape of the arrays are the same, but they're not comparable to each other at face value, which is very counter intuitive. Is there a good reason for dense jagged structures with no Nulls to be represented as a BitMaskedArray?
All data in Arrow are nullable (at every level), and they use bit masks (as opposed to byte masks) to specify which elements are valid. The specification allows columns of entirely valid data to not write the bitmask, but not every writer takes advantage of that freedom. Quite often, you see unnecessary bitmasks.
When it encounters a bitmask, such as here, awkward inserts a BitMaskedArray.
It could be changed to check to see if the mask is unnecessary and skip that step, though that adds an operation that scales with the size of the dataset (though likely insignificant in most cases—bitmasks are 8 times faster to check than bytemasks). It's also a little complicated: the last byte may be incomplete if the length of the dataset is not a multiple of 8. One would need to check these bits individually, but the rest of the mask could be checked in bulk. (Maybe even cast as int64 to check 64 flags at a time.)

Using numpy for polynomial fit on pandas dataframe

I have a dataframe containing astronomical data:
I'm using statsmodels.formula.api to try to apply a polynomial fit to an dataframe, using columns labelled log_z and U, B, V, and other variables. I've got so far
sources['log_z'] = np.log10(sources.z)
mask = ~np.isnan((B-I)) & ~np.isnan(log_z)
model = ols(formula='(B-I) + np.power((U-R),2) ~ log_z', data = [log_z[mask], (B-I)[mask]]).fit()
but I keep getting
PatsyError: Error evaluating factor: TypeError: list indices must be integers or slices, not str
(B-I) + np.power((U-R),2) ~ log_z
even though I'm passing arrays into the function. I get the same error message (apart from the last line) no matter what arrays I use, or how I format them. Can anyone see what I'm doing wrong?

How to get a subarray in numpy

I have an 3d array and I want to get a sub-array of size (2n+1) centered around an index indx. Using slices I can use
which will only get uglier if I want a different size for each dimension. Is there a nicer way to do this.
You don't need to use the slice constructor unless you want to store the slice object for later use. Instead, you can simply do:
y[indx[0]-n:indx[0]+n+1, indx[1]-n:indx[1]+n+1, indx[2]-n:indx[2]+n+1]
If you want to do this without specifying each index separately, you can use list comprehensions:
y[[slice(i-n, i+n+1) for i in indx]]
You can create numpy arrays for indexing into different dimensions of the 3D array and then use use ix_ function to create indexing map and thus get the sliced output. The benefit with ix_ is that it allows for broadcasted indexing maps. More info on this could be found here. Then, you can specify different window sizes for each dimension for a generic solution. Here's the implementation with sample input data -
import numpy as np
A = np.random.randint(0,9,(17,18,16)) # Input array
indx = np.array([5,10,8]) # Pivot indices for each dim
N = [4,3,2] # Window sizes
# Arrays of start & stop indices
start = indx - N
stop = indx + N + 1
# Create indexing arrays for each dimension
xc = np.arange(start[0],stop[0])
yc = np.arange(start[1],stop[1])
zc = np.arange(start[2],stop[2])
# Create mesh from multiple arrays for use as indexing map
# and thus get desired sliced output
Aout = A[np.ix_(xc,yc,zc)]
Thus, for the given data with window sizes array, N = [4,3,2], the whos info shows -
In [318]: whos
Variable Type Data/Info
A ndarray 17x18x16: 4896 elems, type `int32`, 19584 bytes
Aout ndarray 9x7x5: 315 elems, type `int32`, 1260 bytes
The whos info for the output, Aout seems to be coherent with the intended output shape which must be 2N+1.