Format specifier problem with numpy savetxt - numpy

I have four lists, one with strings and three with floats, which i want to write column-wise to a file. I read the lists from other files using numpy.loadtxt:
A = np.loadtxt(file_1, dtype=str, unpack=True, skiprows=1, usecols=(0))
x, y, z = np.loadtxt(file_2, unpack=True, skiprows=1, usecols=(1, 2, 3))
and would write the file using numpy.savetxt
np.savetxt(file_3, (A, x, y, z), fmt=('%s %15.8f %15.8f %15.8f'))
However, I get the following error:
ValueError: fmt has wrong number of % formats
np.savetxt(file_3, (x, y, z), fmt='%15.8f') and np.savetxt('coord', atom, fmt='%s') work fine. I have tried several variations, but cannot seem to get it right.
Thank you
input:
A = [a, b, c]
x = [1.1, 2.2, 3.3]
y = [4.4, 5.5, 6.6]
z = [7.7, 8.8, 9.9]
output:
a 1.1 4.4 7.7
b 2.2 5.5 8.8
c 3.3 6.6 9.8

Related

How do I use `pd.NamedAgg` with a lambda function inside a `pandas` aggregation?

I want to be able to feed a list as parameters to generate different aggregate functions in pandas. To make this more concrete, let's say I have this as data:
import numpy as np
import pandas as pd
np.random.seed(0)
df_data = pd.DataFrame({
'group': np.repeat(['x', 'y'], 10),
'val': np.random.randint(0, 10, 20)
})
So the first few rows of the data looks like this:
group
val
x
5
x
0
x
3
I have a list of per-group percentiles that I want to compute.
percentile_list = [10, 90]
And I tried to use dictionary comprehension with pd.NamedAgg that calls a lambda function to do this.
df_agg = df_data.groupby('group').agg(
**{f'p{y}_by_dict': pd.NamedAgg('val', lambda x: np.quantile(x, y / 100)) for y in percentile_list},
)
But it doesn't work. Here I calculate both by hand and by dictionary comprehension.
df_agg = df_data.groupby('group').agg(
p10_by_hand=pd.NamedAgg('val', lambda x: np.quantile(x, 0.1)),
p90_by_hand=pd.NamedAgg('val', lambda x: np.quantile(x, 0.9)),
**{f'p{y}_by_dict': pd.NamedAgg('val', lambda x: np.quantile(x, y / 100)) for y in percentile_list},
)
The result looks like this. The manually specified aggregations work but the dictionary comprehension ones have the same values across different aggregations. I guess they just took the last lambda function in the generated dictionary.
p10_by_hand
p90_by_hand
p10_by_dict
p90_by_dict
x
1.8
7.2
7.2
7.2
y
1.0
8.0
8.0
8.0
How do I fix this? I don't have to use dictionary comprehension, as long as each aggregation can be specified programmatically.
In [23]: def agg_gen(y):
...: return lambda x: np.quantile(x, y / 100)
...:
In [24]: df_data.groupby('group').agg(
...: **{f'p{y}_by_dict': pd.NamedAgg('val', agg_gen(y)) for y in percentile_list},
...: )
Out[24]:
p10_by_dict p90_by_dict
group
x 1.8 7.2
y 1.0 8.0
the reason your initial assign fails is because of this - What do lambda function closures capture?

Python Memory error on scipy stats. Scipy linalg lstsq <> manual beta

Not sure if this question belongs here or on crossvalidated but since the primary issue is programming language related, I am posting it here.
Inputs:
Y= big 2D numpy array (300000,30)
X= 1D array (30,)
Desired Output:
B= 1D array (300000,) each element of which regression coefficient of regressing each row (element of length 30) of Y against X
So B[0] = scipy.stats.linregress(X,Y[0])[0]
I tried this first:
B = scipy.stats.linregress(X,Y)[0]
hoping that it will broadcast X according to shape of Y. Next I broadcast X myself to match the shape of Y. But on both occasions, I got this error:
File "C:\...\scipy\stats\stats.py", line 3011, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
File "C:\...\numpy\lib\function_base.py", line 1766, in cov
return (dot(X, X.T.conj()) / fact).squeeze()
MemoryError
I used manual approach to calculate beta, and on Sascha's suggestion below also used scipy.linalg.lstsq as follows
B = lstsq(Y.T, X)[0] # first estimate of beta
Y1=Y-Y.mean(1)[:,None]
X1=X-X.mean()
B1= np.dot(Y1,X1)/np.dot(X1,X1) # second estimate of beta
The two estimates of beta are very different however:
>>> B1
Out[10]: array([0.135623, 0.028919, -0.106278, ..., -0.467340, -0.549543, -0.498500])
>>> B
Out[11]: array([0.000014, -0.000073, -0.000058, ..., 0.000002, -0.000000, 0.000001])
Scipy's linregress will output slope+intercept which defines the regression-line.
If you want to access the coefficients naturally, scipy's lstsq might be more appropriate, which is an equivalent formulation.
Of course you need to feed it with the correct dimensions (your data is not ready; needs preprocessing; swap dims).
Code
import numpy as np
from scipy.linalg import lstsq
Y = np.random.random((300000,30))
X = np.random.random(30)
x, res, rank, s = lstsq(Y.T, X) # Y transposed!
print(x)
print(x.shape)
Output
[ 1.73122781e-05 2.70274135e-05 9.80840639e-06 ..., -1.84597771e-05
5.25035470e-07 2.41275026e-05]
(300000,)

Numpy or Pandas function for "x-value-window" means or other stats?

Let's say I have x-y data samples sorted by x-value. I'm going to use Pandas as example, but I would be perfectly happy with a Numpy/Scipy-only solution, of course.
In [24]: pd.set_option('display.max_rows', 10)
In [25]: df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
In [26]: df = df.sort('x')
In [27]: df
Out[27]:
x y
13 -3.403818 0.717744
49 -2.688876 1.936267
74 -2.388332 -0.121599
52 -2.185848 0.617896
90 -2.155343 -1.132673
.. ... ...
65 1.736506 -0.170502
0 1.770901 0.520490
60 1.878376 0.206113
63 2.263602 1.112115
33 2.384195 -1.877502
[100 rows x 2 columns]
Now, I want to kind of "window" it or "discretize" it and get statistics on each window. But I don't want to do the Pandas moving-window functions because they define windows by rows. I want to define windows by a span of x-values, thus "x-value-window". Specifically, let's define each x-value-window with 2 parameters:
center x-value of each window
in this example, let's say I want x = 0.0 + 0.4 * k for all positive or negative k
thus -3.2, -2.8, -2.4, ..., 1.6, 2.0, 2.4
width of each window
in this example, let's say I want W = 0.5
thus, the example windows will be [-3.2-0.25, -3.2+0.25], [-2.8-0.25, -2.8+0.25], ..., [2.4-0.25, 2.4+0.25]
note that the windows overlap, which is intended
Having thus defined the windows, I would like to ask if there's a function that will produce the following data frame (or numpy array):
x y
-3.2 mean of y-values in x-value-window centered at -3.2
-2.8 mean of y-values in x-value-window centered at -2.8
-2.4 mean of y-values in x-value-window centered at -2.4
... ...
1.6 mean of y-values in x-value-window centered at 1.6
2.0 mean of y-values in x-value-window centered at 2.0
2.4 mean of y-values in x-value-window centered at 2.4
Is there anything that will do this for me? Or do I have to totally roll my own (and probably in a very slow python loop instead of fast numpy or pandas code)?
Extra 1: It would be even better if there's support for weighted windows (such as supported by Pandas's rolling_window function) but of course the weights in this case would not be based on how far the sample's row is from the center row of the window, but rather, how far the sample's x-value is from the center of the x-value-window.
Extra 2: It would be nice if there's support for statistics other than mean on the x-value-windows, e.g. (a) variance of the y-values in each x-value-window or (b) count of the number of samples falling within each x-value-window.
I first create a range of x values centered at zero. This range is wide enough so that then min value minus the width and the max value plus the width will capture all x values.
I then iterate through this range of x values which have k as the step size. At each point, I use loc to capture y values located at the selected x value plus and minus the width. The mean of these selected values are then calculated. These values are used to create the result dataframe.
import math
import numpy as np
import pandas as pd
k = .4
w = .5
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
x_range = np.arange(math.floor((df.x.min() + w) / k) * k,
k * (math.ceil((df.x.max() - w) / k) + 1), k)
result = pd.DataFrame((df.loc[df.x.between(x - w, x + w), 'y'].mean() for x in x_range),
index=x_range, columns=['y_mean'])
result.index.name = 'centered_x'
>>> result
y_mean
centered_x
-2.400000e+00 0.653619
-2.000000e+00 0.733606
-1.600000e+00 0.576594
-1.200000e+00 0.150462
-8.000000e-01 0.065884
-4.000000e-01 0.022925
-8.881784e-16 0.211693
4.000000e-01 0.057527
8.000000e-01 -0.141970
1.200000e+00 0.233695
1.600000e+00 0.203570
2.000000e+00 0.306409
2.400000e+00 0.576789

Matplotlib: 'savefig' throw TypeError when 'linewidths' property is set

When the 'linewidths' property is set, calling 'savefig' throws 'TypeError: cannot perform reduce with flexible type'. Here is a MWE:
# Create sample data.
x = np.arange(-3.0, 3.0, 0.1)
y = np.arange(-2.0, 2.0, 0.1)
X, Y = np.meshgrid(x, y)
Z = 10.0 * (2*X - Y)
# Plot sample data.
plt.contour(X, Y, Z, colors = 'black', linewidths = '1')
plt.savefig('test.pdf')
It is not a problem with the figure rendering (calling 'plt.show()' works fine). If the linewidths property is not set, e.g. changing the second last line to
plt.contour(X, Y, Z, colors = 'black')
'savefig' works as intended. Is this a bug or have i missed something?
This is not a bug, since the documentation for plt.contour() specifies that linewidths should be a [ None | number | tuple of numbers ] while you provide a number as a string.
Here is my output with your code (I am using matplotlib 1.4.3).
>>> matplotlib.__version__
'1.4.3'
Your code 'works' under Python 2.7 but the linewidths parameter is effectively ignored, producing plots that look like this, regardless of the value (this was with linewidths='10'.
In contrast on Python 3.4 I get the following error:
TypeError: unorderable types: str() > int()
Setting linewidths to an int (or a float) as follows produces the correct output and works on both Python 2.7 and Python 3.4. Again, this is with it set to 10:
plt.contour(X, Y, Z, colors = 'black', linewidths = 10)

Slice 3D ndarray with 2D ndarray in numpy?

My apologies if this has been answered many times, but I just can't find a solution.
Assume the following code:
import numpy as np
A,_,_ = np.meshgrid(np.arange(5),np.arange(7),np.arange(10))
B = (rand(7,10)*5).astype(int)
How can I slice A using B so that B represent the indexes in the first and last dimensions of A (I.e A[magic] = B)?
I have tried
A[:,B,:] which doesn't work due to peculiarities of advanced indexing.
A[:,B,np.arange(10)] generates 7 copies of the matrix I'm after
A[np.arange(7),B,np.arange(10)] gives the error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Any other suggestions?
These both work:
A[0, B, 0]
A[B, B, B]
Really, only the B in axis 1 matters, the others can be any range that will broadcast to B.shape and are limited by A.shape[0] (for axis 1) and A.shape[2] (for axis 2), for a ridiculous example:
A[range(7) + range(3), B, range(9,-1, -1)]
But you don't want to use : because then you'll get, as you said, 7 or 10 (or both!) "copies" of the array you want.
A, _, _ = np.meshgrid(np.arange(5),np.arange(7),np.arange(10))
B = (rand(7,10)*A.shape[1]).astype(int)
np.allclose(B, A[0, B, 0])
#True
np.allclose(B, A[B, B, B])
#True
np.allclose(B, A[range(7) + range(3), B, range(9,-1, -1)])
#True