Custom equations with bygroup and apply in pandas - MemoryError - pandas

All,
I am running code to calcluate a new variable (newvar) for each constituent (group) in a panel using the apply function:
df['newvar'] = df.groupby('group')['var1'].apply(lambda x : x - x.shift() + df['var2'] - df['var3'])
The code returns a memory error (MemoryError). I think what's happening is that the code generates a large number of separate dataframes which then causes the system to run out of memory because df itself is quite a large file. I can probalby do this with a for-loop, but is there a more verbose/computationally efficient way of doing this?
Many thanks,
Andres

Related

What is the difference between SeedSequence.spawn and SeedSequence.generate_state

I am trying to use numpy's SeedSequence to seed RNGs in different processes. However I am not sure whether I should use ss.generate_state or ss.spawn:
import concurrent.futures
import numpy as np
def worker(seed):
rng = np.random.default_rng(seed)
return rng.random(1)
num_repeats = 1000
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result1 = np.hstack(list(pool.map(worker, ss.generate_state(num_repeats))))
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result2 = np.hstack(list(pool.map(worker, ss.spawn(num_repeats))))
What are the differences between the two approaches and which should I use?
Using ss.generate_state is ~10% faster for the basic example above, likely because we are serializing floats instead of objects.
Well the SeedSequence is intended to generate good quality seeds from not so good seeds.
Performance
The generate_state is much faster than spawn
The difference of time for transfering the object or the state should not be the main reason for the difference you notice. You can test this without any
%%timeit
ss.generate_state(num_repeats)
100 times faster than
ss.spawn(num_repeats)
Seed size
In the code map(worker, ss.generate_state(num_repeats)) the RNGs are seeded with integers, while in the map(worker, ss.spawn(num_repeats)) the RNGs are seeded with SeedSequence that can potentially give result in quality initialization, the seed sequence can be used internally to generate a state vector with as many bits as required to completely initialize the RNG. Well, to be honest I expect it to do some expansion on this number to initialize the RNG, not simply padding with zeros for example.
Repeated use
The most important difference is that generate_state gives the same result if called multiple times. On the other hand, spawn gives different results each call.
For illustration, check the following example
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
print('With generate_state')
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print('With spawn')
print(np.hstack([worker(s) for s in ss.spawn(5)]))
print(np.hstack([worker(s) for s in ss.spawn(5)]))
With generate_state
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
With spawn
[0.06988312 0.40886412 0.55733136 0.43249601 0.53394111]
[0.64885573 0.16788206 0.12435154 0.14676836 0.51876499]
As you see the arrays generated by seeding different RNGs with generate_state gives the same result, not only immediately after construction, but every time the method is called. Spawn should give you the same results using a newly constructed SeedSequence (I am using numpy 1.19.2), however if you run the same code twice using the same instance, the second time will give produce different seeds.

Memory error with numpy. array

I get a memory error when using numpy.arange with large numbers. My code is as follows:
import numpy as np
list = np.arange(0, 10**15, 10**3)
profit_list = []
for diff in list:
x = do_some_calculation
profit_list.append(x)
What can be a replacement so I can avoid getting the memory error?
If you replace list¹ with a generator, that is, you do
for diff in range(10**15, 10**3):
x = do_some_calculation
profit_list.append(x)
then that will no longer cause MemoryErrors as you no longer initiate the full list. In this world, though, profit_list will probably by causing issues instead, as you are trying to add 10^12 items to that. Again, you can probably get around that by not storing the values explicitly, but rather yield them as you need them, using generators.
¹: Side note: Don't use list as a variable name as it shadows a built-in.

Index pandas series by an hour

In my code I am currently doing the following kind of operation with Pandas:
ser = oldser.dropna().copy()
for i in range(24):
ind = ser.groupby(ser.index.hour).get_group(i).index
ser[ind]=something
This code copies a series, and then for takes each hour separately and does something to it. This seems very messy though - any ways to nicely clean it up?
What I really want, is something analogous to
series['2011']
which gets all data from 2011, but instead
series['2pm']
getting all data at 2pm.
Certainly you want to do the groupby operation once, a slight refactor:
g = ser.groupby(ser.index.hour)
for i, ind in g.indices:
ser.iloc[ind] = something
But most likely you can do a transform or apply (depending on what something is):
g.transform(something)
g.apply(something)

h5py selective read in

I have a problem regarding a selective read-in routine while using h5py.
f = h5py.File('file.hdf5','r')
data = f['Data']
I have several positive values in the 'Data'- dataset and also some placeholders with -9999.
How I can get only all positive values for calculations like np.min?
np.ma.masked_array creates a full copy of the array and all the benefits from using h5py are lost ... (regarding memory usage). The problem is, that I get errors if I try to read data sets that exceed 100 millions of values per data set using data = f['Data'][:,0]
Or if this is not possible is something like that possible?
np.place(data[...], data[...] <= -9999, float('nan'))
Thanks in advance
You could use:
mask = f['Data'] >= 0
data = f['Data'][mask]
although I am not sure how much memory the mask calculation itself uses.

Best way solving optimization with multiple variables in Matlab?

I am trying to compute numerically the solutions for a system of many equations and variables (100+). I tried so far three things:
I now that the vector of p(i) (which contains most of the endogenous variables) is decreasing. Thus I gave simply some starting points, and then was increasing(decreasing) my guess when I saw that the specific p was too low(high). Of course this was always conditional on the other being fixed which is not the case. This should eventually work, but it is neither efficient, nor obvious that I reach a solution in finite time. It worked when reducing the system to 4-6 variables though.
I could create 100+ loops around each other and use bisection for each loop. This would eventually lead me to the solution, but take ages both to program (as I have no idea how to create n loops around each other without actually having to write the loops - which is also bad as I would like to increase/decrease the amount of variables easily) and to execute.
I was trying fminsearch, but as expected for that wast amount of variables - no way!
I would appreciate any ideas... Here is the code (this one the fminsearch I tried):
This is the run file:
clear all
clc
% parameter
z=1.2;
w=20;
lam=0.7;
tau=1;
N=1000;
t_min=1;
t_max=4;
M=6;
a_min=0.6;
a_max=0.8;
t=zeros(1,N);
alp=zeros(1,M);
p=zeros(1,M);
p_min=2;
p_max=1;
for i=1:N
t(i)= t_min + (i-1)*(t_max - t_min)/(N-1);
end
for i=1:M
alp(i)= a_min + (i-1)*(a_max - a_min)/(M-1);
p(i)= p_min + (i-1)*(p_max - p_min)/(M-1);
end
fun=#(p) david(p ,z,w,lam,tau,N,M,t,alp);
p0=p;
fminsearch(fun,p0)
And this is the program-file:
function crit=david(p, z,w,lam,tau,N,M,t,alp)
X = zeros(M,N);
pi = zeros(M,N);
C = zeros(1,N);
Xa=zeros(1,N);
Z=zeros(1,M);
rl=0.01;
rh=1.99;
EXD=140;
while (abs(EXD)>100)
r1=rl + 0.5*(rh-rl);
for i=1:M
for j=1:N
X(i,j)=min(w*(1+lam), (alp(i) * p(i) / r1)^(1/(1-alp(i))) * t(j)^((z-alp(i))/(1-alp(i))));
pi(i,j)=p(i) * t(j)^(z-alp(i)) * X(i,j)^(alp(i)) - r1*X(i,j);
end
end
[C,I] = max(pi);
Xa(1)=X(I(1),1);
for j=2:N
Xa(j)=X(I(j),j);
end
EXD=sum(Xa)- N*w;
if (abs(EXD)>100 && EXD>0)
rl=r1;
elseif (abs(EXD)>100 && EXD<0)
rh=r1;
end
end
Ya=zeros(M,N);
for j=1:N
Ya(I(j),j)=t(j)^(z-alp(I(j))) * X(I(j),j)^(alp(I(j)));
end
Yi=sum(Ya,2);
if (Yi(1)==0)
Z(1)=-50;
end
for j=2:M
if (Yi(j)==0)
Z(j)=-50;
else
Z(j)=(p(1)/p(j))^tau - Yi(j)/Yi(1);
end
end
zz=sum(abs(Z))
crit=(sum(abs(Z)));
First of all my recommendation: use your brain.
What do you know about the function, can you use a gradient approach, linearize the problem, or perhaps fix most of the variables? If not, think twice before you decide that you are really interested in all 100 variables and perhaps simplify the problem.
Now, if that is not possible read this:
If you found a way to quickly get a local optimum, you could simply wrap a loop around it to try different starting points and hope you will find a good optimum.
If you really need to make lots of loops (and a variable amount) I suppose it can be done with recursion, but it is not easily explained.
If you just quickly want to make a fixed number of loops inside each other this can easily be done in excel (hint: loop variables can be called t1,t2 ... )
If you really need to evaluate a function at a lot of points, probably creating all the points first using ndgrid and then evaluating them all at once is preferable. (Needless to say this will not be a nice solution for 100 nontrivial variables)