Pandas Apply(), Transform() ERROR = invalid dtype determination in get_concat_dtype - pandas

Flowing on from this question, which i link as background, but question is standalone.
4 questions:
I cannot understand the error I see when using apply or transform:
"invalid dtype determination in get_concat_dtype"
Why does ClipNetMean work but the other 2 methods not?
Unsure if or why i need the .copy(deep=True)
Why the slightly different syntax needed to call the InnerFoo function
The DataFrame:
cost
section item
11 1 25
2 100
3 77
4 10
12 5 50
1 39
2 7
3 32
13 4 19
1 21
2 27
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'section' : [11,11,11,11,12,12,12,12,13,13,13]
,'item' : [1,2,3,4,5,1,2,3,4,1,2]
,'cost' : [25.,100.,77.,10.,50.,39.,7.,32.,19.,21.,27.]
})
df.set_index(['section','item'],inplace=True)
upper =50
lower = 10
def ClipAndNetMean(cost,upper,lower):
avg = cost.mean()
new_cost = (cost- avg).clip(lower,upper)
return new_cost
def MiniMean(cost,upper,lower):
cost_clone = cost.copy(deep=True)
cost_clone['A'] = lower
cost_clone['B'] = upper
v = cost_clone.apply(np.mean,axis=1)
return v.to_frame()
def InnerFoo(lower,upper):
def inner(group):
group_clone = group.copy(deep=True)
group_clone['lwr'] = lower
group_clone['upr'] = upper
v = group_clone.apply(np.mean,axis=1)
return v.to_frame()
return inner
#These 2 work fine.
print df.groupby(level = 'section').apply(ClipAndNetMean,lower,upper)
print df.groupby(level = 'section').transform(ClipAndNetMean,lower,upper)
#apply works but not transform
print df.groupby(level = 'section').apply(MiniMean,lower,upper)
print df.groupby(level = 'section').transform(MiniMean,lower,upper)
#apply works but not transform
print df.groupby(level = 'section').apply(InnerFoo(lower,upper))
print df.groupby(level = 'section').transform(InnerFoo(lower,upper))
exit()
So to Chris's answer, note that if I add back the column header the methods will work in a Transform call.
see v.columns = ['cost']
def MiniMean(cost,upper,lower):
cost_clone = cost.copy(deep=True)
cost_clone['A'] = lower
cost_clone['B'] = upper
v = cost_clone.apply(np.mean,axis=1)
v = v.to_frame()
v.columns = ['cost']
return v
def InnerFoo(lower,upper):
def inner(group):
group_clone = group.copy(deep=True)
group_clone['lwr'] = lower
group_clone['upr'] = upper
v = group_clone.apply(np.mean,axis=1)
v = v.to_frame()
v.columns = ['cost']
return v
return inner

1 & 2) transform expects something "like-indexed", while apply is flexible. The two failing functions are adding additional columns.
3) In some cases, (e.g. if you're passing a whole DataFrame into a function) it can be necessary to copy to avoid mutating the original. It should not be necessary here.
4) The first two functions take a DataFrame with two parameters and returns data. InnerFoo actually returns another function, so it needs to be called before being passed into apply.

Related

Sudoku Solution Using Multiprocessing

I tried sudoku solution using backtracking, but it was taking a lot time around 12sec to give output. I tried to implement a multiprocessing technique but it's taking lot more time than that of backtracking. I never ran it completely it's too slow. Please suggest what am I missing? Even better if someone can also tell me how to run this through my GPU. (using CUDA).
import concurrent.futures
import copy
A = [[0]*9 for _ in range(9)]
A[0][6] = 2
A[1][1] = 8
A[1][5] = 7
A[1][7] = 9
A[2][0] = 6
A[2][2] = 2
A[2][6] = 5
A[3][1] = 7
A[3][4] = 6
A[4][3] = 9
A[4][5] = 1
A[5][4] = 2
A[5][7] = 4
A[6][2] = 5
A[6][6] = 6
A[6][8] = 3
A[7][1] = 9
A[7][3] = 4
A[7][7] = 7
A[8][2] = 6
Boards = [A]
L = []
for i in range(9):
for j in range(9):
if A[i][j] == 0:
L.append([i,j])
def RC_Check(A,Value,N):
global L
i,j = L[N]
for x in range(9):
if A[x][j] == Value:
return False
if A[i][x] == Value:
return False
return True
def Square_Check(A,Value,N):
global L
i,j = L[N]
X, Y = int(i/3)*3,int(j/3)*3
for x in range(X,X+3):
for y in range(Y,Y+3):
if A[x][y] == Value:
return False
return True
def New_Boards(Board,N):
global L
i,j = L[N]
Boards = []
with concurrent.futures.ProcessPoolExecutor() as executor:
RC_Process = executor.map(RC_Check,[Board]*10,list(range(1,10)),[N]*10)
Square_Process = executor.map(Square_Check,[Board]*10,list(range(1,10)),[N]*10)
for Value, (RC_Process, Square_Process) in enumerate(zip(RC_Process,Square_Process)):
if RC_Process and Square_Process:
Board[i][j] = Value+1
Boards.append(copy.deepcopy(Board))
return Boards
def Solve_Boards(Boards,N):
Results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
Process = executor.map(New_Boards,Boards,[N]*len(Boards))
for new_boards in Process:
if len(new_boards):
Results.extend(new_boards)
return Results
if __name__ == "__main__":
N = 0
while N < len(L):
Boards = Solve_Boards(Boards,N)
N+=1
print(len(Boards),N)
print(Boards)
Multi processing is NOT a silver bullet. Backtracking is pretty more efficient than exhaustive search parallelly in most cases. I tried running this code on my PC which has 32 cores 64 threads, but it takes long time.
And you look like to want to use GPGPU to solve this problem, but i doesn't suit, Because state of board depends on previous state, so can't split calculation efficiently.

Numpy: Construct Slice A La Carte

Suppose I have the following:
# in pseudo code
# function input 1
chord = [0,1,17,35,47,0]
dims = [0,1,2,4,5,6]
x_axis = 3
t_axis = 7
# what I'd like to return
np.squeeze(arr[0,1,17,:,35,47,0,:])
# function input 2
chord = [0,3,4,5,6,7]
dims = [0,2,3,4,5,6]
x_axis = 1
t_axis = 7
# desired return
np.squeeze(arr[0,:,3,4,5,6,7,:])
How do I construct these numpy slices given input that I can arbitrarily specify a pair of axes and a chord coordinate?
I implemented a reflection-based solution:
def reflection_window(arr:np.ndarray,chord:list,dim0,dim1):
var = "arr"
bra = "["
ket = "]"
coord = [str(i) for int(i) in chord]
coord.insert(dim0,':')
coord.insert(dim1,':')
chordstr = ','.join(coord)
slicer = var+bra+chordstr+ket
return eval(slicer)
Staying native to numpy is probably better, but since python is a shell scripting language, it probably makes sense to treat it that way if necessary.

Scipy Optimize minimize returns the initial value

I am building machine learning models for a certain data set. Then, based on the constraints and bounds for the outputs and inputs, I am trying to find the input parameters for the most minimized answer.
The problem which I am facing is that, when the model is a linear regression model or something like lasso, the minimization works perfectly fine.
However, when the model is "Decision Tree", it constantly returns the very initial value that is given to it. So basically, it does not enforce the constraints.
import numpy as np
import pandas as pd
from scipy.optimize import minimize
I am using the very first sample from the input data set for the optimization. As it is only one sample, I need to reshape it to (1,-1) as well.
x = df_in.iloc[0,:]
x = np.array(x)
x = x.reshape(1,-1)
This is my Objective function:
def objective(x):
x = np.array(x)
x = x.reshape(1,-1)
y = 0
for n in range(df_out.shape[1]):
y = Model[n].predict(x)
Y = y[0]
return Y
Here I am defining the bounds of inputs:
range_max = pd.DataFrame(range_max)
range_min = pd.DataFrame(range_min)
B_max=[]
B_min =[]
for i in range(range_max.shape[0]):
b_max = range_max.iloc[i]
b_min = range_min.iloc[i]
B_max.append(b_max)
B_min.append(b_min)
B_max = pd.DataFrame(B_max)
B_min = pd.DataFrame(B_min)
bnds = pd.concat([B_min, B_max], axis=1)
These are my constraints:
con_min = pd.DataFrame(c_min)
con_max = pd.DataFrame(c_max)
Here I am defining the constraint function:
def const(x):
x = np.array(x)
x = x.reshape(1,-1)
Y = []
for n in range(df_out.shape[1]):
y = Model[n].predict(x)[0]
Y.append(y)
Y = pd.DataFrame(Y)
a4 =[]
for k in range(Y.shape[0]):
a1 = Y.iloc[k,0] - con_min.iloc[k,0]
a2 = con_max.iloc[k, 0] - Y.iloc[k,0]
a3 = [a2,a1]
a4 = np.concatenate([a4, a3])
return a4
c = const(x)
con = {'type': 'ineq', 'fun': const}
This is where I try to minimize. I do not pick a method as the automatically picked model has worked so far.
sol = minimize(fun = objective, x0=x,constraints=con, bounds=bnds)
So the actual constraints are:
c_min = [0.20,1000]
c_max = [0.3,1600]
and the max and min range for the boundaries are:
range_max = [285,200,8,85,0.04,1.6,10,3.5,20,-5]
range_min = [215,170,-1,60,0,1,6,2.5,16,-18]
I think you should check the output of 'sol'. At times, the algorithm is not able to perform line search completely. To check for this, you should check message associated with 'sol'. In such a case, the optimizer returns initial parameters itself. There may be various reasons of this behavior. In a nutshell, please check the output of sol and act accordingly.
Arad,
If you have not yet resolved your issue, try using scipy.optimize.differential_evolution instead of scipy.optimize.minimize. I ran into similar issues, particularly with decision trees because of their step-like behavior resulting in infinite gradients.

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I have gotten. The result 'res' returned on compute is a list of 3 delayed objects. When I try to compute each of them in a loop (last tow lines of code) this results in a "TypeError: 'DataFrame' object is not callable" After going through the examples for map_partitions, I also tried altering the input DF (inplace) in the function with no return value which causes a similar TypeError with NoneType. What am I missing?
Also, looking at the visualization (attached) I feel like there is a need for reducing the individually computed (folded) partitions into a single DF. How do I do this?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import math
foldbucketsecs=30
periodicitysecs=15
secsinday=24 * 60 * 60
chunksizesecs=60 # 1 minute
numts = 5
start = 1525132800 # 01/05
end = 1525132800 + (3 * 60) # 3 minute
c = Client('127.0.0.1:8786')
def fold(df, start, bucket):
return df
def reduce_folds(df):
return df
def load(epoch):
idx = []
for ts in range(0, chunksizesecs, periodicitysecs):
idx.append(epoch + ts)
d = np.random.rand(chunksizesecs/periodicitysecs, numts)
ts = []
for i in range(0, numts):
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(start)
cols = len(gts)
idx1 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+periodicitysecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx1[:0], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(start, end, chunksizesecs)]
from_delayed = dd.from_delayed(dfs, meta, 'sorted')
nfolds = int(math.ceil((end - start)/foldbucketsecs))
cprime = nfolds * cols
gtsnew = []
for i in range(0, cprime):
gtsnew.append("ts_%s,fold=%s" % (i%cols, i/cols))
idx2 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+foldbucketsecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx2[:0], data=[], columns=gtsnew, dtype=np.float64)
folded_df = from_delayed.map_partitions(delayed(fold)(from_delayed, start, foldbucketsecs), meta=meta)
result = c.submit(reduce_folds, folded_df)
c.gather(result).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
res = c.gather(result).compute()
for f in res:
f.compute()
Never mind! It was my fault, instead of wrapping my function in delayed I simply passed it to the map_partitions call like so and it worked.
folded_df = from_delayed.map_partitions(fold, start, foldbucketsecs, nfolds, meta=meta)

Cleaner pandas apply with function that cannot use pandas.Series and non-unique index

In the following, func represents a function that uses multiple columns (with coupling across the group) and cannot operate directly on pandas.Series. The 0*d['x'] syntax was the lightest I could think of to force the conversion, but I think it's awkward.
Additionally, the resulting pandas.Series (s) still includes the group index, which must be removed before adding as a column to the pandas.DataFrame. The s.reset_index(...) index manipulation seems fragile and error-prone, so I'm curious if it can be avoided. Is there an idiom for doing this?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
#
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
Instead of
0*d['x'] + x + y
you could use
pd.Series(x+y, index=d.index)
When using groupy-apply, instead of dropping the group key index using:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
you can tell groupby to drop the keys using the keyword parameter group_keys=False:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)
yields
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120