Map elements of multiple columns in Pandas - pandas

I'm trying to label some values in a DataFrame in Pandas based on the value itself, in-place.
df = pd.read_csv('data/extrusion.csv')
# get list of columns that contain thickness
columns = [c for c in data.columns if 'SDickeIst'.lower() in c.lower()]
# create a function that returns the class based on value
def get_label(ser):
ser.map(lambda x : x if x == 0 else 1)
df[columns].apply(get_label)
I would expect that the apply function takes each column in particular and applies get_label on it. In turn, get_label gets the ser argument as a Series and uses map to map each element != 0 with 1.

get_label doesn't return anything.
You want to return ser.map(lambda x : x if x == 0 else 1).
def get_label(ser):
return ser.map(lambda x : x if x == 0 else 1)
Besides that, apply doesn't act in-place, it always returns a new object. Therefore you need
df[columns] = df[columns].apply(get_label)
But in this simple case, using DataFrame.where should be much faster if you are dealing with large DataFrames.
df[columns] = df[columns].where(lambda x: x == 0, 1)

Related

Extract 1st Column data and update 2ndColumn based on 1st Column data

I have an excel file with the following data:
LogID
T-1111
P-09899
P-09189,T-0011
T-111,T-2111
P-09099,P-7897
RCT-0989,RCT-099
I need to extract the first column LogID before the delimiter "-" and then populate a second column 'LogType' based on the string extracted (T is Tank LogType, P is Pump LogType)
For the above input, the output should be
LogID
LogType
T-1111
Tank
P-09899
Pump
P-09189,T-0011
Multiple
T-111,T-2111
Tank
P-09099,P-7897
Pump
RCT-0989,RCT-099
Reactor
I have written a function to do this in python:
def log_parser(log_string):
log_dict = { "T":"Tank","P":"Pump" }
log_list = log_string.split(",")
for i in log_list:
str_extract = i.upper().split("-",1)
if len(log_list) ==1:
result = log_dict[str_extract[0]]
return result
break
else:
idx = log_list.index(i)
for j in range(len(log_list)):
if (idx == j):
continue
str_extract_j = log_list[j].upper().split("-",1)
if str_extract_j[0] != str_extract[0]:
result = "Multiple"
return result
break
else:
result = log_dict[str_extract[0]]
return result
I am not sure how to implement this function in pandas..
Can i define the function in pandas and then use the lamba apply funtion like this:
test_df['LogType'] = test_df[['LogID']].apply(lambda x:log_parser(x), axis=1)
You can use:
# mapping dictionary for types
d = {'T': 'Tank', 'P': 'Pump'}
# extract letters before -
s = df['LogID'].str.extractall('([A-Z])-')[0]
# group by index
g = s.groupby(level=0)
df['LogType'] = (g.first() # get first match
.map(d) # map type name
# mask if several types
.mask(g.nunique().gt(1),
'Multiple')
)
Output:
LogID LogType
0 T-1111 Tank
1 P-09899 Pump
2 P-09189,T-0011 Multiple

About the numpy.where statement

I would like to use the numpy.where to check the value of a previous row but don't know how to code
for n1 in range(len(image1)):
print('input image ',input_folder+'\\' + image1[n1])
print('\n')
print('image1[n1] ',image1[n1])
print('\n')
im = Image.open(input_folder+'\\'+image1[n1])
a = np.array(im, dtype='uint8')
width, height = im.size
print('width ',width)
print('height ',height)
a = np.where(a==[0,0,0],[255,255,255],a)
!-- Change the looping statement to np.where --!
for h in range(height):
for w in range(width):
if h <= (height - 2) and w <= (width - 2):
if a[h,w,0] != 255 and a[h,w,1] != 255 and a[h,w,2] != 255:
if (a[h-1,w,0] == 255 and a[h-1,w,1] == 255 and a[h-1,w,2] == 255 and a[h+1,w,0] == 255 and a[h+1,w,1] == 255 and a[h+1,w,2] == 255) or (a[h,w-1,0] == 255 and a[h,w-1,1] == 255 and a[h,w-1,2] == 255 and a[h,w+1,0] == 255 and a[h,w+1,1] == 255 and a[h,w+1,2] == 255):***
Change the above looping statement to np.where(a[-??] = [255,255,255] or a[+??] = [255,255,255]) so it can run more faster than the for loop statement. -->
a[h,w,0] = 255
a[h,w,1] = 255
a[h,w,2] = 255
I'm afraid, you can not use np.where here.
The reason is that:
the condition passed to np.where should indicate each element of the
source array,
whereas the criterion in your code actually relates only to first 2
dimensions of the source array.
So I came up with another, quite elegant and concise solution.
Part 1: How to get first two indices of elements, where all elements
in the third dimension are != 255:
To to it, on the whole array, you could run:
np.not_equal(a, 255).all(axis=2)
Part 2: How to limit the "range of operation" to elements having both
previous and next row and column.
You can do it passing to the above code a "subrange" of the original array:
np.not_equal(a[1:-1, 1:-1], 255).all(axis=2))
You should eliminate both the first and the last column and row (in
your code you failed to eliminate the first row / column).
But note that this time the resulting indices are by one less than before,
so at the later step you will have to add 1 to them.
Part 3: A function to check whether all elements along the third dimension
== 255, for some row (r) and column (c):
def all_eq(arr, r, c):
return np.equal(arr[r, c], 255).all()
(will be used soon).
Part 4: How to get the result:
res = a.copy()
for r, c in zip(*np.where(np.not_equal(a[1:-1, 1:-1], 255).all(axis=2))):
h = r + 1
w = c + 1
if all_eq(a, h-1, w) and all_eq(a, h+1, w) or\
all_eq(a, h, w-1) and all_eq(a, h, w+1):
res[h, w] = 255
Note that this code starts from making a copy of the original array
(it will hold the result).
Then, for r, c in zip(…) iterates over the indices found.
First 2 lines in the loop add 1 to the indices found, in the subrange
of the original array, so now h and w indicate row / column in the whole
original array.
Then if checks whether the respective adjacent pixels have 255 in all elements.
If they do, then put 255 in all elements of the "current" pixel, in the result.
You can't operate on the original array, since changed values in some pixels
would "falsify" the evaluation of conditions for subseqent pixels.
Edit
After some research I found, that it is possible to use np.where,
although the solution is a bit complicated and involving quite a big
number of Numpy methods:
# Mask 1: Pixels with all elements != 255
m1 = np.zeros((height, width), dtype='int8')
idx = np.where(np.not_equal(a, 255).all(axis=2))
m1[idx] = 1
# Pixels with all elements == 255
m2 = np.apply_along_axis(lambda px: np.equal(px, 255).all(), 2, a).astype('int8')
# Both adjacent pixels (left / right) == 255
m2a = np.logical_and(np.insert(m2, 0, 0, axis=1)[:,:-1], np.insert(m2,
width, 0, axis=1)[:,1:])
# Both adjacent pixels (up / down) == 255
m2b = np.logical_and(np.insert(m2, 0, 0, axis=0)[:-1,:], np.insert(m2,
height, 0, axis=0)[1:,:])
# Mask 2: Both adjacent pixels (either vertically or horizontally) == 255
m2 = np.logical_or(m2a, m2b)
# The "final" mask
msk = np.logical_and(m1, m2)
# Generate the result
result = np.where(np.expand_dims(msk, 2), 255, a)
This solution should be substantially faster than my first concept.

ORTools CP-Sat Solver Channeling Constraint dependant of x

I try to add the following constraints to my model. my problem: the function g() expects x as a binary numpy array. So the result arr_a depends on the current value of x in every step of the optimization!
Afterwards, I want the max of this array times x to be smaller than 50.
How can I add this constraint dynamically so that arr_a is always rightfully calculated with the value of x at each iteration while telling the model to keep the constraint arr_a * x <= 50 ? Currently I am getting an error when adding the constraint to the model because g() expects x as numpy array to calculate arr_a, arr_b, arr_c ( g uses np.where(x == 1) within its calculation).
#Init model
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# Declare the variables
x = []
for i in range(self.ds.n_banks):
x.append(model.NewIntVar(0, 1, "x[%i]" % (i)))
#add bool vars
a = model.NewBoolVar('a')
arr_a, arr_b, arr_c = g(df1,df2,df3,x)
model.Add((arr_a.astype('int32') * x).max() <= 50).OnlyEnforceIf(a)
model.Add((arr_a.astype('int32') * x).max() > 50).OnlyEnforceIf(a.Not())
Afterwards i add the target function that naturally also depends on x.
model.Minimize(target(x))
def target(x):
arr_a, arr_b, arr_c = g(df1,df2,df3,x)
return (3 * arr_b * x + 2 * arr_c * x).sum()
EDIT:
My problem changed a bit and i managed to get it work without issues. Nevertheless, I experienced that the constraint is never actually met! self-defined-function is a highly non-linear function that expects the indices where x==1 and where x == 0 and returns a numpy array. Also it is not possible to re-build it with pre-defined functions of the sat.solver.
#Init model
model = cp_model.CpModel()
# Declare the variables
x = [model.NewIntVar(0, 1, "x[%i]" % (i)) for i in range(66)]
# add hints
[model.AddHint(x[i],np.random.choice(2, 1, p=[0.4, 0.6])[0]) for i in range(66)]
open_elements = [model.NewBoolVar("open_elements[%i]" % (i)) for i in range(66)]
closed_elements = [model.NewBoolVar("closed_elements[%i]" % (i)) for i in range(6)]
# open indices as bool vars
for i in range(66):
model.Add(x[i] == 1).OnlyEnforceIf(open_elements[i])
model.Add(x[i] != 1).OnlyEnforceIf(open_elements[i].Not())
model.Add(x[i] != 1).OnlyEnforceIf(closed_elements[i])
model.Add(x[i] == 1).OnlyEnforceIf(closed_elements[i].Not())
model.Add((self-defined-function(np.where(open_elements), np.where(closed_elements), some_array).astype('int32') * x - some_vector).all() <= 0)
Even when I apply a simpler function, it will not work properly.
model.Add((self-defined-function(x, some_array).astype('int32') * x - some_vector).all() <= 0)
I also tried the following:
arr_indices_open = []
arr_indices_closed = []
for i in range(66):
if open_elements[i] == True:
arr_indices_open.append(i)
else:
arr_indices_closed.append(i)
# final Constraint
arr_ = self-defined-function(arr_indices_open, arr_indices_closed, some_array)[0].astype('int32')
for i in range(66):
model.Add(arr_[i] * x[i] <= some_other_vector[i])
Some minimal example for the self-defined-function, with which I simply try to say that n_closed shall be smaller than 10. Even that condition is not met by the solver:
def self_defined_function(arr_indices_closed)
return len(arr_indices_closed)
arr_ = self-defined-function(arr_indices_closed)
for i in range(66):
model.Add(arr_ < 10)
I'm not sure I fully understand the question, but generally, if you want to optimize a function g(x), you'll have to implement it in using the solver's primitives (docs).
It's easier to do when your calculation coincides with an existing solver function, e.g.: if you're trying to calculate a linear expression; but could get harder to do when trying to calculate something more complex. However, I believe that's the only way.

Pandas groupby aggregating by columns the sum() only provides a count using lambda

I have been trying to aggregate multiple columns in a groupby using lambda functions to select which rows to sum(). The problem I have is that the sum() only provides a count. I am very mediocre at pandas and have searched but not located an answer. Any answer would highly appreciate and I do certainly appreciate your time.
groupedByEmployeeShift['Duration1'] = groupedByEmployeeShift['Duration'] ### create a dummy column for ShiftOT below
groupedByEmployeeShift['RoundedInMinutes1'] = groupedByEmployeeShift['RoundedInMinutes'] ### create a dummy column for RoundedInMinutes below
groupedByEmployeeShift['RoundedOutMinutes1'] = groupedByEmployeeShift['RoundedOutMinutes'] ### create a dummy column for RoundedOutMinutes below
shiftStats = groupedByEmployeeShift.groupby('employee').agg(
WorkLocation = ('WorkedLocation', 'first'),
AllShifts = ('Duration', 'count'),
OTShifts = ('Duration1', lambda x: (x > 8).sum()),
NoRoundedInMinutes = ('RoundedInMinutes', lambda x: (x == 0).sum()),
NoRoundedOutMinutes = ('RoundedOutMinutes', lambda x: (x == 0).sum()),
RoundedInMinutes = ('RoundedInMinutes1', lambda x: (x > 0).sum()),
RoundedOutMinutes = ('RoundedOutMinutes1', lambda x: (x > 0).sum()))
The result of the logical operations such as (x > 0) in your lambda functions are boolean arrays, thus (x > 0).sum() will return the sum over the boolean results, which is equivalent to the count of True instances in the resulting array.
If you want to return the sum over x when the condition is True, you can use: lambda x: x[x > 0].sum()

Cleaner pandas apply with function that cannot use pandas.Series and non-unique index

In the following, func represents a function that uses multiple columns (with coupling across the group) and cannot operate directly on pandas.Series. The 0*d['x'] syntax was the lightest I could think of to force the conversion, but I think it's awkward.
Additionally, the resulting pandas.Series (s) still includes the group index, which must be removed before adding as a column to the pandas.DataFrame. The s.reset_index(...) index manipulation seems fragile and error-prone, so I'm curious if it can be avoided. Is there an idiom for doing this?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
#
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
Instead of
0*d['x'] + x + y
you could use
pd.Series(x+y, index=d.index)
When using groupy-apply, instead of dropping the group key index using:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
you can tell groupby to drop the keys using the keyword parameter group_keys=False:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)
yields
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120