I currently loop through a pandas dataframe that contains orders so that I can remove the ordered items from inventory and keep track of which order may not get filled (this is part of a reservation system).
I'd love to avoid the loop and do this in a more pythonic/panda-esque way but haven't been able to come up with anything that let's me get to the level of granularity I like. Any ideas would be much appreciated!
Here's a much simplified version of this.
Examples of the input would look like this:
import pandas as pd
import random
def get_inventory():
df_inv = pd.DataFrame([{'sku': 'A1', 'remaining': 1000},
{'sku': 'A2', 'remaining': 600},
{'sku': 'A3', 'remaining': 180},
{'sku': 'B1', 'remaining': 800},
{'sku': 'B2', 'remaining': 500},
], columns=['sku', 'remaining']).set_index('sku')
df_inv.loc[:, 'allocated'] = 0
df_inv.loc[:, 'reserved'] = 0
df_inv.loc[:, 'missed'] = 0
return df_inv
def get_reservations():
skus = ['A1', 'A2', 'A3', 'B1', 'B2']
res = []
for i in range(0, 1000, 1):
res.append({'order_id': i,
'sku': random.choice(skus),
'number_of_items_reserved': 1})
df_res = pd.DataFrame(res,
columns=['order_id', 'sku', 'number_of_items_reserved'])
return df_res
Inventory:
df_inv = get_inventory()
print(df_inv)
remaining allocated reserved missed
sku
A1 1000 0 0 0
A2 600 0 0 0
A3 180 0 0 0
B1 800 0 0 0
B2 500 0 0 0
Reservations:
df_res = get_reservations()
print(df_res.head(10))
order_id sku number_of_items_reserved
0 0 A3 1
1 1 B1 1
2 2 A3 1
3 3 A1 1
4 4 B1 1
5 5 B1 1
6 6 B1 1
7 7 B1 1
8 8 A3 1
9 9 B1 1
The logic to allocate reservations to inventory looks roughly like this:
(this is the part I'd love to replace)
"""
df_inv: inventory grouped (indexed) by sku (style and size)
df_res: reservations by order id for a style and size
"""
df_inv = get_inventory()
df_res = get_reservations()
for i, res in df_res.iterrows():
sku = res['sku']
n_items = res['number_of_items_reserved']
inv = df_inv[df_inv.index == sku]['remaining'].values[0]
df_inv.loc[(df_inv.index == sku), 'reserved'] += n_items
if (inv-n_items) >= 0:
df_inv.loc[(df_inv.index == sku), 'allocated'] += n_items
df_inv.loc[(df_inv.index == sku), 'remaining'] -= n_items
else:
df_inv.loc[(df_inv.index == sku), 'missed'] += n_items
Results:
remaining allocated reserved missed
sku
A1 817 183 183 0
A2 390 210 210 0
A3 0 180 210 30
B1 613 187 187 0
B2 290 210 210 0
You can get way without looping due to the intrinsic data alignment in Pandas.
df_inv = get_inventory()
df_res = get_reservations()
Creates series with the index of 'sku'
n_items = df_res.groupby('sku')['number_of_items_reserved'].sum()
shortage = df_inv['remaining'] - n_items
enough_inv = shortage > 0
Because Pandas does intrinsic data alignment and df_inv index is 'sku' and the created series above index is 'sku', these calculations are done by 'sku'. Using boolean indexing to determine which 'sku's has enough inventory to increment allocated and decrement remaining or increment missed.
df_inv['reserved'] += n_items
df_inv.loc[enough_inv,'allocated'] += n_items
df_inv.loc[enough_inv,'remaining'] -= n_items
df_inv.loc[~enough_inv,'missed'] -= shortage
df_inv.loc[~enough_inv,'allocated'] += n_items + shortage
df_inv.loc[~enough_inv,'remaining'] = 0
print(df_inv)
Output:
remaining allocated reserved missed
sku
A1 815.0 185.0 185 0.0
A2 410.0 190.0 190 0.0
A3 0.0 180.0 200 20.0
B1 586.0 214.0 214 0.0
B2 289.0 211.0 211 0.0
Related
I have question how can I optimize my code, in fact only the loops. I use to calculate solutions maximum of two rows, or sometimes max of row and number.
I tried to change my code using .loc and .clip but when it is about max or min which shows up multiple times I have some troubles with logical expressions.
That it was looking at the begining:
def Calc(row):
if row['Forecast'] == 0:
return max(row['Qty'],0)
elif row['def'] == 1:
return 0
elif row['def'] == 0:
return round(max(row['Qty'] - ( max(row['Forecast_total']*14,(row['Qty_12m_1']+row['Qty_12m_2'])) * max(1, (row['Total']/row['Forecast'])/54)),0 ))
df['Calc'] = df.apply(Calc, axis=1)
I menaged to change it using functions that I pointed but I have a problem how to write this max(max())
df.loc[(combined_sf2['Forecast'] == 0),'Calc'] = df.clip(0,None)
df.loc[(combined_sf2['def'] == 1),'Calc'] = 0
df.loc[(combined_sf2['def'] == 0),'Calc'] = round(max(df['Qty']- (max(df['Forecast_total']
*14,(df['Qty_12m_1']+df['Qty_12m_2']))
*max(1, (df['Total']/df['Forecast'])/54)),0))
First two functions are working, the last one doesn't.
id Forecast def Calc Qty Forecast_total Qty_12m_1 Qty_12m_2 Total
31551 0 0 0 2 0 0 0 95
27412 0,1 0 1 3 0,1 11 0 7
23995 0,1 0 0 4 0 1 0 7
27411 5,527 1 0,036186 60 0,2 64 0 183
28902 5,527 0 0,963814 33 5,327 277 0 183
23954 5,527 0 0 6 0 6 0 183
23994 5,527 0 0 8 0 0 0 183
31549 5,527 0 0 6 0 1 0 183
31550 5,527 0 0 6 0 10 0 183
Use numpy.select and instead max use numpy.maximum:
m1 = df['Forecast'] == 0
m2 = df['def'] == 1
m3 = df['def'] == 0
s1 = df['Qty'].clip(lower=0)
s3 = round(np.maximum(df['Qty'] - (np.maximum(df['Forecast_total']*14,(df['Qty_12m_1']+df['Qty_12m_2'])) * np.maximum(1, (df['Total']/df['Forecast'])/54)),0 ))
df['Calc2'] = np.select([m1, m2, m3], [s1, 0, s3], default=None)
I have a 2 dimension variable in ampl and I want to display it. I want to change the order of the indices but I do not know how to do that! I put my code , data and out put I described what kind of out put I want to have.
Here is my code:
param n;
param t;
param w;
param p;
set Var, default{1..n};
set Ind, default{1..t};
set mode, default{1..w};
var E{mode, Ind};
var B{mode,Var};
var C{mode,Ind};
param X{mode,Var,Ind};
var H{Ind};
minimize obj: sum{m in mode,i in Ind}E[m,i];
s.t. a1{m in mode, i in Ind}: sum{j in Var} X[m,j,i]*B[m,j] -C[m,i] <=E[m,i];
solve;
display C;
data;
param w:=4;
param n:=9;
param t:=2;
param X:=
[*,*,1]: 1 2 3 4 5 6 7 8 9 :=
1 69 59 100 70 35 1 1 0 0
2 34 31 372 71 35 1 0 1 0
3 35 25 417 70 35 1 0 0 1
4 0 10 180 30 35 1 0 0 0
[*,*,2]: 1 2 3 4 5 6 7 8 9 :=
1 64 58 68 68 30 2 1 0 0
2 44 31 354 84 30 2 0 1 0
3 53 25 399 85 30 2 0 0 1
4 0 11 255 50 30 2 0 0 0
The output of this code using glpksol is like tis:
C[1,1].val = -1.11111111111111
C[1,2].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[2,2].val = -1.11111111111111
C[3,1].val = -0.915032679738562
C[3,2].val = -1.11111111111111
C[4,1].val = 0.141414141414141
C[4,2].val = 0.2003367003367
but I want the result to be like this:
C[1,1].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[3,1].val = -0.915032679738562
C[4,1].val = 0.141414141414141
C[1,2].val = -1.11111111111111
C[2,2].val = -1.11111111111111
C[3,2].val = -1.11111111111111
C[4,2].val = 0.2003367003367
any idea?
You can use for loops and printf commands in your .run file:
for {i in Ind}
for {m in mode}
printf "C[%d,%d] = %.4f\n", m, i, C[m,i];
or even:
printf {i in Ind, m in mode} "C[%d,%d] = %.4f\n", m, i, C[m,i];
I don't get the same numerical results as you, but anyway the output works:
C[1,1] = 0.0000
C[2,1] = 0.0000
C[3,1] = 0.0000
C[4,1] = 0.0000
C[1,2] = 0.0000
C[2,2] = 0.0000
C[3,2] = 0.0000
C[4,2] = 0.0000
Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?
What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.
I am new to data analysis , I wand to find cell position which containing input string.
example:
Price | Rate p/lot | Total Comm|
947.2 1.25 CAD 1.25
129.3 2.1 CAD 1.25
161.69 0.8 CAD 2.00
How do I find position of string "CAD 2.00".
Required output is (2,2)
In [353]: rows, cols = np.where(df == 'CAD 2.00')
In [354]: rows
Out[354]: array([2], dtype=int64)
In [355]: cols
Out[355]: array([2], dtype=int64)
Replace columns names to numeric by range, stack and for first occurence of value use idxmax:
d = dict(zip(df.columns, range(len(df.columns))))
s = df.rename(columns=d).stack()
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
If want check all occurencies use boolean indexing and convert MultiIndex to list:
a = s[(s == 'CAD 1.25')].index.tolist()
print (a)
[(0, 2), (1, 2)]
Explanation:
Create dict for rename columns names to range:
d = dict(zip(df.columns, range(len(df.columns))))
print (d)
{'Rate p/lot': 1, 'Price': 0, 'Total Comm': 2}
print (df.rename(columns=d))
0 1 2
0 947.20 1.25 CAD 1.25
1 129.30 2.10 CAD 1.25
2 161.69 0.80 CAD 2.00
Then reshape by stack for MultiIndex with positions:
s = df.rename(columns=d).stack()
print (s)
0 0 947.2
1 1.25
2 CAD 1.25
1 0 129.3
1 2.1
2 CAD 1.25
2 0 161.69
1 0.8
2 CAD 2.00
dtype: object
Compare by string:
print (s == 'CAD 2.00')
0 0 False
1 False
2 False
1 0 False
1 False
2 False
2 0 False
1 False
2 True
dtype: bool
And get position of first True - values of MultiIndex:
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
Another solution is use numpy.nonzero for check values, zip values together and convert to list:
i, j = (df.values == 'CAD 2.00').nonzero()
t = list(zip(i, j))
print (t)
[(2, 2)]
i, j = (df.values == 'CAD 1.25').nonzero()
t = list(zip(i, j))
print (t)
[(0, 2), (1, 2)]
A simple alternative:
def value_loc(value, df):
for col in list(df):
if value in df[col].values:
return (list(df).index(col), df[col][df[col] == value].index[0])
I have the following dataframe,
df = pd.DataFrame({
'CARD_NO': [000, 001, 002, 002, 001, 111],
'request_code': [2400,2200,2400,3300,5500,6600],
'merch_id': [1, 2, 1, 3, 3, 5],
'resp_code': [0, 1, 0, 1, 1, 1]})
Based on this requirement,
inquiries = df[(df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)]
I need to flag records in df for which CARD_NO == CARD_NO where inquiries is True.
If inquiries returns:
[6 rows x 4 columns]
index CARD_NO merch_id request_code resp_code
0 0 1 2400 0
2 2 1 2400 0
Then df should look like so:
index CARD_NO merch_id request_code resp_code flag
0 0 1 2400 0 N
1 1 2 2200 1 N
2 2 1 2400 0 N
3 2 3 3300 1 Y
4 1 3 5500 1 N
5 111 5 6600 1 N
I've tried several merges, but cannot seem to get the result I want.
Any help would be greatly appreciated.
Thank you.
the following should work if I understand your question correctly, which is that you want to set the flag is ture only when the CARD_NO is in the filtered group but the row itself is not in the filtered group.
import numpy as np
filter = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df['flag']=np.where(~filter & df.CARD_NO.isin(df.ix[filter, 'CARD_NO']), 'Y', 'N')
filtered = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df["flag"] = filtered.map(lambda x: "Y" if x else "N")