Compare multiple records in pandas - pandas

Trying to compare df1 and df2 for 'Cntr No' AND either the value in any of the column of df2 ['Labour Cost', Material Cost', 'Amount in Estimate Currency'] must be matched with df1's Total.
For example df1 OOLU 3868088 is matched with df2 OOLU 3868088 AND the Total value of df1 "28" is matched with df2's "Labour Cost" value of '28'.
df:
df1 = pd.DataFrame({'Cntr No': ['OOLU 3868088','OOLU 3868088','OOLU 3868088','TRIU 0625840','TRIU 0625840','TRIU 0625840','TRIU 1234567','OOLU 6232016','OOLU 0981231','OOLU 1212444'],
'Total': [12,28,48,119,82.5,11.0,18.0,11.0,13.0,10.0]})
df2 = pd.DataFrame({'Cntr No': ['OOLU 3868088','OOLU 3868088','OOLU 3868088','TRIU 0625840','TRIU 0625840','TRIU 0625840','TRIU 1234567'],
'Labour Cost': [0.0,0.0,28.0,0.0,54.0,0.0,0.0],
'Material Cost':[0.00,12.0,58.91,82.5,54.0,0.0,16.0],
'Amount in Estimate Currency':[48.00,12.00,87.81,82.5,119.0,12.0,16.0]})
Expected output:
Cntr No Total Tally_with_df2
0 OOLU 3868088 12.0 Yes
1 OOLU 3868088 28.0 Yes
2 OOLU 3868088 48.0 Yes
3 TRIU 0625840 119.0 Yes
4 TRIU 0625840 82.5 Yes
5 TRIU 0625840 11.0 No
6 TRIU 1234567 18.0 No
Used code: this is the below code I tried on but doesn't able to achieve my requirements
cols = ['Labour Cost', 'Material Cost', 'Amount in Estimate Currency']
d = {k: set(v.values()) for k, v in \
df_co.set_index('Cntr No')[cols].to_dict(orient='index').items()}
df['Tally'] = [j in d.get(i, set()) for i, j in zip(df['Cntr No'], df['Total'])]
df['Tally'] = df['Tally'].map({True: 'Yes', False: 'No'})
df1:
Cntr No object
Serviced By object
Location object
WO No object
WASH - CHEMICAL float64
PTI - CHILL float64
WASHING CONTAINER AGENT float64
WASH - CHEMICAL AGENT float64
WASHING CONTAINER -AGENT float64
BUNDLING/UNBUNDLING OF FR float64
PTI - AUTO float64
PTI float64
Struct Repair - Labour float64
Struct Repair - Material float64
Machy Repair - Labour float64
Total float64
Vendor object
Sz object
Ty object
CO object
WO Date object
WO ID object
df2:
Cntr No object
Equipment Size/type Group Code object
Labour Cost float64
Material Cost float64
Amount in Estimate Currency float64
Remarks object

IIUC, we can create a groupby data from df2 for every unique Cntr No.
## this is grouped data
to_remove = df2.select_dtypes(['object']).columns.values.tolist()
df3 = (df2
.groupby('Cntr No')
.apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))
## df3 looks like this - using set for faster speed
print(df3)
Cntr No
OOLU 3868088 {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
TRIU 0625840 {0.0, 12.0, 82.5, 54.0, 119.0}
TRIU 1234567 {16.0, 0.0}
## this function ensures all cases are handles
def get_value(x, data):
if x['Cntr No'] not in data.index:
return 'Not Found'
else:
if x['Total'] in data[x['Cntr No']]:
return 'Yes'
else:
return 'No'
## next we do a simple look-up
df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)
print(df1)
Cntr No Total Tally_with_df2
0 OOLU 3868088 12.0 Yes
1 OOLU 3868088 28.0 Yes
2 OOLU 3868088 48.0 Yes
3 TRIU 0625840 119.0 Yes
4 TRIU 0625840 82.5 Yes
5 TRIU 0625840 11.0 No
6 TRIU 1234567 18.0 No

Related

Concise way to concatenate consecutive rows in pandas

I would like to take a dataframe and concatenate consecutive rows for comparison.
e.g.
Take
xyt = pd.DataFrame(np.concatenate((np.random.randn(3,2), np.arange(3).reshape((3, 1))), axis=1), columns=['x','y','t'])
Which looks something like:
x y t
0 1.237007 -1.035837 0.0
1 -1.782458 1.042942 1.0
2 0.063130 0.355014 2.0
And make:
a b
x y t x y t
0 1.237007 -1.035837 0.0 -1.782458 1.042942 1.0
1 -1.782458 1.042942 1.0 0.063130 0.355014 2.0
The best I could come up with was:
pd.DataFrame(
[np.append(x,y) for (x, y) in zip(xyt.values, xyt[1:].values)],
columns=pd.MultiIndex.from_product([('a', 'b'), xyt.columns]))
Is there a better way?
Let's try concat on axis=1 with the shifted frame:
import pandas as pd
xyt = pd.DataFrame({'x': {0: 1.237007, 1: -1.782458, 2: 0.06313},
'y': {0: -1.035837, 1: 1.042942, 2: 0.355014},
't': {0: 0.0, 1: 1.0, 2: 2.0}})
merged = pd.concat((xyt, xyt.shift(-1)), axis=1, keys=('a', 'b')).iloc[:-1]
print(merged)
merged:
a b
x y t x y t
0 1.237007 -1.035837 0.0 -1.782458 1.042942 1.0
1 -1.782458 1.042942 1.0 0.063130 0.355014 2.0
You can use pd.concat:
# Generate random data
n = 10
x, y = np.random.randn(2, n)
t = np.arange(n)
xyt = pd.DataFrame({
'x': x, 'y': y, 't': t
})
# The call
pd.concat([xyt, xyt.shift(-1)], axis=1, keys=['a','b'])
# Result
a b
x y t x y t
0 1.180544 1.707380 0 -0.227370 0.734225 1.0
1 -0.227370 0.734225 1 0.271997 -1.039424 2.0
2 0.271997 -1.039424 2 -0.729960 -1.081224 3.0
3 -0.729960 -1.081224 3 0.185301 0.530126 4.0
4 0.185301 0.530126 4 -0.175333 -0.126157 5.0
5 -0.175333 -0.126157 5 -0.634870 0.068683 6.0
6 -0.634870 0.068683 6 0.350867 0.361564 7.0
7 0.350867 0.361564 7 0.090678 -0.269504 8.0
8 0.090678 -0.269504 8 0.177076 -0.976640 9.0
9 0.177076 -0.976640 9 NaN NaN NaN

python pandas divide dataframe in method chain

I want to divide a dataframe by a number:
df = df/10
Is there a way to do this in a method chain?
# idea:
df = df.filter(['a','b']).query("a>100").assign(**divide by 10)
We can use DataFrame.div here:
df = df[['a','b']].query("a>100").div(10)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Use DataFrame.pipe with lambda function for use some function for all data of DataFrame:
df = pd.DataFrame({
'a':[400,500,40,50,5,700],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4]
})
df = df.filter(['a','b']).query("a>100").pipe(lambda x: x / 10)
print (df)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Here if use apply all columns are divided separately:
df = df.filter(['a','b']).query("a>100").apply(lambda x: x / 10)
You can see difference with print:
df1 = df.filter(['a','b']).query("a>100").pipe(lambda x: print (x))
a b
0 400 7
1 500 8
5 700 3
df2 = df.filter(['a','b']).query("a>100").apply(lambda x: print (x))
0 400
1 500
5 700
Name: a, dtype: int64
0 7
1 8
5 3
Name: b, dtype: int64

Pandas melt data based on two or more binary columns

I have a data frame that looks like this that includes price side and volume parameters from multiple exchanges.
df = pd.DataFrame({
'price_ex1' : [9380.59650, 9394.85206, 9397.80000],
'side_ex1' : ['bid', 'bid', 'ask'],
'size_ex1' : [0.416, 0.053, 0.023],
'price_ex2' : [9437.24045, 9487.81185, 9497.81424],
'side_ex2' : ['bid', 'bid', 'ask'],
'size_ex2' : [10.0, 556.0, 23.0]
})
df
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
For each exchange (I have more than two exchanges), I want the index to be the union of all prices from all exchanges (i.e. union of price_ex1, price_ex2, etc...) ranked from highest to lowest. Then I want to create two size columns for each exchange based on the side parameter of that exchange. The output should look like this where empty columns are NaN.
I am not sure what is the best pandas function to do this, whether it is pivot or melt and how to use that function when I have more than 1 binary column I am flattening.
Thank you for your help!
This is a three step process. After you correct your multiindexed columns, you should stack your dataset, then pivot it.
First, clean up the multiindex columns so that you more easily transform:
df.columns = pd.MultiIndex.from_product([['1', '2'], [col[:-4] for col in df.columns[:3]]], names=['exchange', 'params'])
exchange 1 2
params price side size price side size
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
Then stack and append the exchange num to the bid and ask values:
df = df.swaplevel(axis=1).stack()
df['side'] = df.apply(lambda row: row.side + '_ex' + row.name[1], axis=1)
params price side size
exchange
0 1 9380.59650 bid_ex1 0.416
2 9437.24045 bid_ex2 10.000
1 1 9394.85206 bid_ex1 0.053
2 9487.81185 bid_ex2 556.000
2 1 9397.80000 ask_ex1 0.023
2 9497.81424 ask_ex2 23.000
Finally, pivot and sort by price:
df.pivot_table(index=['price'], values=['size'], columns=['side']).sort_values('price', ascending=False)
params size
side ask_ex1 ask_ex2 bid_ex1 bid_ex2
price
9497.81424 NaN 23.0 NaN NaN
9487.81185 NaN NaN NaN 556.0
9437.24045 NaN NaN NaN 10.0
9397.80000 0.023 NaN NaN NaN
9394.85206 NaN NaN 0.053 NaN
9380.59650 NaN NaN 0.416 NaN
You can try something like this.
Please make a dataframe with the data that you show us and name it something 'example.csv'
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df1 = df[['price_ex1','side_ex1','size_ex1']]
df2 = df[['price_ex2','side_ex2','size_ex2']]
df3 = df1.append(df2)
df4 = df3[['price_ex1','price_ex2']]
arr = df4.values
df3['price_ex1'] = arr[~np.isnan(arr)].astype(float)
df3.drop(columns=['price_ex2'], inplace=True)
df3.columns = ['price', 'bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2']
def change(bid_ex1, ask_ex1, bid_ex2, ask_ex2, col_name):
if col_name == 'bid_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'bid':
return bid_ex2
else:
return bid_ex1
if col_name == 'ask_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'ask':
return bid_ex2
else:
return ask_ex1
if col_name == 'ask_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'ask':
return ask_ex2
else:
return ask_ex1
if col_name == 'bid_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'bid':
return ask_ex2
else:
return ask_ex1
df3['bid_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex1_col'), axis=1)
df3['ask_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex1_col'), axis=1)
df3['ask_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex2_col'), axis=1)
df3['bid_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex2_col'), axis=1)
df3.drop(columns=['bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2'], inplace=True)
df3.replace(to_replace='ask', value=np.nan,inplace=True)
df3.replace(to_replace='bid', value=np.nan,inplace=True)
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(names_to = ('ex1', 'ex2', 'ex'),
values_to=('price','side','size'),
names_pattern=['price', 'side', 'size'])
.loc[:, ['price', 'side','ex','size']]
.assign(ex = lambda df: df.ex.str.split('_').str[-1])
.pivot_wider('price', ('side', 'ex'), 'size')
.sort_values('price', ascending = False)
)
price bid_ex1 ask_ex1 bid_ex2 ask_ex2
5 9497.81424 NaN NaN NaN 23.0
4 9487.81185 NaN NaN 556.0 NaN
3 9437.24045 NaN NaN 10.0 NaN
2 9397.80000 NaN 0.023 NaN NaN
1 9394.85206 0.053 NaN NaN NaN
0 9380.59650 0.416 NaN NaN NaN

Get column value using index dict

I have this pandas df:
value
index1 index2 index3
1 1 1 10.0
2 -0.5
3 0.0
2 2 1 3.0
2 0.0
3 0.0
3 1 0.0
2 -5.0
3 6.0
I would like to get the 'value' of a specific combination of index, using a dict.
Usually, I use, for example:
df = df.iloc[df.index.isin([2],level='index1')]
df = df.iloc[df.index.isin([3],level='index2')]
df = df.iloc[df.index.isin([2],level='index3')]
value = df.values[0][0]
Now, I would like to get my value = -5 in a shorter way using this dictionary:
d = {'index1':2,'index2':3,'index3':2}
And also, if I use:
d = {'index1':2,'index2':3}
I would like to get the array:
[0.0, -5.0, 6.0]
Tips?
You can use SQL-like method DataFrame.query():
In [69]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[69]:
value
index1 index2 index3
2.0 3.0 2 -5.0
for another dict:
In [77]: d = {'index1':2,'index2':3}
In [78]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[78]:
value
index1 index2 index3
2.0 3.0 1 0.0
2 -5.0
3 6.0
A non-query way would be
In [64]: df.loc[np.logical_and.reduce([
df.index.get_level_values(k) == v for k, v in d.items()])]
Out[64]:
value
index1 index2 index3
2 3 2 -5.0

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0