Pandas index clause across multiple columns in a multi-column header - pandas

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

a list as a sublist of a list from group into list

I have a dataframe, which has 2 columns,
a b
0 1 2
1 1 1
2 1 1
3 1 2
4 1 1
5 2 0
6 2 1
7 2 1
8 2 2
9 2 2
10 2 1
11 2 1
12 2 2
Is there a direct way to make a third column as below
a b c
0 1 2 0
1 1 1 1
2 1 1 0
3 1 2 1
4 1 1 0
5 2 0 0
6 2 1 1
7 2 1 0
8 2 2 1
9 2 2 0
10 2 1 0
11 2 1 0
12 2 2 0
in which target [1, 2] is a sublist of df.groupby('a').b.apply(list), find the 2 rows that firstly fit the target in every group.
df.groupby('a').b.apply(list) gives
1 [2, 1, 1, 2, 1]
2 [0, 1, 1, 2, 2, 1, 1, 2]
[1,2] is a sublist of [2, 1, 1, 2, 1] and [0, 1, 1, 2, 2, 1, 1, 2]
so far, I have a function
def is_sub_with_gap(sub, lst):
'''
check if sub is a sublist of lst
'''
ln, j = len(sub), 0
ans = []
for i, ele in enumerate(lst):
if ele == sub[j]:
j += 1
ans.append(i)
if j == ln:
return True, ans
return False, []
test on the function
In [55]: is_sub_with_gap([1,2], [2, 1, 1, 2, 1])
Out[55]: (True, [1, 3])
You can change output by select index values of groups in custom function, flatten it by Series.explode and then test index values by Index.isin:
L = [1, 2]
def is_sub_with_gap(sub, lst):
'''
check of sub is a sublist of lst
'''
ln, j = len(sub), 0
ans = []
for i, ele in enumerate(lst):
if ele == sub[j]:
j += 1
ans.append(i)
if j == ln:
return lst.index[ans]
return []
idx = df.groupby('a').b.apply(lambda x: is_sub_with_gap(L, x)).explode()
df['c'] = df.index.isin(idx).view('i1')
print (df)
a b c
0 1 2 0
1 1 1 1
2 1 1 0
3 1 2 1
4 1 1 0
5 2 0 0
6 2 1 1
7 2 1 0
8 2 2 1
9 2 2 0
10 2 1 0
11 2 1 0
12 2 2 0

iterate among two columns of a dataframe

I am trying to iterate among two columns of a dataframe ("binS99", 'bin3HMax'). Those columns have values from 0 to 4. then I would like to create a new column ('Probability') in the same dataframe ("df_selection") taking the values from the matrix "Prob". The following code goes into a loop. any ideas on how to solve? thank you
prob = [[0, 0.00103, 0.00103],
[0, 0.00267, 0.00311],
[0, 0.00688, 0.01000],
[0, 0.01777, 0.03218]]
for index, row, in df_selection.iterrows():
a = int(df_selection.loc[index,"binS99"]) #int(str(row["binS99"]))
b = int(df_selection.loc[index,"bin3HMax"]) #int(str(row["bin3HMax"]))
df_selection.loc[index,"Probability"]= prob[a][b]
'''
I believe you need first check if maximal values in columns matched maximal number of values in lists and then use numpy indexing:
df_selection = pd.DataFrame({
'A':list('abcdef'),
'binS99':[0,1,2,0,2,1],
'bin3HMax':[1,2,1,0,1,0],
})
print (df_selection)
A binS99 bin3HMax
0 a 0 1
1 b 1 2
2 c 2 1
3 d 0 0
4 e 2 1
5 f 1 0
prob = [[0, 0.00103, 0.00103],
[0, 0.00267, 0.00311],
[0, 0.00688, 0.01000],
[0, 0.01777, 0.03218]]
arr_prob = np.array(prob)
print (arr_prob)
[[0. 0.00103 0.00103]
[0. 0.00267 0.00311]
[0. 0.00688 0.01 ]
[0. 0.01777 0.03218]]
a = df_selection['binS99'].to_numpy()
b = df_selection['bin3HMax'].to_numpy()
df_selection['Probability'] = arr_prob[a, b]
print (df_selection)
A binS99 bin3HMax Probability
0 a 0 1 0.00103
1 b 1 2 0.00311
2 c 2 1 0.00688
3 d 0 0 0.00000
4 e 2 1 0.00688
5 f 1 0 0.00000

Python Pandas Dataframe cell value split

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter
You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]

Pandas dataframe operations

I have the following dataframe,
df = pd.DataFrame({
'CARD_NO': [000, 001, 002, 002, 001, 111],
'request_code': [2400,2200,2400,3300,5500,6600],
'merch_id': [1, 2, 1, 3, 3, 5],
'resp_code': [0, 1, 0, 1, 1, 1]})
Based on this requirement,
inquiries = df[(df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)]
I need to flag records in df for which CARD_NO == CARD_NO where inquiries is True.
If inquiries returns:
[6 rows x 4 columns]
index CARD_NO merch_id request_code resp_code
0 0 1 2400 0
2 2 1 2400 0
Then df should look like so:
index CARD_NO merch_id request_code resp_code flag
0 0 1 2400 0 N
1 1 2 2200 1 N
2 2 1 2400 0 N
3 2 3 3300 1 Y
4 1 3 5500 1 N
5 111 5 6600 1 N
I've tried several merges, but cannot seem to get the result I want.
Any help would be greatly appreciated.
Thank you.
the following should work if I understand your question correctly, which is that you want to set the flag is ture only when the CARD_NO is in the filtered group but the row itself is not in the filtered group.
import numpy as np
filter = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df['flag']=np.where(~filter & df.CARD_NO.isin(df.ix[filter, 'CARD_NO']), 'Y', 'N')
filtered = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df["flag"] = filtered.map(lambda x: "Y" if x else "N")