How to create new pandas column by vlookup-like procedure on another data-frame - pandas

I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?

If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2

Related

Changing date order

I have csv file containing a set of dates.
The format is like:
14/06/2000
15/08/2002
10/10/2009
09/09/2001
01/03/2003
11/12/2000
25/11/2002
23/09/2001
For some reason pandas.to_datetime() does not work on my data.
So, I have split the column into 3 columns, as day, month and year.
And now I am trying to combine the columns without "/" with:
df["period"] = df["y"].astype(str) + df["m"].astype(str)
But the problem is instead of getting:
200006
I get:
20006
One zero is missing.
Could you please help me with that?
This will allow you to take the column of dates and turn it into pd.to_datetime()
#This is assuming the column name is 0 as it was on my df
#you can change that to whatever the column name is in your dataframe
df[0] = pd.to_datetime(df[0], infer_datetime_format=True)
df[0] = df[0].sort_values(ascending = False, ignore_index = True)
df
The dayfirst= parameter might help you:
print(df)
0
0 14/06/2000
1 15/08/2002
2 10/10/2009
3 09/09/2001
4 01/03/2003
5 11/12/2000
6 25/11/2002
7 23/09/2001
pd.to_datetime(df[0], dayfirst=True).sort_values()
0 2000-06-14
5 2000-12-11
3 2001-09-09
7 2001-09-23
1 2002-08-15
6 2002-11-25
4 2003-03-01
2 2009-10-10

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

How to (idiomatically) use pandas .loc to return an empty dataframe when key is not in index [duplicate]

This question already has answers here:
Pandas .loc without KeyError
(6 answers)
Closed 2 years ago.
Say I have a DataFrame (with a multi-index, for that matter), and I wish to take the values at some index - but, if that index does not exist, I wish for it to return an empty df instead of a KeyError.
I've searched for similar questions, but they are all about pandas returning an empty dataframe when it is not desired at some cases (conversely, I do desire an empty dataframe in return).
For example:
import pandas as pd
df = pd.DataFrame(index=pd.MultiIndex.from_tuples([(1,1),(1,2),(3,1)]),
columns=['a','b'], data=[[1,2],[3,4],[10,20]])
so, df is:
a b
1 1 1 2
2 3 4
3 1 10 20
and df.loc[1] is:
a b
1 1 2
2 3 4
df.loc[2] raises a KeyError, and I'd like something that returns
a b
The closest I could get is by calling df.loc[idx:idx] as a slice, which gives the correct result for idx=2, but for idx=1 it returns
a b
1 1 1 2
2 3 4
instead of the desires result.
Of course I can define a function to do it,
One idea with if-else statament:
def get_val(x):
return df.loc[x] if x in df.index.levels[0] else pd.DataFrame(columns=df.columns)
Or generally with try-except statement:
def get_val(x):
try:
return df.loc[x]
except KeyError:
return pd.DataFrame(columns=df.columns)
print (get_val(1))
a b
1 1 2
2 3 4
print (get_val(2))
Empty DataFrame
Columns: [a, b]
Index: []

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')