Pandas - Dynamically passing variable to a column name - pandas

I am trying to have one of the columns in the Dataframe to be assigned value of a variable.
For example:
I have a column called : Count of sales done in the past 10 days
I am trying to have the value of 10 in the column name changed dynamically from another variable called date
So each time I update the Dataframe I want the value assigned to the variable date be shown in the column name
For example if date = 4, column name should be Count of sales done in the past 4 days and so on.
I tried to pass df.columns = date but get a KeyError

One idea with position of column is use rename:
df = pd.DataFrame({
'A':list('abcdef'),
'Count of sales done in the past 10 days':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
date = 4
pos = 1
df = df.rename(columns={df.columns[pos]: f'Count of sales done in the past {date} days'})
print (df)
A Count of sales done in the past 4 days C
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
Another idea is get column by mask and set new values by Index.where, but if mask return more columns names it rewrite all of them:
m = df.columns.str.startswith('Count of sales done in the past')
df.columns = df.columns.where(~m, f'Count of sales done in the past {date} days')

Related

Changing date order

I have csv file containing a set of dates.
The format is like:
14/06/2000
15/08/2002
10/10/2009
09/09/2001
01/03/2003
11/12/2000
25/11/2002
23/09/2001
For some reason pandas.to_datetime() does not work on my data.
So, I have split the column into 3 columns, as day, month and year.
And now I am trying to combine the columns without "/" with:
df["period"] = df["y"].astype(str) + df["m"].astype(str)
But the problem is instead of getting:
200006
I get:
20006
One zero is missing.
Could you please help me with that?
This will allow you to take the column of dates and turn it into pd.to_datetime()
#This is assuming the column name is 0 as it was on my df
#you can change that to whatever the column name is in your dataframe
df[0] = pd.to_datetime(df[0], infer_datetime_format=True)
df[0] = df[0].sort_values(ascending = False, ignore_index = True)
df
The dayfirst= parameter might help you:
print(df)
0
0 14/06/2000
1 15/08/2002
2 10/10/2009
3 09/09/2001
4 01/03/2003
5 11/12/2000
6 25/11/2002
7 23/09/2001
pd.to_datetime(df[0], dayfirst=True).sort_values()
0 2000-06-14
5 2000-12-11
3 2001-09-09
7 2001-09-23
1 2002-08-15
6 2002-11-25
4 2003-03-01
2 2009-10-10

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

How to create new pandas column by vlookup-like procedure on another data-frame

I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?
If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')