I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']
Related
I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])
I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397
I have 2 data frames, and i'd like to get the first data frame that contains data from the second data frame, based on the their index. The catch is that I do it iteratively and the columns index numbers of only the first df increase by one with each iteration, so it causes error.
example to that would be:
First df after first iteration:
0
440 7.691
Second df after first iteration (doesn't change after each iteration):
1
0 M
1 M
2 M
3 M
4 M
.. ..
440 B
441 M
442 M
when i ran the code, I get the wanted df:
df_with_label = first_df.join(self.second_df)
0 1
440 7.691 B
After second iteration, my first df in now:
1
3 10.72
and when i run the same df_with_label = first_df.join(self.second_df) i'd like to get:
1 2
3 10.72 M
But I get the error:
ValueError: columns overlap but no suffix specified: Int64Index([1], dtype='int64')
I'm guessing it has a problem with the fact that the index of the column of the first df is 1 after the second iteration, but don't know how to fix it.
i'd like to keep the index of the first column to keep increasing.
The best solution would be to give the second column different name, so like:
1 class
3 10.72 M
Any idea how to fix it?
If I got it right your second dataframe doesn't change with iterations so why don't you just change its column name once and for all:
second_df.columns=['colname']
this should solve your naming conflicts.
Try:
df_with_label = first_df.join(self.second_df, rsuffix = "_2")
The thing is - df_with_label and second_df both have column 1, so the rsuffix will add "_2" to the second_df column name "1" := "1_2". You join on indexes, so every other column is shown on default - so you need to avoid naming conflicts.
REF
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?
If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2
I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)