How to get a value of a column from another df based on index? pandas - pandas

I have 2 data frames, and i'd like to get the first data frame that contains data from the second data frame, based on the their index. The catch is that I do it iteratively and the columns index numbers of only the first df increase by one with each iteration, so it causes error.
example to that would be:
First df after first iteration:
0
440 7.691
Second df after first iteration (doesn't change after each iteration):
1
0 M
1 M
2 M
3 M
4 M
.. ..
440 B
441 M
442 M
when i ran the code, I get the wanted df:
df_with_label = first_df.join(self.second_df)
0 1
440 7.691 B
After second iteration, my first df in now:
1
3 10.72
and when i run the same df_with_label = first_df.join(self.second_df) i'd like to get:
1 2
3 10.72 M
But I get the error:
ValueError: columns overlap but no suffix specified: Int64Index([1], dtype='int64')
I'm guessing it has a problem with the fact that the index of the column of the first df is 1 after the second iteration, but don't know how to fix it.
i'd like to keep the index of the first column to keep increasing.
The best solution would be to give the second column different name, so like:
1 class
3 10.72 M
Any idea how to fix it?

If I got it right your second dataframe doesn't change with iterations so why don't you just change its column name once and for all:
second_df.columns=['colname']
this should solve your naming conflicts.

Try:
df_with_label = first_df.join(self.second_df, rsuffix = "_2")
The thing is - df_with_label and second_df both have column 1, so the rsuffix will add "_2" to the second_df column name "1" := "1_2". You join on indexes, so every other column is shown on default - so you need to avoid naming conflicts.
REF
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

Related

New data table not updating based on if-else condition [duplicate]

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)
DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd
The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Why doesn't df.append( ) work in my for loop? [duplicate]

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)
DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd
The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

Remove all duplicate data only show unique

I have a data set:
import pandas as pd
data = pd.read_csv('email_list.csv')
new_data = data[['Email Address','First Name','Last Name']]
Email Address First Name Last Name
0 zoe#gmail.com ZoƩ Z
1 yvonne#yahoo.com Yvonne T
2 Whitney#gmail.com Whitney W
3 zoe#gmail.com Zoe Z
4 yvonne#yahoo.com Yvonne T
I want the output to only show me unique emails and names. So from the short list above the output should be:
Email Address First Name Last Name
1 Whitney#gmail.com Whitney W
How can I do this? The simplest way will be best.
This is what you are searching for:
df.drop_duplicates(keep=False)
drop_duplicates remove dupes in your dataframe. The powerful keep argument let you tune what to keep and what to drop. If the argument is false, all dupes are dropped.

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)