Re-define dataframe index with map function - pandas

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!

Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

Related

Copy first of group down and sum total - pre defined groups

I have previously asked how to iterate through a prescribed grouping of items and received the solution.
import pandas as pd
data = [['apple', 1], ['orange', 2], ['pear', 3], ['peach', 4],['plum', 5], ['grape', 6]]
#index_groups = [0],[1,2],[3,4,5]
df = pd.DataFrame(data, columns=['Name', 'Number'])
for i in range(len(df)):
print(df['Number'][i])
Name Age
0 apple 1
1 orange 2
2 pear 3
3 peach 4
4 plum 5
5 grape 6
where :
for group in index_groups:
print(df.loc[group])
gave me just what I needed. Following up on this I would like to now sum the numbers per group but also copy down the first 'Name' in each group to the other names in the group, and then concatenate so one line per 'Name'.
In the above example the output I'm seeking would be
Name Age
0 apple 1
1 orange 5
2 peach 15
I can append the sums to a list easy enough
group_sum = []
group_sum.append(sum(df['Number'].loc[group]))
But I can't get the 'Names' in order to merge with the sums.
You could try:
df_final = pd.DataFrame()
for group in index_groups:
_df = df.loc[group]
_df["Name"] = df.loc[group].Name.iloc[0]
df_final = pd.concat([df_final, _df])
df_final.groupby("Name").agg(Age=("Number", "sum")).reset_index()
Output:
Name Age
0 apple 1
1 orange 5
2 peach 15

How to find substring in list and return to substring in list instead of true or false only

Hi i have dataset something like this
dx = pd.DataFrame({'IDs':[1234,5346,1234,8793,8793],
'Names':['APPLE ABCD ONE','APPLE ABCD','NO STRAWBERRY YES','ORANGE AVAILABLE','TEA AVAILABLE']})
kw = ['APPLE', 'ORANGE', 'LEMONS', 'STRAWBERRY', 'BLUEBERRY', 'TEA COFFEE']
dx['Check']=dx['Names'].apply(lambda x: 1 if any(k in x for k in kw) else 0)
instead of returning to 1 or 0 i want it to return to kw like 'APPLE', 'ORANGE' or 'TEA COFFE' in new column
hope anyone can help me
Thank you
Use a regex with str.extract to benefit from vectorial speed:
import re
regex = '|'.join(map(re.escape, kw))
dx['Check'] = dx['Names'].str.extract(f'({regex})')
NB. this only returns the first match, if you want all use extractall and perform an aggregation step.
output:
IDs Names Check
0 1234 APPLE ABCD ONE APPLE
1 5346 APPLE ABCD APPLE
2 1234 NO STRAWBERRY YES STRAWBERRY
3 8793 ORANGE AVAILABLE ORANGE
4 8793 TEA AVAILABLE NaN
would this work?
dx['Check']=dx['Names'].apply(lambda x: [k for k in kw if k in x ])

Python pandas dataframe, how to get the set number

Here is eaxmple:
df=pd.DataFrame([('apple'),('apple'),('apple'),('orange'),('orange')],columns=['A'])
df
Out[5]:
A
0 apple
1 apple
2 apple
3 orange
4 orange
I want to assign a number next to it, example, apple is the first set of list ['apple','orange'], B column is 1, then 2 for orange:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
Bellow wouldn't work.
df['B']=df['A'].tolist().index(df['A']) +1
You can use the pd.factorize function. This function is used to convert arrays into categorical ones.
pd.Series.factorize is also available as a method of pd.Series objects:
codes, _ = df["A"].factorize()
df["B"] = codes + 1
print(df)
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
use groupby ngroup + 1 with sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['B'] = df.groupby('A', sort=False).ngroup() + 1
df:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2

How to flat a string to several columns in pandas?

fruit = pd.DataFrame({'type': ['apple: 1 orange: 2 pear: 3']})
I want to flat the dataframe and get the below format:
apple orange pear
1 2 3
Thanks
You are making your live extremely difficult if you work with multiple values in a single field. You can basically use none of the pandas functions because they all assume they data in a field belong together and should stay together.
For instance with
In [10]: fruit = pd.Series({'apple': 1, 'orange': 2, 'pear': 3})
In [11]: fruit
Out[11]:
apple 1
orange 2
pear 3
dtype: int64
you could easily transform your data as in
In [14]: fruit.to_frame()
Out[14]:
0
apple 1
orange 2
pear 3
In [15]: fruit.to_frame().T
Out[15]:
apple orange pear
0 1 2 3

Iteration through two Pandas Dataframes + create new column

I am new to using Pandas and I am trying to iterate through two columns from different Dataframes and if both columns have the same word, to append "yes" to another column. If not, append the word "no".
This is what I have:
for row in df1.iterrows():
for word in df2.iterrows():
if df1['word1'] == df2['word2']:
df1.column1.append('Yes') #I just want to have two columns in binary form, if one is yes the other must be no
df2.column2.append('No')
else:
df1.column1.append('No')
df2.column2.append('Yes')
I Have now:
column1 column2 column3
apple None None
orange None None
banana None None
tomato None None
sugar None None
grapes None None
fig None None
I want:
column1 column2 column3
apple Yes No
orange No No
banana No No
tomato No No
sugar No Yes
grapes No Yes
figs No Yes
Sample of words from df1: apple, orange, pear
Sample of words from df2: yellow, orange, green
I get this error:
Can only compare identically-labeled Series objects
Note: The words in df2 are 2500 than the words in df1 are 500.
Any help is appreciated!
Actually, you want to fill:
df1.column1 with:
Yes - if word1 from this row occurs in df2.word1 (in any row),
No - otherwise,
df2.column2 with:
Yes - if word2 from this row occurs in df1.word2 (in any row),
No - otherwise.
To do it, you can run:
df1['column1'] = np.where(df1.word1.isin(df2.word2), 'Yes', 'No')
df2['column2'] = np.where(df2.word2.isin(df1.word1), 'Yes', 'No')
To test my code I used the following DataFrames:
df1: df2:
word1 word2
0 apple 0 yellow
1 orange 1 orange
2 pear 2 green
3 strawberry 3 strawberry
4 plum
The result of my code is:
df1: df2:
word1 column1 word2 column2
0 apple No 0 yellow No
1 orange Yes 1 orange Yes
2 pear No 2 green No
3 strawberry Yes 3 strawberry Yes
4 plum No
I think it might be a better idea to get set of words from both columns and then do lookup. It would be way faster as well. Something like this:
words_df1 = set(df1['word1'].tolist())
words_df2 = set(df2['word2'].tolist())
Then do
df1['has_word2'] = df1['word1'].isin(words_df2)
df2['has_word1'] = df2['word2'].isin(words_df1)