How to manipulate specific condition in some colums by pandas - pandas

I've question about how to manipulate specific condtion in some colums.
For example,
from pandas import DataFrame
import pandas as pd
df = DataFrame({'name' : ['apple','pineapple','melon','orange','mango','durian'],
'amt' : [200,300,100,1,3,120]},
index = ['1','2','3','4','5','6'])
print(df)
I can see,
amt name
1 200 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
From above result I want to manipulate amt of apple with other items hold.
I just only know...
df.loc[df.name.str.contains('apple'), 'amt'] = df['amt']/100
This syntax manipulates not only 'apple' but 'pineapple'.
I'd like to get only result revising apple's amt like...
amt name
1 2 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
Is there anyone help me?
Thanks for reading.

Related

Python pandas dataframe, how to get the set number

Here is eaxmple:
df=pd.DataFrame([('apple'),('apple'),('apple'),('orange'),('orange')],columns=['A'])
df
Out[5]:
A
0 apple
1 apple
2 apple
3 orange
4 orange
I want to assign a number next to it, example, apple is the first set of list ['apple','orange'], B column is 1, then 2 for orange:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
Bellow wouldn't work.
df['B']=df['A'].tolist().index(df['A']) +1
You can use the pd.factorize function. This function is used to convert arrays into categorical ones.
pd.Series.factorize is also available as a method of pd.Series objects:
codes, _ = df["A"].factorize()
df["B"] = codes + 1
print(df)
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
use groupby ngroup + 1 with sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['B'] = df.groupby('A', sort=False).ngroup() + 1
df:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2

How to flat a string to several columns in pandas?

fruit = pd.DataFrame({'type': ['apple: 1 orange: 2 pear: 3']})
I want to flat the dataframe and get the below format:
apple orange pear
1 2 3
Thanks
You are making your live extremely difficult if you work with multiple values in a single field. You can basically use none of the pandas functions because they all assume they data in a field belong together and should stay together.
For instance with
In [10]: fruit = pd.Series({'apple': 1, 'orange': 2, 'pear': 3})
In [11]: fruit
Out[11]:
apple 1
orange 2
pear 3
dtype: int64
you could easily transform your data as in
In [14]: fruit.to_frame()
Out[14]:
0
apple 1
orange 2
pear 3
In [15]: fruit.to_frame().T
Out[15]:
apple orange pear
0 1 2 3

How do you groupby and aggregate using conditional statements in Pandas?

Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
Orange_Total
0 20
1 12
EDIT:
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
Orange_Total
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
John;Banana;3
Jane;Coconut;5
John;Lime;10
""")
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
})
df = pd.read_csv(StringIO("""
Index,Name,Item,Quantity
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
7,John,Banana,3
8,Jane,Coconut,5
9,John,Lime,10
"""))
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
items_list
All Items
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
temp2
Item Quantity
Name
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
pivot_df
Item Apple Orange
Name
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
output
All Items Apples Total Oranges Total
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0

pandas search a value in a dataframe column

I have following dataframe and i want to search apple in column fruits and display all the rows if apple is found.
Before :
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
2 mango yes
banana
3 apple yes
orange
4 grapes no
pear
After:
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
3 apple yes
orange
Use groupby and filter to filter groups that contain 'apple':
df['number'] = df['number'].ffill()
df.groupby('number').filter(lambda x: (x['fruits'] == 'apple').any())
df_out.assign(number = df_out['number'].mask(df.number.duplicated()))\
.replace(np.nan,'')
Output:
number fruits purchase
0 0 apple yes
1 mango
2 banana
3 1 apple no
4 cheery
7 3 apple yes
8 orange
It looks like you're using 'number' as the index, so I'm going to assume that.
Get the numbers where 'apple' is present, and slice into those:
idx = df.index[df.fruits == 'apple']
df.loc[idx]

Python3 pandas: data frame grouped by a columns(such as name), then extract a number of rows for each group

There is data frame called df as following:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
The data frame is grouped by name, then I want to extract a number of rows by row index(for example: 2) for the new dataframe: df_new.
name id age text
a 1 1 very good, and I like him
a 1 1 I will visit china
b 2 2 I play basketball with his brother
b 2 2 no one can understand me, I will solve it
c 3 3 I hope to get a offer
c 3 3 I like followers
d 4 4 everything goes well, I think
d 4 4 maybe I will be good
df_new = (df.groupby('screen_name'))[0:2]
But there is error:
hash(key)
TypeError: unhashable type: 'slice'
Try using head() instead.
import pandas as pd
from io import StringIO
buff = StringIO('''
name,id,age,text
a,1,1,"very good, and I like him"
b,2,2,I play basketball with his brother
c,3,3,I hope to get a offer
d,4,4,"everything goes well, I think"
a,1,1,I will visit china
b,2,2,"no one can understand me, I will solve it"
c,3,3,I like followers
d,4,4,maybe I will be good
a,1,1,I should work hard to finish my research
b,2,2,"water is the source of earth, I agree it"
c,3,3,I hope you can keep in touch with me
d,4,4,"My baby is very cute, I like him"
''')
df = pd.read_csv(buff)
using head() instead of [:2] then sorting by name
df_new = df.groupby('name').head(2).sort_values('name')
print(df_new)
name id age text
0 a 1 1 very good, and I like him
4 a 1 1 I will visit china
1 b 2 2 I play basketball with his brother
5 b 2 2 no one can understand me, I will solve it
2 c 3 3 I hope to get a offer
6 c 3 3 I like followers
3 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good
Another solution with iloc:
df_new = df.groupby('name').apply(lambda x: x.iloc[:2]).reset_index(drop=True)
print(df_new)
name id age text
0 a 1 1 very good, and I like him
1 a 1 1 I will visit china
2 b 2 2 I play basketball with his brother
3 b 2 2 no one can understand me, I will solve it
4 c 3 3 I hope to get a offer
5 c 3 3 I like followers
6 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good