Iteration through two Pandas Dataframes + create new column - pandas

I am new to using Pandas and I am trying to iterate through two columns from different Dataframes and if both columns have the same word, to append "yes" to another column. If not, append the word "no".
This is what I have:
for row in df1.iterrows():
for word in df2.iterrows():
if df1['word1'] == df2['word2']:
df1.column1.append('Yes') #I just want to have two columns in binary form, if one is yes the other must be no
df2.column2.append('No')
else:
df1.column1.append('No')
df2.column2.append('Yes')
I Have now:
column1 column2 column3
apple None None
orange None None
banana None None
tomato None None
sugar None None
grapes None None
fig None None
I want:
column1 column2 column3
apple Yes No
orange No No
banana No No
tomato No No
sugar No Yes
grapes No Yes
figs No Yes
Sample of words from df1: apple, orange, pear
Sample of words from df2: yellow, orange, green
I get this error:
Can only compare identically-labeled Series objects
Note: The words in df2 are 2500 than the words in df1 are 500.
Any help is appreciated!

Actually, you want to fill:
df1.column1 with:
Yes - if word1 from this row occurs in df2.word1 (in any row),
No - otherwise,
df2.column2 with:
Yes - if word2 from this row occurs in df1.word2 (in any row),
No - otherwise.
To do it, you can run:
df1['column1'] = np.where(df1.word1.isin(df2.word2), 'Yes', 'No')
df2['column2'] = np.where(df2.word2.isin(df1.word1), 'Yes', 'No')
To test my code I used the following DataFrames:
df1: df2:
word1 word2
0 apple 0 yellow
1 orange 1 orange
2 pear 2 green
3 strawberry 3 strawberry
4 plum
The result of my code is:
df1: df2:
word1 column1 word2 column2
0 apple No 0 yellow No
1 orange Yes 1 orange Yes
2 pear No 2 green No
3 strawberry Yes 3 strawberry Yes
4 plum No

I think it might be a better idea to get set of words from both columns and then do lookup. It would be way faster as well. Something like this:
words_df1 = set(df1['word1'].tolist())
words_df2 = set(df2['word2'].tolist())
Then do
df1['has_word2'] = df1['word1'].isin(words_df2)
df2['has_word1'] = df2['word2'].isin(words_df1)

Related

Re-define dataframe index with map function

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

How to flat a string to several columns in pandas?

fruit = pd.DataFrame({'type': ['apple: 1 orange: 2 pear: 3']})
I want to flat the dataframe and get the below format:
apple orange pear
1 2 3
Thanks
You are making your live extremely difficult if you work with multiple values in a single field. You can basically use none of the pandas functions because they all assume they data in a field belong together and should stay together.
For instance with
In [10]: fruit = pd.Series({'apple': 1, 'orange': 2, 'pear': 3})
In [11]: fruit
Out[11]:
apple 1
orange 2
pear 3
dtype: int64
you could easily transform your data as in
In [14]: fruit.to_frame()
Out[14]:
0
apple 1
orange 2
pear 3
In [15]: fruit.to_frame().T
Out[15]:
apple orange pear
0 1 2 3

How do you groupby and aggregate using conditional statements in Pandas?

Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
Orange_Total
0 20
1 12
EDIT:
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
Orange_Total
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
John;Banana;3
Jane;Coconut;5
John;Lime;10
""")
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
})
df = pd.read_csv(StringIO("""
Index,Name,Item,Quantity
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
7,John,Banana,3
8,Jane,Coconut,5
9,John,Lime,10
"""))
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
items_list
All Items
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
temp2
Item Quantity
Name
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
pivot_df
Item Apple Orange
Name
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
output
All Items Apples Total Oranges Total
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0

Is there a way to keep only rows in a DataFrame, when a column of that dataframe contains a substring of another column in that dataframe?

I have a dataset:
id key value
24 Apple Inc_Desktops revenue_rgs_category_-_pc_monitors nan
2 Apple Inc_Desktops revenue_rgs_category_-_mobile_phones 142381000000.000
46 Apple Inc_Desktops revenue_rgs_category_-_smart_tech 24482000000.000
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
35 Apple Inc_Desktops revenue_rgs_category_-_tablets 21280000000.000
1 Apple Inc_Laptops revenue_rgs_category_-_mobile_phones 142381000000.000
45 Apple Inc_Laptops revenue_rgs_category_-_smart_tech 24482000000.000
23 Apple Inc_Laptops revenue_rgs_category_-_pc_monitors nan
34 Apple Inc_Laptops revenue_rgs_category_-_tablets 21280000000.000
12 Apple Inc_Laptops revenue_rgs_category_-_desktop_pcs 12870000000.000
25 Apple Inc_MobilePhones revenue_rgs_category_-_pc_monitors nan
14 Apple Inc_MobilePhones revenue_rgs_category_-_desktop_pcs 12870000000.000
36 Apple Inc_MobilePhones revenue_rgs_category_-_tablets 21280000000.000
47 Apple Inc_MobilePhones revenue_rgs_category_-_smart_tech 24482000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
And I only want to keep the rows when the column key contains a substring from column id. For example, as illustrated in the picture below, i want to keep only rows with index 13,3 because for those rows the 'key' column contains part of the id column - eg, for row with index 3, 'Mobile' is included in key column.
So my desired output would be:
id key value
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
I tried to create a new indicating whether the 'key' column contains substring of the 'id' column, but with not luck:
comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)
Any ideas on an efficient way to do this? Thanking you in advance.
Here is some code that should help you get started:
import numpy as np
import pandas as pd
np.random.seed(1)
# I create a simple DataFrame
df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
"key": np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
"value": np.random.randint(0,20, 15)})
df looks like this:
id key value
0 banana cherry pie 13
1 apple banana pie 9
2 apple cherry pie 9
3 banana apple pie 7
4 banana apple pie 1
5 apple cherry pie 0
6 apple apple pie 17
7 banana banana pie 8
8 apple cherry pie 13
9 banana cherry pie 19
10 apple apple pie 15
11 cherry banana pie 10
12 banana banana pie 8
13 cherry cherry pie 7
14 apple apple pie 3
Here is a simple option to select only the rows that satisfy a certain condition.
# create a function that checks if a row satisfies your condition
check_condition = lambda row: row["id"] in row["key"]
# create a new column that determines whether you keep the row
# by applying the check_condition function row wise (-> axis=1)
df["keep_row"] = df.apply(check_condition, axis=1)
# finally select and keep only the desired rows
df = df[df["keep_row"]]
Now df looks like this:
id key value keep_row
6 apple apple pie 17 True
7 banana banana pie 8 True
10 apple apple pie 15 True
12 banana banana pie 8 True
13 cherry cherry pie 7 True
14 apple apple pie 3 True
One final issue is how to check if a substring is contained in another string. There are a few ways to go about this.
Replace the values such that this operation becomes trivial, eg. row["id"] in row["key"]
Make new columns with the crucial information of the sting, if you only need to know whether is a mobile or pc make a new 'device' column.
Just code it anyway, this a a bit cumbersome though
This check_condition might work, form seeing your data but I cannot be sure of course.
def check_condition(row):
for i in row["id"].lower().split('_'):
if i in row["key"].lower():
return True
elif i[:-1] in row["key"].lower(): # account for the final 's'
return True
return False
2 notes:
This isn't a lambda function, but in this case it is equivalent to one, so you can replace the lambda check_condition-function by this one.
Also note that in the "id" and "key" columns some words ends with '-s' and some don't so that needs to be accounted for as well.
A solution to your question is to check if a string in your key column is present in the index column. In the example below, I construct a df (since you didn't provide one, with a column containing string in one column, present in another one:
import pandas as pd
a1,b2,c3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS'
s1, s2, s3 = 1,3,1
e1, e2, e3 = 3,6,6
df = pd.DataFrame({'key':[a1,b2,c3],'start': [s1, s2, s3],'end': [e1, e2, e3]})
df = df[['key', 'start', 'end']]
df['sliced'] = df.apply(fn, axis = 1)
aa1,bb2,cc3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS'
ss1, ss2, ss3 = 2,2,2
ee1, ee2, ee3 = 1,1,1
df2 = pd.DataFrame({'key':[aa1,bb2,cc3],'start': [ss1, ss2, ss3],'end': [ee1, ee2, ee3]})
df2 = df2[['key', 'start', 'end']]
dff = df.append(df2)
You apply this to determine if string in one column exist in key:
df['Check'] = df.apply(lambda x: x.sliced in x.key, axis=1)
and slice for True.

pandas search a value in a dataframe column

I have following dataframe and i want to search apple in column fruits and display all the rows if apple is found.
Before :
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
2 mango yes
banana
3 apple yes
orange
4 grapes no
pear
After:
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
3 apple yes
orange
Use groupby and filter to filter groups that contain 'apple':
df['number'] = df['number'].ffill()
df.groupby('number').filter(lambda x: (x['fruits'] == 'apple').any())
df_out.assign(number = df_out['number'].mask(df.number.duplicated()))\
.replace(np.nan,'')
Output:
number fruits purchase
0 0 apple yes
1 mango
2 banana
3 1 apple no
4 cheery
7 3 apple yes
8 orange
It looks like you're using 'number' as the index, so I'm going to assume that.
Get the numbers where 'apple' is present, and slice into those:
idx = df.index[df.fruits == 'apple']
df.loc[idx]