Pandas iterrows is givign wrong values - pandas

I have 2 data frames & I am using iterrows to check to find the items that are present in both the dataframes.
The codes is as below.
import pandas as pd
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad'],
'Age': ['24', '25', '35', '27'],
'City': ['Agra', 'Bangalore', 'Calcutta', 'Delhi']})
Df2 = pd.DataFrame({'name': ['Jake', 'John', 'Marc', 'Tony', 'Bob', 'Marc'],
'Age': ['25', '25', '24', '28','29', '39'],
'City': ['Bangalore', 'Chennai', 'Agra', 'Delhi','Pune','zoo']})
age1=[]
age2=[]
for index, row in Df1.iterrows():
if Df2.name.isin([row['name']]).any():
print(Df2.loc[Df2['name']==row['name'],'Age'].values)
print(Df1.loc[Df2['name']==row['name'],'Age'].values)
print(Df1.loc[Df2['name']==row['name']])
The code works for value Marc, this value is present in both data frames, so it gets printed out. However, this code also prints Sam (Sam is only present in Df2) instead of Jake which is present in both Df1 & Df2.
The out put is something like this
['24' '39']
['35']
name Age City
2 Sam 35 Calcutta
['25']
['24']
name Age City
0 Marc 24 Agra
Why is it giving the out put like this? IT does not make any sense. Marc's age in Df2 is printed (which is correct), then Sam's age in DF1 is printed!
Then row where Sam is present in Df1.
Then, I don't know how to make sense of the rest.
Also, why is Marc being printed 2nd? i assumed that since Marc is the first value in Df1, that should be checked first & printed & then Jake.

Related

Stuck merging two dataframes, getting unexpected results

Thank you, I only 3 weeks into learning Pandas, and I am getting unexpected results, any guidance would be appreciated.
I would like to merge two DataFrames together and retain my set_index.
I have a simple DataFrame
import pandas as pd
data = {
'part_number': [123,123,123],
'part_name': ['some name in 11', 'some name in 12', 'some name in 13'],
'part_size': [11,12,13]
}
df = pd.DataFrame(data=data)
df.set_index('part_name', inplace=True)
I groupby the part_sizes, and merge.
This is where my knowledge breaks down, I lose my index which is the part_name.
I see there are joins and concats, am I using the wrong syntax?
part_size_merge = df.groupby(['part_number'], dropna=False)['part_size'].agg(tuple).to_frame()
merged = df.merge(part_size_merge, on=['part_number'])
display(merged.head())
I tried concat, however, it looks like it stacks the two df's together, which isn't how I'd like it.
x = pd.concat([df, part_size_merge], axis=0, join='inner')
x.head()
Yes that is normal merge
out = df.reset_index().merge(part_size_merge, on=['part_number']).set_index('part_name')
Out[334]:
part_number part_size_x part_size_y
part_name
some name in 11 123 11 (11, 12, 13)
some name in 12 123 12 (11, 12, 13)
some name in 13 123 13 (11, 12, 13)

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

Error while trying to perform isin while iterating

I have 2 data frames and I am trying to get the first value in the column 'name' of one dataframe & then do a isin using that value on the 'name' column of the other dataframe. I am trying to do it like this because, if isin is true, then i want to get the corresponding AGE and match both, then if that also is true, then get the corresponding City & match.
But i am getting the error as below.
"TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]"
if I just print "row['name']", i get the value of first name but why is it not doing the isin check? what am i missing here?
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad'],
'Age': ['24', '25', '26', '27'],
'City': ['Agra', 'Bangalore', 'Calcutta', 'Delhi']})
Df2 = pd.DataFrame({'name': ['Jake', 'John', 'Marc', 'Tony', 'Bob', 'Marc'],
'Age': ['25', '25', '24', '28','29', '39'],
'City': ['Bangalore', 'Chennai', 'Agra', 'Delhi','Pune','zoo']})
for index, row in Df1.iterrows():
if Df2.name.isin(row['name'])==True:
print('present')
Problem is isin need lists, so possible solution is create one element list with Series.any for test if at least one value matching - at least one True:
for index, row in Df1.iterrows():
if Df2.name.isin([row['name']]).any():
print ('present')
Or compare by Series.eq:
for index, row in Df1.iterrows():
if Df2.name.eq(row['name']).any():
print ('present')

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock

Get all rows that are the same and put into own dataframe

Say I have a dataframe where there are different values in a column, e.g.,
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
How do I create a new dataframe(s), where each dataframe contains only the values for USA, only the values for UK, and only the values for France? But here is the thing, say `I don't what to specify a condition like
Don't want this
# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"
I want all the data aggregated for each nationality whatever the nationality is, without having to specify the nationality condition. I just want all the same nationalities together in their own dataframe. Also, I want all the columns that pertain to that row.
So for example, the function
SplitDFIntoSeveralDFWhereColumnValueAllTheSame(column):
code
Will return an array of dataframes with all the values of a column in each dataframe are equal.
So if I had more data and more nationalities, the aggregation into new dataframes will work without changing the code.
This will give you a dictionary of dataframes where the keys are the unique values of the 'nationality' column and the values are the dataframes you are looking for.
{name: group for name, group in df.groupby('nationality')}
demo
dodf = {name: group for name, group in df.groupby('nationality')}
for k in dodf:
print(k, '\n'*2, dodf[k], '\n'*2)
France
first_name nationality age
2 NaN France 36
USA
first_name nationality age
0 Jason USA 42
1 Molly USA 52
UK
first_name nationality age
3 NaN UK 24
4 NaN UK 70