Error while trying to perform isin while iterating - pandas

I have 2 data frames and I am trying to get the first value in the column 'name' of one dataframe & then do a isin using that value on the 'name' column of the other dataframe. I am trying to do it like this because, if isin is true, then i want to get the corresponding AGE and match both, then if that also is true, then get the corresponding City & match.
But i am getting the error as below.
"TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]"
if I just print "row['name']", i get the value of first name but why is it not doing the isin check? what am i missing here?
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad'],
'Age': ['24', '25', '26', '27'],
'City': ['Agra', 'Bangalore', 'Calcutta', 'Delhi']})
Df2 = pd.DataFrame({'name': ['Jake', 'John', 'Marc', 'Tony', 'Bob', 'Marc'],
'Age': ['25', '25', '24', '28','29', '39'],
'City': ['Bangalore', 'Chennai', 'Agra', 'Delhi','Pune','zoo']})
for index, row in Df1.iterrows():
if Df2.name.isin(row['name'])==True:
print('present')

Problem is isin need lists, so possible solution is create one element list with Series.any for test if at least one value matching - at least one True:
for index, row in Df1.iterrows():
if Df2.name.isin([row['name']]).any():
print ('present')
Or compare by Series.eq:
for index, row in Df1.iterrows():
if Df2.name.eq(row['name']).any():
print ('present')

Related

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

Why I can't merge the columns together

My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.
You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

Pandas iterrows is givign wrong values

I have 2 data frames & I am using iterrows to check to find the items that are present in both the dataframes.
The codes is as below.
import pandas as pd
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad'],
'Age': ['24', '25', '35', '27'],
'City': ['Agra', 'Bangalore', 'Calcutta', 'Delhi']})
Df2 = pd.DataFrame({'name': ['Jake', 'John', 'Marc', 'Tony', 'Bob', 'Marc'],
'Age': ['25', '25', '24', '28','29', '39'],
'City': ['Bangalore', 'Chennai', 'Agra', 'Delhi','Pune','zoo']})
age1=[]
age2=[]
for index, row in Df1.iterrows():
if Df2.name.isin([row['name']]).any():
print(Df2.loc[Df2['name']==row['name'],'Age'].values)
print(Df1.loc[Df2['name']==row['name'],'Age'].values)
print(Df1.loc[Df2['name']==row['name']])
The code works for value Marc, this value is present in both data frames, so it gets printed out. However, this code also prints Sam (Sam is only present in Df2) instead of Jake which is present in both Df1 & Df2.
The out put is something like this
['24' '39']
['35']
name Age City
2 Sam 35 Calcutta
['25']
['24']
name Age City
0 Marc 24 Agra
Why is it giving the out put like this? IT does not make any sense. Marc's age in Df2 is printed (which is correct), then Sam's age in DF1 is printed!
Then row where Sam is present in Df1.
Then, I don't know how to make sense of the rest.
Also, why is Marc being printed 2nd? i assumed that since Marc is the first value in Df1, that should be checked first & printed & then Jake.