I have the following DataFrame:
import pandas as pd
# create simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location': ["New York", "Paris", "Berlin", "London"],
'Age': [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
#display(data_pandas)
data_pandas
What is returned is the following DF:
Age Location Name
0 24 New York John
1 13 Paris Anna
2 53 Berlin Peter
3 33 London Linda
I then do this:
olderThan30 = data_pandas[data_pandas > 30]
olderThan30
And it returns the following:
Age Location Name
0 NaN New York John
1 NaN Paris Anna
2 53.0 Berlin Peter
3 33.0 London Linda
What I would like to return is only those that have the Age column greater than 30. Something like this:
Age Location Name
2 53.0 Berlin Peter
3 33.0 London Linda
How do I do that?
You need to pass the appropriate boolean condition to mask:
In [104]:
data_pandas[data_pandas['Age'] > 30]
Out[104]:
Age Location Name
2 53 Berlin Peter
3 33 London Linda
what you did was compare the entire df:
In [105]:
data_pandas > 30
Out[105]:
Age Location Name
0 False True True
1 False True True
2 True True True
3 True True True
this then masks the cells in the entire df, which is why you get NaN in the first 2 rows of age
Whilst masking just the col of interest:
In [106]:
data_pandas['Age'] > 30
Out[106]:
0 False
1 False
2 True
3 True
Name: Age, dtype: bool
when passed as a mask to a df, masks the rows
as #JonClements has suggested, you may feel more comfortable using query:
In [110]:
data_pandas.query('Age > 30')
Out[110]:
Age Location Name
2 53 Berlin Peter
3 33 London Linda
This has a dependency on numexpr library but this is normally installed correctly in my experience
Related
not used pandas explode before. I got the gist of the pd.explode but for value lists where selective cols have nested lists I heard that pd.Series.explode is useful. However, i keep getting : "KeyError: "None of ['city'] are in the columns". Yet 'city' is defined in the keys:
keys = ["city", "temp"]
values = [["chicago","london","berlin"], [[32,30,28],[39,40,25],[33,34,35]]]
df = pd.DataFrame({"keys":keys,"values":values})
df2 = df.set_index(['city']).apply(pd.Series.explode).reset_index()
desired output is:
city / temp
chicago / 32
chicago / 30
chicago / 28
etc.
I would appreciate an expert weighing in as to why this throws an error, and a fix, thank you.
The problem comes from how you define df:
df = pd.DataFrame({"keys":keys,"values":values})
This actually gives you the following dataframe:
keys values
0 city [chicago, london, berlin]
1 temp [[32, 30, 28], [39, 40, 25], [33, 34, 35]]
You probably meant:
df = pd.DataFrame(dict(zip(keys, values)))
Which gives you:
city temp
0 chicago [32, 30, 28]
1 london [39, 40, 25]
2 berlin [33, 34, 35]
You can then use explode:
print(df.explode('temp'))
Output:
city temp
0 chicago 32
0 chicago 30
0 chicago 28
1 london 39
1 london 40
1 london 25
2 berlin 33
2 berlin 34
2 berlin 35
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes as follows:
df1 =
Index Name Age
0 Bob1 20
1 Bob2 21
2 Bob3 22
The second dataframe is as follows -
df2 =
Index Country Name
0 US Bob1
1 UK Bob123
2 US Bob234
3 Canada Bob2
4 Canada Bob987
5 US Bob3
6 UK Mary1
7 UK Mary2
8 UK Mary3
9 Canada Mary65
I would like to compare the names from df1 to the countries in df2 and create a new dataframe as follows:
Index Country Name Age
0 US Bob1 20
1 Canada Bob2 21
2 US Bob3 22
Thank you.
Using merge() should solve the problem.
df3 = pd.merge(df1, df2, on='Name')
Outcome:
import pandas as pd
df1 = pd.DataFrame({ "Name":["Bob1", "Bob2", "Bob3"], "Age":[20,21,22]})
df2 = pd.DataFrame({ "Country":["US", "UK", "US", "Canada", "Canada", "US", "UK", "UK", "UK", "Canada"],
"Name":["Bob1", "Bob123", "Bob234", "Bob2", "Bob987", "Bob3", "Mary1", "Mary2", "Mary3", "Mary65"]})
df3 = pd.merge(df1, df2, on='Name')
df3
I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1
I have made an ad hoc example that you can run, to show you a dataframe similar to df3 that I have to use:
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age'])
df2 = pd.DataFrame(people2,columns=['Name','Age'])
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
df3
How would I modify the dataframe df3 to get rid of the NaN values and put the 2 columns one next to the other (I don't care about keeping the id's, I just want a clean dataframe with the 2 columns next to each other) ??
you can drop nan values 1st:
df3 = pd.concat([df1.dropna(), df2.dropna()])
Output:
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Mark 20.0
4 Jane 22.0
5 Jack 23.0
Or if you want to contact side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True), df2.dropna().reset_index(drop=True)], 1)
output:
Name Age Name Age
0 Alex 10.0 Mark 20.0
1 Bob 12.0 Jane 22.0
2 Clarke 13.0 Jack 23.0
If you just wanna concat the name column side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True)['Name'], df2.dropna().reset_index(drop=True)['Name']], 1)
output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack
If you want to modify only df3 it can be done via iloc and dropna:
df3 = pd.concat([df3.iloc[:,0].dropna().reset_index(drop=True) , df3.iloc[:,1].dropna().reset_index(drop=True)],1)
Output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name','Age']).dropna()
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
print(df3)
This will help you concatenate two df
if I have understood correctly what you mean, this is a possible solution
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name1','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name2','Age']).dropna().reset_index()
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name1'], people_list[1]['Name2']), axis=1)
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack
if you already have that dataframe:
count = df3.Name2.isna().sum()
df3.loc[:, 'Name2'] = df3.Name2.shift(-count)
df3 = df3.dropna()
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack
How to merge 2 cells in the Pandas dataframe when one of the cells of the other column is empty
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30],
['', '', 26], ['juli', 'williams', 22]]
df = pd.DataFrame(lst,columns=['FName','LName','Age'],dtype=float)
In [4]:df
Out[4]:
FName LName Age
0 tom reacher 25.0
1 krish pete 30.0
2 26.0
3 juli williams 22.0
The ouput which I want is:
In [6]:df
Out[6]:
FName LName Age
0 tom reacher 25
1 krish pete 30,26
2 juli williams 22
First find empty cells in a column col1, then merge them to other column col2 and replace.
idx = df[df[col1] == ""].index # i guess definition of empty is ""
df.loc[idx,col1] = df.loc[idx][col2] + df.loc[idx][col1]
If always empty strings for both columns only is possible repalce them to missing values NaNs and forward filling them, so possible aggregate join:
df[['FName','LName']] = df[['FName','LName']].replace('', np.nan).ffill()
print (df[['FName','LName']])
FName LName
0 tom reacher
1 krish pete
2 krish pete
3 juli williams
df['Age'] = df['Age'].astype(int).astype(str)
df = df.groupby(['FName','LName'])['Age'].apply(','.join).reset_index()
print (df)
FName LName Age
0 juli williams 22
1 krish pete 30,26
2 tom reacher 25