Fill nan's in dataframe after filtering column by names - pandas

Can anyone please tell me what the right approach here to filter (and fill nan) based on another column name. Thanks a lot.
Related link: How to fill dataframe's empty/nan cell with conditional column mean
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction nan
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction nan
9 Stripfind Financial Services nan
df_mean_expenses = (df.groupby(['Industry'], as_index = False)['Expenses']).mean()
df_mean_expenses
Industry Expenses
0 Construction 554433.11
1 Financial Services 2362818.48
2 IT Services 149153.46
In order to replace the Contruction-Revenue nan's by the contruction row's mean (in df_mean_expenses) , i tried two approaches:
1.
df.loc[df['Expenses'].isna(),['Expenses']][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. returns Error: Item wrong length 500 instead of 3!
2.
df['Expenses'][np.isnan(df['Expenses'])][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. this runs but does not add values to the df.
Expected output:
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction 554433.11
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction 554433.11
9 Stripfind Financial Services nan

Try with transform
df_mean_expenses = df.groupby('Industry')['Expenses'].transform('mean')
df['Revenue'] = df['Revenue'].fillna(df_mean_expenses[df['Industry']=='Construction'])

Related

Create a new column based on another column in a dataframe

I have a df with multiple columns. One of my column is extra_type. Now i want to create a new column based on the values of extra_type column. For example
extra_type
NaN
legbyes
wides
byes
Now i want to create a new column with 1 and 0 if extra_type is not equal to wide then 1 else 0
I tried like this
df1['ball_faced'] = df1[df1['extra_type'].apply(lambda x: 1 if [df1['extra_type']!= 'wides'] else 0)]
It not working this way.Any help on how to make this work is appreciated
expected output is like below
extra_type ball_faced
NaN 1
legbyes 1
wides 0
byes 1
Note that there's no need to use apply() or a lambda as in the original question, since comparison of a pandas Series and a string value can be done in a vectorized manner as follows:
df1['ball_faced'] = df1.extra_type.ne('wides').astype(int)
Output:
extra_type ball_faced
0 NaN 1
1 legbyes 1
2 wides 0
3 byes 1
Here are links to docs for ne() and astype().
For some useful insights on when to use apply (and when not to), see this SO question and its answers. TL;DR from the accepted answer: "If you're not sure whether you should be using apply, you probably shouldn't."
df['ball_faced'] = df.extra_type.apply(lambda x: x != 'wides').astype(int)
extra_type
ball_faced
0
NaN
1
1
legbyes
1
2
wides
0
3
byes
1

groupby does not apply effectively

Following is the ranking dataframe i am working on:
Q6 Q17
1 Consultant NaN
2 Other NaN
3 Data Scientist Java
4 Not employed Python
5 Data Analyst SQL
I want to:
count how many times each programming language occurs for 'Data Scientists' and record the frequency in a column 'counts'
sort the count in descending order
reset index and rename Q17 as Language
The following code does not group each Language.
ranking_data = ranking_data[ranking_data.Q6 == 'Data Scientist']
ranking_data_summary = ranking_data.copy().rename(columns = {'Q17':'Language'})
ranking_data_summary['counts'] = ranking_data_summary.groupby('Language')
['Language'].transform('count')
ranking_data_summary.sort_values('counts',ascending = False, inplace = True)
ranking_data_summary.reset_index(inplace = True)
What am i doing wrong?

Merging two dataframes on the same type column gives me wrong result

I have two dataframes, assume A and B, which have been created after reading the sheets of an Excel file and performing some basic functions. I need to merge right the two dataframes on a column named ID which has first been converted to astype(str) for both dataframes.
The ID column of the left Dataframe (A) is:
0 5815518813016
1 5835503994014
2 5835504934023
3 5845535359006
4 5865520960012
5 5865532845006
6 5875531550008
7 5885498289039
8 5885498289039_A2
9 5885498289039_A3
10 5885498289039_X2
11 5885498289039_X3
12 5885509768698
13 5885522349999
14 5895507791025
Name: ID, dtype: object
The ID column of the right Dataframe (B) is:
0 5835503994014
1 5845535359006
2 5835504934023
3 5815518813016
4 5885498289039_A1
5 5885498289039_A2
6 5885498289039_A3
7 5885498289039_X1
8 5885498289039_X2
9 5885498289039_X3
10 5885498289039
11 5865532845006
12 5875531550008
13 5865520960012
14 5885522349998
15 5895507791025
16 5885509768698
Name: ID, dtype: object
However, when I merge the two, the rest of the columns of the left (A) dataframe become "empty" (np.nan) except for the rows where the ID does not contain only numbers but letters too. This is the pd.merge() I do:
A_B=A.merge(B[['ID','col_B']], left_on='ID', right_on='ID', how='right')
Do you have any ideas what might be so wrong? Your input is valuable.
Try turning all values in both columns into strings:
A['ID'] = A['ID'].astype(str)
B['ID'] = B['ID'].astype(str)
Generally, when a merge like this doesn't work, I would try to debug by printing out the unique values in each column to check if anything pops out (usually dtype issues).

Python Pandas groupby and join

I am fairly new to python pandas and cannot find the answer to my problem in any older posts.
I have a simple dataframe that looks something like that:
dfA ={'stop':[1,2,3,4,5,1610,1611,1612,1613,1614,2915,...]
'seq':[B, B, D, A, C, C, A, B, A, C, A,...] }
Now I want to merge the 'seq' values from each group, where the difference between the next and previous value in 'stop' is equal to 1. When the difference is high like 5 and 1610, that is where the next cluster begins and so on.
What I need is to write all values from each cluster into separate rows:
0 BBDAC #join'stop' cluster 1-5
1 CABAC #join'stop' cluster 1610-1614
2 A.... #join'stop' cluster 2015 - ...
etc...
What I am getting with my current code is like:
True BDACABAC...
False BCA...
for the entire huge dataframe.
I understand the logic behid the whay it merges it, which is meeting the condition (not perfect, loosing cluster edges) I specified, but I am running out of ideas if I can get it joined and split properly into clusters somehow, not all rows of the dataframe.
Please see my code below:
dfB = dfA.groupby((dfA.stop - dfA.stop.shift(1) == 1))['seq'].apply(lambda x: ''.join(x)).reset_index()
Please help.
P.S. I have also tried various combinations with diff() but that didn't help either. I am not sure if groupby is any good for this solution as well. Please advise!
dfC = dfA.groupby((dfA['stop'].diff(periods=1)))['seq'].apply(lambda x: ''.join(x)).reset_index()
This somehow splitted the dataframe into smaller chunks, cluster-like, but I am not understanding the legic behind the way it did it, and I know the result makes no sense and is not what I intended to get.
I think you need create helper Series for grouping:
g = dfA['stop'].diff().ne(1).cumsum()
dfC = dfA.groupby(g)['seq'].apply(''.join).reset_index()
print (dfC)
stop seq
0 1 BBDAC
1 2 CABAC
2 3 A
Details:
First get differences by diff:
print (dfA['stop'].diff())
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1605.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1301.0
Name: stop, dtype: float64
Compare by ne (!=) for first values of groups:
print (dfA['stop'].diff().ne(1))
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
Name: stop, dtype: bool
Asn last create groups by cumsum:
print (dfA['stop'].diff().ne(1).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Name: stop, dtype: int32
I just figured it out.
I managed to round the values of 'stop' to a nearest 100 and assigned it as a new column.
Then my previous code is working....
Thank you so much for quick answer though.
dfA['new_val'] = (dfA['stop'] / 100).astype(int) *100

Replacing NaN values with group mean

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.
Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))