Replacing NaN values with group mean - pandas

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.

Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))

Related

Divide rows in two columns with Pandas

I am using Pandas.
For each row, regardless of the County, I would like to divide "AcresBurned" by "CrewsInvolved".
For each County, I would like to sum the total AcresBurned for that County and divide by the sum of the total CrewsInvolved for that County.
I just started coding and am not able to solve this. Please help. Thank you so much.
Counties AcresBurned CrewsInvolved
1 400 2
2 500 3
3 600 5
1 800 9
2 850 8
This is very simple with Pandas. You could create a new col with these operations.
df['Acer_per_Crew'] = df['AcersBurned'] / df['CrewsaInvolved']
You could use a groupby clause for viewing the sum of AcersBurned for a county.
df_gb = df.groupby(['counties']) ['AcersBurned', 'CrewsInvolved'].sum().reset_index()
df_gb.columns = ['counties', 'AcersBurnedPerCounty', 'CrewsInvolvedPerCounty']
df = df.merge(df_gb, on = 'counties')
Once you've done this, you could create a new column with a similar arithmetic operation to divide AcersBurnedPerCounty by CrewsInvolvedPerCounty.

Fill nan's in dataframe after filtering column by names

Can anyone please tell me what the right approach here to filter (and fill nan) based on another column name. Thanks a lot.
Related link: How to fill dataframe's empty/nan cell with conditional column mean
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction nan
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction nan
9 Stripfind Financial Services nan
df_mean_expenses = (df.groupby(['Industry'], as_index = False)['Expenses']).mean()
df_mean_expenses
Industry Expenses
0 Construction 554433.11
1 Financial Services 2362818.48
2 IT Services 149153.46
In order to replace the Contruction-Revenue nan's by the contruction row's mean (in df_mean_expenses) , i tried two approaches:
1.
df.loc[df['Expenses'].isna(),['Expenses']][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. returns Error: Item wrong length 500 instead of 3!
2.
df['Expenses'][np.isnan(df['Expenses'])][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. this runs but does not add values to the df.
Expected output:
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction 554433.11
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction 554433.11
9 Stripfind Financial Services nan
Try with transform
df_mean_expenses = df.groupby('Industry')['Expenses'].transform('mean')
df['Revenue'] = df['Revenue'].fillna(df_mean_expenses[df['Industry']=='Construction'])

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Pandas Duplicated returns some not duplicate values?

I am trying to remove duplicates from dataset.
Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.
How to get accurate duplicate results and drop real duplicates?
city price year manufacturer cylinders fuel odometer
whistler 26880 2016.0 chrysler NaN gas 49000.0
whistler 17990 2010.0 toyota NaN hybrid 117000.0
whistler 15890 2010.0 audi NaN gas 188000.0
whistler 8800 2007.0 nissan NaN gas 163000.0
Encountered the same problem.
At first, it looks like
df.duplicated(subset='my_column_of_interest')
returns results which actually have unique values in my_column_of_interest field.
This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.
Which means that if you have a value present twice in this column, running
df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

DAX - Reference measure in calculated column?

I have data like this
EmployeeID Value
1 7
2 6
3 5
4 3
I would like to create a DAX calculated column (or do I need a measure?) that gives me for each row, Value - AVG() of selected rows.
So if the AVG() of the above 4 rows is 5.25, I would get results like this
EmployeeID Value Diff
1 7 1.75
2 6 0.75
3 5 -0.25
4 3 -1.75
Still learning DAX, I cannot figure out how to implement this?
Thanks
I figured this out with the help of some folks on MSDN forums.
This will only work as a measure because measures are selection aware while calculated columns are not.
The Average stored in a variable is critical. ALLSELECTED() gives you the current selection in a pivot table.
AVERAGEX does the row value - avg of selection.
Diff:=
Var ptAVG = CALCULATE(AVERAGE[Value],ALLSELECTED())
RETURN AVERAGEX(Employee, Value - ptAVG)
You can certainly do this with a calculated column. It's simply
Diff = TableName[Value] - AVERAGE(TableName[Value])
Note that this averages over all employees. If you want to average over only specific groups, then more work needs to be done.