pandas: fillna whole df with groupby - pandas

I have the following df with a lot more number columns. I now want to make a forward filling for all the columns in the dataframe but grouped by id.
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 NaN 13
2 2001 7 NaN
2 2002 8 2
The result should look like this:
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 4 13
2 2001 7 NaN
2 2002 8 2
I tried the following command:
df= df.groupby("id").fillna(method="ffill", limit=2)
However, this raises a KeyError "isin". Filling just one column with the following command works just fine, but how can I efficiently forward fill the whole df grouped by isin?
df["number"]= df.groupby("id")["number"].fillna(method="ffill", limit=2)

You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.

ffill can be use directly
df.groupby('id').ffill(2)
Out[423]:
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
#isin
#df.loc[:,df.columns.isin([''])]=df.loc[:,df.columns.isin([''])].groupby('id').ffill(2)

Related

Create and calculate new rows based on other rows condition

I have the dataframe as follow:
dataframe generator:
df = pd.DataFrame({
'year':[2000,2001,2002]*3,
'id':['a']*3+['b']*3+['c']*3,
'othernulcol': ['xyz']*3+[np.nan]*4+['tyu']*2,
'val':[np.nan,2,3,4,5,6,7,8,9]
})
data looks like:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
I want to create new 3 rows from 2000 to 2002 that is the sum of row with id = a and b in the same year. othernulcol is just other column in dataframe. When creating new rows, just set those cols as np.NaN
Expected output:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 10.0
11 2002 ab NaN 12.0
Thank you for reading
Filter values by categories and convert year to index for align same years from another DataFrame, sum values by DataFrame.add and append to original DataFrame by concat:
cols = ['id','val']
df1 = df[df['id'].eq('a')].set_index('year')[cols]
df2 = df[df['id'].eq('b')].set_index('year')[cols]
df = pd.concat([df, df1.add(df2).reset_index()], ignore_index=True)
print (df)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0
Another solution could be as follows:
Select rows from df with df.id.isin(['a','b'] (see Series.isin) and apply df.groupby to year.
For the aggregration, use sum for column id. For column val use a lambda function to apply Series.sum, which allows skipna=False.
Finally, use pd.concat to add the result to the original df with ignoring the index.
out = pd.concat([df,df[df.id.isin(['a','b'])]\
.groupby('year', as_index=False)\
.agg({'id':'sum',
'val':lambda x: x.sum(skipna=False)})],
ignore_index=True)
print(out)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0

if (columnArow1= columnArow2, columnBrow2, "") excel if(logic_test, [value_if_true],[value_if_false]) how can I write this in python?

I would like to write Excel code into a Python (pandas)
I have filtered the df.loc[df.Activity_Mailbox.isnull()], now the na values must be calculated using
if (columnArow1 = columnArow2, columnBrow2, "")
This formula is according to Excel.
Please provide next time some demo data, like in your other question :-)
If I understand you correctly. Your data looks like:
df = pd.DataFrame({"A":[1,2,3,np.nan,5,np.nan],
"B":[10,11,12,13,14,15]})
df
A B
0 1.0 10
1 2.0 11
2 3.0 12
3 NaN 13
4 5.0 14
5 NaN 15
And now you want to fill the NaN with value from the other column. This can easily be done with:
df["A"] = df["A"].fillna(df["B"])
Output:
df
A B
0 1.0 10
1 2.0 11
2 3.0 12
3 13.0 13
4 5.0 14
5 15.0 15

pandas DataFrame: replace nan values with median of columns for each period

I've got a pandas DataFrame (panel data) filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the NaNs with the median of the columns for each year (cross-sectional medians per year)?
id
year
A
B
C
1
2000
3.539.101
265.152
.0683649
1
2001
3.539.101
2.485.833
NaN
1
2002
NaN
2.939.903
NaN
1
2003
3.733.545
3.021.591
-.0257413
2
2000
3.960.184
NaN
.9781774
2
2001
3.960.184
9.418.228
.4855057
2
2002
3.960.184
9.880.249
.049056
2
2003
3.960.184
NaN
.2310434
3
2000
NaN
1.287.206
-.0373083
3
2001
NaN
1.582.817
.1202868
3
2002
4.724.285
1.279.348
-.1824576
3
2003
4.724.285
1.213.678
-.0513311
Try this: df.fillna(df.median())

Transpose and rearrange Dataframe pandas

I need some help to rearrange a dataframe. this is what the data looks like.
Year item 1 item 2 item 3
2001 22 54 33
2002 77 54 33
2003 22 NaN 33
2004 22 54 NaN
The layout I want is:
Items Year Value
item 1 2001 22
item 1 2002 77
...
And so on...
Use melt if not necessary remove NaNs:
df = df.melt('Year', var_name='Items', value_name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
6 2003 item 2 NaN
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
11 2004 item 3 NaN
For remove NaNs add dropna:
df = df.melt('Year', var_name='Items', value_name='Value').dropna(subset=['Value'])
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
For a bit different ordering with removing NaNs is possible use set_index + stack + rename_axis + reset_index:
df = df.set_index('Year').stack().rename_axis(['Year','Items']).reset_index(name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0
Using comprehensions and pd.DataFrame.itertuples
pd.DataFrame(
[[y, i, v]
for y, *vals in df.itertuples(index=False)
for i, v in zip(df.columns[1:], vals)
if pd.notnull(v)],
columns=['Year', 'Item', 'Value']
)
Year Item Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0

pandas multiindex selection with ranges

I have a python frame like
y m A B
1990 1 3.4 5
2 4 4.9
...
1990 12 4.0 4.5
...
2000 1 2.3 8.1
2 3.7 5.0
...
2000 12 2.4 9.1
I would like to select 2-12 from the second index (m) and years 1991-2000. I don't seem to get the multindex slicing correct. E.g. I tried
idx = pd.IndexSlice
dfa = df.loc[idx[1:,1:],:]
but that does not seem to slice on the first index. Any suggestions on an elegant solution?
Cheers, Mike
Without a sample code to reproduce your df it is difficult to guess, but if you df is similar to:
import pandas as pd
df = pd.read_csv(pd.io.common.StringIO(""" y m A B
1990 1 3.4 5
1990 2 4 4.9
1990 12 4.0 4.5
2000 1 2.3 8.1
2000 2 3.7 5.0
2000 12 2.4 9.1"""), sep='\s+')
df
y m A B
0 1990 1 3.4 5.0
1 1990 2 4.0 4.9
2 1990 12 4.0 4.5
3 2000 1 2.3 8.1
4 2000 2 3.7 5.0
5 2000 12 2.4 9.1
Then this code will extract what you need:
print df.loc[(df['y'].isin(range(1990,2001))) & df['m'].isin(range(2,12))]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0
If however your df is indexes by y and m, then this will do the same:
df.set_index(['y','m'],inplace=True)
years = df.index.get_level_values(0).isin(range(1990,2001))
months = df.index.get_level_values(1).isin(range(2,12))
df.loc[years & months]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0