I have the dataframe as follow:
dataframe generator:
df = pd.DataFrame({
'year':[2000,2001,2002]*3,
'id':['a']*3+['b']*3+['c']*3,
'othernulcol': ['xyz']*3+[np.nan]*4+['tyu']*2,
'val':[np.nan,2,3,4,5,6,7,8,9]
})
data looks like:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
I want to create new 3 rows from 2000 to 2002 that is the sum of row with id = a and b in the same year. othernulcol is just other column in dataframe. When creating new rows, just set those cols as np.NaN
Expected output:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 10.0
11 2002 ab NaN 12.0
Thank you for reading
Filter values by categories and convert year to index for align same years from another DataFrame, sum values by DataFrame.add and append to original DataFrame by concat:
cols = ['id','val']
df1 = df[df['id'].eq('a')].set_index('year')[cols]
df2 = df[df['id'].eq('b')].set_index('year')[cols]
df = pd.concat([df, df1.add(df2).reset_index()], ignore_index=True)
print (df)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0
Another solution could be as follows:
Select rows from df with df.id.isin(['a','b'] (see Series.isin) and apply df.groupby to year.
For the aggregration, use sum for column id. For column val use a lambda function to apply Series.sum, which allows skipna=False.
Finally, use pd.concat to add the result to the original df with ignoring the index.
out = pd.concat([df,df[df.id.isin(['a','b'])]\
.groupby('year', as_index=False)\
.agg({'id':'sum',
'val':lambda x: x.sum(skipna=False)})],
ignore_index=True)
print(out)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0
Related
I've got a pandas DataFrame (panel data) filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the NaNs with the median of the columns for each year (cross-sectional medians per year)?
id
year
A
B
C
1
2000
3.539.101
265.152
.0683649
1
2001
3.539.101
2.485.833
NaN
1
2002
NaN
2.939.903
NaN
1
2003
3.733.545
3.021.591
-.0257413
2
2000
3.960.184
NaN
.9781774
2
2001
3.960.184
9.418.228
.4855057
2
2002
3.960.184
9.880.249
.049056
2
2003
3.960.184
NaN
.2310434
3
2000
NaN
1.287.206
-.0373083
3
2001
NaN
1.582.817
.1202868
3
2002
4.724.285
1.279.348
-.1824576
3
2003
4.724.285
1.213.678
-.0513311
Try this: df.fillna(df.median())
Now I have a dataframe like below (original dataframe):
Equipment
A
B
C
1
10
10
10
1
11
11
11
2
12
12
12
2
13
13
13
3
14
14
14
3
15
15
15
And I want to transform the dataframe like below (transformed dataframe):
1
-
-
2
-
-
3
-
-
A
B
C
A
B
C
A
B
C
10
10
10
12
12
12
14
14
14
11
11
11
13
13
13
15
15
15
How can I make such groupby transformation with two level header by Pandas?
Additionally, I want to use the transformed dataframe to generate box plot, and the whole box plot is divided into three parts (i.e. 1,2,3), and each part has three box plots (i.e. A,B,C). Can I use the transformed dataframe in Image 2 without any processing? Or can I realize the box plotting only by the original dataframe?
Thank you so much.
Try:
g = df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.reset_index(drop=True).T)
g:
Equipment 1 2 3
A B C A B C A B C
0 10 10 10 12 12 12 14 14 14
1 11 11 11 13 13 13 15 15 15
Explanation:
grp = df.groupby(' Equipment ')[df.columns[1:]]
grp.apply(print)
A B C
0 10 10 10
1 11 11 11
A B C
2 12 12 12
3 13 13 13
A B C
4 14 14 14
5 15 15 15
you can see the index 0 1, 2 3, 4 5 for each equipment group(1,2,3).
That's why I used reset_index to make them 0 1 for each group why???
If you do without reset index:
df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.T)
0 1 2 3 4 5
Equipment
1 A 10.0 11.0 NaN NaN NaN NaN
B 10.0 11.0 NaN NaN NaN NaN
C 10.0 11.0 NaN NaN NaN NaN
2 A NaN NaN 12.0 13.0 NaN NaN
B NaN NaN 12.0 13.0 NaN NaN
C NaN NaN 12.0 13.0 NaN NaN
3 A NaN NaN NaN NaN 14.0 15.0
B NaN NaN NaN NaN 14.0 15.0
C NaN NaN NaN NaN 14.0 15.0
See the values in (2,3) and (4,5) column. I want to combine them into (0, 1) column only. That's why reset index with a drop.
0 1
Equipment
1 A 10 11
B 10 11
C 10 11
2 A 12 13
B 12 13
C 12 13
3 A 14 15
B 14 15
C 14 15
You can play with the code to understand it deeply. What's happening inside.
I've one DataFrame
import pandas as pd
data = {'a': [1,2,3,None,4,None,2,4,5,None],'b':[6,6,6,'NaN',4,'NaN',11,11,11,'NaN']}
df = pd.DataFrame(data)
condition = (df['a']>2) | (df['a'] == None)
print(df[condition])
a b
0 1.0 6
1 2.0 6
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
6 2.0 11
7 4.0 11
8 5.0 11
9 NaN NaN
Here, i've to keep where condition is coming True and Where None is there i want to keep those rows as well.
Expected output is :
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN
Thanks in Advance
You can use another | or condition (Note: See #ALlolz's comment, you shouldnt compare a series with np.nan)
condition = (df['a']>2) | (df['a'].isna())
df[condition]
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN
I have the following df with a lot more number columns. I now want to make a forward filling for all the columns in the dataframe but grouped by id.
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 NaN 13
2 2001 7 NaN
2 2002 8 2
The result should look like this:
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 4 13
2 2001 7 NaN
2 2002 8 2
I tried the following command:
df= df.groupby("id").fillna(method="ffill", limit=2)
However, this raises a KeyError "isin". Filling just one column with the following command works just fine, but how can I efficiently forward fill the whole df grouped by isin?
df["number"]= df.groupby("id")["number"].fillna(method="ffill", limit=2)
You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.
ffill can be use directly
df.groupby('id').ffill(2)
Out[423]:
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
#isin
#df.loc[:,df.columns.isin([''])]=df.loc[:,df.columns.isin([''])].groupby('id').ffill(2)
How can i achieve the desired result based on the following dataset ?
A B C D E
1 apple 5 2 20 NaN
2 orange 2 6 30 NaN
3 apple 6 1 40 NaN
4 apple 10 3 50 NaN
5 banana 8 9 60 NaN
Desired Result :
A B C D E
1 apple 5 NaN 2 20
2 orange 2 6 30 NaN
3 apple 6 NaN 1 40
4 apple 10 NaN 3 50
5 banana 8 9 60 NaN
IIUC you can use np.roll on the rows of interest, here we need to select only the rows where 'A' is 'apple' and then roll these by a single column row-wise and assign back:
In [14]:
df.loc[df['A']=='apple', 'C':] = np.roll(df.loc[df['A']=='apple', 'C':], 1,axis=1)
df
Out[14]:
A B C D E
1 apple 5 NaN 2 20.0
2 orange 2 6.0 30 NaN
3 apple 6 NaN 1 40.0
4 apple 10 NaN 3 50.0
5 banana 8 9.0 60 NaN
Note that because you introduce NaN values the dtype changes to float to allow this