I need some help to rearrange a dataframe. this is what the data looks like.
Year item 1 item 2 item 3
2001 22 54 33
2002 77 54 33
2003 22 NaN 33
2004 22 54 NaN
The layout I want is:
Items Year Value
item 1 2001 22
item 1 2002 77
...
And so on...
Use melt if not necessary remove NaNs:
df = df.melt('Year', var_name='Items', value_name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
6 2003 item 2 NaN
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
11 2004 item 3 NaN
For remove NaNs add dropna:
df = df.melt('Year', var_name='Items', value_name='Value').dropna(subset=['Value'])
print (df)
Year Items Value
0 2001 item 1 22.0
1 2002 item 1 77.0
2 2003 item 1 22.0
3 2004 item 1 22.0
4 2001 item 2 54.0
5 2002 item 2 54.0
7 2004 item 2 54.0
8 2001 item 3 33.0
9 2002 item 3 33.0
10 2003 item 3 33.0
For a bit different ordering with removing NaNs is possible use set_index + stack + rename_axis + reset_index:
df = df.set_index('Year').stack().rename_axis(['Year','Items']).reset_index(name='Value')
print (df)
Year Items Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0
Using comprehensions and pd.DataFrame.itertuples
pd.DataFrame(
[[y, i, v]
for y, *vals in df.itertuples(index=False)
for i, v in zip(df.columns[1:], vals)
if pd.notnull(v)],
columns=['Year', 'Item', 'Value']
)
Year Item Value
0 2001 item 1 22.0
1 2001 item 2 54.0
2 2001 item 3 33.0
3 2002 item 1 77.0
4 2002 item 2 54.0
5 2002 item 3 33.0
6 2003 item 1 22.0
7 2003 item 3 33.0
8 2004 item 1 22.0
9 2004 item 2 54.0
Related
I have the dataframe as follow:
dataframe generator:
df = pd.DataFrame({
'year':[2000,2001,2002]*3,
'id':['a']*3+['b']*3+['c']*3,
'othernulcol': ['xyz']*3+[np.nan]*4+['tyu']*2,
'val':[np.nan,2,3,4,5,6,7,8,9]
})
data looks like:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
I want to create new 3 rows from 2000 to 2002 that is the sum of row with id = a and b in the same year. othernulcol is just other column in dataframe. When creating new rows, just set those cols as np.NaN
Expected output:
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 10.0
11 2002 ab NaN 12.0
Thank you for reading
Filter values by categories and convert year to index for align same years from another DataFrame, sum values by DataFrame.add and append to original DataFrame by concat:
cols = ['id','val']
df1 = df[df['id'].eq('a')].set_index('year')[cols]
df2 = df[df['id'].eq('b')].set_index('year')[cols]
df = pd.concat([df, df1.add(df2).reset_index()], ignore_index=True)
print (df)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0
Another solution could be as follows:
Select rows from df with df.id.isin(['a','b'] (see Series.isin) and apply df.groupby to year.
For the aggregration, use sum for column id. For column val use a lambda function to apply Series.sum, which allows skipna=False.
Finally, use pd.concat to add the result to the original df with ignoring the index.
out = pd.concat([df,df[df.id.isin(['a','b'])]\
.groupby('year', as_index=False)\
.agg({'id':'sum',
'val':lambda x: x.sum(skipna=False)})],
ignore_index=True)
print(out)
year id othernulcol val
0 2000 a xyz NaN
1 2001 a xyz 2.0
2 2002 a xyz 3.0
3 2000 b NaN 4.0
4 2001 b NaN 5.0
5 2002 b NaN 6.0
6 2000 c NaN 7.0
7 2001 c tyu 8.0
8 2002 c tyu 9.0
9 2000 ab NaN NaN
10 2001 ab NaN 7.0
11 2002 ab NaN 9.0
I have 2 dataframes
df_1:
Week Day Coeff_1 ... Coeff_n
1 1 12 23
1 2 11 19
1 3 23 68
1 4 57 81
1 5 35 16
1 6 0 0
1 7 0 0
...
50 1 12 23
50 2 11 19
50 3 23 68
50 4 57 81
50 5 35 16
50 6 0 0
50 7 0 0
df_2:
Week Day Coeff_1 ... Coeff_n
1 1 0 0
1 2 0 0
1 3 0 0
1 4 0 0
1 5 0 0
1 6 56 24
1 7 20 10
...
50 1 0 0
50 2 0 0
50 3 0 0
50 4 0 0
50 5 0 0
50 6 10 84
50 7 29 10
In the first dataframe df_1 I have coefficients for monday to friday. In the second dataframes df_2 I have coefficients for the week end. My goal is to merge both dataframes such that I have no longer 0 values which are obsolete.
What is the best approach to do that?
I found that using df.replace seems to be a good approach
Assuming that your dataframes follow the same structure, you can capitalise on pandas functionality to align automatically on indexes. Thus you can replace 0's with np.nan in df1, and then use fillna:
df1.replace({0:np.nan},inplace=True)
df1.fillna(df2)
Week Day Coeff_1 Coeff_n
0 1.0 1.0 12.0 23.0
1 1.0 2.0 11.0 19.0
2 1.0 3.0 23.0 68.0
3 1.0 4.0 57.0 81.0
4 1.0 5.0 35.0 16.0
5 1.0 6.0 56.0 24.0
6 1.0 7.0 20.0 10.0
7 50.0 1.0 12.0 23.0
8 50.0 2.0 11.0 19.0
9 50.0 3.0 23.0 68.0
10 50.0 4.0 57.0 81.0
11 50.0 5.0 35.0 16.0
12 50.0 6.0 10.0 84.0
13 50.0 7.0 29.0 10.0
Can't you just append the rows df_1 where day is 1-5 to the rows of df_2 where day is 6-7?
df_3 = df_1[df_1.Day.isin(range(1,6))].append(df_2[df_2.Day.isin(range(6,8))])
To get a normal sorting, you can sort your values by week and day:
df_3.sort_values(['Week','Day'])
I'm trying to get an expanding mean. I can get it to work when I iterate and "group" just by filtering by the specific values, but it takes way too long to do. I feel like this should be an easy application to do with a groupby, but when I do it, it just does the expanding mean to the entire dataset, as opposed to just doing it for each of the groups in grouby.
for a quick example:
I want to take this (in this particular case, grouped by 'player' and 'year'), and get an expanding mean.
player pos year wk pa ra
a qb 2001 1 10 0
a qb 2001 2 5 0
a qb 2001 3 10 0
a qb 2002 1 12 0
a qb 2002 2 13 0
b rb 2001 1 0 20
b rb 2001 2 0 17
b rb 2001 3 0 12
b rb 2002 1 0 14
b rb 2002 2 0 15
to get:
player pos year wk pa ra avg_pa avg_ra
a qb 2001 1 10 0 10 0
a qb 2001 2 5 0 7.5 0
a qb 2001 3 10 0 8.3 0
a qb 2002 1 12 0 12 0
a qb 2002 2 13 0 12.5 0
b rb 2001 1 0 20 0 20
b rb 2001 2 0 17 0 18.5
b rb 2001 3 0 12 0 16.3
b rb 2002 1 0 14 0 14
b rb 2002 2 0 15 0 14.5
Not sure where I'm going wrong:
# Group by player and season - also put weeks in correct ascending order
grouped = calc_averages.groupby(['player','pos','seas']).apply(pd.DataFrame.sort_values, 'wk')
grouped['avg_pa'] = grouped['pa'].expanding().mean()
But this will give an expanding mean for the entire set, not for each player, season.
Try:
df.sort_values('wk').groupby(['player','pos','year'])['pa','ra'].expanding().mean()\
.reset_index()
Output:
player pos year level_3 pa ra
0 a qb 2001 0 10.000000 0.000000
1 a qb 2001 1 7.500000 0.000000
2 a qb 2001 2 8.333333 0.000000
3 a qb 2002 3 12.000000 0.000000
4 a qb 2002 4 12.500000 0.000000
5 b rb 2001 5 0.000000 20.000000
6 b rb 2001 6 0.000000 18.500000
7 b rb 2001 7 0.000000 16.333333
8 b rb 2002 8 0.000000 14.000000
9 b rb 2002 9 0.000000 14.500000
I have a data frame, my data frame is like this:
except the last column is not there.
I mean I do not have formula column and here my purpose is to calculate that column.
but how it has been calculated?
the formula for the last column is: for each patientNumber,
number of Yes/total number of questions has been answered by the patient.
for example for the patient number one:there is 1 Yes and 2 No, so it has been 1/3
for patient two, in year 2006, month 10, we can not see Yes the three questions are no, so it has been calculated 0
PatientNumber QT Answer Answerdate year month dayofyear count formula
1 1 transferring No 2017-03-03 2017 3 62 2.0 (1/3)
2 1 preparing food No 2017-03-03 2017 3 62 2.0 (1/3)
3 1 medications Yes 2017-03-03 2017 3 62 1.0 (1/3)
4 2 transferring No 2006-10-05 2006 10 275 3.0 0
5 2 preparing food No 2006-10-05 2006 10 275 3.0 0
6 2 medications No 2006-10-05 2006 10 275 3.0 0
7 2 transferring Yes 2007-4-15 2007 4 105 2.0 2/3
8 2 preparing food Yes 2007-4-15 2007 4 105 2.0 2/3
9 2 medications No 2007-4-15 2007 4 105 1.0 2/3
10 2 transferring Yes 2007-12-15 2007 12 345 1.0 1/3
11 2 preparing food No 2007-12-15 2007 12 345 2.0 1/3
12 2 medications No 2007-12-15 2007 12 345 2.0 1/3
13 2 transferring Yes 2008-10-10 2008 10 280 1.0 (1/3)
14 2 preparing food No 2008-10-10 2008 10 280 2.0 (1/3)
15 2 medications No 2008-10-10 2008 10 280 2.0 (1/3)
16 3 medications No 2008-10-10 2008 12 280 …… ………..
Update 1
Also, what if the formula change a little bit:
if the patient visit the hospital once a year, the same formula as it is multiple by 2. for example, for year 2017 there is just one month related to that patient, so it means the patient reached just one time during the year. in this case the above formula multiple by 2 works.
(why because my window should be every 6 month, so if the patient has not come every 6 month I am assuming the same record is happening)
But if there is several records during one year for one patient, it should be multiplied 2/the number of record on that year.
for example at year 2007, the patient reached the hospital 2 times once in month 4 and another in month 12 so in this case the same formula should be multiplied by 2/2
try this,
def func(x):
x['yes']= len(x[x['Answer']=='Yes'])
x['all']= len(x)
return x
df=df.groupby(['PatientNumber','Answerdate']).apply(func)
df['formula_applied']=df['yes']/df['all']
df['formula']=(df['yes']).astype(str)+'/'+(df['all']).astype(str)
print df
Output:
PatientNumber QT Answer Answerdate year month dayofyear \
0 1 transferring No 2017-03-03 2017 3 62
1 1 preparing food No 2017-03-03 2017 3 62
2 1 medications Yes 2017-03-03 2017 3 62
3 2 transferring No 2006-10-05 2006 10 275
4 2 preparing food No 2006-10-05 2006 10 275
5 2 medications No 2006-10-05 2006 10 275
6 2 transferring Yes 2007-4-15 2007 4 105
7 2 preparing food Yes 2007-4-15 2007 4 105
8 2 medications No 2007-4-15 2007 4 105
9 2 transferring Yes 2007-12-15 2007 12 345
10 2 preparing food No 2007-12-15 2007 12 345
11 2 medications No 2007-12-15 2007 12 345
12 2 transferring Yes 2008-10-10 2008 10 280
13 2 preparing food No 2008-10-10 2008 10 280
14 2 medications No 2008-10-10 2008 10 280
count yes all formula_applied formula
0 2.0 1 3 0.333333 1/3
1 2.0 1 3 0.333333 1/3
2 1.0 1 3 0.333333 1/3
3 3.0 0 3 0.000000 0/3
4 3.0 0 3 0.000000 0/3
5 3.0 0 3 0.000000 0/3
6 2.0 2 3 0.666667 2/3
7 2.0 2 3 0.666667 2/3
8 1.0 2 3 0.666667 2/3
9 1.0 1 3 0.333333 1/3
10 2.0 1 3 0.333333 1/3
11 2.0 1 3 0.333333 1/3
12 1.0 1 3 0.333333 1/3
13 2.0 1 3 0.333333 1/3
14 2.0 1 3 0.333333 1/3
Explanation:
Try to get help from user defined method. this func will calculate you number of yes and total record. then you could solve it as your wish. column formula is your desired result. If you want it to evaluate i added formula_applied.
I have the following df with a lot more number columns. I now want to make a forward filling for all the columns in the dataframe but grouped by id.
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 NaN 13
2 2001 7 NaN
2 2002 8 2
The result should look like this:
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 4 13
2 2001 7 NaN
2 2002 8 2
I tried the following command:
df= df.groupby("id").fillna(method="ffill", limit=2)
However, this raises a KeyError "isin". Filling just one column with the following command works just fine, but how can I efficiently forward fill the whole df grouped by isin?
df["number"]= df.groupby("id")["number"].fillna(method="ffill", limit=2)
You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.
ffill can be use directly
df.groupby('id').ffill(2)
Out[423]:
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
#isin
#df.loc[:,df.columns.isin([''])]=df.loc[:,df.columns.isin([''])].groupby('id').ffill(2)