I have 2 dataframes
df_1:
Week Day Coeff_1 ... Coeff_n
1 1 12 23
1 2 11 19
1 3 23 68
1 4 57 81
1 5 35 16
1 6 0 0
1 7 0 0
...
50 1 12 23
50 2 11 19
50 3 23 68
50 4 57 81
50 5 35 16
50 6 0 0
50 7 0 0
df_2:
Week Day Coeff_1 ... Coeff_n
1 1 0 0
1 2 0 0
1 3 0 0
1 4 0 0
1 5 0 0
1 6 56 24
1 7 20 10
...
50 1 0 0
50 2 0 0
50 3 0 0
50 4 0 0
50 5 0 0
50 6 10 84
50 7 29 10
In the first dataframe df_1 I have coefficients for monday to friday. In the second dataframes df_2 I have coefficients for the week end. My goal is to merge both dataframes such that I have no longer 0 values which are obsolete.
What is the best approach to do that?
I found that using df.replace seems to be a good approach
Assuming that your dataframes follow the same structure, you can capitalise on pandas functionality to align automatically on indexes. Thus you can replace 0's with np.nan in df1, and then use fillna:
df1.replace({0:np.nan},inplace=True)
df1.fillna(df2)
Week Day Coeff_1 Coeff_n
0 1.0 1.0 12.0 23.0
1 1.0 2.0 11.0 19.0
2 1.0 3.0 23.0 68.0
3 1.0 4.0 57.0 81.0
4 1.0 5.0 35.0 16.0
5 1.0 6.0 56.0 24.0
6 1.0 7.0 20.0 10.0
7 50.0 1.0 12.0 23.0
8 50.0 2.0 11.0 19.0
9 50.0 3.0 23.0 68.0
10 50.0 4.0 57.0 81.0
11 50.0 5.0 35.0 16.0
12 50.0 6.0 10.0 84.0
13 50.0 7.0 29.0 10.0
Can't you just append the rows df_1 where day is 1-5 to the rows of df_2 where day is 6-7?
df_3 = df_1[df_1.Day.isin(range(1,6))].append(df_2[df_2.Day.isin(range(6,8))])
To get a normal sorting, you can sort your values by week and day:
df_3.sort_values(['Week','Day'])
Related
I have a dataset:
value score
0 0.0 8
1 0.0 7
2 NaN 4
3 1.0 11
4 2.0 22
5 NaN 12
6 0.0 4
7 NaN 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 NaN 28
There are some NaNs in it. I want to fill those NaNs with these conditions:
If 'score' is less than 10, then fill nan with 0.0
If 'score' is between 10 and 20, then fill nan with 1.0
If 'score' is greater than 20, then fill nan with 2.0
How do I do this in pandas?
Here is an example dataframe:
value = [0,0,np.nan,1,2,np.nan,0,np.nan,0,2,1,1,0,2,np.nan]
score = [8,7,4,11,22,12,4,15,5,24,12,15,5,26,28]
pd.DataFrame({'value': value, 'score':score})
Do with cut then fillna
df.value.fillna(pd.cut(df.score,[-np.Inf,10,20,np.Inf],labels = [0,1,2]).astype(int),inplace=True)
df
Out[6]:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
You could use numpy.select with conditions on <10, 10≤score<20, etc. but a more efficient version could be to use a floor division to have values below 10 become 0, below 20 -> 1, etc.
df['value'] = df['value'].fillna(df['score'].floordiv(10))
with numpy.select:
df['value'] = df['value'].fillna(np.select([df['score'].lt(10),
df['score'].between(10, 20),
df['score'].ge(20)],
[0, 1, 2])
)
output:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
use np.select or pd.cut to map the intervals to values, then fillna:
mapping = np.select((df['score'] < 10, df['score'] > 20),
(0, 2), 1)
df['value'] = df['value'].fillna(mapping)
I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0
I was wondering if it is possible to add 1 (or n) to missing values in a pandas DataFrame / Series.
For example:
1
10
nan
15
25
nan
nan
nan
30
Would return :
1
10
11
15
25
26
27
28
30
Thank you,
Use .ffill + the result of a groupby.cumcount to determine n
df[0].ffill() + df.groupby(df[0].notnull().cumsum()).cumcount()
0 1.0
1 10.0
2 11.0
3 15.0
4 25.0
5 26.0
6 27.0
7 28.0
8 30.0
dtype: float64
I have been struggling with this issue for a bit and even though there are some workarounds i would assume, I would love to know if there is an elegant way to achieve this result:
import pandas as pd
import numpy as np
data = np.array([
[1,10],
[2,12],
[4,13],
[5,14],
[8,15]])
df1 = pd.DataFrame(data=data, index=range(0,5), columns=['x','a'])
data = np.array([
[2,100,101],
[3,120,122],
[4,130,132],
[7,140,142],
[9,150,151],
[12,160,152]])
df2 = pd.DataFrame(data=data, index=range(0,6), columns=['x','b','c'])
Now I would like to have a data frame that concatenate those 2 and fill the missing values with the previous value
or the first value otherwise. Both data frames can have differnet sizes, what we are interested in here is the unique column x.
That would be my desired output frame df_result.
x is the aggregated unique "x" between the 2 frames
x a b c
0 1 10 100 101
1 2 12 100 101
2 3 12 120 122
3 4 13 130 132
4 5 14 130 132
5 7 14 140 142
6 8 15 140 142
7 9 15 150 151
8 12 15 160 152
Any help or hint would be much appreciated, thank you very much
You can simply use merge operation on 2 dataframes, after that you can apply a sorting, forward fill and backward filling for null values fillling.
df1.merge(df2,on='x',how='outer').sort_values('x').ffill().bfill()
Out:
x a b c
0 1 10.0 100.0 101.0
1 2 12.0 100.0 101.0
5 3 12.0 120.0 122.0
2 4 13.0 130.0 132.0
3 5 14.0 130.0 132.0
6 7 14.0 140.0 142.0
4 8 15.0 140.0 142.0
7 9 15.0 150.0 151.0
8 12 15.0 160.0 152.0
I managed to solve using if and for loops but I'm looking for a less computationally expensive way to do this. i.e. using apply or map or any other technique
d = {1:10, 2:20, 3:30}
df
a b
1 35
1 nan
1 nan
2 nan
2 47
2 nan
3 56
3 nan
I want to fill missing values of column b according to dict d, i.e. output should be
a b
1 35
1 10
1 10
2 20
2 47
2 20
3 56
3 30
You can use fillna or combine_first by maped a column:
print (df['a'].map(d))
0 10
1 10
2 10
3 20
4 20
5 20
6 30
7 30
Name: a, dtype: int64
df['b'] = df['b'].fillna(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
df['b'] = df['b'].combine_first(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
And if all values are ints add astype:
df['b'] = df['b'].fillna(df['a'].map(d)).astype(int)
print (df)
a b
0 1 35
1 1 10
2 1 10
3 2 20
4 2 47
5 2 20
6 3 56
7 3 30
If all data in column a are in keys of dict, then is possible use replace:
df['b'] = df['b'].fillna(df['a'].replace(d))