how to replace nan value using the value of which from other rows with common column value

how to replace nan value using the value of which from other rows with common column value - pandas

Using column B as the reference how can I replace NaN value
>>> a
A B
1 1
Nan 3
1 1
Nan 1
Nan 2
5 3
1 1
2 2
I want result like this.
>> result
A B
1 1
5 3
1 1
1 1
2 2
5 3
1 1
2 2
I tried merging on the column b but couldn't figure out
b=a.groupby('B').reset_index()
dfM = pd.merge(a,b,on='B', how ='left')

We need a map from values in column B to the values in A.
mapping = a.dropna().drop_duplicates().set_index("B")["A"]
It looks like this
B
1 1.0
3 5.0
2 2.0
Name: A, dtype: float64
Filling null values becomes irrelevant at this point. We can just map B to get column A
a["B"].map(mapping)
This gives you
0 1.0
1 5.0
2 1.0
3 1.0
4 2.0
5 5.0
6 1.0
7 2.0
Name: B, dtype: float64
Cast to int and use it to overwrite column A in your original dataframe if you need to.

Related

Pandas: new column where value is based on a specific value within subgroup

I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?

Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN

Average per category per last N round in Pandas and lag it

I have a following problem.
I want to compute mean of last 2 observations per name and round and lag it. See following example:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ]})
Desired output is:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ], 'mean_last_2':["NaN","NaN",5.5,4.5,"NaN","NaN","NaN"]})
I tried this, but got "AttributeError: 'float' object has no attribute 'shift'":
df['mean_last_2'] = df.groupby("name")['value'].apply(lambda x:
x.tail(2).mean().shift(1))
How can I fix it please?

You could try something like this:
df['mean_last_2'] = df.groupby('name')['value'].apply(lambda x: x.rolling(2).mean().shift())
Output:
name value round mean_last_2
0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
4 b 1 1 NaN
5 b 2 2 NaN
6 c 1 1 NaN

You can do something like
df.groupby("name").apply(lambda d: d.assign(mean_last_2 = d['value'].rolling(2).mean().shift()))
to get
name value round mean_last_2
name
a 0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
b 4 b 1 1 NaN
5 b 2 2 NaN
c 6 c 1 1 NaN

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas

You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.

Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0

You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0

Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64

pandas diff between within successive groups

d = pd.DataFrame({'a':[7,6,3,4,8], 'b':['c','c','d','d','c']})
d.groupby('b')['a'].diff()
Gives me
0 NaN
1 -1.0
2 NaN
3 1.0
4 2.0
What I'd need
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Which is difference between only successive values within group, so when a group appears after another group , it's previous values are ignored.
In my example last c value is a new c group.

You would need to groupby on consecutive segments
In [1055]: d.groupby((d.b != d.b.shift()).cumsum())['a'].diff()
Out[1055]:
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Name: a, dtype: float64
Details
In [1056]: (d.b != d.b.shift()).cumsum()
Out[1056]:
0 1
1 1
2 2
3 2
4 3
Name: b, dtype: int32

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to replace nan value using the value of which from other rows with common column value - pandas

Related

Pandas: new column where value is based on a specific value within subgroup

Average per category per last N round in Pandas and lag it

Compute lagged means per name and round in pandas

pandas column operation on certain row in succession

pandas diff between within successive groups

Categories

Resources