I have a column. how I can make a new column to count repetative positive and negative signs?
col1
-5
-3
-7
4
5
-0.5
6
8
9
col1 count_sign
-5 3
-3 3
-7 3
4 2
5 2
-0.5 1
6 3
8 3
9 3
the first 3 rows are 3 because we have 3 negative signs in the first 3 rows, then 2 positive signs and ....
# identify the change of signs among rows,
# making count as NaN, where sign is same, else 1
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
# cumsum to group the rows
df['count']=df['count'].cumsum().ffill()
# groupby to take count of each group of rows and return groupsize using transform
df['count']=df.groupby('count')['col1'].transform('size')
df
col1 count
0 -5.0 3
1 -3.0 3
2 -7.0 3
3 4.0 2
4 5.0 2
5 -0.5 1
6 6.0 3
7 8.0 3
8 9.0 3
To add a sign to the count values
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
df['count']=df['count'].cumsum().ffill()
df['count']=df.groupby('count')['col1'].transform('size')*np.sign(df['col1'])
df
col1 count
0 -5.0 -3.0
1 -3.0 -3.0
2 -7.0 -3.0
3 4.0 2.0
4 5.0 2.0
5 -0.5 -1.0
6 6.0 3.0
7 8.0 3.0
8 9.0 3.0
Related
I have a dataframe n rows:
1 2 3
3 4 1
5 3 2
9 8 2
7 2 6
0 0 0
4 4 4
8 4 1
...
and a dictionary of keys , so that row is a key and the value is the group:
d = {0 : 0 , 1: 0, 2 : 0, 3 : 1, 4 : 1, 5: 2, 6: 2}
I want to group by the keys and then apply mean on the groups.
So I will get:
3 3 2 #This is the mean of rows 0,1,2 from the original df, as d[0]=d[1]=d[2]=0
8 5 4
2 2 2
8 4 1
What is the best way to do so?
Simply use the dictionary in the groupby it will replace the index value by the dictionary value matching on the key:
df.groupby(d).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
If you also want to get the missing keys, use dropna=False in groupby. Those keys will be listed in the 'NaN' group:
df.groupby(d, dropna=False).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
NaN 8.0 4.0 1.0
And for a range index instead of the dictionary keys:
df.groupby(d, dropna=False, as_index=False).mean()
output:
a b c
0 3.0 3.0 2.0
1 8.0 5.0 4.0
2 2.0 2.0 2.0
3 8.0 4.0 1.0
used input:
a b c
0 1 2 3
1 3 4 1
2 5 3 2
3 9 8 2
4 7 2 6
5 0 0 0
6 4 4 4
7 8 4 1
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6
df = pd.DataFrame({'timePoint': [1,1,1,1,2,2,2,2,3,3,3,3],
'item': [1,2,3,4,3,4,5,6,1,3,7,2],
'value': [2,4,7,6,5,9,3,2,4,3,1,5]})
>>> df
item timePoint value
0 1 1 2
1 2 1 4
2 3 1 7
3 4 1 6
4 3 2 5
5 4 2 9
6 5 2 3
7 6 2 2
8 1 3 4
9 3 3 3
10 7 3 1
11 2 3 5
In this df, not every item appears at every timePoint. I want to have all unique items at every timePoint, and these newly inserted items should either have:
(i) a NaN value if they have not appeared at a previous timePoint, or
(ii) if they have, they get their most recent value.
The desired output should look like the following (lines with hashtag are those inserted).
>>> dfx
item timePoint value
0 1 1 2.0
3 1 2 2.0 #
8 1 3 4.0
1 2 1 4.0
4 2 2 4.0 #
11 2 3 5.0
2 3 1 7.0
4 3 2 5.0
9 3 3 3.0
3 4 1 6.0
5 4 2 9.0
6 4 3 9.0 #
0 5 1 NaN #
6 5 2 3.0
7 5 3 3.0 #
1 6 1 NaN #
7 6 2 2.0
8 6 3 2.0 #
2 7 1 NaN #
5 7 2 NaN #
10 7 3 1.0
For example, item 1 gets a 4.0 at timePoint 2 because that's what it had a timePoint 1 whereas item 6 gets a NaN at timePoint 1 because there is no preceding value.
Now, I know that if I manage to insert all lines of every unique item missing in each timePoint group, i.e. reach this point:
>>> dfx
item timePoint value
0 1 1 2.0
1 2 1 4.0
2 3 1 7.0
3 4 1 6.0
4 3 2 5.0
5 4 2 9.0
6 5 2 3.0
7 6 2 2.0
8 1 3 4.0
9 3 3 3.0
10 7 3 1.0
11 2 3 5.0
0 5 1 NaN
1 6 1 NaN
2 7 1 NaN
3 1 2 NaN
4 2 2 NaN
5 7 2 NaN
6 4 3 NaN
7 5 3 NaN
8 6 3 NaN
Then I can do:
dfx.sort_values(by = ['item', 'timePoint'],
inplace = True,
ascending = [True, True])
dfx['value'] = dfx.groupby('item')['value'].fillna(method='ffill')
which will return the desired output.
But how do I add as lines all df.item.unique() items that are missing to each timePoint group?
Also, if you have a more efficient solution from scratch to suggest, then by all means please be my guest.
Using pd.MultiIndex.from_product, levels, reindex
d = df.set_index(['item', 'timePoint'])
d.reindex(
pd.MultiIndex.from_product(d.index.levels, names=d.index.names)
).groupby(level='item').ffill().reset_index()
item timePoint value
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I think stack with unstack will achieve the format , then we using groupby ffillto fill the nan value forward
s=df.set_index(['item','timePoint']).value.unstack().stack(dropna=False)
s.groupby(level=0).ffill().reset_index()
Out[508]:
item timePoint 0
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I have dataframe look like this:
raw_data ={'col0':[1,4,5,1,3,3,1,5,8,9,1,2]}
df = DataFrame(raw_data)
col0
0 1
1 4
2 5
3 1
4 3
5 3
6 1
7 5
8 8
9 9
10 1
11 2
What I want to do is to count every 3 rows to fit condition(df['col0']>3) and make new col looks like this:
col0 col_roll_count3
0 1 0
1 4 1
2 5 2 #[index 0,1,2/ 4,5 fit the condition]
3 1 2
4 3 1
5 3 0 #[index 3,4,5/no fit the condition]
6 1 0
7 5 1
8 8 2
9 9 3
10 1 2
11 2 1
How can I achieve that?
I tried this but failed:
df['col_roll_count3'] = df[df['col0']>3].rolling(3).count()
print(df)
col0 col1
0 1 NaN
1 4 1.0
2 5 2.0
3 1 NaN
4 3 NaN
5 3 NaN
6 1 NaN
7 5 3.0
8 8 3.0
9 9 3.0
10 1 NaN
11 2 NaN
df['col_roll_count3'] = df['col0'].gt(3).rolling(3).sum()
Let's use rolling, apply, np.count_nonzero:
df['col_roll_count3'] = df.col0.rolling(3,min_periods=1)\
.apply(lambda x: np.count_nonzero(x>3))
Output:
col0 col_roll_count3
0 1 0.0
1 4 1.0
2 5 2.0
3 1 2.0
4 3 1.0
5 3 0.0
6 1 0.0
7 5 1.0
8 8 2.0
9 9 3.0
10 1 2.0
11 2 1.0
I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0