Given a dataframe df, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?
Input data:
import pandas as pd
import numpy as np
data = {'col_0':[7, 8, 9, 10, 11, 12],
'col_1':[4, 5, 6, 7, 8, 9],
'col_2':[2, 5, 8, 11, 14, 15],
'col_3':[2, 6, 10, 14, 18, 21],
'col_4':[7, 5, 7, 5, 7, 5],
'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5
0 7 4 2 2 7 2
1 8 5 5 6 5 6
2 9 6 8 10 7 10
3 10 7 11 14 5 14
4 11 8 14 18 7 18
5 12 9 15 21 5 21
Expected output (with window size = 2):
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
For the example above with window size = 2.
At window 0's array we have row[0].
[[7 4 2 2 7 2]]
rolling_nunique[0] is 3 with the elements being [2, 4, 7].
At window 1's array we have row[0] & row[1].
[[7 4 2 2 7 2]
[8 5 5 6 5 6]]
rolling_nunique[1] is 6 with the elements being [2, 4, 5, 6, 7, 8].
At window 2's array we have row[1] & row[2].
[[ 8 5 5 6 5 6]
[ 9 6 8 10 7 10]]
rolling_nunique[2] is 6 with the elements being [5, 6, 7, 8, 9, 10].
etc.
Using sliding_window_view, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1 in pd.rolling), we need to add some empty rows at the top. This can be done using vstack and full. At the end, we need to account for these added nan values by filtering them away.
from numpy.lib.stride_tricks import sliding_window_view
w = 2
values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
df['rolling_nunique'] = unique_count
Result:
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
I found it could resolve by using sliding_window_view() from numpy,
Here's the approach:
rolling = 2
ar = df.values # turn into np.ndarray
length = ar.shape[1]
head_arrs = np.zeros((rolling-1, rolling*length))
cuboid = np.lib.stride_tricks.sliding_window_view(ar, (rolling,length)).astype(float)
plane = cuboid.reshape(-1,rolling*length)
for i in range(rolling-1,0,-1):
head_arr_l = plane[0,:i*length]
head_arr_l = np.pad(head_arr_l.astype(float), (0,length*(rolling-i)), 'constant', constant_values=np.nan)
head_arr_l = np.roll(head_arr_l, length*(rolling-i))
head_arrs[i-1,:] = head_arr_l
plane = np.insert(plane, 0, head_arrs, axis=0)
df['rolling_nunique'] = pd.DataFrame(plane).nunique(axis=1)
df
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
[reference] numpy.lib.stride_tricks.sliding_window_view
Related
I wanted to add or append a row (in the form of a list) to a dataframe. All the methods requires that I turn the list into another dataframe first, eg.
df = df.append(another dataframe)
df = df.merge(another dataframe)
df = pd.concat(df, another dataframe)
I've found a trick if the index is in running number at https://www.statology.org/pandas-add-row-to-dataframe/
import pandas as pd
#create DataFrame
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
#view DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
#add new row to end of DataFrame
df.loc[len(df.index)] = [20, 7, 5]
#view updated DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
6 20 7 5
However, the dataframe must have index in running number or else, the add/append will override the existing data.
So my question is: Is there are simple, foolproof way to just append/add a list to a dataframe ?
Thanks very much !!!
>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
If the indexes are "numbers" - you could add 1 to the max index.
>>> df.loc[max(df.index) + 1] = 'my', 'new', 'row'
>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
4 my new row
I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2
Given the following DataFrame:
cols = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']])
example = pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]], columns=cols)
example
A B
a b a b
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I would like to end up with the following one:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I used this code:
concatenated = pd.DataFrame([])
for A_sub_col in ('a', 'b'):
for B_sub_col in ('a', 'b'):
new_frame = example[[['A', A_sub_col], ['B', B_sub_col]]]
new_frame.columns = ['A', 'B']
concatenated = pd.concat([concatenated, new_frame])
However, I strongly suspect that there is a more straight-forward, idiomatic way to do that with Pandas. How would one go about it?
Here's an option using list comprehension:
pd.concat([
example[[('A', i), ('B', j)]].droplevel(level=1, axis=1)
for i in example['A'].columns
for j in example['B'].columns
]).reset_index(drop=True)
Output:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
Here is one way. Not sure how more pythonic this is. It is definitely less readable :-) but on the other hand does not use explicit loops:
(example
.apply(lambda c: [list(c)])
.stack(level=1)
.apply(lambda c:[list(c)])
.explode('A')
.explode('B')
.apply(pd.Series.explode)
.reset_index(drop = True)
)
to understand what's going on it would be helpful to do this one step at a time, but the end result is
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)
I have the following dataframe:
I would like to get the following output from the dataframe
Is there anyway to group other columns ['B', 'index'] based on column 'A' using groupby aggregate function, pivot_table in pandas.
I couldn't think about an approach to write code.
Use:
df=df.reset_index() #if 'index' not is a colum
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
In pandas <0.25:
new_df=df.groupby(g,as_index=False).agg({'index':list,'A':'first','B':lambda x: list(x.unique())})
if you want to repeat repeated in the index use the same function for the index column as for B:
new_df=df.groupby(g,as_index=False).agg(index=('index',lambda x: list(x.unique())),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
Here is an example:
df=pd.DataFrame({'index':range(20),
'A':[1,1,1,1,2,2,0,0,0,1,1,1,1,1,1,0,0,0,3,3]
,'B':[1,2,3,5,5,5,7,8,9,9,9,12,12,14,15,16,17,18,19,20]})
print(df)
index A B
0 0 1 1
1 1 1 2
2 2 1 3
3 3 1 5
4 4 2 5
5 5 2 5
6 6 0 7
7 7 0 8
8 8 0 9
9 9 1 9
10 10 1 9
11 11 1 12
12 12 1 12
13 13 1 14
14 14 1 15
15 15 0 16
16 16 0 17
17 17 0 18
18 18 3 19
19 19 3 20
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
index A B
0 [0, 1, 2, 3] 1 [1, 2, 3, 5]
1 [4, 5] 2 [5]
2 [6, 7, 8] 0 [7, 8, 9]
3 [9, 10, 11, 12, 13, 14] 1 [9, 12, 14, 15]
4 [15, 16, 17] 0 [16, 17, 18]
5 [18, 19] 3 [19, 20]