create column based on column values - merge integers - pandas

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)

You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Related

Passing Tuple to a function via apply

I am trying to run below function which takes two points..
point A=(2,3)
point B=(4,5
def Somefunc(pointA, point B):
x= pointA[0] + pointB[1]
return x
Now, when in try to create a separate column based on this fucntion, it is throwing me errors like cannot convert the series to <class 'float'>, so I tried this
df['T']=df.apply(Somefunc((df['A'].apply(lambda x: float(x)),df['B'].apply(lambda x: float(x))),\
(df['C'].apply(lambda x: float(x)),df['D'].apply(lambda x: float(x)))),axis=0))
Sample dataframe below;
A B C D
1 2 3 5
2 4 7 8
4 7 9 0
Any help will be appreciated.
This is the best guess I can make as to what you're trying to do:
df['T']=df.apply(lambda row: [(row['A'],row['B']),(row['C'],row['D'])],axis=1)
Edit: to apply your function;
df['T'] = df.apply(lambda row: SomeFunc((row['A'],row['B']),(row['C'],row['D'])),axis=1)
that being said, the same result can be achieved much quicker and idiomatically like so:
>>> df
A B C D
0 2 7 3 3
1 3 1 5 7
2 2 0 6 2
3 3 9 5 9
4 0 2 3 7
>>> df['T']=df.apply(tuple,axis=1)
>>> df
A B C D T
0 2 7 3 3 (2, 7, 3, 3)
1 3 1 5 7 (3, 1, 5, 7)
2 2 0 6 2 (2, 0, 6, 2)
3 3 9 5 9 (3, 9, 5, 9)
4 0 2 3 7 (0, 2, 3, 7)

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

groupby list of lists of indexes

I have a list of np. arrays, representing indexes of pandas dataframe.
I need to groupby index to get each group for each array
let's say, that is the df:
index values
0 2
1 3
2 2
3 2
4 4
5 4
6 1
7 4
8 4
9 4
and that is the list of np.arrays:
[array([0, 1, 2, 3]), array([6, 7, 8])]
from this data I expect to get 2 groups without loop opertaions as a single groupby object:
group1:
index values
0 2
1 3
2 2
3 2
group2:
index values
6 1
7 4
8 4
I would stress again that finally I need to get a single groupby object.
Thank you!
I still using for-loop to create the groupby key dict
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
df=pd.DataFrame([2, 3, 2, 2, 4, 4, 1, 4, 4, 4],columns=['values'])
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
list(df.groupby(L))
Out[33]:
[(0.0, values
index
0 2
1 3
2 2
3 2), (1.0, values
index
6 1
7 4
8 4)]
df=pd.DataFrame([2,3,2,2,4,4,1,4,4,4],columns=['values'])
df.index.name ='index'
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
group1= df.loc[pd.Series(l[0])]
group2= df.loc[pd.Series(l[1])]
This seems like an X-Y problem:
l = [np.array([0,1,2,3]), np.array([6,7,8])]
df_indx = pd.DataFrame(l).stack().reset_index()
df_new = df.assign(foo=df['index'].map(df_indx.set_index(0)['level_0']))
for n,g in df_new.groupby('foo'):
print(g)
Output:
index values foo
0 0 2 0.0
1 1 3 0.0
2 2 2 0.0
3 3 2 0.0
index values foo
6 6 1 1.0
7 7 4 1.0
8 8 4 1.0

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?
I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7