Classify a value under certain conditions in pandas dataframe - pandas

I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks

We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

Related

how count repetitive unchanged signs of a column?

I have a column. how I can make a new column to count repetative positive and negative signs?
col1
-5
-3
-7
4
5
-0.5
6
8
9
col1 count_sign
-5 3
-3 3
-7 3
4 2
5 2
-0.5 1
6 3
8 3
9 3
the first 3 rows are 3 because we have 3 negative signs in the first 3 rows, then 2 positive signs and ....
# identify the change of signs among rows,
# making count as NaN, where sign is same, else 1
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
# cumsum to group the rows
df['count']=df['count'].cumsum().ffill()
# groupby to take count of each group of rows and return groupsize using transform
df['count']=df.groupby('count')['col1'].transform('size')
df
col1 count
0 -5.0 3
1 -3.0 3
2 -7.0 3
3 4.0 2
4 5.0 2
5 -0.5 1
6 6.0 3
7 8.0 3
8 9.0 3
To add a sign to the count values
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
df['count']=df['count'].cumsum().ffill()
df['count']=df.groupby('count')['col1'].transform('size')*np.sign(df['col1'])
df
col1 count
0 -5.0 -3.0
1 -3.0 -3.0
2 -7.0 -3.0
3 4.0 2.0
4 5.0 2.0
5 -0.5 -1.0
6 6.0 3.0
7 8.0 3.0
8 9.0 3.0

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

resample data within each group in pandas

I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0

Operations with multiple dataframes partialy sharing indexes in pandas

I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6

Pandas Python Moving Rows

I am new to Pandas and I have a csv file that I want to move every row 2 & 3 to value1 and value2 column. Could someone please help me out? I can't seem to figure it out.
data, value1, value2
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
output would turn into this
one, value1, value2
1.00 2.00 3.00
4.00 5.00 6.00
7.00 8.00 9.00
More general solution is create MultiIndex.from_arrays with modulo and floor division of numpy.arange with unstack:
print (df)
data
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
a = np.arange(len(df.index))
print (a)
[0 1 2 3 4 5 6 7 8 9]
df.index = pd.MultiIndex.from_arrays([a % 3, a // 3])
print (df)
data
0 0 1.0
1 0 2.0
2 0 3.0
0 1 4.0
1 1 5.0
2 1 6.0
0 2 7.0
1 2 8.0
2 2 9.0
0 3 10.0
df1 = df['data'].unstack(0)
df1.columns=['data','value1','value2']
print (df1)
data value1 value2
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
3 10.0 NaN NaN
You can use a numpy method reshape then convert back to dataframe with pd.DataFrame and name your columns.
pd.DataFrame(df.values.reshape(3,3), columns=['data','value1','value2'])
Output:
data value1 value2
0 1 2 3
1 4 5 6
2 7 8 9