Adding column to pandas data frame that provides labels based on condition - pandas

I have a data frame filled with time series temperature data and need to label the equipment status as 'good' or 'bad' based on the temperature. It is 'good' if it is between 35 and 45 and 'bad' otherwise. However, I want to add a condition that if it returns to the appropriate temperature range after being listed as 'bad' it must be 'good' for a least 2 days before it is labeled as 'good' again. So far, I can label on a more basic level, but struggling with implementing the more complicated label switch.
df['status'] = ['bad' if x <35 or x >45 else 'good' for x in df['temp']]
Any help would be greatly appreciated. Thank you.

What about an approach like this?
You can make a group_check function for each row, and check if that row has any neighboring offending temperature within the group from the broader df.
This will only check the previous measurements. You would need to do a quick boolean check against the current measurement to confirm the prior measurements are OK AND the current measurement is OK.
def group_check_maker(index, row):
def group_check(group):
if len(group) > 1:
if index in group.index:
failed_status = False
for index2, row2 in group.drop(index).iterrows():
if (row['Date'] > row2['Date']) and (row['Date'] - row2['Date'] < pd.Timedelta(days = 2)) and (row2['Temperature'] < 35 or row2['Temperature'] >45):
failed_status = True
if failed_status:
return 'Bad'
else:
return 'Good'
return group_check
def row_checker_maker(df):
def row_checker(row):
group_check = group_check_maker(row.name, row)
return df[df['Equipment ID'] == row['Equipment ID']].groupby('Equipment ID').apply(group_check).iloc[0]
return row_checker
row_checker = row_checker_maker(df)
df['Neighboring Day Status'] = df.apply(row_checker, axis = 1)

import numpy as np
df['status'] = np.where((df['temp']>35) | (df['temp']>45) , 'bad', 'good')
This should solve the issue.

you can create a pd.Series with the value 'bad' that you replace values where its is betweem 35 and 45, then propagate the value 'bad' to the next two empty rows with ffill and a limit of 2 and finally fillna the rest with good such as:
#dummy df
df = pd.DataFrame({'temp': [36, 39, 24, 34 ,56, 42, 40, 38, 36, 37, 32, 36, 23]})
df['status'] = pd.Series('bad', index=df.index).where(df.temp.lt(35)|df.temp.gt(45))\
.ffill(limit=2).fillna('good')
print (df)
temp status
0 36 good
1 39 good
2 24 bad
3 34 bad
4 56 bad
5 42 bad #here it is 42 but the previous row is bad so still bad
6 40 bad #here it is 40 but the second previous row is bad so still bad
7 38 good #here it is good then
8 36 good
9 37 good
10 32 bad
11 36 bad
12 23 bad

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

How To Create a new DF Column By checking a lookup table and inserting the appropriate value

I have 2 dataframes, the first is a cartesian joined table which has the below structure:
track_x
Energy_x
Camelot_x
BPM_x
join
track_y
Energy_y
Camelot_y
BPM_y
Energy_Distance
BPM Distance
Buju Banton - Blessed
5
10A
75
1
Gyptian - Hold You - Hold Yuh
6
4B
67
1
8
where it has around 10k rows with every track referencing every other track.
I then have a second table which i am using to store the distances between camelot_x & camelot_y which has the columns and indexes as the 10A, 4B value for example and then the value as an integer.
1A
1B
2A
1A
0
1
1
1B
1
0
1
2A
1
1
0
I am struggling however to retrieve this corresponding value.
I have used code:
def harmonic_distance_lookup(x, y):
value = distance_df.lookup(x, y)
return value
ct_df["Harmonic Distance"] = ct_df.apply(harmonic_distance_lookup(ct_df["Camelot_x"], ct_df["Camelot_y"]), axis=1)
However this just spins forever and doesn't seem to be doing anything.
Is there a better method of doing this? I want to check the distance between camelot_x & camelot_y for every row and append it to a new column
Expected output:
|track_x|Energy_x|Camelot_x|BPM_x|join|track_y|Energy_y|Camelot_y|BPM_y|Energy_Distance|BPM Distance| Harmonic Distance|
|-|-|-|-|-|-|-|-|-|-|-|-|
|Buju Banton - Blessed|5|10A|75|1|Gyptian - Hold You - Hold Yuh|6|4B|67|1|8|6|
Working Answer:
def harmonic_distance_lookup(x, y):
value = distance_df.at[x, y]
return value
ct_df["Harmonic Distance"] = ct_df.apply(lambda x: harmonic_distance_lookup(x["Camelot_x"], x["Camelot_y"]), axis=1)
ct_df
Try this:
ct_df.apply(lambda x: harmonic_distance_lookup(x["Camelot_x"], x["Camelot_y"]), axis=1)
Also df.lookup is deprecated in 1.2.0. You might want to look at df.at.

Collapse pandas DataFrame based on daily column value

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns?
It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does.
Maybe this is already posted somewhere, but I have trouble finding it.
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value column is the same with minimum of Distance column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251

Avoiding loops in python/pandas

I can do python/pandas to basic stuff, but I still struggle with the "no loops necessary" world of pandas. I tend to fall back to converting to lists and doing loops like in VBA and then just bring those list back to dfs. I know there is a simpler way, but I can't figure it out.
I simple example is just a very basic strategy of creating a signal of -1 if a series is above 70 and keep it -1 until the series breaks below 30 when the signal changes to 1 and keep this until a value above 70 again and so on.
I can do this via simple list looping, but I know this is far from "Pythonic"! Can anyone help "translating" this to some nicer code without loops?
#rsi_list is just a list from a df column of numbers. Simple example:
rsi={'rsi':[35, 45, 75, 56, 34, 29, 26, 34, 67. 78]}
rsi=pd.DataFrame(rsi)
rsi_list=rsi['rsi'].tolist()
signal_list=[]
hasShort=0
hasLong=0
for i in range(len(rsi_list)-1):
if rsi_list[i] >= 70 or hasShort==1:
signal_list.append(-1)
if rsi_list[i+1] >= 30:
hasShort=1
else:
hasShort=0
elif rsi_list[i] <= 30 or hasLong==1:
signal_list.append(1)
if rsi_list[i+1] <= 70:
hasLong=1
else:
hasLong=0
else:
signal_list.append(0)
#last part just for the list to be the same lenght of the original df as I put it back as a column
if rsi_list[-1]>=70:
signal_list.append(-1)
else:
signal_list.append(1)
First clip the values to 30 in lower and 70 in upper bound, use where to change to nan all the values that are not 30 or 70. replace by 1 and -1 and propagate these values with ffill. fillna with 0 the values before the first 30 or 70.
rsi['rsi_cut'] = (
rsi['rsi'].clip(lower=30,upper=70)
.where(lambda x: x.isin([30,70]))
.replace({30:1, 70:-1})
.ffill()
.fillna(0)
)
print(rsi)
rsi rsi_cut
0 35 0.0
1 45 0.0
2 75 -1.0
3 56 -1.0
4 34 -1.0
5 29 1.0
6 26 1.0
7 34 1.0
8 67 1.0
9 78 -1.0
Edit: maybe a bit easier, use ge (greater or equal) and le (less or equal) and do a subtraction, then replace the 0s with the ffill method
print((rsi['rsi'].le(30).astype(int) - rsi['rsi'].gt(70))
.replace(to_replace=0, method='ffill'))

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286