Avoiding loops in python/pandas - pandas

I can do python/pandas to basic stuff, but I still struggle with the "no loops necessary" world of pandas. I tend to fall back to converting to lists and doing loops like in VBA and then just bring those list back to dfs. I know there is a simpler way, but I can't figure it out.
I simple example is just a very basic strategy of creating a signal of -1 if a series is above 70 and keep it -1 until the series breaks below 30 when the signal changes to 1 and keep this until a value above 70 again and so on.
I can do this via simple list looping, but I know this is far from "Pythonic"! Can anyone help "translating" this to some nicer code without loops?
#rsi_list is just a list from a df column of numbers. Simple example:
rsi={'rsi':[35, 45, 75, 56, 34, 29, 26, 34, 67. 78]}
rsi=pd.DataFrame(rsi)
rsi_list=rsi['rsi'].tolist()
signal_list=[]
hasShort=0
hasLong=0
for i in range(len(rsi_list)-1):
if rsi_list[i] >= 70 or hasShort==1:
signal_list.append(-1)
if rsi_list[i+1] >= 30:
hasShort=1
else:
hasShort=0
elif rsi_list[i] <= 30 or hasLong==1:
signal_list.append(1)
if rsi_list[i+1] <= 70:
hasLong=1
else:
hasLong=0
else:
signal_list.append(0)
#last part just for the list to be the same lenght of the original df as I put it back as a column
if rsi_list[-1]>=70:
signal_list.append(-1)
else:
signal_list.append(1)

First clip the values to 30 in lower and 70 in upper bound, use where to change to nan all the values that are not 30 or 70. replace by 1 and -1 and propagate these values with ffill. fillna with 0 the values before the first 30 or 70.
rsi['rsi_cut'] = (
rsi['rsi'].clip(lower=30,upper=70)
.where(lambda x: x.isin([30,70]))
.replace({30:1, 70:-1})
.ffill()
.fillna(0)
)
print(rsi)
rsi rsi_cut
0 35 0.0
1 45 0.0
2 75 -1.0
3 56 -1.0
4 34 -1.0
5 29 1.0
6 26 1.0
7 34 1.0
8 67 1.0
9 78 -1.0
Edit: maybe a bit easier, use ge (greater or equal) and le (less or equal) and do a subtraction, then replace the 0s with the ffill method
print((rsi['rsi'].le(30).astype(int) - rsi['rsi'].gt(70))
.replace(to_replace=0, method='ffill'))

Related

Removing the .0 from a pandas column

After a simple merge of two dataframes the following X column becomes an object and an ".0" is being added at the end for no apparent reason. I tried replacing the nan values with an integer and then converting the whole column to an integer hoping for the .0 to be gone. The code runs but it doesn't really change the dtype of that column. Also, I tried removing the .0 with the rstrip command but then all it really does is it removes everything and even the values that are 249123.0 become NaN which doesn't make sense. I know that is a very basic issue but I am not sure what else could I try at this point.
Input:
Age ID
22 23105.0
34 214541.0
51 0
8 62341.0
Desired output:
Age ID
22 23105
34 214541
51 0
8 62341
Any ideas would be much appreciated.
One of the ways to get rid of the trailing .0 in an object column is to use pandas.DataFrame.replace :
df['ID'] = df['ID'].replace(r'\.0$', '', regex=True).astype(np.int64)
# Output :
print(df)
Age ID
0 22 23105
1 34 214541
2 51 0
3 8 62341

How to replace certain values in the dataframe using tidyverse in R?

There are some values in my dataset(df) that needs to be replaced with correct values e.g.,
Height
Disease
Weight>90kg
1.58
1
0
1.64
0
1
1.67
1
0
52
0
1
67
0
0
I want to replace the first three values as '158', '164' & '167'. I want to replace the next as 152 and 167 (adding 1 at the beginning).
I tried the following code but it doesn't work:
data_clean <- function(df) {
df[height==1.58] <- 158
df}
data_clean(df)
Please help!
Using recode you can explicitly recode the values:
df <- mutate(df, height = recode(height,
1.58 = 158,
1.64 = 164,
1.67 = 167,
52 = 152,
67 = 167))
However, this obviously is a manual process and not ideal for a case with many values that need recoding.
Alternatively, you could do something like:
df <- mutate(df, height = case_when(
height < 2.5 ~ height * 100,
height < 100 ~ height + 100
)
This would really depend on the makeup of your data but for the example given it would work. Just be careful with what your assumptions are. Could also have used is.double and 'is.integer`.

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Adding column to pandas data frame that provides labels based on condition

I have a data frame filled with time series temperature data and need to label the equipment status as 'good' or 'bad' based on the temperature. It is 'good' if it is between 35 and 45 and 'bad' otherwise. However, I want to add a condition that if it returns to the appropriate temperature range after being listed as 'bad' it must be 'good' for a least 2 days before it is labeled as 'good' again. So far, I can label on a more basic level, but struggling with implementing the more complicated label switch.
df['status'] = ['bad' if x <35 or x >45 else 'good' for x in df['temp']]
Any help would be greatly appreciated. Thank you.
What about an approach like this?
You can make a group_check function for each row, and check if that row has any neighboring offending temperature within the group from the broader df.
This will only check the previous measurements. You would need to do a quick boolean check against the current measurement to confirm the prior measurements are OK AND the current measurement is OK.
def group_check_maker(index, row):
def group_check(group):
if len(group) > 1:
if index in group.index:
failed_status = False
for index2, row2 in group.drop(index).iterrows():
if (row['Date'] > row2['Date']) and (row['Date'] - row2['Date'] < pd.Timedelta(days = 2)) and (row2['Temperature'] < 35 or row2['Temperature'] >45):
failed_status = True
if failed_status:
return 'Bad'
else:
return 'Good'
return group_check
def row_checker_maker(df):
def row_checker(row):
group_check = group_check_maker(row.name, row)
return df[df['Equipment ID'] == row['Equipment ID']].groupby('Equipment ID').apply(group_check).iloc[0]
return row_checker
row_checker = row_checker_maker(df)
df['Neighboring Day Status'] = df.apply(row_checker, axis = 1)
import numpy as np
df['status'] = np.where((df['temp']>35) | (df['temp']>45) , 'bad', 'good')
This should solve the issue.
you can create a pd.Series with the value 'bad' that you replace values where its is betweem 35 and 45, then propagate the value 'bad' to the next two empty rows with ffill and a limit of 2 and finally fillna the rest with good such as:
#dummy df
df = pd.DataFrame({'temp': [36, 39, 24, 34 ,56, 42, 40, 38, 36, 37, 32, 36, 23]})
df['status'] = pd.Series('bad', index=df.index).where(df.temp.lt(35)|df.temp.gt(45))\
.ffill(limit=2).fillna('good')
print (df)
temp status
0 36 good
1 39 good
2 24 bad
3 34 bad
4 56 bad
5 42 bad #here it is 42 but the previous row is bad so still bad
6 40 bad #here it is 40 but the second previous row is bad so still bad
7 38 good #here it is good then
8 36 good
9 37 good
10 32 bad
11 36 bad
12 23 bad

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')