How to replace certain values in the dataframe using tidyverse in R? - tidyverse

There are some values in my dataset(df) that needs to be replaced with correct values e.g.,
Height
Disease
Weight>90kg
1.58
1
0
1.64
0
1
1.67
1
0
52
0
1
67
0
0
I want to replace the first three values as '158', '164' & '167'. I want to replace the next as 152 and 167 (adding 1 at the beginning).
I tried the following code but it doesn't work:
data_clean <- function(df) {
df[height==1.58] <- 158
df}
data_clean(df)
Please help!

Using recode you can explicitly recode the values:
df <- mutate(df, height = recode(height,
1.58 = 158,
1.64 = 164,
1.67 = 167,
52 = 152,
67 = 167))
However, this obviously is a manual process and not ideal for a case with many values that need recoding.
Alternatively, you could do something like:
df <- mutate(df, height = case_when(
height < 2.5 ~ height * 100,
height < 100 ~ height + 100
)
This would really depend on the makeup of your data but for the example given it would work. Just be careful with what your assumptions are. Could also have used is.double and 'is.integer`.

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Avoiding loops in python/pandas

I can do python/pandas to basic stuff, but I still struggle with the "no loops necessary" world of pandas. I tend to fall back to converting to lists and doing loops like in VBA and then just bring those list back to dfs. I know there is a simpler way, but I can't figure it out.
I simple example is just a very basic strategy of creating a signal of -1 if a series is above 70 and keep it -1 until the series breaks below 30 when the signal changes to 1 and keep this until a value above 70 again and so on.
I can do this via simple list looping, but I know this is far from "Pythonic"! Can anyone help "translating" this to some nicer code without loops?
#rsi_list is just a list from a df column of numbers. Simple example:
rsi={'rsi':[35, 45, 75, 56, 34, 29, 26, 34, 67. 78]}
rsi=pd.DataFrame(rsi)
rsi_list=rsi['rsi'].tolist()
signal_list=[]
hasShort=0
hasLong=0
for i in range(len(rsi_list)-1):
if rsi_list[i] >= 70 or hasShort==1:
signal_list.append(-1)
if rsi_list[i+1] >= 30:
hasShort=1
else:
hasShort=0
elif rsi_list[i] <= 30 or hasLong==1:
signal_list.append(1)
if rsi_list[i+1] <= 70:
hasLong=1
else:
hasLong=0
else:
signal_list.append(0)
#last part just for the list to be the same lenght of the original df as I put it back as a column
if rsi_list[-1]>=70:
signal_list.append(-1)
else:
signal_list.append(1)
First clip the values to 30 in lower and 70 in upper bound, use where to change to nan all the values that are not 30 or 70. replace by 1 and -1 and propagate these values with ffill. fillna with 0 the values before the first 30 or 70.
rsi['rsi_cut'] = (
rsi['rsi'].clip(lower=30,upper=70)
.where(lambda x: x.isin([30,70]))
.replace({30:1, 70:-1})
.ffill()
.fillna(0)
)
print(rsi)
rsi rsi_cut
0 35 0.0
1 45 0.0
2 75 -1.0
3 56 -1.0
4 34 -1.0
5 29 1.0
6 26 1.0
7 34 1.0
8 67 1.0
9 78 -1.0
Edit: maybe a bit easier, use ge (greater or equal) and le (less or equal) and do a subtraction, then replace the 0s with the ffill method
print((rsi['rsi'].le(30).astype(int) - rsi['rsi'].gt(70))
.replace(to_replace=0, method='ffill'))

Checking multiple conditions against list within a list

I am trying to filter out entities from sample dataset below based on two conditions i.e if their values are between 0-180 or 180-360.. and plan to create two separate lists of such entities..
my df is;
Entityname Value
A 200
A 240
A 330
B 15
B 120
C 190
C 220
expected Output:
Entities_1=['A','C']
Entities_2=['B']
Below is what I have been trying..
Entities_1=[]
Entities_2=[]
for name in df.Entityname:
if df.Value > 0 & df.Value < 180:
Entities_1.append(name)
else:
if df.Value > 180 & df.Value < 360:
Entities_2.append(name)
getting some errors on above and not sure if this is the right way forward..
any help would be appreciated please!..
You can use pandas.Series.between() and boolean indexing.
Entities_1 = df.loc[df['Value'].between(0, 180), 'Entityname'].unique().tolist()
Entities_2 = df.loc[df['Value'].between(180, 360), 'Entityname'].unique().tolist()
# print(Entities_1)
['B']
# print(Entities_2)
['A', 'C']

Adding column to pandas data frame that provides labels based on condition

I have a data frame filled with time series temperature data and need to label the equipment status as 'good' or 'bad' based on the temperature. It is 'good' if it is between 35 and 45 and 'bad' otherwise. However, I want to add a condition that if it returns to the appropriate temperature range after being listed as 'bad' it must be 'good' for a least 2 days before it is labeled as 'good' again. So far, I can label on a more basic level, but struggling with implementing the more complicated label switch.
df['status'] = ['bad' if x <35 or x >45 else 'good' for x in df['temp']]
Any help would be greatly appreciated. Thank you.
What about an approach like this?
You can make a group_check function for each row, and check if that row has any neighboring offending temperature within the group from the broader df.
This will only check the previous measurements. You would need to do a quick boolean check against the current measurement to confirm the prior measurements are OK AND the current measurement is OK.
def group_check_maker(index, row):
def group_check(group):
if len(group) > 1:
if index in group.index:
failed_status = False
for index2, row2 in group.drop(index).iterrows():
if (row['Date'] > row2['Date']) and (row['Date'] - row2['Date'] < pd.Timedelta(days = 2)) and (row2['Temperature'] < 35 or row2['Temperature'] >45):
failed_status = True
if failed_status:
return 'Bad'
else:
return 'Good'
return group_check
def row_checker_maker(df):
def row_checker(row):
group_check = group_check_maker(row.name, row)
return df[df['Equipment ID'] == row['Equipment ID']].groupby('Equipment ID').apply(group_check).iloc[0]
return row_checker
row_checker = row_checker_maker(df)
df['Neighboring Day Status'] = df.apply(row_checker, axis = 1)
import numpy as np
df['status'] = np.where((df['temp']>35) | (df['temp']>45) , 'bad', 'good')
This should solve the issue.
you can create a pd.Series with the value 'bad' that you replace values where its is betweem 35 and 45, then propagate the value 'bad' to the next two empty rows with ffill and a limit of 2 and finally fillna the rest with good such as:
#dummy df
df = pd.DataFrame({'temp': [36, 39, 24, 34 ,56, 42, 40, 38, 36, 37, 32, 36, 23]})
df['status'] = pd.Series('bad', index=df.index).where(df.temp.lt(35)|df.temp.gt(45))\
.ffill(limit=2).fillna('good')
print (df)
temp status
0 36 good
1 39 good
2 24 bad
3 34 bad
4 56 bad
5 42 bad #here it is 42 but the previous row is bad so still bad
6 40 bad #here it is 40 but the second previous row is bad so still bad
7 38 good #here it is good then
8 36 good
9 37 good
10 32 bad
11 36 bad
12 23 bad

does GrADS have a "astd" (similarly to aave) command I could use?

I would like to have the spatial standard deviation for a variable (let's say temperature). in other words, does GrADS have a "astd" (similarly to aave) command I could use?
There is no command like this in GRADS. But you can actually compute the standard deviation in two ways:
[1] Compute manually. For example:
*compute the mean
x1 = ave(ts1.1,t=1,t=120)
*compute stdev
s1 = sqrt(ave(pow(ts1.1-x1,2),t=1,t=120)*(n1/(n1-1)))
n here is the number of samples.
[2] You can use the built in function 'stat' in GRADS.
You can use 'set stat on' or 'set gxout stat'
These commands will give you statics such as the following:
Data Type = grid
Dimensions = 0 1
I Dimension = 1 to 73 Linear 0 5
J Dimension = 1 to 46 Linear -90 4
Sizes = 73 46 3358
Undef value = -2.56e+33
Undef count = 1763 Valid count = 1595
Min, Max = 243.008 302.818
Cmin, cmax, cint = 245 300 5
Stats[sum,sumsqr,root(sumsqr),n]: 452778 1.29046e+08 11359.8 1595
Stats[(sum,sumsqr,root(sumsqr))/n]: 283.874 80906.7 284.441
Stats[(sum,sumsqr,root(sumsqr))/(n-1)]: 284.052 80957.4 284.53
Stats[(sigma,var)(n)]: 17.9565 322.437
Stats[(sigma,var)(n-1)]: 17.9622 322.64
Contouring: 245 to 300 interval 5
Sigma here is the standard deviation and Var is variance.
Is this what you are looking for?