Checking multiple conditions against list within a list - pandas

I am trying to filter out entities from sample dataset below based on two conditions i.e if their values are between 0-180 or 180-360.. and plan to create two separate lists of such entities..
my df is;
Entityname Value
A 200
A 240
A 330
B 15
B 120
C 190
C 220
expected Output:
Entities_1=['A','C']
Entities_2=['B']
Below is what I have been trying..
Entities_1=[]
Entities_2=[]
for name in df.Entityname:
if df.Value > 0 & df.Value < 180:
Entities_1.append(name)
else:
if df.Value > 180 & df.Value < 360:
Entities_2.append(name)
getting some errors on above and not sure if this is the right way forward..
any help would be appreciated please!..

You can use pandas.Series.between() and boolean indexing.
Entities_1 = df.loc[df['Value'].between(0, 180), 'Entityname'].unique().tolist()
Entities_2 = df.loc[df['Value'].between(180, 360), 'Entityname'].unique().tolist()
# print(Entities_1)
['B']
# print(Entities_2)
['A', 'C']

Related

How to update column A value with column B value based on column B's string length property?

I scraped a real estate website and produced a CSV output with data requiring to be cleaned and structured. So far, my code properly organized and reformatted the data for it to work with stats software.
However, every row and then, my 'Gross area' column has the wrong value in m2. The correct value appears in another column ('Furbished').
Gross_area
Furbished
170 #erroneous
190 m2
170 #erroneous
190 m2
160 #correct
Yes
155 #correct
No
I tried using the np.where function. However, I could not specify the condition based on string length, which would allow me to target all '_ _ _ m2' values in column 'Furbished' and reinsert them in 'Gross_area'. It just doesn't work.
df['Gross area']=np.where(len(df['Furbished]) == 6, df['Furbished'],df['Gross area']
As an alternative, I tried setting cumulative conditions to precisely target my '_ _ _ m2' values and insert them in my 'Gross area' column. It does not work:
df['Gross area']=np.where((df['Furbished]) != 'Yes' or 'No', df['Furbished'],df['Gross area']
The outcome I seek is:
Gross_area
Furbished
190 m2
190 m2
190 m2
190m2
160
Yes
Any suggestions? Column Furbished string length criterion would be the best option, as I have other instances that would require the same treatment :)
Thanks in advance for your help!
There is probably a better way to do this, but you could get the intended effect by a simple df.apply() function.
df['Gross area'] = df.apply(lambda row: row['Furbished'] if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
With a simple change, you can also keep the 'Gross area' column in the right type.
df['Gross area'] = df.apply(lambda row: float(row['Furbished'][:-2]) if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
You can use pd.where:
df['Gross_area'] = df['Furbished'].where(df['Furbished'].str.len() == 6, df['Gross_area'])
This tells you to use the value in the Furbished column if its length is 6, otherwise use the value in the Gross_area column.
Result:
Gross_area Furbished
0 190 m2 190 m2
1 190 m2 190 m2
2 160 #correct Yes
3 155 #correct No
Thanks a lot for your help! The suggestion of Derek was the simplest to implement in my program:
df['Gross area']=df['Furbished'].where(df['Furbished'].str.len()==6,df['Gross area'])
I could create a set of rules to replace or delete all the misreferenced data :)
To update data from given column A if column B equals given string
df['Energy_Class']=np.where(df['Energy_Class']=='Usado',df['Bathrooms'],df['Energy_Class'])
To replace string segment found within column rows
net=[]
for row in net_col:
net.append(row)
net_in=[s for s in prices if 'm²' in s]
print(net_in)
net_1=[s.replace('m²','') for s in net]
net_2=[s.replace(',','.') for s in net_1]
net_3=[s.replace('Sim','') for s in net_2]
df['Net area']=np.array(net_3)
To create new column and append standard value B if value A found in existing target column rows
Terrace_list=[]
caraocl0=(df['Caracs/0']
for row in carac_0:
caracl0.append(row)
print(caracl0)
if row == 'Terraço':
yes='Yes'
Terrace_list.append(yes)
else:
null=('No')
Terrace_list.append(null)
df['Terraces']=np.array(Terrace_list)
To append pre-set value B in existing column X if value A found in existing column Y.
df.loc[df['Caracs/1']=='Terraço','Terraces']='Yes'
Hope this helps someone out.

How to replace certain values in the dataframe using tidyverse in R?

There are some values in my dataset(df) that needs to be replaced with correct values e.g.,
Height
Disease
Weight>90kg
1.58
1
0
1.64
0
1
1.67
1
0
52
0
1
67
0
0
I want to replace the first three values as '158', '164' & '167'. I want to replace the next as 152 and 167 (adding 1 at the beginning).
I tried the following code but it doesn't work:
data_clean <- function(df) {
df[height==1.58] <- 158
df}
data_clean(df)
Please help!
Using recode you can explicitly recode the values:
df <- mutate(df, height = recode(height,
1.58 = 158,
1.64 = 164,
1.67 = 167,
52 = 152,
67 = 167))
However, this obviously is a manual process and not ideal for a case with many values that need recoding.
Alternatively, you could do something like:
df <- mutate(df, height = case_when(
height < 2.5 ~ height * 100,
height < 100 ~ height + 100
)
This would really depend on the makeup of your data but for the example given it would work. Just be careful with what your assumptions are. Could also have used is.double and 'is.integer`.

How to get same grouping result using data.table comparing to the sqldf?

I try to implement SQL query using sqldf and data.table.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using data.table.
library(sqldf)
library(data.table)
Id <- c(1,2,3,4)
HasPet <- c(0,0,1,1)
Age <- c(20,1,14,10)
Posts <- data.table(Id, HasPet, Age)
# sqldf way
ref <- sqldf("
SELECT Id, HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
")
# data.table way
res <- Posts[,
list(Id, HasPet, MaxAge=max(Age)),
by=list(HasPet)]
head(ref)
head(res)
Output for sqldf is:
> head(ref)
Id HasPet MaxAge
1 1 0 20
2 3 1 14
while the output for data.table is different:
> head(res)
HasPet Id HasPet MaxAge
1: 0 1 0 20
2: 0 2 0 20
3: 1 3 1 14
4: 1 4 1 14
Please note, that SQL query cannot be modified.
This comes up a lot with data.table. If you want the max or min by group, the best way is a self-join. It's fast, and only a little arcane.
You can build it up step by step:
In data.table, you can select in i, do in j, and group afterwards. So first step is to find the thing we want within each level of the group
Posts[, Age == max(Age), by = HasPet]
# HasPet V1
# 1: 0 TRUE
# 2: 0 FALSE
# 3: 1 TRUE
# 4: 1 FALSE
We can use .I to retrieve the integer vector per row, then what was previously the V1 logical vector TRUE and FALSE indexes within each group so we have only the row containing the max per group.
Posts[, .I[Age == max(Age)], by=HasPet]
# From the data.table special symbols help:
# .I is an integer vector equal to seq_len(nrow(x)). While grouping,
# it holds for each item in the group, its row location in x. This is useful
# to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
# HasPet V1
# 1: 0 1
# 2: 1 3
We then use the column V1 that we just made in order to call the specific rows (1 and 3) from the data.table. That's it!
Posts[Posts[, .I[Age == max(Age)], by=HasPet]$V1]
You can use .SD to get subset of rows for each value of HasPet.
library(data.table)
Posts[, .SD[Age==max(Age)], HasPet]
# HasPet Id Age
#1: 0 1 20
#2: 1 3 14

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Adding column to pandas data frame that provides labels based on condition

I have a data frame filled with time series temperature data and need to label the equipment status as 'good' or 'bad' based on the temperature. It is 'good' if it is between 35 and 45 and 'bad' otherwise. However, I want to add a condition that if it returns to the appropriate temperature range after being listed as 'bad' it must be 'good' for a least 2 days before it is labeled as 'good' again. So far, I can label on a more basic level, but struggling with implementing the more complicated label switch.
df['status'] = ['bad' if x <35 or x >45 else 'good' for x in df['temp']]
Any help would be greatly appreciated. Thank you.
What about an approach like this?
You can make a group_check function for each row, and check if that row has any neighboring offending temperature within the group from the broader df.
This will only check the previous measurements. You would need to do a quick boolean check against the current measurement to confirm the prior measurements are OK AND the current measurement is OK.
def group_check_maker(index, row):
def group_check(group):
if len(group) > 1:
if index in group.index:
failed_status = False
for index2, row2 in group.drop(index).iterrows():
if (row['Date'] > row2['Date']) and (row['Date'] - row2['Date'] < pd.Timedelta(days = 2)) and (row2['Temperature'] < 35 or row2['Temperature'] >45):
failed_status = True
if failed_status:
return 'Bad'
else:
return 'Good'
return group_check
def row_checker_maker(df):
def row_checker(row):
group_check = group_check_maker(row.name, row)
return df[df['Equipment ID'] == row['Equipment ID']].groupby('Equipment ID').apply(group_check).iloc[0]
return row_checker
row_checker = row_checker_maker(df)
df['Neighboring Day Status'] = df.apply(row_checker, axis = 1)
import numpy as np
df['status'] = np.where((df['temp']>35) | (df['temp']>45) , 'bad', 'good')
This should solve the issue.
you can create a pd.Series with the value 'bad' that you replace values where its is betweem 35 and 45, then propagate the value 'bad' to the next two empty rows with ffill and a limit of 2 and finally fillna the rest with good such as:
#dummy df
df = pd.DataFrame({'temp': [36, 39, 24, 34 ,56, 42, 40, 38, 36, 37, 32, 36, 23]})
df['status'] = pd.Series('bad', index=df.index).where(df.temp.lt(35)|df.temp.gt(45))\
.ffill(limit=2).fillna('good')
print (df)
temp status
0 36 good
1 39 good
2 24 bad
3 34 bad
4 56 bad
5 42 bad #here it is 42 but the previous row is bad so still bad
6 40 bad #here it is 40 but the second previous row is bad so still bad
7 38 good #here it is good then
8 36 good
9 37 good
10 32 bad
11 36 bad
12 23 bad