change column value based on multiple conditions - pandas

I've seen a lot of posts similar but none seem to answer this question:
I have a data frame with multiple columns. Lets say A, B and C
I want to change column A's value based on conditions on A, B and C
I've got this so far but not working.
df=df.loc[(df['A']=='Harry')
& (df['B']=='George')
& (df['C']>'2019'),'A']=='Matt'
So if A is equal to Harry, and B is equal to George and C is greater than 2019, then change A to Matt
Anyone see what I've done wrong?

You are really close, assign value Matt to filtered A by boolean masks:
df.loc[(df['A']=='Harry') & (df['B']=='George') & (df['C']>'2019'),'A'] = 'Matt'

You can also use np.where
df['A'] = np.where((df['A']=='Harry') & (df['B']=='George') & (df['C']>'2019'), 'Matt', df['A'])

Related

How do I vectorize multiple columns based on a condition in Pandas?

I am able to successfully vectorize one column, for example:
df.loc[df['Pos'] == "26,001+", 'column_D'] = "N/A"
I want to know how I can do this for multiple columns; I tried something like this, which was unsuccessful:
df.loc[df['Pos'] == "Between 10,001 and 26,000", 'A' & 'B' & 'C' & 'D' & 'E'] = "N/A"
I am expecting for all the listed columns to be transformed using vectorization, if possible.
You are almost there. Following the documentation, the syntax is
df.loc[df['Pos']=="Between 10,001 and 26,000", ['A','B','C','D','E']] = "N/A"

How to update column A value with column B value based on column B's string length property?

I scraped a real estate website and produced a CSV output with data requiring to be cleaned and structured. So far, my code properly organized and reformatted the data for it to work with stats software.
However, every row and then, my 'Gross area' column has the wrong value in m2. The correct value appears in another column ('Furbished').
Gross_area
Furbished
170 #erroneous
190 m2
170 #erroneous
190 m2
160 #correct
Yes
155 #correct
No
I tried using the np.where function. However, I could not specify the condition based on string length, which would allow me to target all '_ _ _ m2' values in column 'Furbished' and reinsert them in 'Gross_area'. It just doesn't work.
df['Gross area']=np.where(len(df['Furbished]) == 6, df['Furbished'],df['Gross area']
As an alternative, I tried setting cumulative conditions to precisely target my '_ _ _ m2' values and insert them in my 'Gross area' column. It does not work:
df['Gross area']=np.where((df['Furbished]) != 'Yes' or 'No', df['Furbished'],df['Gross area']
The outcome I seek is:
Gross_area
Furbished
190 m2
190 m2
190 m2
190m2
160
Yes
Any suggestions? Column Furbished string length criterion would be the best option, as I have other instances that would require the same treatment :)
Thanks in advance for your help!
There is probably a better way to do this, but you could get the intended effect by a simple df.apply() function.
df['Gross area'] = df.apply(lambda row: row['Furbished'] if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
With a simple change, you can also keep the 'Gross area' column in the right type.
df['Gross area'] = df.apply(lambda row: float(row['Furbished'][:-2]) if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
You can use pd.where:
df['Gross_area'] = df['Furbished'].where(df['Furbished'].str.len() == 6, df['Gross_area'])
This tells you to use the value in the Furbished column if its length is 6, otherwise use the value in the Gross_area column.
Result:
Gross_area Furbished
0 190 m2 190 m2
1 190 m2 190 m2
2 160 #correct Yes
3 155 #correct No
Thanks a lot for your help! The suggestion of Derek was the simplest to implement in my program:
df['Gross area']=df['Furbished'].where(df['Furbished'].str.len()==6,df['Gross area'])
I could create a set of rules to replace or delete all the misreferenced data :)
To update data from given column A if column B equals given string
df['Energy_Class']=np.where(df['Energy_Class']=='Usado',df['Bathrooms'],df['Energy_Class'])
To replace string segment found within column rows
net=[]
for row in net_col:
net.append(row)
net_in=[s for s in prices if 'm²' in s]
print(net_in)
net_1=[s.replace('m²','') for s in net]
net_2=[s.replace(',','.') for s in net_1]
net_3=[s.replace('Sim','') for s in net_2]
df['Net area']=np.array(net_3)
To create new column and append standard value B if value A found in existing target column rows
Terrace_list=[]
caraocl0=(df['Caracs/0']
for row in carac_0:
caracl0.append(row)
print(caracl0)
if row == 'Terraço':
yes='Yes'
Terrace_list.append(yes)
else:
null=('No')
Terrace_list.append(null)
df['Terraces']=np.array(Terrace_list)
To append pre-set value B in existing column X if value A found in existing column Y.
df.loc[df['Caracs/1']=='Terraço','Terraces']='Yes'
Hope this helps someone out.

How to set multiple conditions for a Dataframe while modifying the values?

So, I'm looking for an efficient way to set up values within an existing column and setting values for a new column based on some conditions. If I have 10 conditions in a big data set, do I have to write 10 lines? Or can I combine them somehow...haven't figured it out yet.
Can you guys suggest something?
For example:
data_frame.loc[data_frame.col1 > 50 ,["col1","new_col"]] = "Cool"
data_frame.loc[data_frame.col2 < 100 ,["col1","new_col"]] = "Cool"
Can it be written in a single expression? "&" or "and" don't work...
Thanks!
yes you can do it,
here is an example:
data_frame.loc[(data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500),"test"] = 0
explanation:
the filter I used is (with "and" and "or" conditions): (data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500)
the column that will be changed is "test" and the value will be 0
You can try:
all_conditions = [condition_1, condition_2]
fill_with = [fill_condition_1_with, fill_condition_2_with]
df[["col1","new_col"]] = np.select(all_conditions, fill_with, default=default_value_here)

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem

Access Concatenate rows into single rows: extra conditions needed

I'm working with an Access database and I need to concatenate different related rows into 1 row. I found a solution here and used it with great success. However I need to add extra conditions to it, it should only be done if certain other columns are equal too.
For example:
1 X Alpha
2 Y Beta
1 X Gamma
1 Z Delta
should become
1 X Alpha,Gamma
1 Z Delta
2 Y Beta
Does anyone know to do this, especially for a newbie like me?
It seems you are using the code supplied in Does MS access(2003) have anything comparable to Stored procedure. I want to run a complex query in MS acceess.
There is no reason why you should not feed in two fields as one in your sql, so an example:
SELECT Number & Letter,
Concatenate("SELECT Letter & Alpha As FirstField FROM tblTable
WHERE Number & Letter =""" & [Number] & [Letter] & """") as FirstFields
FROM tblT