condition based on the last character of a column - pandas

[enter image description here][1]
Hi - I was wondering if anybody can help me with this, I have the above table, and I want to create a new column 'D' based on the condition in column 'A'.
For example, if the last character of the string that is found in column A ends with a letter Z (s7-Z), multiply values in columns B and C and store it in a new column E. Else if the last character of the string that is found in column A ends with the letter I (s7-I) multiply the values in column C and D and store in column E.

Something like this should get you started I guess:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A']=pd.Series(['AZ','aI','aa'])
df['B']=pd.Series([2,3,4])
df['C']=pd.Series([5,2,3])
df['D']=pd.Series([10,20,30])
df['E']=np.where(df['A'].str.endswith('Z'),df['B']*df['C'],np.nan)
df['E']=np.where(df['A'].str.endswith('I'),df['C']*df['D'],df['E'])
df
the expression df['A'].str.endswith('Z') returns a series of boolean (True, False) indicating whether the string in each cell of df['A'] ends with 'Z'
np.where reads as np.where( This condition is true, Then bring this - in this case df['B']*df['C'], else - bring this - in this case NaN or the column 'E' itself)
Hope that makes sense... here are the docs for further exploration
https://numpy.org/doc/stable/reference/generated/numpy.where.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.endswith.html

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

cannot look up values in pandas df with the reference number with leading 0

I'm trying to get the description from a dataframe by entering the item number. However, python3 does not support numbers starting with 0.
From the below code, I'm looking up in a df with the style no equal to '0400271' and return the description of that row. The code works only for numbers not starting with 0. For leading 0 numbers, it just returns an empty series.
joe.loc[joe['STYLE_NO'] == '0400271', 'DESCRIPTION']
#also separately tried: (also returns empty series)
value = '0400271'.zfill(7)
print(value)
print(type(value))
b = joe.query('STYLE_NO == #value')['DESCRIPTION']
b

Finding Mismatch of a column having decimal value using Pandas

I have 2 csv files with same headers. I merged them with primary keys. Now from the merged file, I need to create another file with data which has all matching values and mismatch at 7th decimal place for col1 and col2 which are float value columns. What is the best way to do that?
generate some data that matches shape you note
simple case of do equality of rounded numbers, then to_csv()
included sample 5 rows
from pathlib import Path
b = np.random.randint(1,100, 100)
df1 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col1"])
df2 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col2"])
fn = Path.cwd().joinpath("SO_same.csv")
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7))).head(5).to_csv(fn)
with open(fn) as f: contents=f.read()
print(contents)
output
,col1,col2,eq7dp
0,37.00000005733964,37.00000002893621,False
1,46.00000001386966,46.00000008236663,False
2,99.00000007870301,99.00000007452154,True
3,42.00000001906606,42.00000001278533,True
4,79.00000007529009,79.00000007372863,True
supplement
In comments you note you want to use a np.where() expression, to select col1 if equal else False. You need to ensure that 2nd and 3rd parameters to np.where() are compatible. NB False is zero when converted to an int/float.
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7)),
col3=lambda df: np.where(df.col1.round(7).eq(df.col2.round(7)),df.col1,np.full(len(df),False))
)

Pandas DF Select Column Contained By String

I am currently trying to figure out the inverse of df[df['A'].str.contains("hello")]. I would like to return all rows from from column 'A' if they have a string in "hello".
If the values were "he", "llo" "l" in column 'A' they would return true.
If the values were "het", "lil", "elp" in column 'A' they would return false.
Is there a way to do this in a dataframe without iterating each row in the dataframe?
Currently using 2.7 due to working with ESRI ArcGIS 10.4 software constraints.
You can use apply() of Pandas to iterate over each row of column A, and evaluate if it is a substring of 'hello'
def hello_check(row):
return (row in 'hello')
df['contains_hello'] = df['A'].apply(hello_check)

How to change a value in a column based on whether or not a certain string combination is in other columns in the same row? (Pandas)

I am a very new newbie to Pandas and programming in general. I'm using Anaconda, if that matters.
I have the following on my hands:
The infamous Titanic survival dataset.
So, my idea was to search the dataframe, find the rows where in the "Name" column there would be a string "Mrs." AND at the same time the "age" would be a NaN (in which case the value in the "Age" column needs to be changed to 32). Also, finding "Miss"in the cell, values in two other columns are zeros.
My major problem is that I don't know how to tell Pandas to replace the value in the same row or delete the whole row.
#I decided to collect the indexes of rows with the "Age" value == NaN to further use the
#indices to search through the "Names column."
list_of_NaNs = df[df['Age'].isnull()].index.tolist()
for name in df.Name:
if "Mrs." in name and name (list_of_NaNs):#if the string combination "Mrs."
#can be found within the cell...
df.loc['Age'] = 32.5 #need to change the value in the
#column IN THE SAME ROW
elif "Miss" in name and df.loc[Parch]>0: #how to make a
#reference to a value IN THE SAME ROW???
df.loc["Age"] = 5
elif df.SibSp ==0 and Parch ==0:
df.loc["Age"] = 32.5
else:
#mmm... how do I delete entire row so that it doesn't
#interfere with my future actions?
Here is how you can test if 'Miss' or 'Mrs.'is present in name columns:
df.name.str.contains('Mrs')
So following will give you the rows where 'Mrs' is in name and Age is NaN
df[(df.name.str.contains('Mrs')) & (df.age.isna())]
You can play with different cases and tasks from here on.
Hope this helps :)
And to drop rows with NaN in age column:
df = df.drop(df[df.age.isna()].index)