Pandas DF Select Column Contained By String - pandas

I am currently trying to figure out the inverse of df[df['A'].str.contains("hello")]. I would like to return all rows from from column 'A' if they have a string in "hello".
If the values were "he", "llo" "l" in column 'A' they would return true.
If the values were "het", "lil", "elp" in column 'A' they would return false.
Is there a way to do this in a dataframe without iterating each row in the dataframe?
Currently using 2.7 due to working with ESRI ArcGIS 10.4 software constraints.

You can use apply() of Pandas to iterate over each row of column A, and evaluate if it is a substring of 'hello'
def hello_check(row):
return (row in 'hello')
df['contains_hello'] = df['A'].apply(hello_check)

Related

Filling NaNs using apply lambda function to work with DASK dataframe

I am trying to figure out how to fill a column with a value if another column and filling the null values with the value of another column as so
df['NewCol'] = df.apply(lambda row: 'Three' if row['VersionThree'] == row['VersionThree']\
else ('Two' if row['VersionTwo'] == row['VersionTwo'] else \
('Test' if row['VS'] == row['VS'] else '')), axis=1)
So the function works as it should, but I am now trying to figure out how to get it to run when I read my dataset in as a Dask Data Frame
I tried to vectorize and see if I could use numpy where with it as so,
df['NewCol'] = np.where((df['VersionThree'] == df['VersionThree']), ['Three'],
np.where((df['VersionTwo']== df['VersionTwo']), ['Two'],
np.where((df['VS'] == df['VS']), ['Test'], np.nan)))
But it does not run/crashes. I would like the function to iterate through every row and does a check for those 3 columns if any of them exist then output it to NewCol, if null then check the next column in the if's, if all are null, then place a np.nan into that cell
I am trying to use a Dask DataFrame

Create columns which correspond to the number of characters contained in strings of a dataframe column

I have a dataframe, the first column contains string (eg:'AABCD'). I have to count occurences for each string. Then the results for each count must be stored in column (one column for each character, A,B,C,D).
See below
I have the following dataframe:
I want to get:
Remark: Columns A, B, C, D contain the number of characters for each string in each line
I want to create columns A,B,C,D with the number of characters for each string in each line
Assuming the columns are already in the dataframe, and the column containing the strings to start really is a column and not the index:
Set up dataframe:
df = pd.DataFrame({
"string":["AABCD", "ACCB", "AB", "AC"],
"A":[float("nan"),float("nan"),float("nan"),float("nan")],
"B":[float("nan"),float("nan"),float("nan"),float("nan")],
"C":[float("nan"),float("nan"),float("nan"),float("nan")],
"D":[float("nan"),float("nan"),float("nan"),float("nan")],
})
Loop through the columns and apply a lambda function to each row.
for col_name in df.columns:
if col_name == "string":
continue
df[col_name]=df.apply(lambda row: row["string"].count(col_name), axis=1)

condition based on the last character of a column

[enter image description here][1]
Hi - I was wondering if anybody can help me with this, I have the above table, and I want to create a new column 'D' based on the condition in column 'A'.
For example, if the last character of the string that is found in column A ends with a letter Z (s7-Z), multiply values in columns B and C and store it in a new column E. Else if the last character of the string that is found in column A ends with the letter I (s7-I) multiply the values in column C and D and store in column E.
Something like this should get you started I guess:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A']=pd.Series(['AZ','aI','aa'])
df['B']=pd.Series([2,3,4])
df['C']=pd.Series([5,2,3])
df['D']=pd.Series([10,20,30])
df['E']=np.where(df['A'].str.endswith('Z'),df['B']*df['C'],np.nan)
df['E']=np.where(df['A'].str.endswith('I'),df['C']*df['D'],df['E'])
df
the expression df['A'].str.endswith('Z') returns a series of boolean (True, False) indicating whether the string in each cell of df['A'] ends with 'Z'
np.where reads as np.where( This condition is true, Then bring this - in this case df['B']*df['C'], else - bring this - in this case NaN or the column 'E' itself)
Hope that makes sense... here are the docs for further exploration
https://numpy.org/doc/stable/reference/generated/numpy.where.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.endswith.html

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

Is there an elegant way to select all rows in a pandas column that have data type float?

An elegant function like
df[~pandas.isnull(df.loc[:,0])]
can check a pandas DataFrame column and return the entire DataFrame but with all NaN value rows from the selected column removed.
I am wondering if there is a similar function which can check and return a df column conditional on its dtype without using any loops.
I've looked at
.select_dtypes(include=[np.float])
but this only returns columns that have entirely float64 values, not every row in a column that is a float.
First lets set up a DataFrame with two columns. Only column b has a float. We'll try and find this row:
df = pandas.DataFrame({
'a': ['qw', 'er'],
'b' : ['ty', 1.98]
})
When printed this looks like:
a b
0 qw ty
1 er 1.98
Then create a map to select the rows using apply()
def check_if_float(row):
return isinstance(row['b'], float)
map = df.apply(check_if_float, axis=1)
This will give a boolean map of all the rows that have a float in column b:
0 False
1 True
You can then use this map to select the rows you want
filtered_rows = df[map]
Which leaves you only the rows that contain a float in column b:
a b
1 er 1.98