pandas constains with regular expression- special character and full word - pandas

I am trying to remove rows that contains any of this characters(##%+*=) and also a full word
col
ahoi*
word
be
df = df[~df[col].str.contains(r'[##%+*=](word))', regex=True)]
I achieved to remove special characters only with .str.contains(r'[##%+*=])', however I cannot remove the row with the full word.
What am I missing?
This is the expected result.
col
be

IIUC, you need to use the or operator (|) instead of parenthesis :
df = df[~df["col"].str.contains(r'[##%+*=]|word', regex=True)]
​
Output :
print(df)
col
2 be

You can try:
>>> df[~df['col'].str.contains(r'(?:[##%+*=]|word)', regex=True)]
col
2 be

Related

pandas: column name with ' separator

Im reading a file which has a column with a ' in the column. Something like
df:
colA col'21 colC
abc 2001 Ab1
now I can't seem to read the column like:
df['col\'21']
It gives the KeyError.
Your character is not a quote but the Right Single Quotation Mark
Replace this character by the standard quote:
df.columns = df.columns.str.replace('\u2019', "'")
print(df["col'21"])
To find the unicode character, use:
>>> hex(ord("’"))
'0x2019'
You need to use the ’ instead:
instead of:
df[YTD'21"]
try:
df["YTD’21"]
I don't get any error:
df = pd.DataFrame({"col'21":[1,2,3]})
or
df = pd.DataFrame({"""col'21""":[1,2,3]})
Output:
col'21
0 1
1 2
2 3

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

How to replace element in pandas DataFrame column [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Get characters before the underscore

I have a pandas dataframe in the below format
name
BC_new-0
BC_new-1
BC_new-2
Would like to extract whatever is below the "_" and append it to a new column
df['value'] = str(df['name']).split("_")[0]
But I get the below results
value
0 BC
0 BC
0 BC
Any suggestions on how this "0" could not be present in the output. Any leads would be appreciated.
I might use str.extract here:
df['value'] = df['name'].str.extract(r'^([^_]+)')
As the comment above suggests, if you want to use string splitting, then use str.split:
df['value'] = df['name'].str.split("_").str[0]

pandas fillna with fuzzy search on col names

i have a dataframe with many col names having _paid as part of the name (eg. A_paid, B_paid. etc). I need to fill miss values in any col that has _paid as part of the name. (note: i am not allowed to replace missing value in other cols with no _paid as part of the name).
I tried to use .fillna(), but not sure how to make it do fuzzy search on col names.
If you want to select any column that has _paid in it:
paid_cols = df.filter(like="_paid").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid", regex=False)]
andthen
df[paid_cols] = df[paid_cols].fillna(...)
If you need _paid to be at the end only, then with $ anchor in a regex:
paid_cols = df.filter(regex="_paid$").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid$")]
then the same fillna above.