pandas: column name with ' separator - pandas

Im reading a file which has a column with a ' in the column. Something like
df:
colA col'21 colC
abc 2001 Ab1
now I can't seem to read the column like:
df['col\'21']
It gives the KeyError.

Your character is not a quote but the Right Single Quotation Mark
Replace this character by the standard quote:
df.columns = df.columns.str.replace('\u2019', "'")
print(df["col'21"])
To find the unicode character, use:
>>> hex(ord("’"))
'0x2019'

You need to use the ’ instead:
instead of:
df[YTD'21"]
try:
df["YTD’21"]

I don't get any error:
df = pd.DataFrame({"col'21":[1,2,3]})
or
df = pd.DataFrame({"""col'21""":[1,2,3]})
Output:
col'21
0 1
1 2
2 3

Related

pandas constains with regular expression- special character and full word

I am trying to remove rows that contains any of this characters(##%+*=) and also a full word
col
ahoi*
word
be
df = df[~df[col].str.contains(r'[##%+*=](word))', regex=True)]
I achieved to remove special characters only with .str.contains(r'[##%+*=])', however I cannot remove the row with the full word.
What am I missing?
This is the expected result.
col
be
IIUC, you need to use the or operator (|) instead of parenthesis :
df = df[~df["col"].str.contains(r'[##%+*=]|word', regex=True)]
​
Output :
print(df)
col
2 be
You can try:
>>> df[~df['col'].str.contains(r'(?:[##%+*=]|word)', regex=True)]
col
2 be

How to replace element in pandas DataFrame column [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Get characters before the underscore

I have a pandas dataframe in the below format
name
BC_new-0
BC_new-1
BC_new-2
Would like to extract whatever is below the "_" and append it to a new column
df['value'] = str(df['name']).split("_")[0]
But I get the below results
value
0 BC
0 BC
0 BC
Any suggestions on how this "0" could not be present in the output. Any leads would be appreciated.
I might use str.extract here:
df['value'] = df['name'].str.extract(r'^([^_]+)')
As the comment above suggests, if you want to use string splitting, then use str.split:
df['value'] = df['name'].str.split("_").str[0]

How to find and replace second comma (,) with - in a dataframe using Pandas?

I have dataframe like :
Name
Address
Anuj
Anuj,Sinha,BB
Sinha
Sinha,Anuj BB
In column Adrress, I want to replace all comma (,) except the fist comma in all row with -.
Can anyone please suggest me the possible solution?
provided:
df.dtypes
Customer ID Int64
First_name-Last_name string
Address string
Phone string
Secondary_station string
Customer_disconnected string
If there is a maximum of 2 commas, you can use this simple regex:
df['Address'] = df['Address'].str.replace('(,.*),', r'\1-')
output:
Name Address
0 Anuj Anuj,Sinha-BB
1 Sinha Sinha,Anuj BB
If there are possibly more than 2 commas, you can do:
df['Address'] = df['Address'].str.split(',').apply(lambda x: x[0]+','+'-'.join(x[1:]))
or, more efficient:
splits = df['Address'].str.split(',', 1)
df['Address'] = splits.str[0]+','+splits.str[1].str.replace(',', '-')
You can use the replace function this way :
txt = "I like bananas"
x = txt.replace("bananas", "apples")
print(x)
It will display :
I like apples
For your dataframe, you just need to iterate thought your values this way:
import pandas as pd
df = pd.DataFrame(
{
'name': ['Anuj', 'Sinha'],
'adresse': ['Anuj,Sinha,BB', 'Sinha,Anuj BB']
}
)
for colunm in df.columns:
for index, value in enumerate(df[colunm]):
df[colunm][index] = value.replace(',', '-')
print(df)
It will display :
name adresse
0 Anuj Anuj-Sinha-BB
1 Sinha Sinha-Anuj BB

Pandas - Extracting all text after the 4th character

I am trying to see how can we extract all characters in a column after the 4th character.
col_a
XYZ123
ABCD001
Expecting the below
col_a, new_col
XYZ123, 23
ABCD001, D001
Try with string slicing:
df['new_col']=df['col_a'].str[4:]
OR
Via re module:
import re
df['new_col']=df['col_a'].apply(lambda x:re.findall('[0-9]+', x)[0])
With your shown samples, could you please try following. Using str.extract function of Pandas. Simple explanation would be, using regex ^.{4}(.*)$ by which getting everything apart from 1st 4 characters into capturing group and saving it to new column.
df['new_col'] = df['col_a'].str.extract(r'^.{4}(.*)$',expand=False)
Output of df will be as follows:
col_a new_col
0 XYZ123 23
1 ABCD001 001
Another way;
Extract alphanumerics left of the first 3 alphanumerics
df['new_col']= df.col_a.str.extract('((?<=^\w{3})\w+)')