How to replace element in pandas DataFrame column [duplicate] - pandas

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?

Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs

If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".

Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)

Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))

If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Related

pandas cant replace commas with dots

Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ​​​​in the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes

Pyspark Dataframe - How to create new column with only first 2 words

dataframe --> df having a column for Full Name (First, middle & last). The column name is full_name and words are seperated by a space (delimiter)
I'd like to create a new column having only 1st and middle name.
I have tried the following
df = df.withColumn('new_name', split(df['full_name'], ' '))
But this returns all the words in a list.
I also tried
df = df.withColumn('new_name', split(df['full_name'], ' ')).getItem(1)
But this returns only the 2nd name in the list (middle name)
Please advise how to proceed with this.
Try this
import pyspark.sql.functions as F
split_col = F.split(df['FullName'], ' ')
df = df.withColumn('FirstMiddle', F.concat_ws(' ',split_col.getItem(0),split_col.getItem(1)))
df.show()
Took my some time thinking but I came up with this
df1 = df.withColumn('first_name', f.split(df['full_name'], ' ').getItem(0))\
.withColumn('middle_name', f.split(df['full_name'], ' ').getItem(1))\
.withColumn('New_Name', f.concat(f.col('first_name'), f.lit(' '), f.col('middle_name')))\
.drop('first_name')\
.drop('middle_name')
It is a working code and the output is as expected but I am not sure how efficient this is considered. If someone has any better ideas please reply

.str.replace() is not replacing values in my dataframe [duplicate]

I have the following pandas dataframe. Say it has two columns: id and search_term:
id search_term
37651 inline switch
I do:
train['search_term'] = train['search_term'].str.replace("in."," in. ")
expecting that the dataset above is unaffected, but I get in return for this dataset:
id search_term
37651 in. in. switch
which means inl is replaced by in. and ine is replaced by in., as if I where using a regular expression, where dot means any character.
How do I rewrite the first command so that, literally, in. is replaced by in. but any in not followed by a dot is untouched, as in:
a = 'inline switch'
a = a.replace('in.','in. ')
a
>>> 'inline switch'
The version 0.23 or newer, the str.replace() got a new option for switching regex.
Following will simply turn it off.
df.search_term.str.replace('in.', 'in. ', regex=False)
Will results in:
0 inline switch
1 in. here
Name: search_term, dtype: object
and here is the answer: regular expression to match a dot.
str.replace() in pandas indeed uses regex, so that:
df['a'] = df['a'].str.replace('in.', ' in. ')
is not comparable to:
a.replace('in.', ' in. ')
the latter does not use regex. So use '\.' instead of '.' in a statement that uses regex if you really mean dot and not any character.
Regular Expression to match a dot
Try escaping the .:
import pandas as pd
df = pd.DataFrame({'search_term': ['inline switch', 'in.here']})
>>> df.search_term.str.replace('in\\.', 'in. ')
0 inline switch
1 in. here
Name: search_term, dtype: object

pandas fillna with fuzzy search on col names

i have a dataframe with many col names having _paid as part of the name (eg. A_paid, B_paid. etc). I need to fill miss values in any col that has _paid as part of the name. (note: i am not allowed to replace missing value in other cols with no _paid as part of the name).
I tried to use .fillna(), but not sure how to make it do fuzzy search on col names.
If you want to select any column that has _paid in it:
paid_cols = df.filter(like="_paid").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid", regex=False)]
andthen
df[paid_cols] = df[paid_cols].fillna(...)
If you need _paid to be at the end only, then with $ anchor in a regex:
paid_cols = df.filter(regex="_paid$").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid$")]
then the same fillna above.

Is there a way to selectively replace content in a dataframe?

df.replace(['《', '》', ' ,\t'], ['', '', '|'], regex=True, inplace=True)
I have a dataFrame in which I want to replace the 3 characters. But these replace all the columns. Can I exclude a particular column in the above code that won't be replaced? For example, I have a column called 'Summary' that I don't want to apply these replacements.
Is that possible?
Assuming you only want to exclude df['Summary], you can use .loc to exclude all columns but one and then replace the strings you need to replace.
I dropped the inplace=True because I don't know if that will work.
df.loc[:, df.columns != 'Summary'] = df.loc[:, df.columns != 'Summary'].replace(['《','》',' ,\t'],['','','|'], regex=True)
Try with pass dict
df.update(df.drop('Summary',1).replace(dict(zip(['《', '》', ' ,\t'], ['', '', '|'])), regex=True))