Remove special characters from dataframe columns - pandas

How can I remove special characters in columns in a dataframe. eg
name
verified
id
Jason' Carly
True
1
Eunice, Banks
None
2
Expected result
name
verified
id
Jason Carly
True
1
Eunice Banks
None
2

Assuming you want to remove everything that is not a space or an ASCII letter, use a regex:
df['name'] = df['name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
output:
name verified id
0 Jason Carly True 1
1 Eunice Banks None 2

Related

Filter Pandas DataFrame to show only rows that contain all strings from a list of strings

If we have a DataFrame:
Column1 Column2
0 Alpha This is bananas
1 Bravo This is not
2 Charlie This is not bananas
3 Delta This is not a banana
4 Echo This is not a Banana
5 Foxtrot This is not a banananananana
and we want to select only the rows that include all the strings from a list of strings, how would we create a function to filter this? Case insensitive.
For example, if I wanted to filter specifically for ['not', 'banana'], I could put that list into the function and it should return the following DataFrame:
Column1 Column2
0 Delta 'This is not a banana'
1 Echo 'This is not a Banana'
The basic requirements:
Column2 must include all the strings from a given list (of arbitrary length) of strings. I want to be able to search for a list of 1, 2, 3, 5, 10, however many strings I want.
Case insensitivity (why filtering for "banana" gives rows with "banana" and "Banana")
Ignores results that include extra letters. Rows with "bananas" or "banana's" or "banananananana" would not be selected when filtering for "banana".
One approach is using sets.
casefold and split the string into a list of words - then into a set.
>>> df.Column2.str.casefold().str.split().map(set)
0 {bananas, this, is}
1 {not, this, is}
2 {not, bananas, this, is}
3 {is, this, not, banana, a}
4 {is, this, not, banana, a}
5 {is, this, banananananana, not, a}
Name: Column2, dtype: object
You can then check if your words are
a subset
>>> set(words) <= df.Column2.str.lower().str.split().map(set)
0 False
1 False
2 False
3 True
4 True
5 False
This can be used to index your dataframe.
>>> df[ set(words) <= df.Column2.str.lower().str.split().map(set) ]
Column1 Column2
3 Delta This is not a banana
4 Echo This is not a Banana
Instead of set(words) - you would just make words a set instead of a list in the first place.
Another approach is regex - you put each of your words inside a positive lookahead assertion.
>>> import re
>>> pattern = '(?i)' + ''.join(f'(?=.*(^|\s){re.escape(word)}(\s|$))' for word in words)
>>> pattern
'(?i)(?=.*(^|\\s)not(\\s|$))(?=.*(^|\\s)banana(\\s|$))'
Which you can use with pandas.Series.str.contains()
>>> df[ df.Column2.str.contains(pattern) ]
Column1 Column2
3 Delta This is not a banana
4 Echo This is not a Banana

Select rows where column value is a combination of numbers and letters

Having a dataset like this:
word
0 TBH46T
1 BBBB
2 5AAH
3 CAAH
4 AAB1
5 5556
Which would be the most efficient way to select the rows where column word is a combination of numbers and letters?
The output would be like this:
word
0 TBH46T
2 5AAH
4 AAB1
A possible solution would be to create a new column using apply and regex in which store if column word has the desired structure. But I'm curious about if this could be achieved in a more straightforward way.
Use Series.str.contains for chain mask for match numeric and for match non numeric with & for bitwise AND:
df = df[df['word'].str.contains('\d') & df['word'].str.contains('\D')]
print (df)
word
0 TBH46T
2 5AAH
4 AAB1

Need a way to split string pandas to colums with numbers

hi i have string in one column :
s='123. 125. 200.'
i want to split it to 3 columns(or as many numbers i have ends with .)
To separate columns and that it will be number not string !, in every column .
From what I understand, you can use:
s='123. 125. 200.'
pd.Series(s).str.rstrip('.').str.split('.',expand=True).apply(pd.to_numeric,errors='coerce')
0 1 2
0 123 125 200

Removing double space and single space in data frame simultaneously

I have a column where the names are separated by Single space, double space(there can be more) and I want to split the names by Fist Name and Last Name
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],{'Age':[32,34,36]})
df['Name'] = df['Name'].str.strip()
df[['First_Name', 'Last_Name']] = df['Name'].str.split(" ",expand = True,)
this should do it
df[['First_Name', 'Last_Name']] = df.Name.apply(lambda x: pd.Series(list((filter(None, x.split(' '))))))
Use \s+ as your split pattern. This is the regex pattern meaning "one or more whitespace characters".
Also, limit number of splits with n=1. This means the string will only be split once (The first occurance of whitespace from left to right) - restricting the output to 2 columns.
df[['First_Name', 'Last_Name']] = df.Name.str.split('\s+', expand=True, n=1)
[out]
Name Age First_Name Last_Name
0 Steve Smith 32 Steve Smith
1 Joe Nadal 34 Joe Nadal
2 Roger Federer 36 Roger Federer

change value (string manipulation) in Pandas DataFrame

I am reading a CSV file to Pandas DataFrame but need to be cleaned up before can be used. I need to do two things:
use regex to filter values
apply string functions such as trim, left, right, ...
For instance, DataFrame may looks like:
0 city_some_string_45
1 city_Other_string_56
2 city_another_string_77
so I need to filter (using regex) for all rows that its value start with "city" and get last two character.
the end result should looks like:
0 45
1 56
2 77
In another word, logic I want to apply is: read value of cell and if starts with city (filtering with regex ie: ^city) and replace the value of cell with its two last character of the cell (eg using right string function)
For a dataframe like this:
No city
0 0 city_some_string_45
1 1 city_Other_string_56
2 2 city_another_string_77
Filter the dataframe to keep the rows with city column starting with city
df = df[df.city.str.startswith('city')]
You can use str.extract to extract only the number
df['city'] = df.city.str.extract('(\d+)').astype(int)
The resulting df
No city
0 0 45
1 1 56
2 2 77