How do I drop pandas dataframe columns that contains special characters such as # / ] [ } { - _ etc.?
For example I have the following dataframe (called df):
I need to drop the columns Name and Matchkey becasue they contain some special characters.
Also, how can I specify a list of special characters based on which the columns will be dropped?
For example: I'd like to drop the columns that contain (in any record, in any cell) any of the following special characters:
listOfSpecialCharacters: ¬,`,!,",£,$,£,#,/,\
One option is to use a regex with str.contains and apply, then use boolean indexing to drop the columns:
import re
chars = '¬`!"£$£#/\\'
regex = f'[{"".join(map(re.escape, chars))}]'
# '[¬`!"£\\$£\\#/\\\\]'
df2 = df.loc[:, ~df.apply(lambda c: c.str.contains(regex).any())]
example:
# input
A B C
0 123 12! 123
1 abc abc a¬b
# output
A
0 123
1 abc
Related
I have a Pandas dataframe with several columns wherein the entries of each column are a combination of numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|". Each entry of the column is of the form:
'x=ABCDefgh_5|123|'
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry 'x=ABCDefgh_5|123|' without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]
I have a pandas dataframe called df of about 2 million records.
There is a column called transaction_id that might contain:
alpha values (e.g. "abscdwew") for some records
numeric values (e.g. "123454") for some records
both alpha and numeric values (e.g. "asd12354") for some records
alpha, numeric and special characters (e.g. "asd435_!") for some records
special characters (e.g. "_-!")
I want to drop that column if ALL values (i.e. across ALL records) contain:
combination of alpha and numeric values (e.g. "aseder345")
combination of alpha and special characters (e.g. "asedre_!")
combination of numeric and special characters (e.g. "123_!")
all special characters (e.g. "!")
Is there a pythonic way of doing so?
So, if a column contains across al
Given the following toy dataframe, in which col1 should be removed and col2 should be kept according to your criteria:
import pandas as pd
df = pd.DataFrame(
{
"col1": [
"abs#&wew",
"123!45!4",
"asd12354",
"asdfzf_!",
"123_!",
"asd435_!",
"_-!",
],
"col2": [
"abscdwew",
"123454",
"asd12354",
"a_!sdfzf",
"123_!",
"asd435_!",
"_-!",
],
}
)
Here is one way to do it:
test = lambda x: True if x.isalpha() or x.isdigit() else False
cols_to_keep = df.apply(lambda x: any(test(x) for x in x))
df = df.loc[:, cols_to_keep]
print(df)
# Output
col2
0 abscdwew
1 123454
2 asd12354
3 a_!sdfzf
4 123_!
5 asd435_!
6 _-!
My first question thanks) Sorry for lengthy formulationenter image description here
Researched all related posts
What I have
my Dataframe column (please see screenshot) is strings separated by delimiter ',' Car parameters.
My Dataframe:-
Some rows come with mileage while others not (screenshot) hence some rows have fewer delimiters.
The Task
Need to create 5 columns (max number of delimiters) to store CarParameters separately (Mileage, GearBox, HP, Body etc)
If a row doesn't have Mileage Put 0 in the Mileage Column
What I know and works well
df["name"].str.split(" ", expand = True) by default n=-1 and splits into necessary columns
example:
The issue:
If I use the str.split(" ", expand = True) method - GearBox (ATM) is wrongly put under newly created Mileage column because that row is short of one delimiter (screenshot)
Result:-
-
You can try lambda function combined with list concatenation like below.
>>> import pandas as pd
>>> df = pd.DataFrame([['1,2,3,4,5'],['2,3,4,5']], columns=["CarParameters"])
>>> print(pd.DataFrame(df.CarParameters.apply(
lambda x: str(x).split(',')).apply(
lambda x: [0]*(5-len(x)) + x).to_list(), columns=list("ABCDE")))
A B C D E
0 1 2 3 4 5
1 0 2 3 4 5
This is what my main dataframe looks like:
Group IDs New ID
1 [N23,N1,N12] N102
2 [N134,N100] N501
I have another dataframe that has all the required ID info in an unordered manner:
ID Name Age
N1 Milo 5
N23 Mark 21
N11 Jacob 22
I would like to modify the original dataframe such that all IDs are replaced with their respective names obtained from the other file. So that the dataframe has only names and no IDs and looks like this:
Group IDs New ID
1 [Mark,Silo,Bond] Niki
2 [Troy,Fangio] Kvyat
Thanks in advance
IIUC you can .explode your lists, replace values with .map and regroup them with .groupby
df['ID'] = (df.ID.explode()
.map(df1.set_index('ID')['Name'])
.groupby(level=0).agg(list)
)
If New ID column is not a list, you can use only .map()
df['New ID'] = df['New ID'].map(df1.set_index('ID')['Name'])
you can try making a dict from your second DF and then replacing on the first using regex patterns (no need to fully understand it, check the comments bellow):
ps: since you didn't provide the full df with the codes, I created with some of them, that's why the print() won't replace all the results.
import pandas as pd
# creating dummy dfs
df1 = pd.DataFrame({"Group":[1,2], "IDs":["[N23,N1,N12]", "[N134,N100]"], "New ID":["N102", "N501"] })
df2 = pd.DataFrame({"ID":['N1', "N23", "N11", "N100"], "Name":["Milo", "Mark", "Jacob", "Silo"], "Age":[5,21,22, 44]})
# Create the unique dict we're using regex patterns to make exact match
dict_replace = df2.set_index("ID")['Name'].to_dict()
# 'f' before string means fstrings and 'r' means to interpret it as regex
# the \b is a regex pattern that it sinalizes the begining and end of the match
## so that if you're searching for N1, it won't match if it is N11
dict_replace = {fr"\b{k}\b":v for k, v in dict_replace.items()}
# Replacing on original where you want it
df1['IDs'].replace(dict_replace, regex=True, inplace=True)
print(df1['IDs'].tolist())
# >>> ['[Mark,Milo,N12]', '[N134,Silo]']
Please note the change in my dataframes. In your sample data, the IDs in df that do not exists in df1 IDs. I altered my df to ensure only IDs in df1 were represented. I use the following df
print(df)
Group IDs New
0 1 [N23,N1,N11] N102
1 2 [N11,N23] N501
print(df1)
ID Name Age
0 N1 Milo 5
1 N23 Mark 21
2 N11 Jacob 22
Solution
dict df1.Id and df.Name and map to an exploded df.IDs. Add the result to list.
df['IDs'] = df['IDs'].str.strip('[]')#Strip corner brackets
df['IDs'] = df['IDs'].str.split(',')#Reconstruct list, this was done because for some reason I couldnt explode list
#df.explode list and map df1 to df and add to list
df.explode('IDs').groupby('Group')['IDs'].apply(lambda x:(x.map(dict(zip(df1.ID,df1.Name)))).tolist()).reset_index()
Group IDs
0 1 [Mark, Milo, Jacob]
1 2 [Jacob, Mark]
I have a dataframe with some columns names having Swedish characters (ö,ä,å). I would like to replace these characters with simple o,a,a instead.
I tried to convert the columns names to str and replace the characters, it works but then it gets complicated if I want to assign back the str as columns names, i.e., there are multiple operations needed which makes it complicated.
I tried the following code which replaces the Swedish characters in columns names with the English alphabets and returns the result as str.
from unidecode import unidecode
unicodedata.normalize('NFKD',str(df.columns).decode('utf-8')).encode('ascii', 'ignore')
Is there a way to use the returning str as columns names for the dataframe? If not, then is there a better way to replace the Swedish characters in columns names?
For me working first normalize, then encode to ascii and last decode to utf-8:
df = pd.DataFrame(columns=['aä','åa','oö'])
df.columns = (df.columns.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.to_series()
.str.decode('utf-8'))
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
Another solutions with map or list comprehension:
import unicodedata
f = lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
df.columns = df.columns.map(f)
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
import unicodedata
df.columns = [unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
for x in df.columns]
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
This might be lot of manual work when you have lots of columns, but one way to do this is to use str.replace like this:
bänk röund
0 1 3
1 2 4
2 3 5
df.columns = df.columns.str.replace('ä', 'a')
df.columns = df.columns.str.replace('ö', 'o')
bank round
0 1 3
1 2 4
2 3 5