How to replace Swedish characters ä, å, ö in columns names in python? - pandas

I have a dataframe with some columns names having Swedish characters (ö,ä,å). I would like to replace these characters with simple o,a,a instead.
I tried to convert the columns names to str and replace the characters, it works but then it gets complicated if I want to assign back the str as columns names, i.e., there are multiple operations needed which makes it complicated.
I tried the following code which replaces the Swedish characters in columns names with the English alphabets and returns the result as str.
from unidecode import unidecode
unicodedata.normalize('NFKD',str(df.columns).decode('utf-8')).encode('ascii', 'ignore')
Is there a way to use the returning str as columns names for the dataframe? If not, then is there a better way to replace the Swedish characters in columns names?

For me working first normalize, then encode to ascii and last decode to utf-8:
df = pd.DataFrame(columns=['aä','åa','oö'])
df.columns = (df.columns.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.to_series()
.str.decode('utf-8'))
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
Another solutions with map or list comprehension:
import unicodedata
f = lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
df.columns = df.columns.map(f)
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
import unicodedata
df.columns = [unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
for x in df.columns]
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []

This might be lot of manual work when you have lots of columns, but one way to do this is to use str.replace like this:
bänk röund
0 1 3
1 2 4
2 3 5
df.columns = df.columns.str.replace('ä', 'a')
df.columns = df.columns.str.replace('ö', 'o')
bank round
0 1 3
1 2 4
2 3 5

Related

How to drop pandas dataframe columns containing special characters

How do I drop pandas dataframe columns that contains special characters such as # / ] [ } { - _ etc.?
For example I have the following dataframe (called df):
I need to drop the columns Name and Matchkey becasue they contain some special characters.
Also, how can I specify a list of special characters based on which the columns will be dropped?
For example: I'd like to drop the columns that contain (in any record, in any cell) any of the following special characters:
listOfSpecialCharacters: ¬,`,!,",£,$,£,#,/,\
One option is to use a regex with str.contains and apply, then use boolean indexing to drop the columns:
import re
chars = '¬`!"£$£#/\\'
regex = f'[{"".join(map(re.escape, chars))}]'
# '[¬`!"£\\$£\\#/\\\\]'
df2 = df.loc[:, ~df.apply(lambda c: c.str.contains(regex).any())]
example:
# input
A B C
0 123 12! 123
1 abc abc a¬b
# output
A
0 123
1 abc

Pandas dataframe replace contents based on ID from another dataframe

This is what my main dataframe looks like:
Group IDs New ID
1 [N23,N1,N12] N102
2 [N134,N100] N501
I have another dataframe that has all the required ID info in an unordered manner:
ID Name Age
N1 Milo 5
N23 Mark 21
N11 Jacob 22
I would like to modify the original dataframe such that all IDs are replaced with their respective names obtained from the other file. So that the dataframe has only names and no IDs and looks like this:
Group IDs New ID
1 [Mark,Silo,Bond] Niki
2 [Troy,Fangio] Kvyat
Thanks in advance
IIUC you can .explode your lists, replace values with .map and regroup them with .groupby
df['ID'] = (df.ID.explode()
.map(df1.set_index('ID')['Name'])
.groupby(level=0).agg(list)
)
If New ID column is not a list, you can use only .map()
df['New ID'] = df['New ID'].map(df1.set_index('ID')['Name'])
you can try making a dict from your second DF and then replacing on the first using regex patterns (no need to fully understand it, check the comments bellow):
ps: since you didn't provide the full df with the codes, I created with some of them, that's why the print() won't replace all the results.
import pandas as pd
# creating dummy dfs
df1 = pd.DataFrame({"Group":[1,2], "IDs":["[N23,N1,N12]", "[N134,N100]"], "New ID":["N102", "N501"] })
df2 = pd.DataFrame({"ID":['N1', "N23", "N11", "N100"], "Name":["Milo", "Mark", "Jacob", "Silo"], "Age":[5,21,22, 44]})
# Create the unique dict we're using regex patterns to make exact match
dict_replace = df2.set_index("ID")['Name'].to_dict()
# 'f' before string means fstrings and 'r' means to interpret it as regex
# the \b is a regex pattern that it sinalizes the begining and end of the match
## so that if you're searching for N1, it won't match if it is N11
dict_replace = {fr"\b{k}\b":v for k, v in dict_replace.items()}
# Replacing on original where you want it
df1['IDs'].replace(dict_replace, regex=True, inplace=True)
print(df1['IDs'].tolist())
# >>> ['[Mark,Milo,N12]', '[N134,Silo]']
Please note the change in my dataframes. In your sample data, the IDs in df that do not exists in df1 IDs. I altered my df to ensure only IDs in df1 were represented. I use the following df
print(df)
Group IDs New
0 1 [N23,N1,N11] N102
1 2 [N11,N23] N501
print(df1)
ID Name Age
0 N1 Milo 5
1 N23 Mark 21
2 N11 Jacob 22
Solution
dict df1.Id and df.Name and map to an exploded df.IDs. Add the result to list.
df['IDs'] = df['IDs'].str.strip('[]')#Strip corner brackets
df['IDs'] = df['IDs'].str.split(',')#Reconstruct list, this was done because for some reason I couldnt explode list
#df.explode list and map df1 to df and add to list
df.explode('IDs').groupby('Group')['IDs'].apply(lambda x:(x.map(dict(zip(df1.ID,df1.Name)))).tolist()).reset_index()
Group IDs
0 1 [Mark, Milo, Jacob]
1 2 [Jacob, Mark]

Pandas get list of columns if columns name contains

I have written this code to show a list of column names in a dataframe if they contains 'a','b' ,'c' or 'd'.
I then want to say trim the first 3 character of the column name for these columns.
However its showing an error. Is there something wrong with the code?
ind_cols= [x for x in df if df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]]
df[ind_cols].columns=df[ind_cols].columns.str[3:]
Use list comprehension with if-else:
L = df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]
df.columns = [x[3:] if x in L else x for x in df.columns]
Another solution with numpy.where by boolean mask:
m = df.columns.str.contains('|'.join(['a','b','c','d']))
df.columns = np.where(m, df.columns.str[3:], df.columns)

iterating over a dictionary of empty pandas dataframes to append them with data from existing dataframe based on list of column names

I'm a biologist and very new to Python (I use v3.5) and pandas. I have a pandas dataframe (df), from which I need to make several dataframes (df1... dfn) that can be placed in a dictionary (dictA), which currently has the correct number (n) of empty dataframes. I also have a dictionary (dictB) of n (individual) lists of column names that were extracted from df. The keys in 2 dictionaries match. I'm trying to append the empty dfs within dictA with parts of df based on the column names within the lists in dictB.
import pandas as pd
listA=['A', 'B', 'C',...]
dictA={i:pd.DataFrame() for i in listA}
lets say I have something like this:
dictA={'A': df1, 'B': df2}
dictB={'A': ['A1', A2', 'A3'],
'B': ['B1', B2']}
df=pd.DataFrame({'A1': [0,2,4,5],
'A2': [2,5,6,7],
'A3': [5,6,7,8],
'B1': [2,5,6,7],
'B2': [1,3,5,6]})
listA=['A', 'B']
what I'm trying to get is for df1 and df2 to get appended with portions of df like this, so that the output for df1 is like this:
A1 A2 A3
0 0 2 5
1 2 4 6
2 4 6 7
3 5 7 8
df2 would have columns B1 and B2.
I tried the following loop and some alterations, but it doesn't yield populated dfs:
for key, values in dictA.items():
values.append(df[dictB[key]])
Thanks and sorry if this was already addressed elsewhere but I couldn't find it.
You could create the dataframes you want like this instead :
df = #Your original dataframe containing all the columns
df_A = df.iloc[:][[col for col in df if 'A' in col]]
df_B = df.iloc[:][[col for col in df if 'B' in col]]

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c