Pandas get list of columns if columns name contains - pandas

I have written this code to show a list of column names in a dataframe if they contains 'a','b' ,'c' or 'd'.
I then want to say trim the first 3 character of the column name for these columns.
However its showing an error. Is there something wrong with the code?
ind_cols= [x for x in df if df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]]
df[ind_cols].columns=df[ind_cols].columns.str[3:]

Use list comprehension with if-else:
L = df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]
df.columns = [x[3:] if x in L else x for x in df.columns]
Another solution with numpy.where by boolean mask:
m = df.columns.str.contains('|'.join(['a','b','c','d']))
df.columns = np.where(m, df.columns.str[3:], df.columns)

Related

How do I split a string and ignore the rest for all columns in pandas?

I have a dataframe where some rows contain multiple values separated by a "&" and I'd like to keep only the first value in those rows.
example :
A
B
1&2
3
3
5&6
my desired output is :
A
B
1
3
3
5
here's my code :
df.apply(lambda x: i[0] for i in x.split("&"))
I get the follwing error :
NameError: name 'x' is not defined
How is "x" not defined?
Use DataFrame.applymap:
df = df.applymap(lambda x: x.split("&")[0] if isinstance(x, str) else x)
If need omit numeric columns, so processing strings or mixed strings with numbers columns:
cols = df.select_dtypes(object).columns
df[cols] = df[cols].applymap(lambda x: x.split("&")[0] if isinstance(x, str) else x)

Pandas Assign Dataframe name by value in list

I have a list l=['x','y']
I want to make 2 blank dataframes called x and y by a loop from the list l.
So something like this
for v in l:
v=pd.DataFrame()
You could try something like,
l = []
length = 2
for i in range(length):
l.append(pd.DataFrame())
Or if you really want to modify the initial list with strings,
l = ['x', 'y']
for i in range(len(l)):
l[i] = pd.DataFrame()

Dataframe-renaming multiply columns with othe same name

I have a dataframe with several columns with almost the same name and a number in the end (Hora1, Hora2, ..., Hora12).
I would like to change all column names to GAx, where x is a different number (GA01.0, GA01.1, ...).
Well, we can achieve the above output in many ways. One of the ways I will share here.
df.columns = [col.replace('Hora', 'GA01.') for col in df.columns]
Please check the screenshot for reference.
You can rename the columns by passing a list of column names:
columns = ['GA1.0','GA01.1']
df.columns = columns
You can try:
import re
df.columns = [re.sub('Hora', 'GA01.', x) for x in df.columns]

How to replace Swedish characters ä, å, ö in columns names in python?

I have a dataframe with some columns names having Swedish characters (ö,ä,å). I would like to replace these characters with simple o,a,a instead.
I tried to convert the columns names to str and replace the characters, it works but then it gets complicated if I want to assign back the str as columns names, i.e., there are multiple operations needed which makes it complicated.
I tried the following code which replaces the Swedish characters in columns names with the English alphabets and returns the result as str.
from unidecode import unidecode
unicodedata.normalize('NFKD',str(df.columns).decode('utf-8')).encode('ascii', 'ignore')
Is there a way to use the returning str as columns names for the dataframe? If not, then is there a better way to replace the Swedish characters in columns names?
For me working first normalize, then encode to ascii and last decode to utf-8:
df = pd.DataFrame(columns=['aä','åa','oö'])
df.columns = (df.columns.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.to_series()
.str.decode('utf-8'))
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
Another solutions with map or list comprehension:
import unicodedata
f = lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
df.columns = df.columns.map(f)
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
import unicodedata
df.columns = [unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
for x in df.columns]
print (df)
Empty DataFrame
Columns: [aa, aa, oo]
Index: []
This might be lot of manual work when you have lots of columns, but one way to do this is to use str.replace like this:
bänk röund
0 1 3
1 2 4
2 3 5
df.columns = df.columns.str.replace('ä', 'a')
df.columns = df.columns.str.replace('ö', 'o')
bank round
0 1 3
1 2 4
2 3 5

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028