How to remove part of the column name? - pandas

I have a DataFrame with several columns like:
'clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave', etc
I want to change the name of the columns and remove all the text neighbord a dot.
I only know the method pandas.rename({'A':'a','B':'b'})
But to name them as they are now I used:
df_tabela_clientes.columns = ["tabela_clientes." + str(col) for col in df_tabela_clientes.columns]
How could I reverse the process?

Try rename with lambda and string manipulation:
df = pd.DataFrame(columns=['clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave'])
print(df)
#Empty DataFrame
#Columns: [clientes_enderecos.CEP, tabela_clientes.RENDA, tabela_produtos.cod_ramo, #tabela_qar.chave]
#Index: []
dfc = df.rename(columns=lambda x: x.split('.')[-1])
print(dfc)
#Empty DataFrame
#Columns: [CEP, RENDA, cod_ramo, chave]
#Index: []

To get rid of whats to the right of the dot you can split the columns names and choose whichever side of the dot you want.
import pandas as pd
df = pd.DataFrame(columns=['clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave'])
df.columns = [name.split('.')[0] for name in df.columns] # 0: before the dot | 1:after the dot

Related

Pandas sort column names by first character after delimiter

I want to sort the columns in a df based on the first letter after the delimiter '-'
df.columns = ['apple_B','cat_A','dog_C','car_D']
df.columns.sort_values(by=df.columns.str.split('-')[1])
TypeError: sort_values() got an unexpected keyword argument 'by'
df.sort_index(axis=1, key=lambda s: s.str.split('-')[1])
ValueError: User-provided `key` function must not change the shape of the array.
Desired columns would be:
'cat_A','apple_B','dog_C','car_D'
Many thanks!
I needed to sort the index names and then rename the columns accordingly:
sorted_index = sorted(df.index, key=lambda s: s.split('_')[1])
# reorder index
df = df.loc[sorted_index]
# reorder columns
df = df[sorted_index]
Use sort_index with the extracted part of the string as key:
df.sort_index(axis=1, key=lambda s: s.str.extract('_(\w+)', expand=False))
Output columns:
[cat_A, apple_B, dog_C, car_D]
You can do:
df.columns = ['apple_B','cat_A','dog_C','car_D']
new_cols = sorted(df.columns, key=lambda s: s.str.split('-')[1])
df = df[new_cols]

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Assign dataframes in a list to a list of names; pandas

I have a variable
var=[name1,name2]
I have a dataframe also in a list
df= [df1, df2]
How do i assign df1 to name1 and df2 to name2 and so on.
If I understand correctly, assuming the lengths of both lists are the same you just iterate over the indices of both lists and just assign them, example:
In [412]:
name1,name2 = None,None
var=[name1,name2]
df1, df2 = 1,2
df= [df1, df2]
​
for x in range(len(var)):
var[x] = df[x]
var
Out[412]:
[1, 2]
If your variable list is storing strings then I would not make variables from those strings (see How do I create a variable number of variables?) and instead create a dict:
In [414]:
var=['name1','name2']
df1, df2 = 1,2
df= [df1, df2]
d = dict(zip(var,df))
d
Out[414]:
{'name1': 1, 'name2': 2}
To answer your question, you can do this by:
for i in zip(var, df):
globals()[i[0]] = i[1]
And then access your variables.
But proceeding this way is bad. You're like launching a dog in your global environment. It's better to keep control about what you handle, keep your dataframe in a list or dictionary.