Pandas: Selecting rows by list - pandas

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.

You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028

Related

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe. How do I do that?

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe.
How do I do that?
I used query()
but output
TypeError: query() takes from 2 to 3 positional arguments but 27 were given
If you want to select the columns based on their name, you can do the following:
df_new = df[["colA", "colB", "colC", ...]]
or use the "filter" function:
df_new = df.filter(["colA", "colB", "colC", ..])
In case that your column selection is based on the index of columns:
df_new = df.iloc[:, 0:27] # if columns are consecutive
df_new = df.iloc[:, [0,2,10,..]] # if columns are not consecutive (the numbers refer to the column indices)

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Pandas get list of columns if columns name contains

I have written this code to show a list of column names in a dataframe if they contains 'a','b' ,'c' or 'd'.
I then want to say trim the first 3 character of the column name for these columns.
However its showing an error. Is there something wrong with the code?
ind_cols= [x for x in df if df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]]
df[ind_cols].columns=df[ind_cols].columns.str[3:]
Use list comprehension with if-else:
L = df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]
df.columns = [x[3:] if x in L else x for x in df.columns]
Another solution with numpy.where by boolean mask:
m = df.columns.str.contains('|'.join(['a','b','c','d']))
df.columns = np.where(m, df.columns.str[3:], df.columns)

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c