How to rename pandas dataframe column name by checking columns's data - pandas

Example df would be:
a b c d e
0 SN123456 3 5 7 SN123456
1 SN456123 4 6 8 SN456123
I am wondering how I can rename the column name from 'a' to 'Serial_Number' base on the data -- it starts with 'SN' and length is fix:8.
(we may not know the name of 'a' as it read from some csv file, also the position is not known)
Also how to remove duplicated column 'e', it's completely duplicated with column 'a'
Any idea on a faster way?
Loop each column serial and get it's index and rename column's name is somehow not a good method.
Thanks!

Here's a rewrite in response to your comment. This will rename + drop in a vectorized fashion.
Given df:
>>> df
a b c d e f g
0 SN123456 3 5 7 SN123456 0 0
1 SN456123 4 6 8 SN456123 0 0
Create 3 boolean masks of the same length as the columns:
>>> mask1 = df.dtypes == 'object'
>>> mask2 = df.iloc[0].str.len() == 8
>>> mask3 = df.iloc[0].str.startswith('SN')
Use these to identify which columns look like serial numbers. The first will be renamed; the rest will be dropped.
>>> rename, *drop = df.columns[mask1 & mask2 & mask3]
Then rename + drop:
>>> rename
'a'
>>> drop
['e']
>>> df.rename(columns={rename: 'Serial_Number'}).drop(drop, axis=1)
Serial_Number b c d f g
0 SN123456 3 5 7 0 0
1 SN456123 4 6 8 0 0

Related

How do I subset the columns of a dataframe based on the index of another dataframe?

The rows of clin.index (row length = 81) is a subset of the columns of common_mrna (col length = 151). I want to keep the columns of common_mrna only if the column names match to the row values of clin dataframe.
My code failed to reduce the number of columns in common_mrna to 81.
import pandas as pd
common_mrna = common_mrna.set_index("Hugo_Symbol")
mrna_val = {}
for colnames, val in common_mrna.iteritems():
for i, rows in clin.iterrows():
if [[common_mrna.columns == i] == "TRUE"]:
mrna_val = np.append(mrna_val, val)
mrna = np.concatenate(mrna_val, axis=0)
common_mrna
Hugo_Symbol
A
B
C
D
First
1
2
3
4
Second
5
row
6
7
clin
Another header
A
20
D
30
desired output
Hugo_Symbol
A
D
First
1
4
Second
5
7
Try this using reindex:
common_mrna.reindex(clin.index, axis=1)
Output:
A D
First 1 4
Second 5 7
Update, IIUC:
common_mrna.set_index('Hugo_Symbol').reindex(clin.index, axis=1).reset_index()
IUUC, you can select the rows of A header in clin found in common_mrna columns and add the first column of common_mrna
cols = clin.loc[clin.index.isin(common_mrna.columns)].index.tolist()
# or with set
cols = list(sorted(set(clin.index.tolist()) & set(common_mrna.columns), key=common_mrna.columns.tolist().index))
out = common_mrna[['Hugo_Symbol'] + cols]
print(out)
Hugo_Symbol A D
0 First 1 4
1 Second 5 7

How to convert categorical index to normal index

I have the following DataFrame (result of the method unstack):
df = pd.DataFrame(np.arange(12).reshape(2, -1),
columns=pd.CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c']))
df looks like this:
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
When I try to df.reset_index() I get the following error:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
To bypass this problem I want to convert the column's index from categorical to a normal one. What is the most straightforward way to do it? Maybe you have an idea of how to reset the index without index conversion. I have the following idea:
df.columns = list(df.columns)
Most general is converting columns to list:
df.columns = df.columns.tolist()
Or if possible, convert them to strings:
df.columns = df.columns.astype(str)
df = df.reset_index()
print (df)
index a b c a b c
0 0 0 1 2 3 4 5
1 1 6 7 8 9 10 11

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Pandas: merge miscellaneous keys into the "others" row

I have a DataFrame like this
DataFrame({"key":["a","b","c","d","e"], "value": [5,4,3,2,1]})
I am mainly interested in row "a", "b" and "c". I want to merge everything else into an "others" row like this
key value
0 a 5
1 b 4
2 c 3
3 others 3
I wonder how can this be done.
First create a dataframe without d and e:
df2 = df[df.key.isin(["a","b","c"])]
Then find the value that you want the other column to have (using the sum function in this example):
val = df[~df["key"].isin(["a","b","c"])].sum()["value"]
Finally, append this column to the second df:
df2.append({"key":"others", "value":val},ignore_index=True)
df2 is now:
key value
0 a 5
1 b 4
2 c 3
3 others 3
I have found a way to do it. Not sure if it is the best way.
In [3]: key_map = {"a":"a", "b":"b", "c":"c"}
In [4]: data['key1'] = data['key'].map(lambda k: key_map.get(k, "others"))
In [5]: data.groupby("key1").sum()
Out[5]:
value
key1
a 5
b 4
c 3
others 3

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2