How to convert categorical index to normal index - pandas

I have the following DataFrame (result of the method unstack):
df = pd.DataFrame(np.arange(12).reshape(2, -1),
columns=pd.CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c']))
df looks like this:
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
When I try to df.reset_index() I get the following error:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
To bypass this problem I want to convert the column's index from categorical to a normal one. What is the most straightforward way to do it? Maybe you have an idea of how to reset the index without index conversion. I have the following idea:
df.columns = list(df.columns)

Most general is converting columns to list:
df.columns = df.columns.tolist()
Or if possible, convert them to strings:
df.columns = df.columns.astype(str)
df = df.reset_index()
print (df)
index a b c a b c
0 0 0 1 2 3 4 5
1 1 6 7 8 9 10 11

Related

Remove rows in pandas df with index values within a range

I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Comparing and replacing column items pandas dataframe

I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?

How to rename pandas dataframe column name by checking columns's data

Example df would be:
a b c d e
0 SN123456 3 5 7 SN123456
1 SN456123 4 6 8 SN456123
I am wondering how I can rename the column name from 'a' to 'Serial_Number' base on the data -- it starts with 'SN' and length is fix:8.
(we may not know the name of 'a' as it read from some csv file, also the position is not known)
Also how to remove duplicated column 'e', it's completely duplicated with column 'a'
Any idea on a faster way?
Loop each column serial and get it's index and rename column's name is somehow not a good method.
Thanks!
Here's a rewrite in response to your comment. This will rename + drop in a vectorized fashion.
Given df:
>>> df
a b c d e f g
0 SN123456 3 5 7 SN123456 0 0
1 SN456123 4 6 8 SN456123 0 0
Create 3 boolean masks of the same length as the columns:
>>> mask1 = df.dtypes == 'object'
>>> mask2 = df.iloc[0].str.len() == 8
>>> mask3 = df.iloc[0].str.startswith('SN')
Use these to identify which columns look like serial numbers. The first will be renamed; the rest will be dropped.
>>> rename, *drop = df.columns[mask1 & mask2 & mask3]
Then rename + drop:
>>> rename
'a'
>>> drop
['e']
>>> df.rename(columns={rename: 'Serial_Number'}).drop(drop, axis=1)
Serial_Number b c d f g
0 SN123456 3 5 7 0 0
1 SN456123 4 6 8 0 0

In PANDAS, how to get the index of a known value?

If we have a known value in a column, how can we get its index-value? For example:
In [148]: a = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [149]: a
Out[149]:
c1 c2
0 0 1
1 2 3
2 4 5
........
As we know, we can get a value by the index corresponding to it, like this.
In [151]: a.ix[0,1] In [152]: a.c2[0] In [154]: a.c2.ix[0] <-- use index
Out[151]: 1 Out[152]: 1 Out[154]: 1 <-- get value
But how to get the index by value?
There might be more than one index map to your value, it make more sense to return a list:
In [48]: a
Out[48]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [49]: a.c1[a.c1 == 8].index.tolist()
Out[49]: [4]
Using the .loc[] accessor:
In [25]: a.loc[a['c1'] == 8].index[0]
Out[25]: 4
Can also use the get_loc() by setting 'c1' as the index. This will not change the original dataframe.
In [17]: a.set_index('c1').index.get_loc(8)
Out[17]: 4
The other way around using numpy.where() :
import numpy as np
import pandas as pd
In [800]: df = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [801]: df
Out[801]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [802]: np.where(df["c1"]==6)
Out[802]: (array([3]),)
In [803]: indices = list(np.where(df["c1"]==6)[0])
In [804]: df.iloc[indices]
Out[804]:
c1 c2
3 6 7
In [805]: df.iloc[indices].index
Out[805]: Int64Index([3], dtype='int64')
In [806]: df.iloc[indices].index.tolist()
Out[806]: [3]
To get the index by value, simply add .index[0] to the end of a query. This will return the index of the first row of the result...
So, applied to your dataframe:
In [1]: a[a['c2'] == 1].index[0] In [2]: a[a['c1'] > 7].index[0]
Out[1]: 0 Out[2]: 4
Where the query returns more than one row, the additional index results can be accessed by specifying the desired index, e.g. .index[n]
In [3]: a[a['c2'] >= 7].index[1] In [4]: a[(a['c2'] > 1) & (a['c1'] < 8)].index[2]
Out[3]: 4 Out[4]: 3
I think this may help you , both index and columns of the values.
value you are looking for is not duplicated:
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
value=poz.iloc[0,0]
index=poz.index.item()
column=poz.columns.item()
you can get its index and column
duplicated:
matrix=pd.DataFrame([[1,1],[1,np.NAN]],index=['q','g'],columns=['f','h'])
matrix
Out[83]:
f h
q 1 1.0
g 1 NaN
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
index=poz.stack().index.tolist()
index
Out[87]: [('q', 'f'), ('q', 'h'), ('g', 'f')]
you will get a list