How to reset pandas data reader index? [duplicate] - pandas

This seems rather obvious, but I can't seem to figure out how to convert an index of data frame to a column?
For example:
df=
gi ptt_loc
0 384444683 593
1 384444684 594
2 384444686 596
To,
df=
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596

either:
df['index1'] = df.index
or, .reset_index:
df = df.reset_index(level=0)
so, if you have a multi-index frame with 3 levels of index, like:
>>> df
val
tick tag obs
2016-02-26 C 2 0.0139
2016-02-27 A 2 0.5577
2016-02-28 C 6 0.0303
and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:
>>> df.reset_index(level=['tick', 'obs'])
tick obs val
tag
C 2016-02-26 2 0.0139
A 2016-02-27 2 0.5577
C 2016-02-28 6 0.0303

rename_axis + reset_index
You can first rename your index to a desired label, then elevate to a series:
df = df.rename_axis('index1').reset_index()
print(df)
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
This works also for MultiIndex dataframes:
print(df)
# val
# tick tag obs
# 2016-02-26 C 2 0.0139
# 2016-02-27 A 2 0.5577
# 2016-02-28 C 6 0.0303
df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()
print(df)
index1 index2 index3 val
0 2016-02-26 C 2 0.0139
1 2016-02-27 A 2 0.5577
2 2016-02-28 C 6 0.0303

To provide a bit more clarity, let's look at a DataFrame with two levels in its index (a MultiIndex).
index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'],
['North', 'South']],
names=['State', 'Direction'])
df = pd.DataFrame(index=index,
data=np.random.randint(0, 10, (6,4)),
columns=list('abcd'))
The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.
df.reset_index()
Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.
df.reset_index(level='State') # same as df.reset_index(level=0)
In the rare event that you want to preserve the index and turn the index into a column, you can do the following:
# for a single level
df.assign(State=df.index.get_level_values('State'))
# for all levels
df.assign(**df.index.to_frame())

For MultiIndex you can extract its subindex using
df['si_name'] = R.index.get_level_values('si_name')
where si_name is the name of the subindex.

If you want to use the reset_index method and also preserve your existing index you should use:
df.reset_index().set_index('index', drop=False)
or to change it in place:
df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)
For example:
print(df)
gi ptt_loc
0 384444683 593
4 384444684 594
9 384444686 596
print(df.reset_index())
index gi ptt_loc
0 0 384444683 593
1 4 384444684 594
2 9 384444686 596
print(df.reset_index().set_index('index', drop=False))
index gi ptt_loc
index
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
And if you want to get rid of the index label you can do:
df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
index gi ptt_loc
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596

This should do the trick (if not multilevel indexing) -
df.reset_index().rename({'index':'index1'}, axis = 'columns')
And of course, you can always set inplace = True, if you do not want to assign this to a new variable in the function parameter of rename.

df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123

In the newest version of pandas 1.5.0, you could use the function reset_index with the new argument names to specify a list of names you want to give the index columns. Here is a reproducible example with one index column:
import pandas as pd
df = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
gi ptt
0 232 342
1 66 56
2 34 662
3 43 123
df.reset_index(names=['new'])
Output:
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
This can also easily be applied with MultiIndex. Just create a list of the names you want.

I usually do it this way:
df = df.assign(index1=df.index)

Related

index compatibility of dataframe with multiindex result from apply on group

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.
Following code reflects our approach to the problem in a simplified manner.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)
This generates a DataFrame as follows.
key x
0 0 0.969585
1 1 0.775133
2 1 0.939499
3 1 0.894827
4 1 0.597900
.. ... ...
95 53 0.036887
96 54 0.609564
97 55 0.502679
98 56 0.051479
99 56 0.278646
Application of exemplary methods on the DataFrame groups.
def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))
When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.
TypeError: incompatible index of inserted column with frame index
Thus a compatible multiindex needs to be introduced first.
df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)
Which yields the intended result.
key x y
index
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
... ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.
Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:
g = df.groupby('key')
df['y'] = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key x y
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
.. ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
[100 rows x 3 columns]

comverting the numpy array to proper dataframe

I have numpy array as data below
data = np.array([[1,2],[4,5],[7,8]])
i want to split it and change to dataframe with column name as below to get the first value of each array as below
df_main:
value_items excluded_items
1 2
4 5
7 8
from which later I can take like
df:
value_items
1
4
7
df2:
excluded_items
2
5
8
I tried to convert to dataframe with command
df = pd.DataFrame(data)
it resulted in still array of int32
so, the splitting is failure for me
Use reshape for 2d array and also add columns parameter:
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
Sample:
data = np.arange(785*2).reshape(1, 785, 2)
print (data)
[[[ 0 1]
[ 2 3]
[ 4 5]
...
[1564 1565]
[1566 1567]
[1568 1569]]]
print (data.shape)
(1, 785, 2)
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
print (df)
value_items excluded_items
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
.. ... ...
780 1560 1561
781 1562 1563
782 1564 1565
783 1566 1567
784 1568 1569
[785 rows x 2 columns]

Two-level header in pandas?

I created a new dataframe from an old one and now I have something like this:
df = pd.DataFrame({0:[1,5,1,1,3]}, index=[243,254,507,1903,2358]).rename_axis('uid')
print (df)
0
uid
243 1
254 5
507 1
1903 1
2358 3
I don't really understand what it means. Is that a double header with the first header having just one index and the second having the other one? How can I transform this dataframe into having a single header, with names ['userID' , 'counts'] ?
Here is one column DataFrame with column 0 and index name uid.
So need:
df = df.reset_index()
df.columns = ['userID' , 'counts']
print (df)
userID counts
0 243 1
1 254 5
2 507 1
3 1903 1
4 2358 3
Another solution:
df = df.rename_axis('userID').squeeze().reset_index(name='counts')

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?
IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0

sorting within the keys of group by

I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !