How to make first level of MultiIndex as the columns? - pandas

Say I have a MultiIndex dataframe like:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = pa.DataFrame(randn(6,1),index=pa.MultiIndex.from_tuples(zip(*arrays)),columns=['A'])
In [3]: df
Out[3]:
A
one 1 0.229037
2 -1.640695
3 0.908127
two 1 -0.918750
2 1.170112
3 -2.620850
I would like to change this to a new dataframe, with the columns as the first level index of the MultiIndex dataframe? Is there an easy way? (below an example)
In [12]: dft = df.ix['one']
In [13]: dft = dft.rename(columns={'A':'one'})
In [14]: dft['two'] = df.ix['two']['A']
In [15]: dft
Out[15]:
one two
1 0.229037 -0.918750
2 -1.640695 1.170112
3 0.908127 -2.620850

Perhaps you are looking for pandas.unstack:
In [56]: df
Out[56]:
A
one 1 0.229037
2 -1.640695
3 0.908127
two 1 -0.918750
2 1.170112
3 -2.620850
In [57]: df.unstack(level=0)
Out[57]:
A
one two
1 0.229037 -0.918750
2 -1.640695 1.170112
3 0.908127 -2.620850

Just to add something to this, there is another option of making a multi-index into columns using the reset_index() function. The difference here being that it simply "pops" out the values as new columns. Depends on your usecase:
In [5]: df
Out[5]:
A
one 1 -1.598591
2 -0.354813
3 -0.435924
two 1 1.408328
2 0.448303
3 0.381360
In [6]: df.reset_index()
Out[6]:
level_0 level_1 A
0 one 1 -1.598591
1 one 2 -0.354813
2 one 3 -0.435924
3 two 1 1.408328
4 two 2 0.448303
5 two 3 0.381360

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

Access elements of pandas series

I have a dataframe and I want to extract the frequency of 0/1 in a particular column.
df=pd.DataFrame({'A':[0,0,1,0,1]})
df
Out[6]:
A
0 0
1 0
2 1
3 0
4 1
Calculating frequency of occurance of 0/1s -
df['A'].value_counts()
Out[8]:
0 3
1 2
Name: A, dtype: int64
type(df['A'].value_counts())
Out[9]: pandas.core.series.Series
How can I extract the frequency of 0s and 1s, in lets suppose two variables, ones and zeros as -
zeros=3, ones=2
I think it would be bit more flexible to return a dictionary:
In [234]: df['A'].value_counts().to_dict()
Out[234]: {0: 3, 1: 2}
or
In [236]: d = df['A'].astype(str).replace(['0','1'], ['zeros','ones']).value_counts().to_dict()
In [237]: d
Out[237]: {'ones': 2, 'zeros': 3}
In [238]: d['ones']
Out[238]: 2
In [239]: d['zeros']
Out[239]: 3
you can also access it directly:
In [3]: df['A'].value_counts().loc[0]
Out[3]: 3
In [4]: df['A'].value_counts().loc[1]
Out[4]: 2
Another way to solve this issue is to use collections library and the function counter() in it.
import collections
c = collections.Counter(df['A'])
c
Out[31]: Counter({0: 3, 1: 2})
count_0s=c.Counter(df['A'])[0]#Returns 3
count_1s=c.Counter(df['A'])[1]#Returns 2

In Python Pandas using cumsum with groupby

I am trying to do a pandas cumsum(), where want to initialize the value to 0 every time group changes.
Say I have below dataframe where after group by I have col2(Group) and expect col3(cumsum) while using the function
Value Group Cumsum
a 1 0
a 1 1
a 1 2
b 2 0
b 2 1
b 2 2
b 2 3
c 3 0
c 3 1
d 4 0
This doesnt work
df['Cumsum'] = df['Group'].cumsum()
Please advise.
Thanks!
Hmm, this turned out more complicated than I imagined, due to getting the groups' keys back in. Perhaps someone else will find something shorter.
First, imports
import pandas as pd
import itertools
Now a DataFrame:
df = pd.DataFrame({
'a': ['a', 'b', 'a', 'b'],
'b': [0, 1, 2, 3]})
So now we separately do a groupby-cumsum, some itertools stuff for finding the keys, and combine both:
>>> pd.DataFrame({
'keys': list(itertools.chain.from_iterable([len(g) * [k] for k, g in df.b.groupby(df.a)])),
'cumsum': df.b.groupby(df.a).cumsum()})
cumsum keys
0 0 a
1 1 a
2 2 b
3 4 b

In PANDAS, how to get the index of a known value?

If we have a known value in a column, how can we get its index-value? For example:
In [148]: a = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [149]: a
Out[149]:
c1 c2
0 0 1
1 2 3
2 4 5
........
As we know, we can get a value by the index corresponding to it, like this.
In [151]: a.ix[0,1] In [152]: a.c2[0] In [154]: a.c2.ix[0] <-- use index
Out[151]: 1 Out[152]: 1 Out[154]: 1 <-- get value
But how to get the index by value?
There might be more than one index map to your value, it make more sense to return a list:
In [48]: a
Out[48]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [49]: a.c1[a.c1 == 8].index.tolist()
Out[49]: [4]
Using the .loc[] accessor:
In [25]: a.loc[a['c1'] == 8].index[0]
Out[25]: 4
Can also use the get_loc() by setting 'c1' as the index. This will not change the original dataframe.
In [17]: a.set_index('c1').index.get_loc(8)
Out[17]: 4
The other way around using numpy.where() :
import numpy as np
import pandas as pd
In [800]: df = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [801]: df
Out[801]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [802]: np.where(df["c1"]==6)
Out[802]: (array([3]),)
In [803]: indices = list(np.where(df["c1"]==6)[0])
In [804]: df.iloc[indices]
Out[804]:
c1 c2
3 6 7
In [805]: df.iloc[indices].index
Out[805]: Int64Index([3], dtype='int64')
In [806]: df.iloc[indices].index.tolist()
Out[806]: [3]
To get the index by value, simply add .index[0] to the end of a query. This will return the index of the first row of the result...
So, applied to your dataframe:
In [1]: a[a['c2'] == 1].index[0] In [2]: a[a['c1'] > 7].index[0]
Out[1]: 0 Out[2]: 4
Where the query returns more than one row, the additional index results can be accessed by specifying the desired index, e.g. .index[n]
In [3]: a[a['c2'] >= 7].index[1] In [4]: a[(a['c2'] > 1) & (a['c1'] < 8)].index[2]
Out[3]: 4 Out[4]: 3
I think this may help you , both index and columns of the values.
value you are looking for is not duplicated:
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
value=poz.iloc[0,0]
index=poz.index.item()
column=poz.columns.item()
you can get its index and column
duplicated:
matrix=pd.DataFrame([[1,1],[1,np.NAN]],index=['q','g'],columns=['f','h'])
matrix
Out[83]:
f h
q 1 1.0
g 1 NaN
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
index=poz.stack().index.tolist()
index
Out[87]: [('q', 'f'), ('q', 'h'), ('g', 'f')]
you will get a list