I am trying to use a slicer on a pandas dataframe with a MultiIndex:
dates = pd.date_range('6/30/2000', periods=12, freq='M')
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
df = DataFrame(randn(12, 4), index=index, columns=['A', 'B', 'C', 'D'])
I would like to get the rows where id='H'. From this comment in a related question I asked, and from reading the documentation, I thought this:
df.loc[(slice(None), 'H'),:]
or perhaps this:
df.loc[(slice(None), ['H']),:]
would work. The first one returns this error:
IndexError: index 12 is out of bounds for axis 1 with size 12
and the second one gives this error:
IndexError: indices are out-of-bounds
From looking at other questions, I thought perhaps I need to sort by the 2nd-level index before trying to slice. I'm not really sure what I'm doing here, but I tried to use df.sort_index() but am having trouble with the syntax. I'm also not sure whether that is even the issue here.
You've got a problem in your line
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
The arrays should be the same length. This not raising is probably a bug.
If you fix that then either of
In [58]: df.xs('H', level='id')
Out[58]:
A B C D
date
2000-06-30 -0.645203 0.965441 0.150037 -0.083979
2000-10-31 -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 -0.941248 2.025381 0.450256 1.182266
In [59]: df.loc[(slice(None), 'H'), :]
Out[59]:
A B C D
date id
2000-06-30 H -0.645203 0.965441 0.150037 -0.083979
2000-10-31 H -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 H -0.941248 2.025381 0.450256 1.182266
should work, depending on whether you'd like to drop the id level.
Related
I found subsetting multi-index dataframe will keep original index values behind.
Here is the sample code for test.
level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df
Above code will show dataframe like this.
number
foo a 0
b 1
c 2
bar a 3
b 4
c 5
baz a 6
b 7
c 8
Code below subset the dataframe to contain only 'a' and 'b' for index level 1.
df_subset = df.query("(number%3) <=1")
df_subset
number
foo a 0
b 1
bar a 3
b 4
baz a 6
b 7
The dataframe itself is expected result.
BUT index level of it is still containing the original index level, which is NOT expected.
#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')
My first question is how can I remove the 'original' index after subsetting?
The Second question is this is expected behavior for pandas?
Thanks
Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:
df_subset.index = df_subset.index.remove_unused_levels()
print(df_subset.index.levels[1])
Output:
Index(['a', 'b'], dtype='object')
It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."
You can see that the index levels is a FrozenList using:
[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])
If you want to see only the used levels, you can use the get_level_values() or the unique() methods.
Here some example:
[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')
[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')
Hope it can help you!
I have a Pandas DataFrame of ML experiment results (from MLFlow). I am trying to access the run_id of a single element in the 0th row and under the "tags" -> "run_id" multi-index in the columns.
The DataFrame is called experiment_results_df. I can access the element with the following command:
experiment_results_df.loc[0,(slice(None),'run_id')].values[0]
I thought I should be able to grab the value itself with a statement like the following:
experiment_results_df.at[0,('tags','run_id')]
# or...
experiment_results_df.loc[0,('tags','run_id')]
But either of those just results in the following rather confusing error (as I'm not setting anything):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It's working now, but I'd prefer to use a simpler syntax. And more than that, I want to understand why the other approach isn't working, and if I can modify it. I find multiindexes very frustrating to work with in Pandas compared to regular indexes, but the additional formatting is nice when I print the DF to the console, or display it in a CSV viewer as I currently have 41 columns (and growing).
I don't understand what is the problem:
df = pd.DataFrame({('T', 'A'): {0: 1, 1: 4},
('T', 'B'): {0: 2, 1: 5},
('T', 'C'): {0: 3, 1: 6}})
print(df)
# Output
T
A B C
0 1 2 3
1 4 5 6
How to extract 1:
>>> df.loc[0, ('T', 'A')]
1
>>> df.at[0, ('T', 'A')]
1
>>> df.loc[0, (slice(None), 'A')][0]
1
I am trying to insert 72 matrixes with dimensions (24,12) from an np array into a preexisting MultiIndexDataFrame indexed according to a np.array with dimension (72,2). I don't care to index the content of the matrixes (24,12), I just need to index the 72 matrix even as objects for rearrangemnet purposes. It is like a map to reorder accroding to some conditions to then unstack the columns.
what I have tried so far is:
cosphi.shape
(72, 2)
MFPAD_RCR.shape
(72, 24, 12)
df = pd.MultiIndex.from_arrays(cosphi.T, names=("costheta","phi"))
I successfully create an DataFrame of 2 columns with 72 index row. Then I try to add the 72 matrixes
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR},index=df)
or possibly
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR.astype(object)},index=df)
I get the error
Exception: Data must be 1-dimensional.
Any idea?
After a bot of careful research, I found that my question has been already answered here (the right answer) and here (a solution using a deprecated function).
For my specific question, the answer is something like:
data = MFPAD_RCR.reshape(72, 288).T
df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([phiM, cosM],names=["phi","cos(theta)"]),
columns=['item {}'.format(i) for i in range(72)]
)
Note: that the 3D np array has to be reshaped with the second dimension equal to the product of the major and the minor indexes.
df1 = df.T
I want to be able to sort my items (aka matrixes) according to extra indexes coming from cosphi
cosn=np.array([col[0] for col in cosphi]); #list
phin=np.array([col[1] for col in cosphi]); #list
Note: the length of the new indexes has to be the same as the items (matrixes) = 72
df1.set_index(cosn, "cos_ph", append=True, inplace=True)
df1.set_index(phin, "phi_ph", append=True, inplace=True)
And after this one can sort
df1.sort_index(level=1, inplace=True, kind="mergesort")
and reshape
outarray=(df1.T).values.reshape(24,12,72).transpose(2, 0, 1)
Any suggestion to make the code faster / prettier is more than welcome!
I am trying to understand how the date-related features of indexing in pandas work.
If I have this data frame:
dates = pd.date_range('6/1/2000', periods=12, freq='M')
df1 = DataFrame(randn(12, 2), index=dates, columns=['A', 'B'])
I know that we can extract records from 2000 using df1['2000'] or a range of dates using df1['2000-09':'2001-03'].
But suppose instead I have a dataframe with a multi-index
index = pd.MultiIndex.from_arrays([dates, list('HIJKHIJKHIJK')], names=['date', 'id'])
df2 = DataFrame(randn(12, 2), index=index, columns=['C', 'D'])
Is there a way to extract rows with a year 2000 as we did with a single index? It appears that df2.xs('2000-06-30') works for accessing a particular date, but df2.xs('2000') does not return anything. Is xs not the right way to go about this?
You don't need to use xs for this, but you can index using .loc.
One of the example you tried, would then look like df2.loc['2000-09':'2001-03']. The only problem is that the 'partial string parsing' feature does not work yet when using multi-index. So you have to provide actual datetimes:
In [17]: df2.loc[pd.Timestamp('2000-09'):pd.Timestamp('2001-04')]
Out[17]:
C D
date id
2000-09-30 K -0.441505 0.364074
2000-10-31 H 2.366365 -0.404136
2000-11-30 I 0.371168 1.218779
2000-12-31 J -0.579180 0.026119
2001-01-31 K 0.450040 1.048433
2001-02-28 H 1.090321 1.676140
2001-03-31 I -0.272268 0.213227
But note that in this case pd.Timestamp('2001-03') would be interpreted as 2001-03-01 00:00:00(an actual moment in time). Therefore, you have to adjust the start/stop values a little bit.
A selection for a full year (eg df1['2000']) would then become df2.loc[pd.Timestamp('2000'):pd.Timestamp('2001')] or df2.loc[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-12-31')]
My first question here!
I'm having some trouble figuring out what I'm doing wrong here, trying to append columns to an existing pd.DataFrame object. Specifically, my original dataframe has n-many columns, and I want to use apply to append an additional 2n-many columns to it. The problem seems to be that doing this via apply() doesn't work, in that if I try to append more than n-many columns, it falls over. This doesn't make sense to me, and I was hoping somebody could either shed some light on to why I'm seeing this behaviour, or suggest a better approach.
For example,
df = pd.DataFrame(np.random.rand(10,2))
def this_works(x):
return 5 * x
def this_fails(x):
return np.append(5 * x, 5 * x)
df.apply(this_works, 1) # Two columns of output, as expected
df.apply(this_fails, 1) # Unexpected failure...
Any ideas? I know there are other ways to create the data columns, this approach just seemed very natural to me and I'm quite confused by the behaviour.
SOLVED! CT Zhu's solution below takes care of this, my error arises from not properly returning a pd.Series object in the above.
Are you trying to do a few different calculations on your df and put the resulting vectors together in one larger DataFrame, like in this example?:
In [39]:
print df
0 1
0 0.718003 0.241216
1 0.580015 0.981128
2 0.477645 0.463892
3 0.948728 0.653823
4 0.056659 0.366104
5 0.273700 0.062131
6 0.151237 0.479318
7 0.425353 0.076771
8 0.317731 0.029182
9 0.543537 0.589783
In [40]:
print df.apply(lambda x: pd.Series(np.hstack((x*5, x*6))), axis=1)
0 1 2 3
0 3.590014 1.206081 4.308017 1.447297
1 2.900074 4.905639 3.480088 5.886767
2 2.388223 2.319461 2.865867 2.783353
3 4.743640 3.269114 5.692369 3.922937
4 0.283293 1.830520 0.339951 2.196624
5 1.368502 0.310656 1.642203 0.372787
6 0.756187 2.396592 0.907424 2.875910
7 2.126764 0.383853 2.552117 0.460624
8 1.588656 0.145909 1.906387 0.175091
9 2.717685 2.948917 3.261222 3.538701
FYI in this trivial case you can do 5 * df !
I think the issue here is that np.append flattens the Series:
In [11]: np.append(df[0], df[0])
Out[11]:
array([ 0.33145275, 0.14964056, 0.86268119, 0.17311983, 0.29618537,
0.48831228, 0.64937305, 0.03353709, 0.42883925, 0.99592229,
0.33145275, 0.14964056, 0.86268119, 0.17311983, 0.29618537,
0.48831228, 0.64937305, 0.03353709, 0.42883925, 0.99592229])
what you want is it to create four columns (isn't it?). The axis=1 means that you are doing this row-wise (i.e. x is the row which is a Series)...
In general you want apply to return either:
a single value, or
a Series (with unique index).
Saying that I kinda thought the following may work (to get four columns):
In [21]: df.apply((lambda x: pd.concat([x[0] * 5, x[0] * 5], axis=1)), axis=1)
TypeError: ('cannot concatenate a non-NDFrame object', u'occurred at index 0')
In [22]: df.apply(lambda x: np.array([1, 2, 3, 4]), axis=1)
ValueError: Shape of passed values is (10,), indices imply (10, 2)
In [23]: df.apply(lambda x: pd.Series([1, 2, 3, 4]), axis=1) # works
Maybe I expected the first to raise about non-unique index... but I was surprised that the second failed.