Ensuring lexicographical sort in pandas MultiIndex - pandas

I've got some data with a MultiIndex (some timing stats, with index levels for "device", "build configuration", "tested function", etc). I want to slice out on some of those index columns.
It seems like "slicers" to the .loc function are probably the way to go. However the docs contain this warning:
Warning: You will need to make sure that the selection axes are fully lexsorted!
Later on in the docs there's a section on The Need for Sortedness with MultiIndex which says
you are responsible for ensuring that things are properly sorted
but thankfully,
The MultiIndex object has code to explicity check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception.
Sounds fine.
However the remaining question is how does one get their data properly sorted for the indexing to work properly? The docs talk about an important new method sortlevel() but then contains the following caveat:
There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!
In my case, sortlevel() did the right thing, but what if my "original ordering of the associated factor" was not sorted? Is there a simple one-liner that I can use on any MultiIndex-ed DataFrame to ensure it's ready for slicing and fully lexsorted?
Edit: My exploration suggests most ways of creating a MultiIndex automatically lexsorts the unique labels when building the index. Example:
In [1]:
import pandas as pd
df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2],
'data':['one','two','three','four']})
df
Out[1]:
col1 col2 data
0 b 3 one
1 d 1 two
2 b 1 three
3 a 2 four
In [2]:
df2 = df.set_index(['col1','col2'])
df2
Out[2]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [3]: df2.index
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'd'], [1, 2, 3]],
labels=[[1, 2, 1, 0], [2, 0, 0, 1]],
names=[u'col1', u'col2'])
Note how the unique items in the levels array are lexsorted, even though the DataFrame object is itself is not. Then, as expected:
In [4]: df2.index.is_lexsorted()
Out[4]: False
In [5]:
sorted = df2.sortlevel()
sorted
Out[5]:
data
col1 col2
a 2 four
b 1 three
3 one
d 1 two
In [6]: sorted.index.is_lexsorted()
Out[6]: True
However, if the levels are explicitly ordered so they are not sorted, things get weird:
In [7]:
df3 = df2
df3.index.set_levels(['b','d','a'], level='col1', inplace=True)
df3.index.set_labels([0,1,0,2], level='col1', inplace=True)
df3
Out[7]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [8]:
sorted2 = df3.sortlevel()
sorted2
Out[8]:
data
col1 col2
b 1 three
3 one
d 1 two
a 2 four
In [9]: sorted2.index.is_lexsorted()
Out[9]: True
In [10]: sorted2.index
Out[10]:
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
names=[u'col1', u'col2'])
So sorted2 is reporting that it is lexsorted, when in fact it is not. This feels a little like what the warning in the docs is getting at, but it's still not clear how to fix it or whether it's really an issue at all.

As far as sorting, as #EdChum pointed out, the docs here seem to indicate it is lexicographically sorted.
For checking whether your index (or columns) are sorted, they have a method is_lexsorted() and an attribute lexsort_depth (which for some reason you can't really find in the documentation itself).
Example:
Create a Series with random order
In [1]:
import pandas as pd
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', '1', '3', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
import random; random.shuffle(tuples)
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s
Out[1]:
baz 3 -0.191653
qux two -1.410311
bar one -0.336475
qux one -1.192908
foo two 0.486401
baz 1 0.888314
foo one -1.504816
bar two 0.917460
dtype: float64
Check is_lexsorted and lexsort_depth:
In [2]: s.index.is_lexsorted()
Out[2]: False
In [3]: s.index.lexsort_depth
Out[3]: 0
Sort the index, and recheck the values:
In [4]: s = s.sortlevel(0, sort_remaining=True)
s
Out[4]:
bar one -0.336475
two 0.917460
baz 1 0.888314
3 -0.191653
foo one -1.504816
two 0.486401
qux one -1.192908
two -1.410311
dtype: float64
In [5]: s.index.is_lexsorted()
Out[5]: True
In [6]: s.index.lexsort_depth
Out[6]: 2

Related

Subset multiindex dataframe keeps original index value

I found subsetting multi-index dataframe will keep original index values behind.
Here is the sample code for test.
level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df
Above code will show dataframe like this.
number
foo a 0
b 1
c 2
bar a 3
b 4
c 5
baz a 6
b 7
c 8
Code below subset the dataframe to contain only 'a' and 'b' for index level 1.
df_subset = df.query("(number%3) <=1")
df_subset
number
foo a 0
b 1
bar a 3
b 4
baz a 6
b 7
The dataframe itself is expected result.
BUT index level of it is still containing the original index level, which is NOT expected.
#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')
My first question is how can I remove the 'original' index after subsetting?
The Second question is this is expected behavior for pandas?
Thanks
Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:
df_subset.index = df_subset.index.remove_unused_levels()
print(df_subset.index.levels[1])
Output:
Index(['a', 'b'], dtype='object')
It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."
You can see that the index levels is a FrozenList using:
[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])
If you want to see only the used levels, you can use the get_level_values() or the unique() methods.
Here some example:
[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')
[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')
Hope it can help you!

How to choose categorical variables?

I'm new to modeling
I have a lot of functions and I have to separate them into discrete and continuous functions. There are only type tips on the internet
Categorical = np.where(df.dtypes = np.object)[0]
Categorical
Then I have only categorical functions resulting from the format, and I mean real categorical functions. Is there a quick way to do this?
Whether a variable should be considered "discrete" or "continuous" is dependent on the variable and the use case.
To count the number of distinct values a variable takes in your dataframe - you can use the pd.Series.nunique or pd.Series.value_counts functions and decide to treat a variable as discrete or continuous based on the output.
pandas does come with a dedicated dtype called category which might be helpful - https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 3], 'B': ['a', 'a', 'b', 'b']})
In [3]: df
Out[3]:
A B
0 1 a
1 1 a
2 2 b
3 3 b
In [4]: df.A.value_counts()
Out[4]:
1 2
3 1
2 1
Name: A, dtype: int64
In [5]: df['B'].nunique()
Out[5]: 2
In [6]: df['B'].unique()
Out[6]: array(['a', 'b'], dtype=object)
And if you don't want to play in a loop, just use this simple line of code:
categorical_fun = np.where((df2.dtypes == np.object)|(df2.nunique() <= 10))[0]
categorical_fun
As you know that these categorical data are either in 'object' format or have a limited number of unique values, use a double condition and let the code show you these columns.
Suppose you think that categorical variables have up to 10 unique values, then:
a,b = df2.shape #<- ile mamy kolumn
b
print('ONLY DISCRETE FUNCTION')
print('----------------------')
for i in range(1,b):
i = df2.columns[i]
f = df2[i].dtypes
h = df2[i].nunique()
if f == np.object or h<=10:
print(i,"---",f,"---",h)

DataFrame Index Created From Columns

I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.

pandas MultiIndex resulting index structure on using xs vs loc between 0.15.2 & 0.18.0

The index structure on the result of slicing a subset of data using .xs & .loc on DataFrame with MultiIndex seems to have changed between v0.15.2 & 0.18.0.
Please refer to the code-snippet & output got in ipython notebook using different versions of Pandas.
import pandas as pd
print 'pandas-version: ', pd.__version__
import numpy as np
l1 = ['A', 'B', 'C', 'D']
l2 = sorted(['foo','bar','baz'])
nrows = len(l1) * len(l2)
s = pd.DataFrame(np.random.random( nrows * 2).reshape(nrows, 2),
index=pd.MultiIndex.from_product([l1, l2],
names=['one','two']))
# print s.index
l_all = slice(None)
# get all records matching 'foo' in level=1 using .loc
sub_loc = s.loc[(l_all, 'foo'),:]
print '.loc[(slice(None), "foo")] result:\n', sub_loc,
print '\n.loc result-index:\n', sub_loc.index
# get all records matching 'foo' in level=1 using .xs()
sub_xs = s.xs('foo', level=1)
print '\n.xs(\'foo\', level=1) result:\n', sub_xs,
print '\n .xs result index:\n', sub_xs.index
0.15.2 output
#######################
pandas-version: 0.15.2
.loc[(slice(None), "foo")] result:
0 1
one two
A foo 0.464551 0.372409
B foo 0.782062 0.268917
C foo 0.779423 0.787554
D foo 0.481901 0.232887
.loc result-index:
one two
A foo
B foo
C foo
D foo
.xs('foo', level=1) result:
0 1
one
A 0.464551 0.372409
B 0.782062 0.268917
C 0.779423 0.787554
D 0.481901 0.232887
.xs result index:
Index([u'A', u'B', u'C', u'D'], dtype='object')
0.18.0 output
##########################
pandas-version: 0.18.0
.loc[(slice(None), "foo")] result:
0 1
one two
A foo 0.723213 0.532838
B foo 0.736941 0.401252
C foo 0.217131 0.044254
D foo 0.712824 0.411026
.loc result-index:
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'bar', u'baz', u'foo']],
labels=[[0, 1, 2, 3], [2, 2, 2, 2]],
names=[u'one', u'two'])
.xs('foo', level=1) result:
0 1
one
A 0.723213 0.532838
B 0.736941 0.401252
C 0.217131 0.044254
D 0.712824 0.411026
.xs result index:
Index([u'A', u'B', u'C', u'D'], dtype='object', name=u'one')
calling sub_loc.index seems to return the same MultiIndex structure of the original DataFrame object (inconsistent with v0.15.2), but sub_xs.index seems to be consistent with earlier version.
Note: I'm using [Python 2.7.11 |Anaconda 1.8.0 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]]
Sorry, forget my other answer, the bug I filed is totally unrelated.
The right answer is: the "index structure" has not changed between the two versions. The only thing that changed is the way the index is represented when you print it.
In both cases you have a MultiIndex, with exactly the same levels and values. You are presumably puzzled by the fact that in 0.18.0 it seems to contain "baz" and "bar". But a MultiIndex can have level values it actually does not use, because, as in this example, it contained them when it was created, and unused level values are not eliminated when the lines using them are dropped. sub_loc.index in 0.15.2 also has "baz" and "bar" inside levels, except that the way it is represented when you print it doesn't reveal this.
And by the way, whether a MultiIndex which has been filtered still contains such "obsolete" labels or not is an implementation detail which you typically should not care of. In other words,
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'bar', u'baz', u'foo']],
labels=[[0, 1, 2, 3], [2, 2, 2, 2]],
names=[u'one', u'two'])
and
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'foo']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]],
names=[u'one', u'two'])
are for practical purposes exactly the same index, in the sense of "having the same values in the same positions", and hence behaving identically when used for assignments between Series, DataFrames...
(As you by now have probably clear, it is the labels component of the MultiIndex which determines which values of the levels are actually used, and in which positions.)
I think it is indeed a bug, which shows up also in simpler settings:
https://github.com/pydata/pandas/issues/12827
EDIT: well, probably not, since the example I made in the bug behaves the same in 0.14.1.