pandas MultiIndex resulting index structure on using xs vs loc between 0.15.2 & 0.18.0 - pandas

The index structure on the result of slicing a subset of data using .xs & .loc on DataFrame with MultiIndex seems to have changed between v0.15.2 & 0.18.0.
Please refer to the code-snippet & output got in ipython notebook using different versions of Pandas.
import pandas as pd
print 'pandas-version: ', pd.__version__
import numpy as np
l1 = ['A', 'B', 'C', 'D']
l2 = sorted(['foo','bar','baz'])
nrows = len(l1) * len(l2)
s = pd.DataFrame(np.random.random( nrows * 2).reshape(nrows, 2),
index=pd.MultiIndex.from_product([l1, l2],
names=['one','two']))
# print s.index
l_all = slice(None)
# get all records matching 'foo' in level=1 using .loc
sub_loc = s.loc[(l_all, 'foo'),:]
print '.loc[(slice(None), "foo")] result:\n', sub_loc,
print '\n.loc result-index:\n', sub_loc.index
# get all records matching 'foo' in level=1 using .xs()
sub_xs = s.xs('foo', level=1)
print '\n.xs(\'foo\', level=1) result:\n', sub_xs,
print '\n .xs result index:\n', sub_xs.index
0.15.2 output
#######################
pandas-version: 0.15.2
.loc[(slice(None), "foo")] result:
0 1
one two
A foo 0.464551 0.372409
B foo 0.782062 0.268917
C foo 0.779423 0.787554
D foo 0.481901 0.232887
.loc result-index:
one two
A foo
B foo
C foo
D foo
.xs('foo', level=1) result:
0 1
one
A 0.464551 0.372409
B 0.782062 0.268917
C 0.779423 0.787554
D 0.481901 0.232887
.xs result index:
Index([u'A', u'B', u'C', u'D'], dtype='object')
0.18.0 output
##########################
pandas-version: 0.18.0
.loc[(slice(None), "foo")] result:
0 1
one two
A foo 0.723213 0.532838
B foo 0.736941 0.401252
C foo 0.217131 0.044254
D foo 0.712824 0.411026
.loc result-index:
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'bar', u'baz', u'foo']],
labels=[[0, 1, 2, 3], [2, 2, 2, 2]],
names=[u'one', u'two'])
.xs('foo', level=1) result:
0 1
one
A 0.723213 0.532838
B 0.736941 0.401252
C 0.217131 0.044254
D 0.712824 0.411026
.xs result index:
Index([u'A', u'B', u'C', u'D'], dtype='object', name=u'one')
calling sub_loc.index seems to return the same MultiIndex structure of the original DataFrame object (inconsistent with v0.15.2), but sub_xs.index seems to be consistent with earlier version.
Note: I'm using [Python 2.7.11 |Anaconda 1.8.0 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]]

Sorry, forget my other answer, the bug I filed is totally unrelated.
The right answer is: the "index structure" has not changed between the two versions. The only thing that changed is the way the index is represented when you print it.
In both cases you have a MultiIndex, with exactly the same levels and values. You are presumably puzzled by the fact that in 0.18.0 it seems to contain "baz" and "bar". But a MultiIndex can have level values it actually does not use, because, as in this example, it contained them when it was created, and unused level values are not eliminated when the lines using them are dropped. sub_loc.index in 0.15.2 also has "baz" and "bar" inside levels, except that the way it is represented when you print it doesn't reveal this.
And by the way, whether a MultiIndex which has been filtered still contains such "obsolete" labels or not is an implementation detail which you typically should not care of. In other words,
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'bar', u'baz', u'foo']],
labels=[[0, 1, 2, 3], [2, 2, 2, 2]],
names=[u'one', u'two'])
and
MultiIndex(levels=[[u'A', u'B', u'C', u'D'], [u'foo']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]],
names=[u'one', u'two'])
are for practical purposes exactly the same index, in the sense of "having the same values in the same positions", and hence behaving identically when used for assignments between Series, DataFrames...
(As you by now have probably clear, it is the labels component of the MultiIndex which determines which values of the levels are actually used, and in which positions.)

I think it is indeed a bug, which shows up also in simpler settings:
https://github.com/pydata/pandas/issues/12827
EDIT: well, probably not, since the example I made in the bug behaves the same in 0.14.1.

Related

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Get column-index from column-name in pandas? [duplicate]

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Ensuring lexicographical sort in pandas MultiIndex

I've got some data with a MultiIndex (some timing stats, with index levels for "device", "build configuration", "tested function", etc). I want to slice out on some of those index columns.
It seems like "slicers" to the .loc function are probably the way to go. However the docs contain this warning:
Warning: You will need to make sure that the selection axes are fully lexsorted!
Later on in the docs there's a section on The Need for Sortedness with MultiIndex which says
you are responsible for ensuring that things are properly sorted
but thankfully,
The MultiIndex object has code to explicity check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception.
Sounds fine.
However the remaining question is how does one get their data properly sorted for the indexing to work properly? The docs talk about an important new method sortlevel() but then contains the following caveat:
There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!
In my case, sortlevel() did the right thing, but what if my "original ordering of the associated factor" was not sorted? Is there a simple one-liner that I can use on any MultiIndex-ed DataFrame to ensure it's ready for slicing and fully lexsorted?
Edit: My exploration suggests most ways of creating a MultiIndex automatically lexsorts the unique labels when building the index. Example:
In [1]:
import pandas as pd
df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2],
'data':['one','two','three','four']})
df
Out[1]:
col1 col2 data
0 b 3 one
1 d 1 two
2 b 1 three
3 a 2 four
In [2]:
df2 = df.set_index(['col1','col2'])
df2
Out[2]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [3]: df2.index
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'd'], [1, 2, 3]],
labels=[[1, 2, 1, 0], [2, 0, 0, 1]],
names=[u'col1', u'col2'])
Note how the unique items in the levels array are lexsorted, even though the DataFrame object is itself is not. Then, as expected:
In [4]: df2.index.is_lexsorted()
Out[4]: False
In [5]:
sorted = df2.sortlevel()
sorted
Out[5]:
data
col1 col2
a 2 four
b 1 three
3 one
d 1 two
In [6]: sorted.index.is_lexsorted()
Out[6]: True
However, if the levels are explicitly ordered so they are not sorted, things get weird:
In [7]:
df3 = df2
df3.index.set_levels(['b','d','a'], level='col1', inplace=True)
df3.index.set_labels([0,1,0,2], level='col1', inplace=True)
df3
Out[7]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [8]:
sorted2 = df3.sortlevel()
sorted2
Out[8]:
data
col1 col2
b 1 three
3 one
d 1 two
a 2 four
In [9]: sorted2.index.is_lexsorted()
Out[9]: True
In [10]: sorted2.index
Out[10]:
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
names=[u'col1', u'col2'])
So sorted2 is reporting that it is lexsorted, when in fact it is not. This feels a little like what the warning in the docs is getting at, but it's still not clear how to fix it or whether it's really an issue at all.
As far as sorting, as #EdChum pointed out, the docs here seem to indicate it is lexicographically sorted.
For checking whether your index (or columns) are sorted, they have a method is_lexsorted() and an attribute lexsort_depth (which for some reason you can't really find in the documentation itself).
Example:
Create a Series with random order
In [1]:
import pandas as pd
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', '1', '3', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
import random; random.shuffle(tuples)
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s
Out[1]:
baz 3 -0.191653
qux two -1.410311
bar one -0.336475
qux one -1.192908
foo two 0.486401
baz 1 0.888314
foo one -1.504816
bar two 0.917460
dtype: float64
Check is_lexsorted and lexsort_depth:
In [2]: s.index.is_lexsorted()
Out[2]: False
In [3]: s.index.lexsort_depth
Out[3]: 0
Sort the index, and recheck the values:
In [4]: s = s.sortlevel(0, sort_remaining=True)
s
Out[4]:
bar one -0.336475
two 0.917460
baz 1 0.888314
3 -0.191653
foo one -1.504816
two 0.486401
qux one -1.192908
two -1.410311
dtype: float64
In [5]: s.index.is_lexsorted()
Out[5]: True
In [6]: s.index.lexsort_depth
Out[6]: 2

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.