pandas: Assign values to a slice a MultiIndex by range of secondary index - pandas

I have a problem with assigning a series like object to a slice of a Pandas dataframe.
Maybe I'm not using the Datafarme the way it is intended to, so some enlightment will be greatly appreciated.
I've already read the following articles:
pandas: slice a MultiIndex by range of secondary index
Returning a view versus a copy
As far as I understand the way I'm evoking the slice with one .loc call does ensure I'm getting not a copy of the data. Obviously also the original dataframe gets altered, but instead of the expected data I get NaN values.
See the appended code snipet.
Do I have to iterate over the desired section of the dataframe for each single value I want to change and use the .set_value(row_idx,col_idx,val) method?
kind regards and thanks in advance
Markus
In [1]: import pandas as pd
In [2]: mindex = pd.MultiIndex.from_product([['one','two'],['first','second']])
In [3]: dfmi = pd.DataFrame([list('abcd'),list('efgh'),list('ijkl'),list('mnop')],
...: index = mindex, columns=(['X','Y','Z','Q']))
In [4]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first i j k l
second m n o p
In [5]: dfmi.loc[('two',slice('first','second')),'X']
Out[5]:
two first i
second m
Name: X, dtype: object
In [6]: substitute = pd.Series(data=["ab","cd"], index= mindex.levels[1])
...: print(substitute)
first ab
second cd
dtype: object
In [7]: dfmi.loc[('two',slice('first','second')),'X'] = substitute
In [8]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first NaN j k l
second NaN n o p

What's happening is that substitute has an index, which determine the location of the values, and dfmi.loc[('two',slice('first','second')),'X'] is also specifying such location.
During the assignment pandas is trying to align both index and since they do not match (they would if substitute was also a multi-index), the result of the alignment are all NA's, which get inserted.
A solution could be to get rid of the index of substitute since the location of where you want to insert the values is already specified in the loc:
dfmi.loc[('two',slice('first','second')),'X'] = substitute.values
or even simpler, insert the values directly:
dfmi.loc[('two',slice('first','second')),'X'] = ["ab","cd"]

Can you try this:
dfmi.loc['two']['X']=substitute

Related

Add/Sum values of a column vector to a matrix with pandas

I have 2 dataframes (without headers or index). One is of size 100x20 (Dataframe A) and the other of size 100x1 (Dataframe B). I would like to add the values of Dataframe B to the first 5 columns in Dataframe A.
I tried to do this with
C = A.iloc[:,:5].add(B,axis=0)
Now C is of size 100X5 but I get A[:,0]+B for the first column alone and the other 4 columns in C is NaN. What am I doing wrong?
This is because of index alignement. DataFrames necessarily have indexes. Here B has index 0, so when aligned with A during addition, only the column 0 of A is used.
Use an array to bypass it:
C = A.iloc[:,:5].add(B.to_numpy(), axis=0)
Or slice B as Series:
A.iloc[:,:5].add(B[0], axis=0)

Pandas dataframe select rows where a list-column contains a specific set of elements

This is a follow-up to the following post: Pandas dataframe select rows where a list-column contains any of a list of strings
I want to be able to select rows that contain the exact pair of strings from the selection list (where selection= ['cat', 'dog']).
starting df:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
df I want:
molecule species
2 c [cat, dog]
I tried the following and it returned only the columns labels.
df[pd.DataFrame(df.species.tolist()).isin(selection).all(1)]
One way to do it:
df['joined'] = df.species.str.join(sep=',')
selection = ['cat,dog']
filtered = df.loc[df.joined.isin(selection)]
This won't find cases with different sorting (i.e. 'dog,cat' or 'horse,cat,pig'), but if that is not an issue then it works fine.
This will find anything.
import pandas as pd
selection = ['cat', 'dog']
mols = pd.DataFrame({'molecule':['a','b','c','d','e'],'species':[['dog'],['horse','pig'],['cat','dog'],['cat','horse','pig'],['chicken','pig']]})
mols.loc[np.where(pd.Series([all(w in selection for w in mols.species.values[k]) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
If you want to find any rows that have at least the elements in the list (and could have others as well), use:
mols.loc[np.where(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
This is an interesting application of matrices as selectors. Use the transposed mols to multiply the vector of zeroes and ones that points which rows in mols fit your criteria:
mols.to_numpy().T.dot(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}))
Another (more readable) solution would be to assign, to mols, a column where the condition is True, map it to 0 and 1 and query mols where that column is equal to 1.

Get True/False boolean list of row in pandas dataframe out of condition

I am working with several Pandas DataFrames and I need the following filtering:
Suppose I get a list like
L=['EP6','EP3','EP2']
I need to get the following vector of a row:
for row concept 1 True where columns index is in L, False where not.
I am trying:
# D being the DataFrame
L=['EP6', 'EP3','EP2']
[True for ind in D.columns if ind in L ]
But only get [True,True,True]
I need the complete list like:
desire_result=[0,0,0,0,1,0,0,1,1,0]
Note: be aware that the 1 in the desire result do not have anything to do with the 1 the Dataframe is populate with.
Thanks
We have isin in pandas
D.columns.isin(L)
You here made a filter where you yield True, given ind in L, and otherwise, you do not yield an element.
You here want to perform a mapping. You can still use list comprehension, but you should put the condition in the yield part:
[ind in L for ind in D.columns]
or if you want integers:
[int(ind in L) for ind in D.columns]

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)