matching consecutive pairs in pd.Series - pandas

I have a DataFrame which looks like this :-
ID | act
1 A
1 B
1 C
1 D
2 A
2 B
3 A
3 C
I am trying to get the IDs where an activity act1 is followed by another act2, for example, A is followed by B. In that case, I want to get [1,2] as the ids. How do I go about this in a vectorized manner?
Edit :- Expected output : For the sample df defined above, the output should be a list/Series of all the IDs where A is followed immediately by B
IDs
1
2

Here is a simple, vectorised way to do it!
df.loc[(df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1)), 'ID']
Output:
0 1
4 2
Name: ID, dtype: int64
Another way of writing this, possibly clearer:
conditions = (df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1))
df.loc[conditions, 'ID']
Numpy makes it easy to filter for one or many boolean conditions. The resulting vector is used to filter your dataframe.

Here is one approach: groupby, and don't sort, since we need to track B immediately following A, based on the current dataframe structure.
Next aggregate using str.cat
check if A,B is present
get the index
pass as a list
(df
.groupby('ID',sort=False)
.Act
.agg(lambda x: x.str.cat(sep=','))
.str.contains('A,B')
.loc[lambda x: x==1]
.index.tolist()
)
[1, 2]
Another approach is using the shift function and filtering:
df['x'] = df.Act.shift()
df.loc[lambda x: (x['Act']=='B') & (x['x']=='A')].ID

Related

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

pandas get columns without copy

I have a data frame with multiple columns, and I want to get some of them, and drop others, without copying a new dataframe
I suppose it should be
df = df['col_a','col_b']
but I'm not sure whether it copy a new one or not. Is there any better way to do this?
Your approach should work, apart from one minor issue:
df = df['col_a','col_b']
shoud be:
df = df[['col_a','col_b']]
Because you assign the subset df back to df, it's essentially equivalent to dropping the other columns.
If you would like to drop other columns in place, you can do:
df.drop(columns=df.columns.difference(['col_a','col_b']),inplace=True)
Let me know if this is what you want.
you have a dataframe df with multiple columns a, b, c, d and e. You want to select let us say a and b and store them back in df. To achieve this, you can do :
df=df[['a', 'b']]
Input dataframe df:
a b c d e
1 1 1 1 1
3 2 3 1 4
When you do :
df=df[['a', 'b']]
output will be :
a b
1 1
3 2

Pipelining pandas: create columns that depend on freshly created ones

Let's say you have the following DataFrame
df=pd.DataFrame({'A': [1, 2]})
now I want to construct the column B = A+1, then the column C=A+2 and D = B +C. These calculations are only here for simplicity. Typically, I want to use some e.g. nonlinear transformations, normalizations etc.
what one could do is the following:
df.assign(**{'B': lambda x: x['A'] +1, 'C': lambda x :['A']+2})\
.assign(**{'D':lambda x: x['B']+ x['C']})
However, this is obviously a bit annoying, specifically, if you have a large number of preprocessing steps in a pipeline. Putting both dictionaries together (even in an orderedDict) fails.
Is there a way to obtain a similar result faster or more elegantly?
Additionally, the same problem occurs, if you want to add a column that uses e.g. the sum of a just defined column. This, as far as I know, will always require two assign calls.
You can using eval
df.eval("""
B= A+1
C= A+2
D = B+C""", inplace=False)
Out[625]:
A B C D
0 1 2 3 5
1 2 3 4 7
If you want the calculation within the query ''
df.eval('B=A.max()',inplace=True)
df
Out[647]:
A B
0 1 2
1 2 2

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)