Pandas drop duplicates only for main index - pandas

I have a multiindex and I want to perform drop_duplicates on a per level basis, I dont want to look at the entire dataframe but only if there is a duplicate with the same main index
Example:
entry,subentry,A,B
1 0 1.0 1.0
1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
should return:
entry,subentry,A,B
1 0 1.0 1.0
1 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0

Use MultiIndex.get_level_values with Index.duplicated for filter out last row per entry in boolean indexing:
df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)
A B
entry subentry
1 0 1.0 1.0
1 1.0 1.0
2 0 1.0 1.0
1 2.0 2.0
Or if need remove duplicates per first level and columns convert first level to column by DataFrame.reset_index, for filter invert boolean mask by ~ and convert Series to numpy array, because indices of mask and original DataFrame not match:
df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Or create helper column by first level of MultiIndex:
df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Details:
print (df.reset_index(level=0))
entry A B
subentry
0 1 1.0 1.0
1 1 1.0 1.0
2 1 2.0 2.0
0 2 1.0 1.0
1 2 2.0 2.0
2 2 2.0 2.0
print (~df.reset_index(level=0).duplicated(keep='last'))
0 False
1 True
2 True
0 True
1 False
2 True
dtype: bool
print (df.assign(new=df.index.get_level_values('entry')))
A B new
entry subentry
1 0 1.0 1.0 1
1 1.0 1.0 1
2 2.0 2.0 1
2 0 1.0 1.0 2
1 2.0 2.0 2
2 2.0 2.0 2
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry subentry
1 0 False
1 True
2 True
2 0 True
1 False
2 True
dtype: bool

It looks like you want to drop_duplicates per group:
out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())
Or, a maybe more efficient variant using a temporary reset_index with duplicated and boolean indexing:
out = df[~df.reset_index('entry').duplicated().values]
Output:
A B
entry subentry
1 0 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0

Related

group dataframe if the column has the same value in consecutive order

let's say I have a dataframe that looks like below:
I want to assign my assets to one group if I have treatment that are consecutive. If there are two consecutive assets without treatment after them, then we still can assign them to the same group. However, if there are more than two assets without treatment, then those assets (without treatment) will have empty group. The next assets that have treatment will be assigned to a new group
You can use a rolling check if whether there was at least one Y in the last N occurrences.
I am providing two options depending on whether or not it's important not to label the leading/trailing Ns:
# maximal number of days without treatment
# to remain in same group
N = 2
m = df['Treatment'].eq('Y')
group = m.rolling(N+1, min_periods=1).max().eq(0)
group = (group & ~group.shift(fill_value=False)).cumsum().add(1)
df['group'] = group
# don't label leading/trailing N
m1 = m.groupby(group).cummax()
m2 = m[::-1].groupby(group).cummax()
df['group2'] = group.where(m1&m2)
print(df)
To handle the last NaNs separately:
m3 = ~m[::-1].cummax()
df['group3'] = group.where(m1&m2|m3)
Output:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 1.0 1.0
4 N 1 1.0 1.0
5 Y 1 1.0 1.0
6 Y 1 1.0 1.0
7 Y 1 1.0 1.0
8 N 1 NaN NaN
9 N 1 NaN NaN
10 N 2 NaN NaN
11 Y 2 2.0 2.0
12 Y 2 2.0 2.0
13 Y 2 2.0 2.0
14 Y 2 2.0 2.0
15 N 2 NaN 2.0
Other example for N=1:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 NaN NaN
4 N 2 NaN NaN
5 Y 2 2.0 2.0
6 Y 2 2.0 2.0
7 Y 2 2.0 2.0
8 N 2 NaN NaN
9 N 3 NaN NaN
10 N 3 NaN NaN
11 Y 3 3.0 3.0
12 Y 3 3.0 3.0
13 Y 3 3.0 3.0
14 Y 3 3.0 3.0
15 N 3 NaN 3.0

mode returns Exception: Must produce aggregated value

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

Numpy or Pandas for multiple dataframes of 2darray datasets

I hope I used the correct synonyms in the title, which describes my problem.
My data has the following structure
D = {E_1, E_2...,E_n} with E_i = {M_{i,1}, M_{i,2},...M_{i,m}} and each M_{i,j} is a 6x2 Matrix.
I used a numpy array with dimension n x m x 6 x 2 to save the data. This was okay if every dataset E_i has the same amount of matrices.
But this solution is not working anymore, since I now work with datasets E_i which have a different number of Matrices i.e. E_i has m_i matrices.
Is there maybe a way in Pandas to resolve my problem? At the end I need to enter each matrix to operate with it as a numpy array i.e. multiplication, inverse, determinant….
You could try to use a multiindex in pandas in order to do this. This allows you to select the dataframe by level. A simple example of how you could achieve something like that:
D = np.repeat([0, 1], 12)
E = np.repeat([0, 1, 0, 1], 6)
print(D, E)
index_cols = pd.MultiIndex.from_arrays(
[D, E],
names=["D_idx", "E_idx"])
M = np.ones([24,2])
df = pd.DataFrame(M,
index=index_cols,
columns=["left", "right"])
print(df)
This gives you the dataframe:
left right
D_idx E_idx
0 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
You can then slice the dataframe based on levels, i.e. if you want to retrieve all elements in set D_1 you can select: df.loc[[(0, 0), (0, 1)], :]
You can generate selectors like this using list(zip(d_idx, e_idx)) in order to select specific rows.
You can find more about slicing and selecting the dataframe here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

how to get correct row data with certain restrict in pandas?

I want to extract the correct row based on certain condition.
The dataframe contains a column entry with the entry signals.
A valid entry only when there is no order in market. Therefore, only the first signal is valid in two consecutive signals
A valid exit is 5 bars later after entry.
Here is my code and dataframe
import pandas as pd
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,1,0,0,0,0,0,0]})
df['exit'] = df['entry'].shift(5)
df['state'] = np.select([df['entry'] == 1, df['exit'] == 1], [1, 0], default=np.nan)
df['state'].ffill(inplace=True)
df['state'].fillna(value=0, inplace=True)
df['change'] = df['state'].diff()
print(df)
entrysig = df[df['change'].eq(1)]
exitsig = df[df['change'].eq(-1)]
tradelist = pd.DataFrame({'entry': entrysig.index, 'exit': exitsig.index})
tradelist['wantedexit'] = [6, 12]
print(tradelist)
The output is like :
entry exit state change
0 0 NaN 0.0 NaN
1 1 NaN 1.0 1.0
2 0 NaN 1.0 0.0
3 1 NaN 1.0 0.0
4 0 NaN 1.0 0.0
5 0 0.0 1.0 0.0
6 0 1.0 0.0 -1.0
7 1 0.0 1.0 1.0
8 0 1.0 0.0 -1.0
9 0 0.0 0.0 0.0
10 0 0.0 0.0 0.0
11 0 0.0 0.0 0.0
12 0 1.0 0.0 0.0
13 0 0.0 0.0 0.0
entry exit wantedexit
0 1 6 6
1 7 8 12
In this example, the first trade entered at bar 1 exit at 6 is correct, it enters at bar 1 and exit after 5 bars which is 6.
The entry on bar 3 is ignored because there is currently an order in market which entered at bar 1.
The second trade entered at bar 7 and exit bar 8 is not correct, because the trade only last for 1 bar while my condition is to exit after 5 bars.
The exit at bar 8 is there because there is an invalid signal at bar 3.
The 'wantedexit' column should be the correct exit bar index.

Pandas programming model for the rolling window indexing

I need an advice on the programming pattern and use of DataFrame for our data. We have thousands of small ASCII files that are the results of the particle tracking experiments (see www.openptv.net for details). Each file is a list of particles identified and tracked in that time instance. The name of the file is the number of the frame. For example:
ptv_is.10000 (i.e. frame no. 10000)
prev next x y z
-1 5 0.0 0.0 0.0
0 0 1.0 1.0 1.0
1 1 2.0 2.0 2.0
2 2 3.0 3.0 3.0
3 -2 4.0 4.0 4.0
ptv_is.10001 (i.e.next time frame, 10001)
1 2 1.1 1.0 1.0
2 8 2.0 2.0 2.0
3 14 3.0 3.0 3.0
4 -2 4.0 4.0 4.0
-1 3 1.5 1.12 1.32
0 -2 0.0 0.0 0.0
The columns of the ASCII files are: prev - is the row number of the particle in the previous frame, next is the row number of the particle in the next frame, x,y,z are coordinates of the particle. If the row index of 'prev' is -1 - the particle appeared in the current frame and doesn't have link back in time. If the 'next' is -2, then the particle doesn't have a link forward in time and the trajectory ends in this frame.
So we are reading these files into a single DataFrame with the same column headers plus we add an index of time, i.e. the frame number
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Now the step were I find it difficult to find the best way of using DataFrame. If we could add an additional column, called trajectory_id, we'd be able later to reindex this DataFrame either by time (creating sub-groups of the particles in single time instance and learn their spatial distributions) or by the trajectory_id and then create trajectories (or linked particles and learn about their time evolution in space, e.g. x(t), y(t), z(t) for the same trajectory_id).
If the input is:
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Then the result I need is:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1
0 0 1.0 1.0 1.0 10000 2
1 1 2.0 2.0 2.0 10000 3
2 2 3.0 3.0 3.0 10000 4
3 -2 4.0 4.0 4.0 10000 -999
1 2 1.1 1.0 1.0 10001 2
2 8 2.0 2.0 2.0 10001 3
3 14 3.0 3.0 3.0 10001 4
-1 -2 4.0 4.0 4.0 10001 -999
-1 3 1.5 1.1 1.3 10001 5
0 -2 0.0 0.0 0.0 10001 1
which means:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1 < - appeared first time, new id
0 0 1.0 1.0 1.0 10000 2 < - the same
1 1 2.0 2.0 2.0 10000 3 <- the same
2 2 3.0 3.0 3.0 10000 4 <- the same
3 -2 4.0 4.0 4.0 10000 -999 <- sort of NaN, there is no link in the next frame
1 2 1.1 1.0 1.0 10001 2 <- from row #1 in the time 10000, has an id = 2
2 8 2.0 2.0 2.0 10001 3 <- row #2 at previous time, has an id = 3
3 14 3.0 3.0 3.0 10001 4 < from row # 3, next on the row #14, id = 4
-1 -2 4.0 4.0 4.0 10001 -999 <- but linked, marked as NaN or -999
-1 3 1.5 1.1 1.3 10001 5 <- new particle, new id = 5 (new trajectory_id)
0 -2 0.0 0.0 0.0 10001 1 <- from row #0 id = 1
Hope this explains better what I'm looking for. The only problem is that I do not know how to have a rolling function through the rows of a DataFrame table, creating a new index column, trajectory_id.
For example, the simple application with lists is shown here:
http://nbviewer.ipython.org/7020209
Thanks for every hint on pandas use,
Alex
Neat! This problem is close to my heart; I also use pandas for particle tracking. This is not exactly the same problem I work on, but here's an untested sketch that offers some helpful pandas idioms.
results = []
first_loop = True
next_id = None
for frame_no, frame in pd.concat(list_of_dataframes).groupby('time'):
if first_loop:
frame['traj_id'] = np.arange(len(frame))
results.append(frame)
next_id = len(frame)
first_loop = False
continue
prev_frame = results[-1]
has_matches = frame['prev'] > 0 # boolean indexer
frame[has_matches]['traj_'id'] = prev_frame.iloc[frame[has_matches]['prev']]
count_unmatched = (~has_matches).sum()
frame[~has_matches]['traj_'id'] = np.arange(next_id, next_id + count_unmatched)
next_id += count_unmatched
results.append(frame)
pd.concat(results)
If I understands well you want to track the position of particles in space across time. You are dealing with data of five dimensions, so maybe a DataFrame it's not the best structure for your problem and you may think about the a panel structure, or a reduction of data.
Taking one particle, you have two posibilities, first treat coordinates as three different values so you need three fields or treat them as a whole, a tuple or a point object for example.
In first case you have time plus three values so you have four axes, you need a DataFrame. In the second case you have two axes so you can use a Series.
For multiple particles just use a particle_id and put all the DataFrames in a Panel or the Series in a DataFrame.
Once you know what data structure to use then it's time to put data in.
Read the files sequentialy and make a collection of 'live' particles, ex.:
{particle_id1: { time1: (x1,y1,z1), time2: (x2,y2,z2), ...}, ...}
When a new particle it's detected (-1 on prev) assign a new particle_id and put on the collection. When a particle 'deads' pop out of the collection and put the data in a Series and then add this Series to a particle DataFrame (or DataFrame / Panel).
You could also make an index of particles id and the next field to help recognizing ids:
{ next_position_of_last_file: particle_id, ... }
or
{ position_in_last_file: particle_id, ...}