Pandas programming model for the rolling window indexing - pandas

I need an advice on the programming pattern and use of DataFrame for our data. We have thousands of small ASCII files that are the results of the particle tracking experiments (see www.openptv.net for details). Each file is a list of particles identified and tracked in that time instance. The name of the file is the number of the frame. For example:
ptv_is.10000 (i.e. frame no. 10000)
prev next x y z
-1 5 0.0 0.0 0.0
0 0 1.0 1.0 1.0
1 1 2.0 2.0 2.0
2 2 3.0 3.0 3.0
3 -2 4.0 4.0 4.0
ptv_is.10001 (i.e.next time frame, 10001)
1 2 1.1 1.0 1.0
2 8 2.0 2.0 2.0
3 14 3.0 3.0 3.0
4 -2 4.0 4.0 4.0
-1 3 1.5 1.12 1.32
0 -2 0.0 0.0 0.0
The columns of the ASCII files are: prev - is the row number of the particle in the previous frame, next is the row number of the particle in the next frame, x,y,z are coordinates of the particle. If the row index of 'prev' is -1 - the particle appeared in the current frame and doesn't have link back in time. If the 'next' is -2, then the particle doesn't have a link forward in time and the trajectory ends in this frame.
So we are reading these files into a single DataFrame with the same column headers plus we add an index of time, i.e. the frame number
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Now the step were I find it difficult to find the best way of using DataFrame. If we could add an additional column, called trajectory_id, we'd be able later to reindex this DataFrame either by time (creating sub-groups of the particles in single time instance and learn their spatial distributions) or by the trajectory_id and then create trajectories (or linked particles and learn about their time evolution in space, e.g. x(t), y(t), z(t) for the same trajectory_id).
If the input is:
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Then the result I need is:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1
0 0 1.0 1.0 1.0 10000 2
1 1 2.0 2.0 2.0 10000 3
2 2 3.0 3.0 3.0 10000 4
3 -2 4.0 4.0 4.0 10000 -999
1 2 1.1 1.0 1.0 10001 2
2 8 2.0 2.0 2.0 10001 3
3 14 3.0 3.0 3.0 10001 4
-1 -2 4.0 4.0 4.0 10001 -999
-1 3 1.5 1.1 1.3 10001 5
0 -2 0.0 0.0 0.0 10001 1
which means:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1 < - appeared first time, new id
0 0 1.0 1.0 1.0 10000 2 < - the same
1 1 2.0 2.0 2.0 10000 3 <- the same
2 2 3.0 3.0 3.0 10000 4 <- the same
3 -2 4.0 4.0 4.0 10000 -999 <- sort of NaN, there is no link in the next frame
1 2 1.1 1.0 1.0 10001 2 <- from row #1 in the time 10000, has an id = 2
2 8 2.0 2.0 2.0 10001 3 <- row #2 at previous time, has an id = 3
3 14 3.0 3.0 3.0 10001 4 < from row # 3, next on the row #14, id = 4
-1 -2 4.0 4.0 4.0 10001 -999 <- but linked, marked as NaN or -999
-1 3 1.5 1.1 1.3 10001 5 <- new particle, new id = 5 (new trajectory_id)
0 -2 0.0 0.0 0.0 10001 1 <- from row #0 id = 1
Hope this explains better what I'm looking for. The only problem is that I do not know how to have a rolling function through the rows of a DataFrame table, creating a new index column, trajectory_id.
For example, the simple application with lists is shown here:
http://nbviewer.ipython.org/7020209
Thanks for every hint on pandas use,
Alex

Neat! This problem is close to my heart; I also use pandas for particle tracking. This is not exactly the same problem I work on, but here's an untested sketch that offers some helpful pandas idioms.
results = []
first_loop = True
next_id = None
for frame_no, frame in pd.concat(list_of_dataframes).groupby('time'):
if first_loop:
frame['traj_id'] = np.arange(len(frame))
results.append(frame)
next_id = len(frame)
first_loop = False
continue
prev_frame = results[-1]
has_matches = frame['prev'] > 0 # boolean indexer
frame[has_matches]['traj_'id'] = prev_frame.iloc[frame[has_matches]['prev']]
count_unmatched = (~has_matches).sum()
frame[~has_matches]['traj_'id'] = np.arange(next_id, next_id + count_unmatched)
next_id += count_unmatched
results.append(frame)
pd.concat(results)

If I understands well you want to track the position of particles in space across time. You are dealing with data of five dimensions, so maybe a DataFrame it's not the best structure for your problem and you may think about the a panel structure, or a reduction of data.
Taking one particle, you have two posibilities, first treat coordinates as three different values so you need three fields or treat them as a whole, a tuple or a point object for example.
In first case you have time plus three values so you have four axes, you need a DataFrame. In the second case you have two axes so you can use a Series.
For multiple particles just use a particle_id and put all the DataFrames in a Panel or the Series in a DataFrame.
Once you know what data structure to use then it's time to put data in.
Read the files sequentialy and make a collection of 'live' particles, ex.:
{particle_id1: { time1: (x1,y1,z1), time2: (x2,y2,z2), ...}, ...}
When a new particle it's detected (-1 on prev) assign a new particle_id and put on the collection. When a particle 'deads' pop out of the collection and put the data in a Series and then add this Series to a particle DataFrame (or DataFrame / Panel).
You could also make an index of particles id and the next field to help recognizing ids:
{ next_position_of_last_file: particle_id, ... }
or
{ position_in_last_file: particle_id, ...}

Related

Pandas drop duplicates only for main index

I have a multiindex and I want to perform drop_duplicates on a per level basis, I dont want to look at the entire dataframe but only if there is a duplicate with the same main index
Example:
entry,subentry,A,B
1 0 1.0 1.0
1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
should return:
entry,subentry,A,B
1 0 1.0 1.0
1 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
Use MultiIndex.get_level_values with Index.duplicated for filter out last row per entry in boolean indexing:
df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)
A B
entry subentry
1 0 1.0 1.0
1 1.0 1.0
2 0 1.0 1.0
1 2.0 2.0
Or if need remove duplicates per first level and columns convert first level to column by DataFrame.reset_index, for filter invert boolean mask by ~ and convert Series to numpy array, because indices of mask and original DataFrame not match:
df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Or create helper column by first level of MultiIndex:
df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Details:
print (df.reset_index(level=0))
entry A B
subentry
0 1 1.0 1.0
1 1 1.0 1.0
2 1 2.0 2.0
0 2 1.0 1.0
1 2 2.0 2.0
2 2 2.0 2.0
print (~df.reset_index(level=0).duplicated(keep='last'))
0 False
1 True
2 True
0 True
1 False
2 True
dtype: bool
print (df.assign(new=df.index.get_level_values('entry')))
A B new
entry subentry
1 0 1.0 1.0 1
1 1.0 1.0 1
2 2.0 2.0 1
2 0 1.0 1.0 2
1 2.0 2.0 2
2 2.0 2.0 2
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry subentry
1 0 False
1 True
2 True
2 0 True
1 False
2 True
dtype: bool
It looks like you want to drop_duplicates per group:
out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())
Or, a maybe more efficient variant using a temporary reset_index with duplicated and boolean indexing:
out = df[~df.reset_index('entry').duplicated().values]
Output:
A B
entry subentry
1 0 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0

group dataframe if the column has the same value in consecutive order

let's say I have a dataframe that looks like below:
I want to assign my assets to one group if I have treatment that are consecutive. If there are two consecutive assets without treatment after them, then we still can assign them to the same group. However, if there are more than two assets without treatment, then those assets (without treatment) will have empty group. The next assets that have treatment will be assigned to a new group
You can use a rolling check if whether there was at least one Y in the last N occurrences.
I am providing two options depending on whether or not it's important not to label the leading/trailing Ns:
# maximal number of days without treatment
# to remain in same group
N = 2
m = df['Treatment'].eq('Y')
group = m.rolling(N+1, min_periods=1).max().eq(0)
group = (group & ~group.shift(fill_value=False)).cumsum().add(1)
df['group'] = group
# don't label leading/trailing N
m1 = m.groupby(group).cummax()
m2 = m[::-1].groupby(group).cummax()
df['group2'] = group.where(m1&m2)
print(df)
To handle the last NaNs separately:
m3 = ~m[::-1].cummax()
df['group3'] = group.where(m1&m2|m3)
Output:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 1.0 1.0
4 N 1 1.0 1.0
5 Y 1 1.0 1.0
6 Y 1 1.0 1.0
7 Y 1 1.0 1.0
8 N 1 NaN NaN
9 N 1 NaN NaN
10 N 2 NaN NaN
11 Y 2 2.0 2.0
12 Y 2 2.0 2.0
13 Y 2 2.0 2.0
14 Y 2 2.0 2.0
15 N 2 NaN 2.0
Other example for N=1:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 NaN NaN
4 N 2 NaN NaN
5 Y 2 2.0 2.0
6 Y 2 2.0 2.0
7 Y 2 2.0 2.0
8 N 2 NaN NaN
9 N 3 NaN NaN
10 N 3 NaN NaN
11 Y 3 3.0 3.0
12 Y 3 3.0 3.0
13 Y 3 3.0 3.0
14 Y 3 3.0 3.0
15 N 3 NaN 3.0

mode returns Exception: Must produce aggregated value

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

How to do nested groupby operations in a dataframe based on a condition on a column?

I have to find the number of cycles within a column in my data frame (A cycle is defined when the variable goes from initial to some max value and again starts from some initial value). Whenever the variable has repeated values, I just average over them. In the desired data frame, I am appending the filter cycle number to that SNo as a suffix to know which cycle the given SNo is in. I need to get the min and the max for a given cycle and SNo (It is not predefined)
An example of the data frame and the desired data frame are as follows:
SNo VarPer Value
1000 0 1.2
1000 1 2.2
1000 2 3.2
1000 3 4.2
1000 4 5.2
1000 4 6.2
1000 5 7.2
1000 5 8.2
1000 0 0.9
1000 1 1.9
1000 2 2.9
1000 3 3.9
1000 3 4.9
1000 4 5.9
1001 0 0.5
1001 1 1.5
1001 2 2.5
1001 2 3.5
1001 0 1
1001 1 1
1001 2 1
SNo VarPer Value
1000_1 0 1.2
1000_1 1 2.2
1000_1 2 3.2
1000_1 3 4.2
1000_1 4 5.7
1000_1 5 7.7
1000_2 0 0.9
1000_2 1 1.9
1000_2 2 2.9
1000_2 3 4.4
1000_2 4 5.9
1001_1 0 0.5
1001_1 1 1.5
1001_1 2 3
1001_2 0 1
1001_2 1 1
1001_2 2 1
I have already tried the following:
y = dat.groupby(['SNo','VarPer'], as_index=False)['Value'].mean()
But this is grouping the entire thing without considering the cycles. I have about 70000 rows of data, so I need something that isn't terribly slow. Please help!
As #Peter Leimbigler noted, I'm also not clear about the logic for how the suffix is generated. I would think 1000_3 through 1000_6 should all be 1000_2.
To use a groupby, you will need to create a new grouping with something like this:
for _, values in df.groupby('SNo'):
group_label = 0
for row in values.index:
if df.loc[row, 'VarPer'] !=0:
df.loc[row, 'group'] = group_label
else:
group_label+=1
df.loc[row, 'group'] = group_label
EDIT: You probably shouldn't use a loop for writing directly to the dataframe. Instead, you can create a list and then create a new column using that list. This will be faster.
new_grouping = []
for _, values in df.groupby('SNo'):
label = 0
group = []
for row in values.index:
if df.loc[row, 'VarPer'] !=0:
group.append(label)
else:
label+=1
group.append(label)
new_grouping.extend(group)
df['group'] = new_grouping
That won't be fast but perhaps you (or someone else) can vectorize it.
Then you can use a groupby to get your averaged values:
df = df.groupby(['SNo','group'],as_index = False])["VarPer"].mean().reset_index()
If your suffixes are actually supposed to be as I describe above, you can do:
df['SNo'] = df['SNo'].map(str) +'_' + df['group'].map(lambda x: str(int(x)).zfill(2))
This will give you:
SNo group VarPer Value
1000_1 1.0 0 1.2
1000_1 1.0 1 2.2
1000_1 1.0 2 3.2
1000_1 1.0 3 4.2
1000_1 1.0 4 5.7
1000_1 1.0 5 7.7
1000_2 2.0 0 0.9
1000_2 2.0 1 1.9
1000_2 2.0 2 2.9
1000_2 2.0 3 4.4
1000_2 2.0 4 5.9
1001_1 1.0 0 0.5
1001_1 1.0 1 1.5
1001_1 1.0 2 3.0
1001_2 2.0 0 1.0
1001_2 2.0 1 1.0
1001_2 2.0 2 1.0

Converting a flat table of records to an aggregate dataframe in Pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?
This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0