Average of certain values in pandas dataframe with if condition - pandas

index
column 1
column 2
1
1
1.2
2
1.2
1.5
3
2.2
2.5
4
3
3.1
5
3.3
3.5
6
3.6
3.8
7
3.9
4.0
8
4.0
4.0
9
4.0
4.1
10
4.1
4.0
I created a moving average with df.rolling(). But I just want to have the average of the "constant" value (here around 4), that is not changing more than 10% any more.
My first approach was to try it with an if condition, but my attemps to just create an average of certain values in the column failed.
Does anyone have ideas?

Related

How to shift all the values from a certain point of the dataframe to the right?

Example:
I have this dataset
A B C D E
0 0.1 0.2 0.3 0.4 0.5
1 1.1 1.2 1.3 1.4 1.5
2 2.1 2.2 2.4 2.5 2.6
3 3.1 3.2 3.4 3.5 3.6
4 4.1 4.2 4.4 4.5 4.6
5 5.1 5.2 5.3 5.4 5.5
What I would like to have is:
A B C D E
0 0.1 0.2 0.3 0.4 0.5
1 1.1 1.2 1.3 1.4 1.5
2 2.1 2.2 2.4 2.5 2.6
3 3.1 3.2 3.4 3.5 3.6
4 4.1 4.2 4.4 4.5 4.6
5 5.1 5.2 5.3 5.4 5.5
So I need to shift only certain rows and only certain columns to the right.
Not all the lines and columns have to be affected by that shift. I hope it's clear, thank you.
Pandas would be a lovely way to solve this. Use the .loc to select the rows and columns and use .shift() to move them to the right.
import pandas as pd
df.loc[2:4, ['C','D']] = df.loc[2:4, ['C','D']].shift(1, axis=1)
If you share your dataframe code to define df, I can fully test the loc/shift solution.

applying vlookup to every element of pandas dataframe

I have two dataframes, one of which is source(src), and the other one is the destination(dest)
dest.tail()
Out[166]:
Item AJ AM AO AR BA BO BR BU BY CA ... TJ TK TR
time ...
2020-06-26 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-29 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-06-30 3.5 4.5 5.5 7.5 4.5 7.5 7 NaN 7.0 5.5 ... 7 7.5 3.5
2020-07-01 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
2020-07-02 3.5 4.5 5.5 1.5 4.5 7.5 7 NaN 2.5 5.5 ... 7 7.5 3.5
src.tail()
Out[167]:
1.00 1.25 1.50 1.75 ... 10.00 10.25
time
2020-06-29 0.153556 0.159041 0.162370 0.164580 ... 0.643962 0.658646
2020-06-30 0.156180 0.159280 0.161534 0.163746 ... 0.660171 0.675189
2020-07-01 0.156947 0.163433 0.168326 0.171734 ... 0.687046 0.701364
2020-07-02 0.152465 0.153910 0.154862 0.155750 ... 0.676183 0.690475
2020-07-03 0.154169 0.153923 0.154868 0.155751 ... 0.676537 0.690816
For each value in dest, i want to replace it with a value in the src table, which has same index, and same column name as itself.
e.g. Value for AJ on '2020-06-26' in the dest table right now is 3.5. I want to replace it with value in src table corresponding to index '2020-06-26' and column = 3.5
I was thinking of using applymap, but it doesnt seem to have a concept of index.
dest.applymap(lambda x: src.loc[x.index][x]).tail()
AttributeError: ("'numpy.float64' object has no attribute 'index'", u'occurred at index AJ')
I then tried using apply and it worked like this:
dest1 = dest.replace(0,np.nan).fillna(1) # 0 and nan are not in src.columns
df= dest1.apply(lambda x: [src[col].loc[row] for row, col in zip(x.index,x)], axis=0).tail()
2 questions on this:
Is there a better solution to this instead of doing a list comprehension within apply?
Is there a better way of handling values in dest that are not in src.columns (like 0 and nan) so the output is nan when that's the case?

Reverse Rolling mean for DataFrame

I am trying to create a fixture difficulty grid using a DataFrame. I want the mean for the next 5 fixtures for each team.
I’m currently using df.rolling(5, min_periods=1).mean().shift(-4). This is working for the start but is pulling NANs at the end. I understand why NANs are returned – there is no DF to shift up. Ideally I’d like the NANs to become mean across the remaining values, value against 38 just being its current value?
Fixture difficulties
ARS AVL BHA BOU
3 4 3 2
2 2 2 2
5 2 2 4
4 2 5 3
3 2 2 2
Mean of next 5 fixtures
ARS AVL BHA BOU
3.4 2.4 2.8 2.6
3.2 2.4 2.8 2.6
3.6 2.4 3.2 2.6
3 2.4 3.6 2.6
2.6 2.4 3 2.4
NAN on last records as nothing to shift up.
3.2 3.6 2.8 3.6
nan nan nan nan
nan nan nan nan
nan nan nan nan
nan nan nan nan
Can I adapt this approach or need a different one altogether to populate the NANs?
IIUC you need inverse values by indexing, use rolling and inverse back:
df1 = df.iloc[::-1].rolling(5, min_periods=1).mean().iloc[::-1]
print (df1)
ARS AVL BHA BOU
0 3.4 2.4 2.80 2.60
1 3.5 2.0 2.75 2.75
2 4.0 2.0 3.00 3.00
3 3.5 2.0 3.50 2.50
4 3.0 2.0 2.00 2.00

How to plot values from the DataFrame? Python 3.0

I'm trying to plot the values from the A column against the index(Of the DataFrame table), but it doesnt allow me to. How to do it?
INDEX is the index from the DataFrame and not the declared variable.
You need plot column A only, index is used for x and values for y by default in Series.plot:
#line is default method, so omitted
Test['A'].plot(style='o')
Another solution is reset_index for column from index and then DataFrame.plot:
Test.reset_index().plot(x='index', y='A', style='o')
Sample:
Test=pd.DataFrame({'A':[3.0,4,5,10], 'B':[3.0,4,5,9]})
print (Test)
A B
0 3.0 3.0
1 4.0 4.0
2 5.0 5.0
3 10.0 9.0
Test['A'].plot(style='o')
print (Test.reset_index())
index A B
0 0 3.0 3.0
1 1 4.0 4.0
2 2 5.0 5.0
3 3 10.0 9.0
Test.reset_index().plot(x='index', y='A', style='o')

Pandas programming model for the rolling window indexing

I need an advice on the programming pattern and use of DataFrame for our data. We have thousands of small ASCII files that are the results of the particle tracking experiments (see www.openptv.net for details). Each file is a list of particles identified and tracked in that time instance. The name of the file is the number of the frame. For example:
ptv_is.10000 (i.e. frame no. 10000)
prev next x y z
-1 5 0.0 0.0 0.0
0 0 1.0 1.0 1.0
1 1 2.0 2.0 2.0
2 2 3.0 3.0 3.0
3 -2 4.0 4.0 4.0
ptv_is.10001 (i.e.next time frame, 10001)
1 2 1.1 1.0 1.0
2 8 2.0 2.0 2.0
3 14 3.0 3.0 3.0
4 -2 4.0 4.0 4.0
-1 3 1.5 1.12 1.32
0 -2 0.0 0.0 0.0
The columns of the ASCII files are: prev - is the row number of the particle in the previous frame, next is the row number of the particle in the next frame, x,y,z are coordinates of the particle. If the row index of 'prev' is -1 - the particle appeared in the current frame and doesn't have link back in time. If the 'next' is -2, then the particle doesn't have a link forward in time and the trajectory ends in this frame.
So we are reading these files into a single DataFrame with the same column headers plus we add an index of time, i.e. the frame number
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Now the step were I find it difficult to find the best way of using DataFrame. If we could add an additional column, called trajectory_id, we'd be able later to reindex this DataFrame either by time (creating sub-groups of the particles in single time instance and learn their spatial distributions) or by the trajectory_id and then create trajectories (or linked particles and learn about their time evolution in space, e.g. x(t), y(t), z(t) for the same trajectory_id).
If the input is:
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Then the result I need is:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1
0 0 1.0 1.0 1.0 10000 2
1 1 2.0 2.0 2.0 10000 3
2 2 3.0 3.0 3.0 10000 4
3 -2 4.0 4.0 4.0 10000 -999
1 2 1.1 1.0 1.0 10001 2
2 8 2.0 2.0 2.0 10001 3
3 14 3.0 3.0 3.0 10001 4
-1 -2 4.0 4.0 4.0 10001 -999
-1 3 1.5 1.1 1.3 10001 5
0 -2 0.0 0.0 0.0 10001 1
which means:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1 < - appeared first time, new id
0 0 1.0 1.0 1.0 10000 2 < - the same
1 1 2.0 2.0 2.0 10000 3 <- the same
2 2 3.0 3.0 3.0 10000 4 <- the same
3 -2 4.0 4.0 4.0 10000 -999 <- sort of NaN, there is no link in the next frame
1 2 1.1 1.0 1.0 10001 2 <- from row #1 in the time 10000, has an id = 2
2 8 2.0 2.0 2.0 10001 3 <- row #2 at previous time, has an id = 3
3 14 3.0 3.0 3.0 10001 4 < from row # 3, next on the row #14, id = 4
-1 -2 4.0 4.0 4.0 10001 -999 <- but linked, marked as NaN or -999
-1 3 1.5 1.1 1.3 10001 5 <- new particle, new id = 5 (new trajectory_id)
0 -2 0.0 0.0 0.0 10001 1 <- from row #0 id = 1
Hope this explains better what I'm looking for. The only problem is that I do not know how to have a rolling function through the rows of a DataFrame table, creating a new index column, trajectory_id.
For example, the simple application with lists is shown here:
http://nbviewer.ipython.org/7020209
Thanks for every hint on pandas use,
Alex
Neat! This problem is close to my heart; I also use pandas for particle tracking. This is not exactly the same problem I work on, but here's an untested sketch that offers some helpful pandas idioms.
results = []
first_loop = True
next_id = None
for frame_no, frame in pd.concat(list_of_dataframes).groupby('time'):
if first_loop:
frame['traj_id'] = np.arange(len(frame))
results.append(frame)
next_id = len(frame)
first_loop = False
continue
prev_frame = results[-1]
has_matches = frame['prev'] > 0 # boolean indexer
frame[has_matches]['traj_'id'] = prev_frame.iloc[frame[has_matches]['prev']]
count_unmatched = (~has_matches).sum()
frame[~has_matches]['traj_'id'] = np.arange(next_id, next_id + count_unmatched)
next_id += count_unmatched
results.append(frame)
pd.concat(results)
If I understands well you want to track the position of particles in space across time. You are dealing with data of five dimensions, so maybe a DataFrame it's not the best structure for your problem and you may think about the a panel structure, or a reduction of data.
Taking one particle, you have two posibilities, first treat coordinates as three different values so you need three fields or treat them as a whole, a tuple or a point object for example.
In first case you have time plus three values so you have four axes, you need a DataFrame. In the second case you have two axes so you can use a Series.
For multiple particles just use a particle_id and put all the DataFrames in a Panel or the Series in a DataFrame.
Once you know what data structure to use then it's time to put data in.
Read the files sequentialy and make a collection of 'live' particles, ex.:
{particle_id1: { time1: (x1,y1,z1), time2: (x2,y2,z2), ...}, ...}
When a new particle it's detected (-1 on prev) assign a new particle_id and put on the collection. When a particle 'deads' pop out of the collection and put the data in a Series and then add this Series to a particle DataFrame (or DataFrame / Panel).
You could also make an index of particles id and the next field to help recognizing ids:
{ next_position_of_last_file: particle_id, ... }
or
{ position_in_last_file: particle_id, ...}