How to transform summary statistics count into integers on pandas - pandas

I'm running into the following issue.
I have pulled some summary statistics from a dataframe, using pd.describe(). Now, I'm trying to convert the number of observations (or count) into an integer. I've used the following but it does not work:
summary_stats = df.describe()
summary_stats = summary_stats.round(2)
summary_stats.iloc[0] = summary_stats.iloc[0].astype(int)
Then, when I print out the summary statistics table, the number of observations is not an integer. Thanks a lot for your insights!

It is problem, because floats with integers in same column. So integers are converted to floats.
Possible solution with transpose - then column has integer dtype:
d = {'A':[1,2,3,4,5], 'B':[2,2,2,2,2], 'C':[3,3,3,3,3]}
df = pd.DataFrame(data=d)
summary_stats = df.describe().T
summary_stats = summary_stats.round(2)
summary_stats['count'] = summary_stats['count'].astype(int)
print (summary_stats)
count mean std min 25% 50% 75% max
A 5 3.0 1.58 1.0 2.0 3.0 4.0 5.0
B 5 2.0 0.00 2.0 2.0 2.0 2.0 2.0
C 5 3.0 0.00 3.0 3.0 3.0 3.0 3.0
If need only display values, here is hack - converted values to object:
summary_stats = df.describe()
summary_stats = summary_stats.round(2).astype(object)
summary_stats.iloc[0] = summary_stats.iloc[0].astype(int)
print (summary_stats)
A B C
count 5 5 5
mean 3.0 2.0 3.0
std 1.58 0.0 0.0
min 1.0 2.0 3.0
25% 2.0 2.0 3.0
50% 3.0 2.0 3.0
75% 4.0 2.0 3.0
max 5.0 2.0 3.0

Related

find minimum value in a column based on condition in an another column of a dataframe?

I want to select the minimum value of a column range based on another column condition
0 1 2 3 4 Capacity Fixed Cost
80.0 270.0 250.0 160.0 180.0 NaN NaN
4.0 5.0 6.0 8.0 10.0 500.0 1000.0
6.0 4.0 3.0 5.0 8.0 500.0 1000.0
9.0 7.0 4.0 3.0 4.0 500.0 1000.0
I get the minimum value of the column with dv.loc[1:, i].min()
but i want to exclude rows where the capacity is 0
IIUC use:
df[df['Capacity'].ne(0)].min()

pandas - how to select rows based on a conjunction of a non indexed column?

Consider the following DataFrame -
In [47]: dati
Out[47]:
x y
frame face lmark
1 NaN NaN NaN NaN
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
2201 0.0 1.0 634.0 395.0
2.0 629.0 439.0
3.0 630.0 486.0
How can we select the rows where dati['x'] > 629.5 for all rows sharing the same value in the 'frame' column. For this example, I would expect to result to be
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
because column 'x' of 'frame' 2201, 'lmark' 2.0 is not greater than 629.5
Use GroupBy.transform with GroupBy.all for test if all Trues per groups and filter in boolean indexing:
df = dat[(dat['x'] > 629.5).groupby(level=0).transform('all')]
print (df)
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0

Numpy or Pandas for multiple dataframes of 2darray datasets

I hope I used the correct synonyms in the title, which describes my problem.
My data has the following structure
D = {E_1, E_2...,E_n} with E_i = {M_{i,1}, M_{i,2},...M_{i,m}} and each M_{i,j} is a 6x2 Matrix.
I used a numpy array with dimension n x m x 6 x 2 to save the data. This was okay if every dataset E_i has the same amount of matrices.
But this solution is not working anymore, since I now work with datasets E_i which have a different number of Matrices i.e. E_i has m_i matrices.
Is there maybe a way in Pandas to resolve my problem? At the end I need to enter each matrix to operate with it as a numpy array i.e. multiplication, inverse, determinant….
You could try to use a multiindex in pandas in order to do this. This allows you to select the dataframe by level. A simple example of how you could achieve something like that:
D = np.repeat([0, 1], 12)
E = np.repeat([0, 1, 0, 1], 6)
print(D, E)
index_cols = pd.MultiIndex.from_arrays(
[D, E],
names=["D_idx", "E_idx"])
M = np.ones([24,2])
df = pd.DataFrame(M,
index=index_cols,
columns=["left", "right"])
print(df)
This gives you the dataframe:
left right
D_idx E_idx
0 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
You can then slice the dataframe based on levels, i.e. if you want to retrieve all elements in set D_1 you can select: df.loc[[(0, 0), (0, 1)], :]
You can generate selectors like this using list(zip(d_idx, e_idx)) in order to select specific rows.
You can find more about slicing and selecting the dataframe here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Pandas: replace outliers in all columns with nan

I have a data frame with 3 columns, for ex
c1,c2,c3
10000,1,2
1,3,4
2,5,6
3,1,122
4,3,4
5,5,6
6,155,6
I want to replace the outliers in all the columns which are outside 2 sigma. Using the below code, I can create a dataframe without the outliers.
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 2).all(axis=1)]
c1,c2,c3
1,3,4
2,5,6
4,3,4
5,5,6
I can find the outliers for each column separately and replace with "nan", but that would not be the best way as the number of lines in the code increases with the number of columns. There must be a better way of doing this. May be boolean output from the above command for rows and then replace "TRUE" with "nan".
Any suggestions, many thanks.
pandas
Use pd.DataFrame.mask
df.mask(df.sub(df.mean()).div(df.std()).abs().gt(2))
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0
numpy
v = df.values
mask = np.abs((v - v.mean(0)) / v.std(0)) > 2
pd.DataFrame(np.where(mask, np.nan, v), df.index, df.columns)
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0
lb = df.quantile(0.01)
ub = df.quantile(0.99)
df_new = df[(df < ub) & (df > lb)]
df_new
I am using interquatile range method to detect outliers. Firstly it calculates the lower bound and upper bound of the df using quantile function. Then based on the condition that all the values should be between lower bound and upper bound it returns a new df with outlier values replaced by NaN.

Pandas programming model for the rolling window indexing

I need an advice on the programming pattern and use of DataFrame for our data. We have thousands of small ASCII files that are the results of the particle tracking experiments (see www.openptv.net for details). Each file is a list of particles identified and tracked in that time instance. The name of the file is the number of the frame. For example:
ptv_is.10000 (i.e. frame no. 10000)
prev next x y z
-1 5 0.0 0.0 0.0
0 0 1.0 1.0 1.0
1 1 2.0 2.0 2.0
2 2 3.0 3.0 3.0
3 -2 4.0 4.0 4.0
ptv_is.10001 (i.e.next time frame, 10001)
1 2 1.1 1.0 1.0
2 8 2.0 2.0 2.0
3 14 3.0 3.0 3.0
4 -2 4.0 4.0 4.0
-1 3 1.5 1.12 1.32
0 -2 0.0 0.0 0.0
The columns of the ASCII files are: prev - is the row number of the particle in the previous frame, next is the row number of the particle in the next frame, x,y,z are coordinates of the particle. If the row index of 'prev' is -1 - the particle appeared in the current frame and doesn't have link back in time. If the 'next' is -2, then the particle doesn't have a link forward in time and the trajectory ends in this frame.
So we are reading these files into a single DataFrame with the same column headers plus we add an index of time, i.e. the frame number
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Now the step were I find it difficult to find the best way of using DataFrame. If we could add an additional column, called trajectory_id, we'd be able later to reindex this DataFrame either by time (creating sub-groups of the particles in single time instance and learn their spatial distributions) or by the trajectory_id and then create trajectories (or linked particles and learn about their time evolution in space, e.g. x(t), y(t), z(t) for the same trajectory_id).
If the input is:
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Then the result I need is:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1
0 0 1.0 1.0 1.0 10000 2
1 1 2.0 2.0 2.0 10000 3
2 2 3.0 3.0 3.0 10000 4
3 -2 4.0 4.0 4.0 10000 -999
1 2 1.1 1.0 1.0 10001 2
2 8 2.0 2.0 2.0 10001 3
3 14 3.0 3.0 3.0 10001 4
-1 -2 4.0 4.0 4.0 10001 -999
-1 3 1.5 1.1 1.3 10001 5
0 -2 0.0 0.0 0.0 10001 1
which means:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1 < - appeared first time, new id
0 0 1.0 1.0 1.0 10000 2 < - the same
1 1 2.0 2.0 2.0 10000 3 <- the same
2 2 3.0 3.0 3.0 10000 4 <- the same
3 -2 4.0 4.0 4.0 10000 -999 <- sort of NaN, there is no link in the next frame
1 2 1.1 1.0 1.0 10001 2 <- from row #1 in the time 10000, has an id = 2
2 8 2.0 2.0 2.0 10001 3 <- row #2 at previous time, has an id = 3
3 14 3.0 3.0 3.0 10001 4 < from row # 3, next on the row #14, id = 4
-1 -2 4.0 4.0 4.0 10001 -999 <- but linked, marked as NaN or -999
-1 3 1.5 1.1 1.3 10001 5 <- new particle, new id = 5 (new trajectory_id)
0 -2 0.0 0.0 0.0 10001 1 <- from row #0 id = 1
Hope this explains better what I'm looking for. The only problem is that I do not know how to have a rolling function through the rows of a DataFrame table, creating a new index column, trajectory_id.
For example, the simple application with lists is shown here:
http://nbviewer.ipython.org/7020209
Thanks for every hint on pandas use,
Alex
Neat! This problem is close to my heart; I also use pandas for particle tracking. This is not exactly the same problem I work on, but here's an untested sketch that offers some helpful pandas idioms.
results = []
first_loop = True
next_id = None
for frame_no, frame in pd.concat(list_of_dataframes).groupby('time'):
if first_loop:
frame['traj_id'] = np.arange(len(frame))
results.append(frame)
next_id = len(frame)
first_loop = False
continue
prev_frame = results[-1]
has_matches = frame['prev'] > 0 # boolean indexer
frame[has_matches]['traj_'id'] = prev_frame.iloc[frame[has_matches]['prev']]
count_unmatched = (~has_matches).sum()
frame[~has_matches]['traj_'id'] = np.arange(next_id, next_id + count_unmatched)
next_id += count_unmatched
results.append(frame)
pd.concat(results)
If I understands well you want to track the position of particles in space across time. You are dealing with data of five dimensions, so maybe a DataFrame it's not the best structure for your problem and you may think about the a panel structure, or a reduction of data.
Taking one particle, you have two posibilities, first treat coordinates as three different values so you need three fields or treat them as a whole, a tuple or a point object for example.
In first case you have time plus three values so you have four axes, you need a DataFrame. In the second case you have two axes so you can use a Series.
For multiple particles just use a particle_id and put all the DataFrames in a Panel or the Series in a DataFrame.
Once you know what data structure to use then it's time to put data in.
Read the files sequentialy and make a collection of 'live' particles, ex.:
{particle_id1: { time1: (x1,y1,z1), time2: (x2,y2,z2), ...}, ...}
When a new particle it's detected (-1 on prev) assign a new particle_id and put on the collection. When a particle 'deads' pop out of the collection and put the data in a Series and then add this Series to a particle DataFrame (or DataFrame / Panel).
You could also make an index of particles id and the next field to help recognizing ids:
{ next_position_of_last_file: particle_id, ... }
or
{ position_in_last_file: particle_id, ...}