calculate number and length of non NaN values in a pandas row - pandas

I have:
pd.DataFrame({'A':[1,2,3],'B':[np.NaN,4,5],'C':[6,7,np.NaN],'D':[8,9,np.NaN],'E':[np.NaN,11,12],'F':[13,14,15]})
A B C D E F
0 1 NaN 6.0 8 NaN 13
1 2 4.0 7.0 9 11.0 14
2 3 5.0 NaN NaN 12.0 15
and I want to calculate the number of non NaN sequences (cnt) and length of each run (runs). Ie, row 0 has a non NaN sequences of length 1,2, and 1, for a total of 3 sequences.
pd.DataFrame({'runs':[[1,2,1],[5],[2,2]],'cnt':[3,1,2]})
runs cnt
0 [1, 2, 1] 3
1 [5] 1
2 [2, 2] 2
Any suggestions

We can do stack then groupby with subgroup created by cumsum and isna
s = df.stack(dropna=False).reset_index(name='value')
out = s[s['value'].notna()].groupby([s['level_0'],s['value'].isna().cumsum()]).size().groupby(level=[0]).agg([list,len])
out
Out[269]:
list len
level_0
0 [1, 2, 1] 3
1 [6] 1
2 [2, 2] 2

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Random sampling from a dataframe

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items.
I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval.
How can I randomly choose?
I tried random.sample but this chooses random columns.Actually I want to choose random items individually.
Assuming this input:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
You can craft a random numpy array to select/mask half of the values:
a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)
Then select your two groups:
df.mask(a)
0 1 2 3 4 5
0 NaN NaN NaN 3.0 4 NaN
1 6.0 NaN 8.0 NaN 10 11.0
df.where(a)
0 1 2 3 4 5
0 0.0 1 2.0 NaN NaN 5.0
1 NaN 7 NaN 9.0 NaN NaN
If you simply want 6 random elements, use nummy.random.choice:
np.random.choice(df.to_numpy(). ravel(), 6, replace=False)
Example:
array([ 4, 5, 11, 7, 8, 3])

How to build a window through number positive and negative ranges in dataframe column?

I would like to have average value and max value in every positive and negative range.
From sample data below:
import pandas as pd
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
Which give me dataframe like this:
value
0 -1
1 -2
2 -3
3 -2
4 -1
5 1
6 2
7 3
8 2
9 1
10 -1
11 -4
12 -5
13 2
14 4
15 7
I would like to have something like that:
AVG1 = [-1, -2, -3, -2, -1] / 5 = - 1.8
Max1 = -3
AVG2 = [1, 2, 3, 2, 1] / 5 = 1.8
Max2 = 3
AVG3 = [2 ,4 ,7] / 3 = 4.3
Max3 = 7
If solution need new column or new dataframe, that is ok for me.
I know that I can use .mean like here
pandas get column average/mean with round value
But this solution give me average from all positive and all negative value.
How to build some kind of window that I can calculate average from first negative group next from second positive group and etc..
Regards
You can create Series by np.sign for distinguish positive and negative groups with compare shifted values with cumulative sum for groups and then aggregate mean and max:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df = df_test.groupby(g)['value'].agg(['mean','max'])
print (df)
mean max
value
1 -1.800000 -1
2 1.800000 3
3 -3.333333 -1
4 4.333333 7
EDIT:
For find locale extremes is used solution from this answer:
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
from scipy.signal import argrelextrema
#https://stackoverflow.com/a/50836425
n=2 # number of points to be checked before and after
# Find local peaks
df_test['min'] = df_test.iloc[argrelextrema(df_test.value.values, np.less_equal, order=n)[0]]['value']
df_test['max'] = df_test.iloc[argrelextrema(df_test.value.values, np.greater_equal, order=n)[0]]['value']
Then are replaced values after extremes to missing values, separately for negative and positive groups:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df_test[['min1','max1']] = df_test[['min','max']].notna().astype(int).iloc[::-1].groupby(g[::-1]).cumsum()
df_test['min1'] = df_test['min1'].where(s.eq(-1) & df_test['min1'].ne(0))
df_test['max1'] = df_test['max1'].where(s.eq(1) & df_test['max1'].ne(0))
df_test['g'] = g
print (df_test)
value min max min1 max1 g
0 -1 NaN -1.0 1.0 NaN 1
1 -2 NaN NaN 1.0 NaN 1
2 -3 -3.0 NaN 1.0 NaN 1
3 -2 NaN NaN NaN NaN 1
4 -1 NaN NaN NaN NaN 1
5 1 NaN NaN NaN 1.0 2
6 2 NaN NaN NaN 1.0 2
7 3 NaN 3.0 NaN 1.0 2
8 2 NaN NaN NaN NaN 2
9 1 NaN NaN NaN NaN 2
10 -1 NaN NaN 1.0 NaN 3
11 -4 NaN NaN 1.0 NaN 3
12 -5 -5.0 NaN 1.0 NaN 3
13 2 NaN NaN NaN 1.0 4
14 4 NaN NaN NaN 1.0 4
15 7 NaN 7.0 NaN 1.0 4
So is possible separately aggregate last 3 values per groups with lambda function and mean, rows with missing values in min1 or max1 are removed by default in groupby:
df1 = df_test.groupby(['g','min1'])['value'].agg(lambda x: x.tail(3).mean())
print (df1)
g min1
1 1.0 -2.000000
3 1.0 -3.333333
Name: value, dtype: float64
df2 = df_test.groupby(['g','max1'])['value'].agg(lambda x: x.tail(3).mean())
print (df2)
g max1
2 1.0 2.000000
4 1.0 4.333333
Name: value, dtype: float64

Is there a way to horizontally concatenate dataframes of same length while ignoring the index?

I have dataframes I want to horizontally concatenate while ignoring the index.
I know that for arithmetic operations, ignoring the index can lead to a substantial speedup if you use the numpy array .values instead of the pandas Series. Is it possible to horizontally concatenate or merge pandas dataframes whilst ignoring the index? (To my dismay, ignore_index=True does something else.) And if so, does it give a speed gain?
import pandas as pd
df1 = pd.Series(range(10)).to_frame()
df2 = pd.Series(range(10), index=range(10, 20)).to_frame()
pd.concat([df1, df2], axis=1)
# 0 0
# 0 0.0 NaN
# 1 1.0 NaN
# 2 2.0 NaN
# 3 3.0 NaN
# 4 4.0 NaN
# 5 5.0 NaN
# 6 6.0 NaN
# 7 7.0 NaN
# 8 8.0 NaN
# 9 9.0 NaN
# 10 NaN 0.0
# 11 NaN 1.0
# 12 NaN 2.0
# 13 NaN 3.0
# 14 NaN 4.0
# 15 NaN 5.0
# 16 NaN 6.0
# 17 NaN 7.0
# 18 NaN 8.0
# 19 NaN 9.0
I know I can get the result I want by resetting the index of df2, but I wonder whether there is a faster (perhaps numpy method) to do this?
np.column_stack
Absolutely equivalent to EdChum's answer.
pd.DataFrame(
np.column_stack([df1,df2]),
columns=df1.columns.append(df2.columns)
)
0 0
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Pandas Option with assign
You can do many things with the new columns.
I don't recommend this!
df1.assign(**df2.add_suffix('_').to_dict('l'))
0 0_
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
A pure numpy method would be to use np.hstack:
In[33]:
np.hstack([df1,df2])
Out[33]:
array([[0, 0],
[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5],
[6, 6],
[7, 7],
[8, 8],
[9, 9]], dtype=int64)
this can be easily converted to a df by passing this as the data arg to the DataFrame ctor:
In[34]:
pd.DataFrame(np.hstack([df1,df2]))
Out[34]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
with respect to whether the data is contiguous, the individual columns will be treated as separate arrays as it's a dict of Series essentially, as you're passing numpy arrays there is no allocation of memory and copying needed here for simple and homogeneous dtype so it should be fast.

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64