remove redundant signals in pandas

remove redundant signals in pandas - pandas

I want to build correspondance between col1 and col2 with certain rule.
Label1 is like an on switch, and label2 is like an off switch. Once label1 is on, further operation on label1 will not re-open the switch until it is switched off by label2. Then label1 can switch on again.
For example, I have a following table:
index label1 label2 note
1 F T label2 is invalid because not switch on yet
2 T F label1 switch on
3 F F
4 T F useless action because it's on already
5 F T switch off
6 F F
7 T F switch on
8 F F
9 F T switch off
10 F F
11 F T invalid off operation, not on
The correct output is something like:
label1ix label2ix
2 5
7 9
What I tries is :
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
mask = (df['label1'] == T) # label1==True, then get the index and label2ix
newdf = pd.Dataframe(df.loc[mask, ['index', 'label2ix']])
This is not correct because I have got is:
label1ix label2ix note
2 5 correct
4 5 wrong operation
7 9 correct
I am not sure how to filter out row 4.
I have got another idea,
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
groups = df.groupby('label2ix')
firstlabel1 = groups['label1'].first()
But for this solution, I don't know how to get the first label1=T in each group.
And I am not sure if there is any more efficient way to do that? Grouping is usually slow.

Not tested yet, but here are few things you can try:
Option 1: For the first approach, you can filter out the 4 by:
newdf.groupby('label2ix').min()
but this approach might not work with more general data.
Option 2: This might work better in general:
# copy all on and off switches to a common column
# 0 - off, 1 - on
df['state'] = np.select([df.label1=='T', df.label2=='T'], [1,0], default=np.nan)
# ffill will fill the na with the state before it
# until changed by a new switch
df['state'] = df['state'].ffill().fillna(0)
# mark the changes of states
df['change'] = df['state'].diff()
At this point, df will be:
index label1 label2 state change
0 1 F T 0.0 NaN
1 2 T F 1.0 1.0
2 3 F F 1.0 0.0
3 4 T F 1.0 0.0
4 5 F T 0.0 -1.0
5 6 F F 0.0 0.0
6 7 T F 1.0 1.0
7 8 F F 1.0 0.0
8 9 F T 0.0 -1.0
9 10 F F 0.0 0.0
10 11 F T 0.0 0.0
which should be easy to track all the state changes:
switch_ons = df.loc[df['change'].eq(1), 'index']
switch_offs = df.loc[df['change'].eq(-1), 'index']
# return df
new_df = pd.DataFrame({'label1ix':switch_ons.values,
'label2ix':switch_offs.values})
and output:
label1ix label2ix
0 2 5
1 7 9

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)

Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125

Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)

This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Pandas rolling window with less than or equal to

I have a dataframe which is classified based on three dimensions:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
when I do a rolling of metric d by the following command:
>>> df.d.rolling(window = 3).mean()
0 NaN
1 NaN
2 2.0
Name: d, dtype: float64
but what I actually want is to perform a rolling <= given number, in a way that if for the first entry the result is the same number itself and then from the second entry it rolls for the window size of 1 and for third it rolls for the window size of 2 and from 3 onwards it rolls the running average of 3 previous windows.
So the result I am expecting is:
for the dataframe:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
>>> df.d.rolling(window = 3).mean()
0 1 #Since this is the first one and so average of the first number is equal to number itself.
1 1.5 # Average of 1 and 2 as rolling criteria is <= 3
2 2.0 # Since here we have 3 elements so from here on it follows the general trend.
Name: d, dtype: float64
Is it possible to roll this way?

I was able to roll using the following command:
>>> df.d.rolling(min_periods = 1, window = 3).mean()
0 1.0
1 1.5
2 2.0
Name: d, dtype: float64
with the help of min_periods one can specify the rolling window minimum config count.

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?

So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

Pandas Creating Normal Dist series

I'm trying to convert an excel "normal distribution" formula into python.
(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))
For example: Here's my given df
Id a b c
ijk 4 3.5 12.53
xyz 12 3 10.74
My goal:
Id a b c 0 1 2 3
ijk 4 3.5 12.53 1 .93 .87 .81
xyz 12 3 10.74 1 .87 .76 .66
Here's the math behind it:
column 0: always 1
column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
This is what I have so far:
df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1
I'm not sure what to do after this.
This is how I use the normal distribution function:
https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58
Many many thanks!

NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,
import scipy.stats as ss
ss.norm.cdf(4, 3.5, 12.53)
Out:
0.51591526057026538
For your case, you can first define a function:
def normalize(a, b, c, col):
return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
and call that function with apply:
for col in range(4):
df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
df
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303
This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:
out = df.apply(
lambda x: pd.Series(
ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
), axis=1
)
Out:
0 1 2 3
0 1.0 0.934455 0.869533 0.805636
1 1.0 0.875050 0.760469 0.656303
And you can use join to add this to your original DataFrame:
df.join(out)
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

remove redundant signals in pandas - pandas

Related

How to add Multilevel Columns and create new column?

pandas dataframe how to replace extreme outliers for all columns

Pandas rolling window with less than or equal to

How to extract different groups of 4 rows from dataframe and unstack the columns

Pandas Creating Normal Dist series

Categories

Resources