pandas vectorize one valid signal in 5 rows - pandas

I want to find the first valid signal in the dataframe. A valid signal is defined that there is no signal in its preceding 5 rows.
The dataframe is like:
entry
0 0
1 1
2 0
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
The entry signal on row 4 is not valid because there is a signal on row 1. Every signals will negate any signal in the following 5 rows.
I implement this by using an apply function with a parameter recording the signal row counter.
The code is as following
import pandas as pd
def testfun(row, orderinfo):
if orderinfo['countrows'] > orderinfo['maxrows']:
orderinfo['countrows'] = 0
if orderinfo['countrows'] > 0:
orderinfo['countrows'] += 1
row['entry'] = 0
if row['entry'] == 1 and orderinfo['countrows'] == 0:
orderinfo['countrows'] += 1
return row
if __name__ == '__main__':
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,0,1,0,0,0,0,0,0]})
orderinfo = dict(countrows=0, maxrows=5)
df = df.apply(lambda row: testfun(row, orderinfo), axis=1)
print(df)
output is:
entry
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
But I am wondering if there is any vectorized way to do this? Because apply is not very efficient.

IIUC,
You need rolling with min_periods=1 and sum less than or equal 1 and compare against entry column
(df.entry.rolling(4, min_periods=1).sum().le(1) & df.entry).astype(int)
Out[595]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
Name: entry, dtype: int32

Related

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

How to remove duplicate columns generated after using pd.get_dummies using their variance as cutoff

I have a dataframe which is being generated using pd.get_dummies as below:
df_target = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
where column is a column name and df_column is the dataframe from which each column is being pulled to do some operations.
rev_grp_m2_> 225 rev_grp_m2_nan rev_grp_m2_nan
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Now I do a check of variance for each column generated and skip those with zero variance.
for target_column in list(df_target.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target[target_column]
Here due to two columns being the same , I get a Key Error for the np.var line.
There are two values of variance for the nan column:
erev_grp_m2_nan 0.000819
rev_grp_m2_nan 0.000000
Ideally I would like to take the one with non-zero variance and drop/skip the one with 0 var.
Can someone please help me do this?
For DataFrame.var use:
print (df.var())
rev_grp_m2_> 225 0.083333
rev_grp_m2_nan 0.000000
rev_grp_m2_nan 0.000000
Last for filtering is used boolean indexing:
out = df.loc[:, df.var()!= 0]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
EDIT: You can get indices of non 0 values and then seelct by iloc:
cols = [i for i in np.arange(len(df.columns)) if np.var(df.iloc[:, i]) != 0]
print (cols)
[0]
df = df.iloc[:, cols]
print (df)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
Another idea is filter out if all values are 0:
cols = [i for i in np.arange(len(df.columns)) if (df.iloc[:, i] != 0).any()]
out = df.iloc[:, cols]
Or:
out = df.loc[:, (df != 0).any()]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0

How to turn a list of event in to a matrix to display in Panda

I have a list of events and i want to display on a graph how many happens per hour each day of the week as shown below:
Example of the graph i want
(each line is a day, x axis is the time of the day, y axis is the number of events)
As i am new to Panda i am not sure what's the best way to do it but here is my way:
x = [(rts[k].getDay(), rts[k].getHour(), 1) for k in rts]
df = pd.DataFrame(x[:30]) # Subset of 30 events
dfGrouped = df.groupby([0, 1]).sum() # Group them by day and hour
#Format to display
pd.DataFrame(np.random.randn(24, 7), index=range(0,24), columns=['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
Question is, how can i go from my dataframe with data grouped to a matrix 24x7 as required to display ?
I tried as_matrix but that give me only a one dimensional array, while i want the index of my dataframe to be the index in my matrix.
print(df)
2
0 1
0 19 1
23 1
1 10 2
18 3
22 1
2 17 1
3 8 2
9 3
11 3
13 1
19 1
4 7 1
9 1
14 1
15 1
18 1
5 1 2
7 1
13 1
19 1
6 12 1
Thanks for your help :)
Antoine
I think you need unstack for reshape data, then rename columns names by dict and if necessary add missing hours to index by reindex_axis:
df1 = df.groupby([0, 1])[2].sum().unstack(0, fill_value=0)
#set columns names
df = pd.DataFrame(x[:30], columns = ['days','hours','val'])
d = {0: 'Mo', 1: 'Tu', 2: 'We', 3: 'Th', 4: 'Fr', 5: 'Sa', 6: 'Su'}
df1 = df.groupby(['days', 'hours'])['val'].sum().unstack(0, fill_value=0)
df1 = df1.rename(columns=d).reindex_axis(range(24), fill_value=0)
print (df1)
days Mo Tu We Th Fr Sa Su
hours
0 0 0 0 0 0 0 0
1 0 0 0 0 0 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 1 1 0
8 0 0 0 2 0 0 0
9 0 0 0 3 1 0 0
10 0 2 0 0 0 0 0
11 0 0 0 3 0 0 0
12 0 0 0 0 0 0 1
13 0 0 0 1 0 1 0
14 0 0 0 0 1 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0
18 0 3 0 0 1 0 0
19 1 0 0 1 0 1 0
20 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0
23 1 0 0 0 0 0 0

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7

plot pandas data frame but most columns have zeros

I am new to pandas and ipython I just setup everything and currently playing around. I have following data frame:
Field 10 20 30 40 50 60 70 80 90 95
0 A 0 0 0 0 0 0 0 0 1 3
1 B 0 0 0 0 0 0 0 1 4 14
2 C 0 0 0 0 0 0 0 1 2 7
3 D 0 0 0 0 0 0 0 1 5 15
4 u 0 0 0 0 0 0 0 1 5 14
5 K 0 0 0 0 0 0 1 2 7 21
6 S 0 0 0 0 0 0 0 1 3 8
7 E 0 0 0 0 0 0 0 1 3 8
8 F 0 0 0 0 0 0 0 1 6 16
I used a csv file to import this data:
df = pd.read_csv('/mycsvfile.csv',
index_col=False, header=0)
As you can see post of the columns are zero this data frame has large number of rows but there is possibility that in column most of the rows can be zero while one or two remaining with a value like "70".
I wounder how can I get this to nice graph where I can show 70, 80, 95 columns with the emphasis.
I found following tutorial: [http://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html][1] but still I am unable to get a good figure.
It depends a bit on how you want to handle the zero values, but here is an approach:
df = pd.DataFrame({'a': [0,0,0,0,70,0,0,90,0,0,80,0,0],
'b': [0,0,0,50,0,60,0,90,0,80,0,0,0]})
fig, axs = plt.subplots(1,2,figsize=(10,4))
# plot the original, for comparison
df.plot(ax=axs[0])
for name, col in df.iteritems():
col[col != 0].plot(ax=axs[1], label=name)
axs[1].set_xlim(df.index[0],df.index[-1])
axs[1].set_ylim(bottom=0)
axs[1].legend(loc=0)
You could also go for something with .replace(0,np.nan), but matplotlib doesnt draw lines if there are nan's in between. So you probably end up with looping over the columns anyway (and then using dropna().plot() for example).