I am new to pandas and ipython I just setup everything and currently playing around. I have following data frame:
Field 10 20 30 40 50 60 70 80 90 95
0 A 0 0 0 0 0 0 0 0 1 3
1 B 0 0 0 0 0 0 0 1 4 14
2 C 0 0 0 0 0 0 0 1 2 7
3 D 0 0 0 0 0 0 0 1 5 15
4 u 0 0 0 0 0 0 0 1 5 14
5 K 0 0 0 0 0 0 1 2 7 21
6 S 0 0 0 0 0 0 0 1 3 8
7 E 0 0 0 0 0 0 0 1 3 8
8 F 0 0 0 0 0 0 0 1 6 16
I used a csv file to import this data:
df = pd.read_csv('/mycsvfile.csv',
index_col=False, header=0)
As you can see post of the columns are zero this data frame has large number of rows but there is possibility that in column most of the rows can be zero while one or two remaining with a value like "70".
I wounder how can I get this to nice graph where I can show 70, 80, 95 columns with the emphasis.
I found following tutorial: [http://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html][1] but still I am unable to get a good figure.
It depends a bit on how you want to handle the zero values, but here is an approach:
df = pd.DataFrame({'a': [0,0,0,0,70,0,0,90,0,0,80,0,0],
'b': [0,0,0,50,0,60,0,90,0,80,0,0,0]})
fig, axs = plt.subplots(1,2,figsize=(10,4))
# plot the original, for comparison
df.plot(ax=axs[0])
for name, col in df.iteritems():
col[col != 0].plot(ax=axs[1], label=name)
axs[1].set_xlim(df.index[0],df.index[-1])
axs[1].set_ylim(bottom=0)
axs[1].legend(loc=0)
You could also go for something with .replace(0,np.nan), but matplotlib doesnt draw lines if there are nan's in between. So you probably end up with looping over the columns anyway (and then using dropna().plot() for example).
Related
I am applying multiple filters on a dataframe at the same time.
data_df[(data_df['1']!=0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]
I needed to know is there any optimized way to do this? As I want to compare one column's value as !=0 and others value as =0 multiple times and columns could be more than 5. So, all operations will be:
data_df[(data_df['1']==0) & (data_df['2']!=0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']!=0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']!=0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']!=0)]
Looking for a short and optimized method.
Based on the below statements:
Looking for a short and optimized method
and
I want to compare one column's value as !=0 and others value as =0
You can use df.ne and df.eq with df.drop on axis=1 to drop the column 1:
data_df[data_df[1].ne(0) & data_df.drop(1,axis=1).eq(0).all(axis=1)]
We can first calculate the dataframe, such that for the given columns we only calculate once if the value is equal to zero or not.
df_bool = df[['1', '2', '3', '4', '5']] == 0
Next we can use this as a mask:
df[~df_bool[1] & df[[2, 3, 4, 5]].all(axis=1)]
One idea is compare by numpy array filled by 1 and 0 and test if all values matching by numpy.all:
#test list - all 0, first 1
L = [1,0,0,0,0]
df = data_df[np.all(data_df == np.array(L), axis=1)]
Or use DataFrame.merge by one row DataFrame:
df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))
Sample:
np.random.seed(2020)
data_df = pd.DataFrame(np.random.randint(2, size=(100, 5)), columns=list('12345'))
#print (data_df)
df = data_df[np.all(data_df == np.array(L), axis=1)]
print (df)
1 2 3 4 5
2 1 0 0 0 0
13 1 0 0 0 0
44 1 0 0 0 0
58 1 0 0 0 0
70 1 0 0 0 0
89 1 0 0 0 0
Or:
L = [1,0,0,0,0]
df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))
print (df)
1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
Solution with merge should be used with helper DataFrame with all combinations:
df1 = pd.DataFrame(0, index=data_df.columns, columns=data_df.columns)
np.fill_diagonal(df1.to_numpy(), 1)
print (df1)
1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
df = data_df.merge(df1.loc[['1']])
print (df)
1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
df = data_df.merge(df1.loc[['2']])
print (df)
1 2 3 4 5
0 0 1 0 0 0
1 0 1 0 0 0
I want to find the first valid signal in the dataframe. A valid signal is defined that there is no signal in its preceding 5 rows.
The dataframe is like:
entry
0 0
1 1
2 0
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
The entry signal on row 4 is not valid because there is a signal on row 1. Every signals will negate any signal in the following 5 rows.
I implement this by using an apply function with a parameter recording the signal row counter.
The code is as following
import pandas as pd
def testfun(row, orderinfo):
if orderinfo['countrows'] > orderinfo['maxrows']:
orderinfo['countrows'] = 0
if orderinfo['countrows'] > 0:
orderinfo['countrows'] += 1
row['entry'] = 0
if row['entry'] == 1 and orderinfo['countrows'] == 0:
orderinfo['countrows'] += 1
return row
if __name__ == '__main__':
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,0,1,0,0,0,0,0,0]})
orderinfo = dict(countrows=0, maxrows=5)
df = df.apply(lambda row: testfun(row, orderinfo), axis=1)
print(df)
output is:
entry
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
But I am wondering if there is any vectorized way to do this? Because apply is not very efficient.
IIUC,
You need rolling with min_periods=1 and sum less than or equal 1 and compare against entry column
(df.entry.rolling(4, min_periods=1).sum().le(1) & df.entry).astype(int)
Out[595]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
Name: entry, dtype: int32
I have a dataframe which is being generated using pd.get_dummies as below:
df_target = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
where column is a column name and df_column is the dataframe from which each column is being pulled to do some operations.
rev_grp_m2_> 225 rev_grp_m2_nan rev_grp_m2_nan
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Now I do a check of variance for each column generated and skip those with zero variance.
for target_column in list(df_target.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target[target_column]
Here due to two columns being the same , I get a Key Error for the np.var line.
There are two values of variance for the nan column:
erev_grp_m2_nan 0.000819
rev_grp_m2_nan 0.000000
Ideally I would like to take the one with non-zero variance and drop/skip the one with 0 var.
Can someone please help me do this?
For DataFrame.var use:
print (df.var())
rev_grp_m2_> 225 0.083333
rev_grp_m2_nan 0.000000
rev_grp_m2_nan 0.000000
Last for filtering is used boolean indexing:
out = df.loc[:, df.var()!= 0]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
EDIT: You can get indices of non 0 values and then seelct by iloc:
cols = [i for i in np.arange(len(df.columns)) if np.var(df.iloc[:, i]) != 0]
print (cols)
[0]
df = df.iloc[:, cols]
print (df)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
Another idea is filter out if all values are 0:
cols = [i for i in np.arange(len(df.columns)) if (df.iloc[:, i] != 0).any()]
out = df.iloc[:, cols]
Or:
out = df.loc[:, (df != 0).any()]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
I have a list of events and i want to display on a graph how many happens per hour each day of the week as shown below:
Example of the graph i want
(each line is a day, x axis is the time of the day, y axis is the number of events)
As i am new to Panda i am not sure what's the best way to do it but here is my way:
x = [(rts[k].getDay(), rts[k].getHour(), 1) for k in rts]
df = pd.DataFrame(x[:30]) # Subset of 30 events
dfGrouped = df.groupby([0, 1]).sum() # Group them by day and hour
#Format to display
pd.DataFrame(np.random.randn(24, 7), index=range(0,24), columns=['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
Question is, how can i go from my dataframe with data grouped to a matrix 24x7 as required to display ?
I tried as_matrix but that give me only a one dimensional array, while i want the index of my dataframe to be the index in my matrix.
print(df)
2
0 1
0 19 1
23 1
1 10 2
18 3
22 1
2 17 1
3 8 2
9 3
11 3
13 1
19 1
4 7 1
9 1
14 1
15 1
18 1
5 1 2
7 1
13 1
19 1
6 12 1
Thanks for your help :)
Antoine
I think you need unstack for reshape data, then rename columns names by dict and if necessary add missing hours to index by reindex_axis:
df1 = df.groupby([0, 1])[2].sum().unstack(0, fill_value=0)
#set columns names
df = pd.DataFrame(x[:30], columns = ['days','hours','val'])
d = {0: 'Mo', 1: 'Tu', 2: 'We', 3: 'Th', 4: 'Fr', 5: 'Sa', 6: 'Su'}
df1 = df.groupby(['days', 'hours'])['val'].sum().unstack(0, fill_value=0)
df1 = df1.rename(columns=d).reindex_axis(range(24), fill_value=0)
print (df1)
days Mo Tu We Th Fr Sa Su
hours
0 0 0 0 0 0 0 0
1 0 0 0 0 0 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 1 1 0
8 0 0 0 2 0 0 0
9 0 0 0 3 1 0 0
10 0 2 0 0 0 0 0
11 0 0 0 3 0 0 0
12 0 0 0 0 0 0 1
13 0 0 0 1 0 1 0
14 0 0 0 0 1 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0
18 0 3 0 0 1 0 0
19 1 0 0 1 0 1 0
20 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0
23 1 0 0 0 0 0 0
I am trying to create a heat map from a DataFrame (df) of IDs (rows) and Positions (columns) at which a motif is possible. If the motif is present the value of the table is 1 and 0 if it is not present. Such as:
ID Position 1 2 3 4 5 6 7 8 9 10 ...etc
A 0 1 0 0 0 1 0 0 0 1
B 1 0 1 0 1 0 0 1 0 0
C 0 0 0 1 0 0 1 0 1 0
D 1 0 1 0 0 0 1 0 1 0
I then multiply this matrix by itself to find the number of times the motifs present co-occur with motifs at other positions using the code:
df.T.dot(df)
To obtain the Data Frame:
POS 1 2 3 4 5 6 7 8 9 10 ...
1 2 0 2 0 1 0 1 1 1 0
2 0 1 0 0 0 1 0 0 0 1
3 2 0 2 0 1 0 1 1 1 0
4 0 0 0 1 0 0 1 0 1 0
5 1 0 1 0 1 0 0 1 0 0
6 0 1 0 0 0 1 0 0 0 1
7 1 0 1 1 0 0 2 0 2 0
8 1 0 1 0 1 0 0 1 0 0
9 1 0 1 1 0 0 2 0 2 0
10 0 1 0 0 0 1 0 0 0 1
...
Which is symmetrical with the diagonal, however when I try to create the Heat Map using
pylab.pcolor(df)
It gives me an asymmetrical map that does not seem to be representing the dotted matrix. I don't have enough reputation to post an image though.
Does anyone know why this might be occurring? Thanks