Matplotlib eventplot - raster plot from binary values - pandas

I have created a dataframe where each column is an equal-length series of 1.0s and 0.0s. There is nothing else in the dataframe. I want to create a raster-style plot from this data where each column would be a horizontal line stacked up along the y-axis and each tick on the x-axis would correspond to a row index value.
However, when I try to do this, I get an "axis -1 is out of bounds for array of dimension 0" error. None of the other entries for this or very similar errors seem to relate to eventplot. I thought the type of data I had would be perfect for eventplot (a discrete black dash wherever there's a 1.0, otherwise nothing), but maybe I'm very wrong.
Here's a toy example of the kind of dataframe I'm trying to pass plus the function as I'm calling it:
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
plt.eventplot(df, colors='black', lineoffsets=1,
linelengths=1, orientation='vertical')
Any help appreciated, thank you.
Edit: If I convert my df into an np.array and pass that instead, I no longer get that particular error, but I don't at all get the result I'm looking for. I do get the correct values on the x-axis (in my real data, this is 0-22), but I don't get each column of data represented as a separate line, and I'm having no luck advancing in that direction.

When using eventplot, the array passed to positions needs to contain the row numbers of the ones in each column. Here is an example with your toy data:
import io
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import data into dataframe
data = """
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create series of indexes containing positions for raster plot
positions = df.apply(lambda x: df.index[x == 1])
# Create raster plot with inverted y-axis to display columns in ascending order
plt.eventplot(positions, lineoffsets=df.index, linelengths=0.75, colors='black')
plt.yticks(range(positions.index.size), positions.index)
plt.gca().invert_yaxis()

Related

How can I change a MultiIndex data to a Stata data file?

Firstly, I have a multiIdex data in pandas, and i want to export it so that Stata can import this file directly. The Pandas file is as follows:
fund
NAME_CITY 七台河市 万宁市 三亚市 三明市 三沙市
year nsfc
2000 B 0.0 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0 0.0
D 0.0 0.0 0.0 0.0 0.0
E 0.0 0.0 0.0 0.0 0.0
F 0.0 0.0 0.0 0.0 0.0
I have three indexes: year, nsfc and the names of China's cites(七台河市,万宁市,,,,). How can I drop the fist row which filled "fund" and set 七台河市,万宁市.etc as the columns' name so that it satisfices Stata's request of variable name?
I know the simplest method is export this file to csv file and edit it manually, but I just wonder how to make it by pandas.

pandas DataFrame column manipulation using previous row value

I have below pandas DataFrame
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
0
1.0
1.0
0
1.0
1.0
0
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
0
1.0
1.0
0
I am trying to update the total column based on below logic.
if df['color'] == 1.0 and df['direction'] == 1.0 then Total should be Total of previous row + 1. if Total of previous row is NaN, then 0+1
Note: I was trying to read the previous row total using df['Total'].shift() + 1 but it didnt work.
Expected DataFrame.
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
1
1.0
1.0
2
1.0
1.0
3
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
1
1.0
1.0
2
You can create the sub-groupby value with cumsum , the new just groupby with color and direction and do cumcount
df.loc[df.Total.notnull(),'Total'] = df.groupby([df['Total'].isna().cumsum(),df['color'],df['direction']]).cumcount()+1
df
Out[618]:
color direction Total
0 -1.0 1.0 NaN
1 1.0 1.0 1.0
2 1.0 1.0 2.0
3 1.0 1.0 3.0
4 -1.0 1.0 NaN
5 1.0 -1.0 NaN
6 1.0 1.0 1.0
7 1.0 1.0 2.0

pandas - how to select rows based on a conjunction of a non indexed column?

Consider the following DataFrame -
In [47]: dati
Out[47]:
x y
frame face lmark
1 NaN NaN NaN NaN
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
2201 0.0 1.0 634.0 395.0
2.0 629.0 439.0
3.0 630.0 486.0
How can we select the rows where dati['x'] > 629.5 for all rows sharing the same value in the 'frame' column. For this example, I would expect to result to be
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
because column 'x' of 'frame' 2201, 'lmark' 2.0 is not greater than 629.5
Use GroupBy.transform with GroupBy.all for test if all Trues per groups and filter in boolean indexing:
df = dat[(dat['x'] > 629.5).groupby(level=0).transform('all')]
print (df)
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0

how do I sum each column based on condition of another column without iterating over the columns in pandas datframe

I have a data frame as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 1.0 85.0 66.0 29.0 0.0 0.0
1 8.0 183.0 64.0 0.0 0.0 0.0
2 1.0 89.0 66.0 23.0 94.0 1.0
3 0.0 137.0 40.0 35.0 168.0 1.0
4 5.0 116.0 74.0 0.0 0.0 1.0
I would like a pythonic way to sum each column in separate based on a condition of one of the columns. I could do it with iterating over the df columns, but I'm sure there is a better way I'm not familiar with.
In specific to the data I have, I'd like to sum each column values if at the last column 'Outcome' is equal to 1. In the end, I should get as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Any ideas?
Here is a solution to get the expected output:
sum_df = df.loc[df.Outcome == 1.0].sum().to_frame().T
sum_df.Outcome = 0.0
Output:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Documentation:
loc: access a group of rows / columns by labels or boolean array
sum: sum by default over all columns and return a Series indexed by the columns.
to_frame: convert a Series to a DataFrame.
.T: accessor the transpose function, transpose the DataFrame.
use np.where
df1[np.where(df1['Outcome'] == 1,True,False)].sum().to_frame().T
Output
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 3.0
Will these work for you?
df1.loc[~(df1['Outcome'] == 0)].groupby('Outcome').agg('sum').reset_index()
or
df1.loc[df1.Outcome == 1.0].sum().to_frame().T

Numpy or Pandas for multiple dataframes of 2darray datasets

I hope I used the correct synonyms in the title, which describes my problem.
My data has the following structure
D = {E_1, E_2...,E_n} with E_i = {M_{i,1}, M_{i,2},...M_{i,m}} and each M_{i,j} is a 6x2 Matrix.
I used a numpy array with dimension n x m x 6 x 2 to save the data. This was okay if every dataset E_i has the same amount of matrices.
But this solution is not working anymore, since I now work with datasets E_i which have a different number of Matrices i.e. E_i has m_i matrices.
Is there maybe a way in Pandas to resolve my problem? At the end I need to enter each matrix to operate with it as a numpy array i.e. multiplication, inverse, determinant….
You could try to use a multiindex in pandas in order to do this. This allows you to select the dataframe by level. A simple example of how you could achieve something like that:
D = np.repeat([0, 1], 12)
E = np.repeat([0, 1, 0, 1], 6)
print(D, E)
index_cols = pd.MultiIndex.from_arrays(
[D, E],
names=["D_idx", "E_idx"])
M = np.ones([24,2])
df = pd.DataFrame(M,
index=index_cols,
columns=["left", "right"])
print(df)
This gives you the dataframe:
left right
D_idx E_idx
0 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
You can then slice the dataframe based on levels, i.e. if you want to retrieve all elements in set D_1 you can select: df.loc[[(0, 0), (0, 1)], :]
You can generate selectors like this using list(zip(d_idx, e_idx)) in order to select specific rows.
You can find more about slicing and selecting the dataframe here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html