Numpy or Pandas for multiple dataframes of 2darray datasets - pandas

I hope I used the correct synonyms in the title, which describes my problem.
My data has the following structure
D = {E_1, E_2...,E_n} with E_i = {M_{i,1}, M_{i,2},...M_{i,m}} and each M_{i,j} is a 6x2 Matrix.
I used a numpy array with dimension n x m x 6 x 2 to save the data. This was okay if every dataset E_i has the same amount of matrices.
But this solution is not working anymore, since I now work with datasets E_i which have a different number of Matrices i.e. E_i has m_i matrices.
Is there maybe a way in Pandas to resolve my problem? At the end I need to enter each matrix to operate with it as a numpy array i.e. multiplication, inverse, determinant….

You could try to use a multiindex in pandas in order to do this. This allows you to select the dataframe by level. A simple example of how you could achieve something like that:
D = np.repeat([0, 1], 12)
E = np.repeat([0, 1, 0, 1], 6)
print(D, E)
index_cols = pd.MultiIndex.from_arrays(
[D, E],
names=["D_idx", "E_idx"])
M = np.ones([24,2])
df = pd.DataFrame(M,
index=index_cols,
columns=["left", "right"])
print(df)
This gives you the dataframe:
left right
D_idx E_idx
0 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
0 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
1 1.0 1.0
You can then slice the dataframe based on levels, i.e. if you want to retrieve all elements in set D_1 you can select: df.loc[[(0, 0), (0, 1)], :]
You can generate selectors like this using list(zip(d_idx, e_idx)) in order to select specific rows.
You can find more about slicing and selecting the dataframe here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Related

Pandas drop duplicates only for main index

I have a multiindex and I want to perform drop_duplicates on a per level basis, I dont want to look at the entire dataframe but only if there is a duplicate with the same main index
Example:
entry,subentry,A,B
1 0 1.0 1.0
1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
should return:
entry,subentry,A,B
1 0 1.0 1.0
1 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0
Use MultiIndex.get_level_values with Index.duplicated for filter out last row per entry in boolean indexing:
df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)
A B
entry subentry
1 0 1.0 1.0
1 1.0 1.0
2 0 1.0 1.0
1 2.0 2.0
Or if need remove duplicates per first level and columns convert first level to column by DataFrame.reset_index, for filter invert boolean mask by ~ and convert Series to numpy array, because indices of mask and original DataFrame not match:
df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Or create helper column by first level of MultiIndex:
df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)
A B
entry subentry
1 1 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
2 2.0 2.0
Details:
print (df.reset_index(level=0))
entry A B
subentry
0 1 1.0 1.0
1 1 1.0 1.0
2 1 2.0 2.0
0 2 1.0 1.0
1 2 2.0 2.0
2 2 2.0 2.0
print (~df.reset_index(level=0).duplicated(keep='last'))
0 False
1 True
2 True
0 True
1 False
2 True
dtype: bool
print (df.assign(new=df.index.get_level_values('entry')))
A B new
entry subentry
1 0 1.0 1.0 1
1 1.0 1.0 1
2 2.0 2.0 1
2 0 1.0 1.0 2
1 2.0 2.0 2
2 2.0 2.0 2
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry subentry
1 0 False
1 True
2 True
2 0 True
1 False
2 True
dtype: bool
It looks like you want to drop_duplicates per group:
out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())
Or, a maybe more efficient variant using a temporary reset_index with duplicated and boolean indexing:
out = df[~df.reset_index('entry').duplicated().values]
Output:
A B
entry subentry
1 0 1.0 1.0
2 2.0 2.0
2 0 1.0 1.0
1 2.0 2.0

pandas DataFrame column manipulation using previous row value

I have below pandas DataFrame
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
0
1.0
1.0
0
1.0
1.0
0
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
0
1.0
1.0
0
I am trying to update the total column based on below logic.
if df['color'] == 1.0 and df['direction'] == 1.0 then Total should be Total of previous row + 1. if Total of previous row is NaN, then 0+1
Note: I was trying to read the previous row total using df['Total'].shift() + 1 but it didnt work.
Expected DataFrame.
color
direction
Total
-1.0
1.0
NaN
1.0
1.0
1
1.0
1.0
2
1.0
1.0
3
-1.0
1.0
NaN
1.0
-1.0
NaN
1.0
1.0
1
1.0
1.0
2
You can create the sub-groupby value with cumsum , the new just groupby with color and direction and do cumcount
df.loc[df.Total.notnull(),'Total'] = df.groupby([df['Total'].isna().cumsum(),df['color'],df['direction']]).cumcount()+1
df
Out[618]:
color direction Total
0 -1.0 1.0 NaN
1 1.0 1.0 1.0
2 1.0 1.0 2.0
3 1.0 1.0 3.0
4 -1.0 1.0 NaN
5 1.0 -1.0 NaN
6 1.0 1.0 1.0
7 1.0 1.0 2.0

Matplotlib eventplot - raster plot from binary values

I have created a dataframe where each column is an equal-length series of 1.0s and 0.0s. There is nothing else in the dataframe. I want to create a raster-style plot from this data where each column would be a horizontal line stacked up along the y-axis and each tick on the x-axis would correspond to a row index value.
However, when I try to do this, I get an "axis -1 is out of bounds for array of dimension 0" error. None of the other entries for this or very similar errors seem to relate to eventplot. I thought the type of data I had would be perfect for eventplot (a discrete black dash wherever there's a 1.0, otherwise nothing), but maybe I'm very wrong.
Here's a toy example of the kind of dataframe I'm trying to pass plus the function as I'm calling it:
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
plt.eventplot(df, colors='black', lineoffsets=1,
linelengths=1, orientation='vertical')
Any help appreciated, thank you.
Edit: If I convert my df into an np.array and pass that instead, I no longer get that particular error, but I don't at all get the result I'm looking for. I do get the correct values on the x-axis (in my real data, this is 0-22), but I don't get each column of data represented as a separate line, and I'm having no luck advancing in that direction.
When using eventplot, the array passed to positions needs to contain the row numbers of the ones in each column. Here is an example with your toy data:
import io
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import data into dataframe
data = """
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create series of indexes containing positions for raster plot
positions = df.apply(lambda x: df.index[x == 1])
# Create raster plot with inverted y-axis to display columns in ascending order
plt.eventplot(positions, lineoffsets=df.index, linelengths=0.75, colors='black')
plt.yticks(range(positions.index.size), positions.index)
plt.gca().invert_yaxis()

pandas - how to select rows based on a conjunction of a non indexed column?

Consider the following DataFrame -
In [47]: dati
Out[47]:
x y
frame face lmark
1 NaN NaN NaN NaN
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
2201 0.0 1.0 634.0 395.0
2.0 629.0 439.0
3.0 630.0 486.0
How can we select the rows where dati['x'] > 629.5 for all rows sharing the same value in the 'frame' column. For this example, I would expect to result to be
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
because column 'x' of 'frame' 2201, 'lmark' 2.0 is not greater than 629.5
Use GroupBy.transform with GroupBy.all for test if all Trues per groups and filter in boolean indexing:
df = dat[(dat['x'] > 629.5).groupby(level=0).transform('all')]
print (df)
x y
frame face lmark
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0

Converting a flat table of records to an aggregate dataframe in Pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?
This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0