Create nested array for all unique indices in a pandas MultiIndex DataFrame - pandas

generate dummy data
np.random.seed(42)
df = pd.DataFrame({'subject': ['A'] * 10 + ['B'] * 10,
'trial': list(range(5)) * 4,
'value1': np.random.randint(0, 100, 20),
'value2': np.random.randint(0, 100, 20)
})
df = df.set_index(['subject', 'trial']).sort_index()
print(df)
value1 value2
subject trial
A 0 51 1
0 20 75
1 92 63
1 82 57
2 14 59
2 86 21
3 71 20
3 74 88
4 60 32
4 74 48
B 0 87 90
0 52 79
1 99 58
1 1 14
2 23 41
2 87 61
3 2 91
3 29 61
4 21 59
4 37 46
Notice: Each subject / trial combination has multiple rows.
I want to create a array with the rows as nested dimensions.
My (as I find ugly) data transformation via list
tmp=list()
for idx in df.index.unique():
tmp.append(df.loc[idx].to_numpy())
goal = np.array(tmp)
print(goal)
[[[51 1]
[20 75]]
...
[[21 59]
[37 46]]]
Can you show me a native pandas / numpy way to do it (without the list crutch)?

To be able to generate a non-ragged numpy array, the number of duplicates must be equal for all values. Thus you don't have to loop over them. Just find out the number and reshape
n = len(df)/(~df.index.duplicated()).sum()
assert n.is_integer()
out = df.to_numpy().reshape(-1, df.shape[1], int(n))
Output:
array([[[51, 1],
[20, 75]],
[[92, 63],
[82, 57]],
[[14, 59],
[86, 21]],
[[71, 20],
[74, 88]],
[[60, 32],
[74, 48]],
[[87, 90],
[52, 79]],
[[99, 58],
[ 1, 14]],
[[23, 41],
[87, 61]],
[[ 2, 91],
[29, 61]],
[[21, 59],
[37, 46]]])

You can use stack:
<code>df.stack().values
</code>
Output:
<code>array([[ 0, 25],
[16, 11],
[49, 87],
[38, 77],
[67, 6],
[27, 27],
[40, 0],
[22, 81],
[83, 89],
[36, 55],
[41, 1],
[13, 74],
[88, 61],
[85, 73],
[55, 66],
[44, 82],
[20, 30],
[82, 69],
[37, 71],
[30, 16],
[81, 96],
[ 0, 56],
[ 5, 99],
[73, 86]], dtype=int64)
</code>

Related

Am tryin to generate different values for each element of a column of a numpy array using a certain range for each column but it's not abiding by it

I need each column to generate random integers with specified range for col 1 (random.randint(1, 50) for col 2 random.randint(51, 100)...etc
import numpy
import random
import pandas
from random import randint
wsn = numpy.arange(1, 6)
taskn = 3
t1 = numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
t2 = numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
t3= numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
print('\nGenerated Data:\t\n\nNumber \t\t\t Task 1 \t\t\t Task 2 \t\t\t Task 3\n')
ni = len(t1)
for i in range(ni):
print('\t {0} \t {1} \t {2} \t {3}\n'.format(wsn[i], t1[i],t2[i],t3[i]))
print('\n\n')
It prints the following
Generated Data:
Number Task 1 Task 2 Task 3
1 [ 1 13 16 121 18] [ 5 22 34 65 194] [ 10 68 60 134 130]
2 [ 0 2 117 176 46] [ 1 15 111 116 180] [22 41 70 24 85]
3 [ 0 12 121 19 193] [ 0 5 37 109 205] [ 5 53 5 106 15]
4 [ 0 5 97 99 235] [ 0 22 142 11 150] [ 6 79 129 64 87]
5 [ 2 46 71 101 186] [ 3 57 141 37 71] [ 15 32 9 117 77]
soemtimes It even generates 0 when I didn't even specifiy 0 in the ranges
np.random.randint(low, high, size=None) allows for low and high being arrays of length num_intervals.
In that case, when size is not specified, it will generate as many integers as there are intervals defined by the low and high bounds.
If you want to generate multiple integers per interval, you just need to specify the corresponding size argument, which must ends by num_intervals.
Here it is size=(num_tasks, num_samples, num_intervals).
import numpy as np
bounds = np.array([1, 50, 100, 150, 200, 250])
num_tasks = 3
num_samples = 7
bounds_low = bounds[:-1]
bounds_high = bounds[1:]
num_intervals = len(bounds_low)
arr = np.random.randint(
bounds_low, bounds_high, size=(num_tasks, num_samples, num_intervals)
)
Checking the properties:
assert arr.shape == (num_tasks, num_samples, num_intervals)
for itvl_idx in range(num_intervals):
assert np.all(arr[:, :, itvl_idx] >= bounds_low[itvl_idx])
assert np.all(arr[:, :, itvl_idx] < bounds_high[itvl_idx])
An example of output:
array([[[ 45, 61, 100, 185, 216],
[ 36, 78, 117, 152, 222],
[ 18, 77, 112, 153, 221],
[ 9, 70, 123, 178, 223],
[ 16, 84, 118, 157, 233],
[ 42, 78, 108, 179, 240],
[ 40, 52, 116, 152, 225]],
[[ 3, 92, 102, 151, 236],
[ 45, 89, 138, 179, 218],
[ 45, 73, 120, 183, 231],
[ 35, 80, 130, 167, 212],
[ 14, 86, 118, 195, 212],
[ 20, 66, 117, 151, 248],
[ 49, 94, 138, 175, 212]],
[[ 13, 75, 116, 169, 206],
[ 13, 75, 127, 179, 213],
[ 29, 64, 136, 151, 213],
[ 1, 81, 140, 197, 200],
[ 17, 77, 112, 171, 215],
[ 18, 75, 103, 180, 209],
[ 47, 57, 132, 194, 234]]])

How to get the specific out put for Numpy array slicing?

x is an array of shape(n_dim,n_row,n_col) of 1st n natural numbers
b is boolean array of shape(2,) having elements True,false
def array_slice(n,n_dim,n_row,n_col):
x = np.arange(0,n).reshape(n_dim,n_row,n_col)
b = np.full((2,),True)
print(x[b])
print(x[b,:,1:3])
expected output
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]]
[[[ 1 2]
[ 6 7]
[11 12]]]
my output:-
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
[[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]]]
[[[ 1 2]
[ 6 7]
[11 12]]
[[16 17]
[21 22]
[26 27]]]
An example:
In [83]: x= np.arange(24).reshape(2,3,4)
In [84]: b = np.full((2,),True)
In [85]: x
Out[85]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [86]: b
Out[86]: array([ True, True])
With two True, b selects both plains of the 1st dimension:
In [87]: x[b]
Out[87]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
A b with a mix of true and false:
In [88]: b = np.array([True, False])
In [89]: b
Out[89]: array([ True, False])
In [90]: x[b]
Out[90]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]])

How does NumPy calculate inner product of two 2D matrices?

I'm unable to understand how NumPy calculates the inner product of two 2D matrices.
For example, this program:
mat = [[1, 2, 3, 4],
[5, 6, 7, 8]]
result = np.inner(mat, mat)
print('\n' + 'result: ')
print(result)
print('')
produces this output:
result:
[[ 30 70]
[ 70 174]]
How are these numbers calculated ??
Before somebody says "read the documentation" I did, https://numpy.org/doc/stable/reference/generated/numpy.inner.html, it's not clear to me from this how this result is calculated.
Before somebody says "check the Wikipedia article" I did, https://en.wikipedia.org/wiki/Frobenius_inner_product shows various math symbols I'm not familiar with and does not explain how a calculation such as the one above is performed.
Before somebody says "Google it", I did, most examples are for 1-d arrays (which is an easy calculation), and others like this video https://www.youtube.com/watch?v=_YtHyjcQ1gw produce a different result than NumPy does.
Any clarification would be greatly appreciated.
In [55]: mat = [[1, 2, 3, 4],
...: [5, 6, 7, 8]]
...:
In [56]: arr = np.array(mat)
In [58]: arr.dot(arr.T)
Out[58]:
array([[ 30, 70],
[ 70, 174]])
That's a matrix product of a (2,4) with a (4,2), resulting in a (2,2). This is the usual 'scan across the columns, down the rows' method.
A couple of other expressions that do this:
I like the expressiveness of einsum, where the sum-of-products is on the j dimension:
In [60]: np.einsum('ij,kj->ik',arr,arr)
Out[60]:
array([[ 30, 70],
[ 70, 174]])
With broadcasted elementwise multiplication and summation:
In [61]: (arr[:,None,:]*arr[None,:,:]).sum(axis=-1)
Out[61]:
array([[ 30, 70],
[ 70, 174]])
Without the sum, the products are:
In [62]: (arr[:,None,:]*arr[None,:,:])
Out[62]:
array([[[ 1, 4, 9, 16],
[ 5, 12, 21, 32]],
[[ 5, 12, 21, 32],
[25, 36, 49, 64]]])
Which are the values you discovered.
I finally found this site https://www.tutorialspoint.com/numpy/numpy_inner.htm which explains things a little better. The above is computed as follows:
(1*1)+(2*2)+(3*3)+(4*4) (1*5)+(2*6)+(3*7)+(4*8)
1 + 4 + 9 + 16 5 + 12 + 21 + 32
= 30 = 70
(5*1)+(6*2)+(7*3)+(8*4) (5*5)+(6*6)+(7*7)+(8*8)
5 + 12 + 21 + 32 25 + 36 + 49 + 64
= 70 = 174

plotting stacked bar graph

i want to plot stacked bar graph using matplotlib and pandas.The below code plot the bargraph very nicely.However when i change a,b,c,d,e,f,g,h ..etc to January, February... it doesnot plot the same graph.It only plot alphabetical order.Are there anyway to overcome this problem.
import pandas as pd
import matplotlib.pyplot as plt
years=["2016","2017","2018","2019","2020", "2021"]
dataavail={
"a":[20,0,0,0,10,21],
"b":[20,13,10,18,15,45],
"c":[20,20,10,15,18,78],
"d":[20,20,10,15,18,75],
"e":[20,20,10,15,18,78],
"f":[20,20,10,15,18,78],
"g":[20,20,10,15,18,78],
"h":[20,20,10,15,18,78],
"i":[20,20,10,15,18,78],
"j":[20,20,10,15,18,78],
"k":[20,20,10,15,18,78],
"l":[20,20,0,0,0,20],
}
df=pd.DataFrame(dataavail,index=years)
df.plot(kind="bar",stacked=True,figsize=(10,8))
plt.legend(loc="centre",bbox_to_anchor=(0.8,1.0))
plt.show()
But when i change the code in the portion a,b,c...to January,February... it doesnot plot the same.
import pandas as pd
import matplotlib.pyplot as plt
years=["2016","2017","2018","2019","2020", "2021"]
dataavail={
"january":[20,0,0,0,10,21],
"February":[20,13,10,18,15,45],
"March":[20,20,10,15,18,78],
"April":[20,20,10,15,18,75],
"may":[20,20,10,15,18,78],
"June":[20,20,10,15,18,78],
"July":[20,20,10,15,18,78],
"August":[20,20,10,15,18,78],
"September":[20,20,10,15,18,78],
"October":[20,20,10,15,18,78],
"November":[20,20,10,15,18,78],
"December":[20,20,0,0,0,20],
}
df=pd.DataFrame(dataavail,index=years)
df.plot(kind="bar",stacked=True,figsize=(10,8))
plt.legend(loc="centre",bbox_to_anchor=(0.8,1.0))
plt.show()
With Python 3.9.7, your graphs look like the same:
>>> df_alpha
a b c d e f g h i j k l
2016 20 20 20 20 20 20 20 20 20 20 20 20
2017 0 13 20 20 20 20 20 20 20 20 20 20
2018 0 10 10 10 10 10 10 10 10 10 10 0
2019 0 18 15 15 15 15 15 15 15 15 15 0
2020 10 15 18 18 18 18 18 18 18 18 18 0
2021 21 45 78 75 78 78 78 78 78 78 78 20
>>> df_month
January February March April may June July August September October November December
2016 20 20 20 20 20 20 20 20 20 20 20 20
2017 0 13 20 20 20 20 20 20 20 20 20 20
2018 0 10 10 10 10 10 10 10 10 10 10 0
2019 0 18 15 15 15 15 15 15 15 15 15 0
2020 10 15 18 18 18 18 18 18 18 18 18 0
2021 21 45 78 75 78 78 78 78 78 78 78 20
Full-code:
import pandas as pd
import matplotlib.pyplot as plt
years = ['2016', '2017', '2018', '2019', '2020', '2021']
dataavail1 = {'a': [20, 0, 0, 0, 10, 21], 'b': [20, 13, 10, 18, 15, 45], 'c': [20, 20, 10, 15, 18, 78], 'd': [20, 20, 10, 15, 18, 75], 'e': [20, 20, 10, 15, 18, 78], 'f': [20, 20, 10, 15, 18, 78], 'g': [20, 20, 10, 15, 18, 78], 'h': [20, 20, 10, 15, 18, 78], 'i': [20, 20, 10, 15, 18, 78], 'j': [20, 20, 10, 15, 18, 78], 'k': [20, 20, 10, 15, 18, 78], 'l': [20, 20, 0, 0, 0, 20]}
dataavail2 = {'January': [20, 0, 0, 0, 10, 21], 'February': [20, 13, 10, 18, 15, 45], 'March': [20, 20, 10, 15, 18, 78], 'April': [20, 20, 10, 15, 18, 75], 'may': [20, 20, 10, 15, 18, 78], 'June': [20, 20, 10, 15, 18, 78], 'July': [20, 20, 10, 15, 18, 78], 'August': [20, 20, 10, 15, 18, 78], 'September': [20, 20, 10, 15, 18, 78], 'October': [20, 20, 10, 15, 18, 78], 'November': [20, 20, 10, 15, 18, 78], 'December': [20, 20, 0, 0, 0, 20]}
df_alpha = pd.DataFrame(dataavail1, index=years)
df_month = pd.DataFrame(dataavail2, index=years)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 8))
df_alpha.plot(kind='bar', stacked=True, colormap=plt.cm.tab20, ax=ax1, rot=0)
df_month.plot(kind='bar', stacked=True, colormap=plt.cm.tab20, ax=ax2, rot=0)
plt.show()
Update: the code also works with Python 3.7.12

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0