How to convert this nested loop to numpy broadcast? - numpy

I want to rearrange my data (two even-length 1d arrays):
cs = [w x y z]
rs = [a b c d e f]
to make a result like this:
[[a b w x]
[c d w x]
[e f w x]
[a b y z]
[c d y z]
[e f y z]]
This is what I have tried (it works):
ls = []
for c in range(0,len(cs),2):
for r in range(0,len(rs),2):
item = [rs[r], rs[r+1], cs[c], cs[c+1]]
ls.append(item)
But I want to get the same result using reshaping/broadcasting or other numpy functions.
What is the idiomatic way to do this task in numpy?

You could tile the elements of rs, repeat the elements of cs and then arrange those as columns for a 2D array:
import numpy as np
cs = np.array(['w', 'x', 'y', 'z'])
rs = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
res = np.c_[np.tile(rs[::2], len(cs) // 2), np.tile(rs[1::2], len(cs) // 2),
np.repeat(cs[::2], len(rs) // 2), np.repeat(cs[1::2], len(rs) // 2)]
Result:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
An alternative:
np.c_[np.tile(rs.reshape(-1, 2), (len(cs) // 2, 1)),
np.repeat(cs.reshape(-1, 2), len(rs) // 2, axis=0)]

An alternative to using tile/repeat, is to generate repeated row indices.
Make the two arrays - reshaped as they will be combined:
In [106]: rs=np.reshape(list('abcdef'),(3,2))
In [107]: cs=np.reshape(list('wxyz'),(2,2))
In [108]: rs
Out[108]:
array([['a', 'b'],
['c', 'd'],
['e', 'f']], dtype='<U1')
In [109]: cs
Out[109]:
array([['w', 'x'],
['y', 'z']], dtype='<U1')
Make 'meshgrid' like indices (itertools.product could also be used)
In [110]: IJ = np.indices((3,2))
In [111]: IJ
Out[111]:
array([[[0, 0],
[1, 1],
[2, 2]],
[[0, 1],
[0, 1],
[0, 1]]])
reshape with order gives two 1d arrays:
In [112]: I,J=IJ.reshape(2,6,order='F')
In [113]: I,J
Out[113]: (array([0, 1, 2, 0, 1, 2]), array([0, 0, 0, 1, 1, 1]))
Then just index the rs and cs and combine them with hstack:
In [114]: np.hstack((rs[I],cs[J]))
Out[114]:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
edit
Here's another way of looking this - a bit more advanced. With sliding_window_view we can get a "block" view of that Out[114] result:
In [130]: np.lib.stride_tricks.sliding_window_view(_114,(3,2))[::3,::2,:,:]
Out[130]:
array([[[['a', 'b'],
['c', 'd'],
['e', 'f']],
[['w', 'x'],
['w', 'x'],
['w', 'x']]],
[[['a', 'b'],
['c', 'd'],
['e', 'f']],
[['y', 'z'],
['y', 'z'],
['y', 'z']]]], dtype='<U1')
With a bit more reverse engineering, I find I can create Out[114] with:
In [147]: res = np.zeros((6,4),'U1')
In [148]: res1 = np.lib.stride_tricks.sliding_window_view(res,(3,2),writeable=True)[::3,::2,:,:]
In [149]: res1[:,0,:,:] = rs
In [150]: res1[:,1,:,:] = cs[:,None,:]
In [151]: res
Out[151]:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
I can't say that either of these is superior, but they show there are various ways of "vectorizing" this kind of array layout.

Related

Extract nx.MultiDiGraph values into a dataframe without using nx.to_pandas_edgelist or nx.to_pandas_adjacency

I have a DataFrame that gives me edgelist like this :
df = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
'father' : ['A', 'D', 'C', 'B', 'F', 'G', 'I', 'H', 'J'],
'son' : ['B', 'E', 'D', 'C', 'G', 'H', 'J', 'I', 'K']
})
df
key father son
0 E A B
1 E D E
2 E C D
3 E B C
4 K F G
5 K G H
6 K I J
7 K H I
8 K J K
Then I turn it into a MultiDiGraph and plot it with this (plotting with pygraphviz !) :
G = nx.from_pandas_edgelist(df_, 'father', 'son', create_using=nx.MultiDiGraph)
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)
It constructs the links just as I wanted ! But I need to get it back into a DataFrame in that form, not in edgelist format (that would make me going back to my initial input).
It's like I need to extract the values in the right order, and it would give me this result :
df2 = pd.DataFrame({
'key' : ['E', 'K'],
'step_0' : ['A', 'F'],
'step_1' : ['B', 'G'],
'step_2' : ['C', 'H'],
'step_3' : ['D', 'I'],
'step_4' : ['E', 'J'],
'step_5' : [np.NaN, 'K']
})
df2
key step_0 step_1 step_2 step_3 step_4 step_5
0 E A B C D E NaN
1 K F G H I J K
I don't know well how graphs are structured in NetworkX, but I can't do this with nx.to_pandas_adjacency or nx.to_pandas_edgelist.
In effect, if I turn the graph back to a df with nx.to_pandas_edgelist the output is my input dataframe, at least it would be good if it could give the edgelist in the following order insted of the original one :
df1 = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
'father' : ['A', 'B', 'C', 'D', 'F', 'G', 'H', 'I', 'J'],
'son' : ['B', 'C', 'D', 'E', 'G', 'H', 'I', 'J', 'K']
})
df1
key father son
0 E A B
1 E B C
2 E C D
3 E D E
4 K F G
5 K G H
6 K H I
7 K I J
8 K J K
How can I find the values in the right order for each key that is the end of the links, and then extract them into a DataFrame by hand ?

Pandas Dataframe - running through two columns 'Father' and 'Son' to rebuild end-to-end links step by step

I have a long dataframe I need to transform to get a wide one.
The long one is :
df = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'J', 'J', 'J', 'J'],
'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I'],
'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J']
})
df
First thing to do, I think, is to group it by key. Then we have to find where those keys are found into the column 'son', it's the end (and last son) of the link I need to rebuild.
To rebuild the link, I need to look for his 'father'. His 'father' needs to be kept as father of final step and, also needs to be found into 'son'.
I need to iterate this until a 'father' cannot be found into the 'son' column, so it's going to be the father_0 of the link.
I think it could be done iterating those steps into a recursive function where the stop case : is 'father' not found in 'son'.
Here is the dataframe I want to get from this :
df1 = pd.DataFrame({
'key' : ['E', 'J'],
'father_1' : ['A', 'F'],
'son_1' : ['B', 'G'],
'father_2' : ['B', 'G'],
'son_2' : ['C', 'H'],
'father_3' : ['C', 'H'],
'son_3' : ['D', 'I'],
'father_4' : ['D', 'I'],
'son_4' : ['E', 'J'],
})
df1
I simplified the problem here with 2 different links of the same depth, but they could be from depth 1 to depth 10 (maybe more but rarely and unpredictably) for a lot of different keys.
Here is another example of df with 2 links of different size :
df_ = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I', 'J'],
'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J', 'K']
})
df_
df_1 = pd.DataFrame({
'key' : ['E', 'K'],
'father_1' : ['A', 'F'],
'son_1' : ['B', 'G'],
'father_2' : ['B', 'G'],
'son_2' : ['C', 'H'],
'father_3' : ['C', 'H'],
'son_3' : ['D', 'I'],
'father_4' : ['D', 'I'],
'son_4' : ['E', 'J'],
'father_5' : [np.NaN, 'J'],
'son_5' : [np.NaN, 'K']
})
df_1
Then the final step is easy, it's about taking 'father_x' and 'son_x-1' into 'step_x-1' :
So the resulting dataframes for these examples would be :
df2 = pd.DataFrame({
'key' : ['E', 'J'],
'step_0' : ['A', 'F'],
'step_1' : ['B', 'G'],
'step_2' : ['C', 'H'],
'step_3' : ['D', 'I'],
'step_4' : ['E', 'J'],
})
df2
df_2 = pd.DataFrame({
'key' : ['E', 'K'],
'step_0' : ['A', 'F'],
'step_1' : ['B', 'G'],
'step_2' : ['C', 'H'],
'step_3' : ['D', 'I'],
'step_4' : ['E', 'J'],
'step_5' : [np.NaN, 'K']
})
df_2
My concerne is more about the way to aggregate the data from wide to long following the previously given rules into an recursive function.
It's like in a groupby.agg but that I can't just pass a dictionnary into it because the new columns are based on the number of iteration of the recursive function on each key.
Assign the new key with cumcount then we can do pivot
out = df.assign(c = df.groupby('key').cumcount().add(1).astype(str)).pivot('key','c').sort_index(level=1,axis=1)
out.columns = out.columns.map('_'.join)
out
Out[34]:
father_1 son_1 father_2 son_2 father_3 son_3 father_4 son_4
key
E A B B C C D D E
J F G G H H I I J
I found a solution for this specific type of dataframe : where we only have 1 predecessor for all values except root.
It also requires using NetworkX. I didn't find a way to do it only using Pandas.
First, we need to build a graph from edgelist :
G = nx.from_pandas_edgelist(df, 'father', 'son', create_using=nx.MultiDiGraph, edge_key = 'key')
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
#write_dot(G,'test.dot')
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)
For pygraphviz install, please see this question.
Then end-to-end links dataframe is built with :
num=0
num_max = len(df.key.drop_duplicates())
m_max = 30
dfy = pd.DataFrame(index=range(num_max),columns=range(m_max))
for n in df.key.drop_duplicates() :
m = 0
dfy.iloc[num, m] = n
while len(list(G.predecessors(dfy.iloc[num,m])))!=0 :
dfy.iloc[num,m+1] = list(G.predecessors(dfy.iloc[num,m]))[0]
m+=1
num+=1
print(dfy)
Output :
0 1 2 3 4 5 6 7 8 9 ...
0 E D C B A NaN NaN NaN NaN NaN ...
1 K J I H G F NaN NaN NaN NaN ...

numpy reshape/transpose 3D to wide 2D

Make example
letters = np.array([
np.array([
np.array(['a','a','a'])
, np.array(['b','b','b'])
, np.array(['c','c','c'])
])
, np.array([
np.array(['d','d','d'])
, np.array(['e','e','e'])
, np.array(['f','f','f'])
])
, np.array([
np.array(['g','g','g'])
, np.array(['h','h','h'])
, np.array(['i','i','i'])
])
])
array([[['a', 'a', 'a'],
['b', 'b', 'b'],
['c', 'c', 'c']],
[['d', 'd', 'd'],
['e', 'e', 'e'],
['f', 'f', 'f']],
[['g', 'g', 'g'],
['h', 'h', 'h'],
['i', 'i', 'i']]], dtype='<U1')
Desired output
array([['a', 'a', 'a', 'd', 'd', 'd', 'g', 'g', 'g'],
['b', 'b', 'b', 'e', 'e', 'e', 'h', 'h', 'h'],
['c', 'c', 'c', 'f', 'f', 'f', 'i', 'i', 'i']], dtype='<U1')
See how the 2D arrays are now side-by-side?
For the sake of memory, I'd prefer to do this with transpose and reshape rather than stacking/ concatting a new array.
Attempt
letters.reshape(
letters.shape[2],
letters.shape[0]*letters.shape[1]
)
array([['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
['d', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
['g', 'g', 'g', 'h', 'h', 'h', 'i', 'i', 'i']], dtype='<U1')
I think I need to transpose... before reshaping?
letters.transpose(
1,0,2
).reshape(
# where index represents dimension
letters.shape[2],
letters.shape[0]*letters.shape[1]
)

Matplotlib scatter plot color-coded by text, how to add legend?

I'm trying to color-code a scatter plot based on the string in a column. I can't figure out how to set up the legend.
Repeatable Example
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
## Dummy Data
x = [0, 0.03, 0.075, 0.108, 0.16, 0.26, 0.37, 0.49, 0.76, 1.05, 1.64,
0.015, 0.04, 0.085, 0.11, 0.165, 0.29, 0.37, 0.6, 0.78, 1.1]
y = [16.13, 0.62, 2.15, 41.083, 59.97, 13.30, 7.36, 6.80, 4.97, 3.53, 11.77,
30.21, 64.47, 57.64, 56.83, 46.69, 4.22, 30.35, 35.12, 5.22, 25.32]
label = ['a', 'a', 'c', 'a', 'c', 'b', 'c', 'c', 'c', 'b', 'c',
'c', 'c', 'a', 'b', 'a', 'a', 'a', 'b', 'c', 'c', 'c']
df = pd.DataFrame(
list(zip(x, y, label)),
columns =['x', 'y', 'label']
)
## Set up colors dictionary
mydict = {'a': 'darkviolet',
'b': 'darkgoldenrod',
'c': 'olive'}
## Plotting
plt.scatter(df.x, df.y, c=df['label'].map(mydict))
plt.legend(loc="upper right", frameon=True)
Current Output
Desired Output
Same plot as above, I just want to define the legend handle.
Thanks for any help
You can use matplotlib.patches.mpatches
Just add these lines of code to your script
import matplotlib.patches as mpatches
fake_handles = [mpatches.Patch(color=item) for item in mydict.values()]
label = mydict.keys()
plt.legend(fake_handles, label, loc='upper right', prop={'size': 10})
and you will get
You will make a list of legend handles as shown below. legendhandle will take the first element of the list of lines.
import matplotlib.pyplot as plt
import pandas as pd
## Dummy Data
x = [0, 0.03, 0.075, 0.108, 0.16, 0.26, 0.37, 0.49, 0.76, 1.05, 1.64,
0.015, 0.04, 0.085, 0.11, 0.165, 0.29, 0.37, 0.6, 0.78, 1.1]
y = [16.13, 0.62, 2.15, 41.083, 59.97, 13.30, 7.36, 6.80, 4.97, 3.53, 11.77,
30.21, 64.47, 57.64, 56.83, 46.69, 4.22, 30.35, 35.12, 5.22, 25.32]
label = ['a', 'a', 'c', 'a', 'c', 'b', 'c', 'c', 'c', 'b', 'c',
'c', 'c', 'a', 'b', 'a', 'a', 'a', 'b', 'c', 'c', 'c']
df = pd.DataFrame(
list(zip(x, y, label)),
columns =['x', 'y', 'label']
)
## Set up colors dictionary
mydict = {'a': 'darkviolet',
'b': 'darkgoldenrod',
'c': 'olive'}
legendhandle = [plt.plot([], marker="o", ls="", color=color)[0] for color in list(mydict.values())]
plt.scatter(df.x, df.y, c=df['label'].map(mydict))
plt.legend(legendhandle,list(mydict.keys()),loc="upper right", frameon=True)
plt.show()
Are you open to seaborn:
import seaborn as sns
sns.scatterplot(data=df, x='x',y='y',hue='label', palette=mydict)
Output:
With pandas/matplotlib only, you can do a loop:
fig, ax = plt.subplots()
for l,d in df.groupby('label'):
d.plot.scatter(x='x',y='y', label=l, c=mydict[l], ax=ax)
plt.legend()
Output:

Dataframe merge creates multiple columns

import numpy as np
import pandas as pd
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': [ 'E', 'F', 'G', 'H'], 'value': np.random.randn(4)})
df = left.merge(right, on='key', how='outer', indicator=True)
df
This always creates value_X and value_y column, is it possible to have only one value column with merge?
I think you want some thing like this, or please share how you want your output to look like :
import numpy as np
import pandas as pd
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': [ 'E', 'F', 'G', 'H'], 'value': np.random.randn(4)})
df = pd.concat([left,right])
df