Networkx: colour nodes differently only when certain attribute determining the colour is present - matplotlib

While I am building a networkx graph, I algorithmically sometimes add a custcol attribute only to a few of them such as:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
G.add_edges_from([
('A','B'),
('B','C'),
('C','D'),
('D','E'),
('F','B'),
('B','G'),
('B','D'),
])
# in real life the following would be an algorithm deciding if the node
# should be custom coloured, and which colour it should get
G.nodes['C']['custcol'] = 'red' # simple setting for the example
# now let's explore the created example nodes
for node in G.nodes(data=True):
print(node)
which would print out:
('A', {})
('B', {})
('C', {'custcol': 'red'})
('D', {})
('E', {})
('F', {})
('G', {})
I am now displaying the network with a single draw such as:
NXDOPTS = {
"node_color": "orange",
"edge_color": "powderblue",
"node_size": 400,
"width": 2,
}
nx.draw(G, with_labels=True, **NXDOPTS)
which would generate the following picture:
What would be the best/pythonic/more efficient way of drawing node 'C' (in this example) with the colour of its custcol attribute? Of course this in reality will need to be applied to a few dozen nodes with a few different colours that are decided case by base when creating them.

You can loop through the node data, and create a list of corresponding colors:
NXDOPTS = {
"node_color": [data["custcol"] if "custcol" in data else "orange" for _, data in G.nodes(data=True)],
"edge_color": "powderblue",
"node_size": 400,
"width": 2,
}
nx.draw(G, with_labels=True, **NXDOPTS)

Related

Combinations & Numpy

I need to rate each combination in order to get the best one.
I completed the first step but it is not optimized at all.
When the value of RQ or NBPIS or NBSER is big, my code is much too long.
Do you have an idea to get the same result much faster?
Thank you very much
import numpy as np
from itertools import combinations, combinations_with_replacement
#USER SETTINGS
RQ=['A','B','C','D','E','F','G','H']
NBPIS=3
NBSER=3
#CODE
Combi1=np.array(list(combinations_with_replacement(RQ,NBPIS)))
Combi2=combinations_with_replacement(Combi1,NBSER)
Combi3=np.array([])
Compt=0
First=0
for X in Combi2:
Long=0
Compt=Compt+1
Y=np.array(X)
for Z in RQ:
Long=Long+1
if Z not in Y:
break
elif Long==len(RQ):
if First==0:
Combi3=Y
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [Y], axis = 0)
#RESULTS
print(Combi3)
print(Combi3.shape)
print(Compt)
Assuming your code produces the desirable result, the first step to optimize your code is refactoring it. This might help others to jump in and help as well.
Let's start making a function of it.
#USER SETTINGS
RQ=['A','B','C','D','E','F','G','H']
NBPIS=3
NBSER=3
def your_code():
Combi1=np.array(list(combinations_with_replacement(RQ,NBPIS)))
Combi2=combinations_with_replacement(Combi1,NBSER)
Combi3=np.array([])
Compt=0
First=0
for X in Combi2:
Long=0
Compt=Compt+1
Y=np.array(X)
for Z in RQ:
Long=Long+1
if Z not in Y:
break
elif Long==len(RQ):
if First==0:
Combi3=Y
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [Y], axis = 0)
shape = Combi3.shape
size = Compt
return Combi3, shape, size
Refactoring
Notice that Compt is equal to len(Combi2), so turning Combi2 as a numpy array will help to simplify the code. This also allow the variable Y to be replaced by X only. Also, there is no need for Combi1 to be a numpy array since it is only consumed by combinations_with_replacement.
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3=np.array([])
First=0
for X in Combi2:
Long=0
for Z in RQ:
Long=Long+1
if Z not in X:
break
elif Long==len(RQ):
if First==0:
Combi3=X
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
Next thing is to refactor how Combi3 is created and populated. The varialbe First is used to expand Combi3 dimension in the first interaction only, so this logic can be simplified as,
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3 = np.empty((0, NBPIS, NBSER))
for X in Combi2:
Long=0
for Z in RQ:
Long=Long+1
if Z not in X:
break
elif Long==len(RQ):
Combi3 = np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
It seems Combi2 is populated only with combinations that have at least one of each element from RQ. This is accomplished by testing if each element of RQ is in X, which is basically checking if RQ is a subset of X. So it is simplified further,
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3 = np.empty((0, NBPIS, NBSER))
unique_RQ = set(RQ)
for X in Combi2:
if unique_RQ.issubset(X.flatten()):
Combi3 = np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
This looks much simpler. Time to make it faster, maybe :)
Optimizing
One way this code can be optimized is to replace np.append by list.append. In numpy's documentation we see that np.append reallocate a larger and larger array each time it is called. The code might be speed up with list.append, since it over-allocates memory. So the code could be rewritten with list comprehension.
def your_code_refactored_and_optimized():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
unique_RQ = set(RQ)
Combi3 = np.array([X for X in Combi2 if unique_RQ.issubset(X.flatten())])
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
Testing
Now we can see it run faster.
import timeit
n = 5
print(timeit.timeit('your_code()', globals=globals(), number=n))
print(timeit.timeit('your_code_refactored_and_optimized()', globals=globals(), number=n))
This isn't much a gain in speed but it's something :)
I have an idea to reduce execution time by removing unnecessary combinations, simplifying with the following example with :
RQ=['A','B','C']
NBPIS=3
NBSER=3
Currently with :
Combi1 = combinations_with_replacement(RQ,NBPIS)
print(list(Combi1))
[('A', 'A', 'A'), ('A', 'A', 'B'), ('A', 'A', 'C'), ('A', 'B', 'B'),
('A', 'B', 'C'), ('A', 'C', 'C'), ('B', 'B', 'B'), ('B', 'B', 'C'),
('B', 'C', 'C'), ('C', 'C', 'C')]
But with :
Combi1 = list(list(combinations(RQ,W)) for W in range(1,NBPIS+1))
print(Combi1)
[[('A',), ('B',), ('C',)], [('A', 'B'), ('A', 'C'), ('B', 'C')],
[('A', 'B', 'C')]]
Problem with :
Combi1 = list(list(combinations(RQ,W)) for W in range(1,NBPIS+1))
Error message :
Combi3 = np.array([X for X in Combi2 if
unique_RQ.issubset(X.flatten())])
TypeError: unhashable type: 'list'
But with :
(Combi1 = combinations(RQ,W) for W in range(1,NBPIS+1))
print(Combi3)
[]
Questions :
For Combi1,
Instead of :
[[('A',), ('B',), ('C',)], [('A', 'B'), ('A', 'C'), ('B', 'C')],
[('A', 'B', 'C')]]
how to get this ? :
[('A'), ('B'), ('C'), ('A', 'B'), ('A', 'C'), ('B', 'C'), ('A', 'B',
'C')]
For Combi3, is it possible to get an array with different sizes ?
Instead of :
[[['A' 'A' 'A'] ['A' 'A' 'A'] ['A' 'B' 'C']]...
Obtain this ? :
[[['A'] ['A'] ['A' 'B' 'C']]...

Query regarding a desired feature in PySimpleGUI TK version

I have built a graphic oriented package using the Graph element. I need to do keyboard input based on Graph Element coordinates. Currently I am using the events that come in from the keyboard to place characters on the Graph element using draw_text. It works but is a bit slow and I get into problems with interpreting the key codes I get back from different platforms and the overhead with me doing the echoing back on to the Graph element does not help.
My Question. In PySimpleGui(Tk) is there a way to use the Tk Entry function directly on Graph Coordinates?
IMO, it can be done like your request, but much complex.
Here only a simple way to enter text on a Graph element.
import PySimpleGUI as sg
font = ('Courier New', 16, 'bold')
layout = [
[sg.Input(expand_x=True, key='INPUT')],
[sg.Graph((320, 240), (0, 0), (320, 240), enable_events=True, key='GRAPH',
background_color='green')],
]
window = sg.Window('Draw Text', layout, margins=(0, 0), finalize=True)
entry, graph = window['INPUT'], window['GRAPH']
while True:
event, values = window.read()
if event == sg.WIN_CLOSED:
break
elif event == 'GRAPH':
location, text = values['GRAPH'], values['INPUT']
if text:
graph.draw_text(text, location, font=font, color='white', text_location=sg.TEXT_LOCATION_LEFT)
window.close()
Here, something like your request, but much complex and tkinter code required.
import PySimpleGUI as sg
def entry_callback(event, graph, entry_id, font, location, color):
widget = event.widget
text = widget.get()
graph.widget.delete(entry_id)
if text:
graph.draw_text(text, location=location, font=font, color=color, text_location='sw')
font = ('Courier New', 16, 'bold')
layout = [
[sg.Graph((320, 240), (0, 0), (320, 240), enable_events=True, key='GRAPH',
background_color='green')],
]
window = sg.Window('Draw Text', layout, margins=(0, 0), finalize=True)
graph = window['GRAPH']
while True:
event, values = window.read()
if event == sg.WIN_CLOSED:
break
elif event == 'GRAPH':
location = tuple(map(int, graph._convert_xy_to_canvas_xy(*values['GRAPH'])))
entry = sg.tk.Entry(graph.widget, font=font, fg='white', bg='green', width=45)
entry_id = graph.widget.create_window(*location, window=entry, anchor="sw")
entry.bind('<Return>', lambda event, graph=graph, entry_id=entry_id, font=font, location=values['GRAPH'], color='white':entry_callback(event, graph, entry_id, font, location, color))
entry.focus_force()
window.close()

Formatting 2 columns into list of tuples (for NER)

I'm looking to format data held in a df, so that it can be used in an NER model. I'm starting with the data in 2 columns, example below:
df['text'] df['annotation']
some text [('Consequence', 23, 47)]
some other text [('Consequence', 33, 46), ('Cause', 101, 150)]
And need to format it to:
TRAIN_DATA = [(some text, {'entities': [(23, 47, 'Consequence')]}), (some other text, {'entities': [(33, 46, 'Consequence'), (101, 150, 'Cause')]})
I've been attempting to iterate over each row, for example trying:
TRAIN_DATA = []
for row in df['annotation']:
entities = []
label, start, end = entity
entities.append((start, end, label))
# add to dataset
TRAIN_DATA.append((df['text'], {'entities': entities}))
However I can't get it to iterate over each row to populate the TRAIN_DATA. Sometimes there are multiple entities within the annotation column.
Grateful if anyone can highlight where I'm going wrong and how to correct it!
You can use zip() function:
TRAIN_DATA = [
(t, {"entities": [(s, e, l) for (l, s, e) in a]})
for t, a in zip(df["text"], df["annotation"])
]
print(TRAIN_DATA)
Prints:
[
("some text", {"entities": [(23, 47, "Consequence")]}),
(
"some other text",
{"entities": [(33, 46, "Consequence"), (101, 150, "Cause")]},
),
]

Scala: how to get the mean and variance and covariance of a matrix?

I am new to scala and I desperately need some guidance on the following problem:
I have a dataframe like the one below (some elements may be NULL)
val dfDouble = Seq(
(1.0, 1.0, 1.0, 3.0),
(1.0, 2.0, 0.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(1.0, 4.0, 0.0, 2.0)).toDF("m1", "m2", "m3", "m4")
dfDouble.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|1.0|1.0|1.0|3.0|
|1.0|2.0|0.0|0.0|
|1.0|3.0|1.0|1.0|
|1.0|4.0|0.0|2.0|
+---+---+---+---+
I need to get the following statistics out of this dataframe:
a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3) ... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation
A vector that contains the number of non-null for each column
vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)
In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)
Edited: apparently it makes a difference if the content of the dataframe is Double or Int. Revised the elements to be double
Used the following command per suggested answer and it worked!
val rdd = dfDouble0.rdd.map {
case a: Row => (0 until a.length).foldRight(Array[Double]())((b, acc) =>
{ val k = a.getAs[Double](b)
if(k == null)
acc.+:(0.0)
else acc.+:(k)}).map(_.toDouble)
}
Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.
It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).
Having this DF:
val df = Seq(
(1, 1, 1, 3),
(1, 2, 0, 0),
(1, 3, 1, 1),
(1, 4, 0, 2),
(1, 5, 0, 1),
(2, 1, 1, 3),
(2, 2, 1, 1),
(2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")
Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:
val rdd = df.rdd.map {
case a: Row =>
(0 until a.length).foldRight(Array[Int]())((b, acc) => {
val k = a.getAs[Int](b)
if(k == null) acc.+:(0) else acc.+:(k)
}).map(_.toDouble)
}
As you can see in your IDE, the inferred type is RDD[Array[Float]. Now convert this to a RDD[DenseVector]. As simple as doing:
val rowsRdd = rdd.map(Vectors.dense(_))
And now you can build your Matrix:
val mat: RowMatrix = new RowMatrix(rowsRdd)
Once you have the matrix, you can easily compute the different metrix per column:
println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)
It gives:
Mean: [1.375,2.625,0.5,1.375]
Variance:
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]
you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api
You can also compute the Covariance matrix, doing the SVD, etc...

Mix dense and sparse tensors inside tf.data.Dataset api

Imagine, that i want to train model, which minimizes distance between image and query. From one side i have image features from CNN, from other side i have mappings from word to embedded vector(w2v for example):
def raw_data_generator():
for row in network_data:
yield (row["cnn"], row["w2v_indices"])
dataset = tf.data.Dataset.from_generator(raw_data_generator, (tf.float32, tf.int32))
dataset = dataset.prefetch(1000)
here i want to create batch, but i want to create dense batch for cnn features, and sparse batch for w2v, cause obviously it has variable length(and i want to use safe_embeddings_lookup_sparse). There is batch function for dense, and .apply(tf.contrib.data.dense_to_sparse_batch(..)) function for sparse, but how to use them simultaneously?
You could try creating two data sets (one for each feature), applying the appropriate batching to each and then zipping them together with tf.data.Dataset.zip.
#staticmethod
zip(datasets)
Creates a Dataset by zipping together the given datasets.
This method has similar semantics to the built-in zip() function in
Python, with the main difference being that the datasets argument can
be an arbitrary nested structure of Dataset objects. For example:
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { 4, 5, 6 }
c = { (7, 8), (9, 10), (11, 12) }
d = { 13, 14 }
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }
# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
(2, 5, (9, 10)),
(3, 6, (11, 12)) }
# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d)) == { (1, 13), (2, 14) }