Graph-tool vertex label with add_edges_from - pandas

I tried my best to look for the solution online, but cannot seem to find one for my graph_tool network. I have a pandas dataframe df that has three columns: id_source, id_target, weight. The id_source and id_target contain names in text form. I want to use them as vertex labels. However, when adding edges from a pandas dataframe, I must set hashed=True. The code then looks like this
from graph_tool.all import *
g = Graph()
eweight = g.new_ep('double')
vmap = g.add_edge_list(df.values, eprops=[eweight], hashed=True)
My objective is to draw this small network with vertex labels. I am stuck and can't figure out the solution on how to add that vertex property when I do not know the order by which each new node is introduced in the graph through the g.add_edge_list() function.
Should I add the vertices first? And then perform the add_edge_list function on the graph. Will the vertex names (or labels) be recognized by the second step?

From the documentation for Graph.add_edge_list:
Add a list of edges to the graph, given by edge_list, which can be an iterator of (source, target) pairs where both source and target are vertex indexes, or a numpy.ndarray of shape (E,2), where E is the number of edges, and each line specifies a (source, target) pair. If the list references vertices which do not exist in the graph, they will be created.
You have passed df.values for edge_list, which has three columns (ie a shape of (E, 3)), but two columns are expected (a shape of (E, 2)).
Try this:
g.add_edge_list(df[['id_source', 'id_target']].to_numpy(), eprops=[eweight], hashed=True)
I suspect that eprops should be df['weight'], but I am not sure.

The answer is in the vmap. It's there all along. When you index the vmap, it will get the label names.
vmap[0] would be the label for the first vertex recorded.
vmap[1] for the second, etc.

Related

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

Drawing a data relationship diagram using networkx and matplotlib

I'm trying to draw a data relationship diagram. I've modeled my input data in triples (subject, predicate, object) e.g. (app, 'consumes', entity), (app, 'masters', entity), etc
Each triple is an edge and 2 nodes. I want to color different sets of nodes in different colors as well.
I'm struggling with setting the color attribute as well as saving the graph to a png file in a size that is readable
Here's the code :
G = nx.DiGraph ()
read input data from file and process is lists of nodes and edges
......
add nodes - set diff color for each set of nodes ??
G.add_nodes_from (list(entities), node_color='yellow')
G.add_nodes_from (list(sornodes))
G.add_nodes_from (list (consumernodes))
add edges - set diff color for each set of edges (how do I do this?)
G.add_edges_from (masters)
G.add_edges_from (consumers)
G.add_edges_from (ads)
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, node_size=1700)
nx.draw_networkx_edges (G, pos, arrow=True)
nx.draw_networkx_labels (G, pos)
nx.draw_spring (G)
plt.figure(figsize=(7.195, 3.841))
fig1 = plt.gcf()
fig1.savefig ('out.png', dpi=1000)
plt.show ()
There is no image in the file. plt.show pops up the graph in a new window and another empty window is opened as well. I'm running this from a bash shell. Closing both the image windows terminates the program.
I need to be able to show sets of nodes in different colors.
I need to be able to show sets of edges in different colors
I want to be able to size the graph to a large image - does not need to fit within a monitor size
Thoughts anyone ?
draw_networkx_nodes (...) does not seem like to multiple calls with different sets of nodes and different colors for each set. It uses the last color specified for all the nodes in the graph.
The solution is to call draw_networkx_nodes once and pass a list with colors for each node and list len same as number of nodes.
# s, c and a are 3 lists of nodes
for x in s:
nodes.append(x)
nodecolor.append('r')
for x in c:
nodes.append(x)
nodecolor.append('g')
for x in a:
nodes.append(x)
nodecolor.append('y')
G.add_nodes_from(nodes)
nx.draw_networkx_nodes(G, pos, node_color=nodecolor, node_size=1700)
I'm sure I could optimize the code for creating the lists.

Interpolating data onto a line of points

I have some irregularly spaced data and need to analyze it. I can successfully interpolate this data onto a regular grid using mlab.griddata (or rather, the natgrid implementation of it). This allows me to use pcolormesh and contour to generate plots, extract levels, etc. Using plot.contour, I then extract a certain level using get_paths from the contour CS.collections().
Now, what I'd like to do is then, with my original irregularly spaced data, interpolate some quantities onto this specific contour line (i.e., NOT onto a regular grid). The similarly named griddata function from Scipy allows for this behavior, and it almost works. However, I find that as I increase the number of original points, I can get odd erratic behavior in the interpolation. I'm wondering if there's a way around this, i.e., another way to interpolate irregularly spaced (or regularly spaced data for that matter, since I can use my regularly spaced data from mlab.griddata) onto a specific line.
Let me show some numerical examples of what I'm talking about. Take a look at this figure:
The top left shows my data as points, and the line shows an extracted level of level=0 from some data D that I have at those points (x,y) [note, I have data 'D', 'Energy', and 'Pressure', all defined in this (x,y) space]. Once I have this curve, I can plot the interpolated quantities of D, Energy, and Pressure onto my specific line. First, note the plot of D (middle, right). It should be zero at all points, but it's not quite zero at all points. The likely cause of this is that the line that corresponds to the 0 level is generated from a uniform set of points that came from mlab.griddata, whereas the plot of 'D' is generated from my ORIGINAL data interpolated onto that level curve. You can also see some unphysical wiggles in 'Energy' and 'Pressure'.
Okay, seems easy enough, right? Maybe I should just get more original data points along my level=0 curve. Getting some more of these points, I then generate the following plots:
First look at the top left. You can see that I've sampled the hell out of the (x,y) space in the vicinity of my level=0 curve. Furthermore, you can see that my new "D" plot (middle, right) now correctly interpolates to zero in the region that it originally didn't. But now I get some wiggles at the start of the curve, as well as getting some other wiggles in the 'Energy' and 'Pressure' in this space! It is far from obvious to me that this should occur, since my original data points are still there and I've only supplemented additional points. Furthermore, some regions where my interpolation is going bad aren't even near the points that I added in the second run -- they are exclusively neighbored by my original points.
So this brings me to my original question. I'm worried that the interpolation that produces the 'Energy', 'D', and 'Pressure' curves is not working correctly (this is scigrid's griddata). Mlab's griddata only interpolates to a regular grid, whereas I want to interpolate to this specific line shown in the top left plot. What's another way for me to do this?
Thanks for your time!
After posting this, I decided to try scipy.interpolate.SmoothBivariateSpline, which produced the following result:
You can now see that my line is smoothed, so it seems like this will work. I'll mark this as the answer unless someone posts something soon that hints that there may be an even better solution.
Edit: As requested, below is some of the code used to generate these plots. I don't have a minimally working example, and the above plots were generated in a larger framework of code, but I'll write the important parts schematically below with comments.
# x,y,z are lists of data where the first point is x[0],y[0],z[0], and so on
minx=min(x)
maxx=max(x)
miny=min(y)
maxy=max(y)
# convert to numpy arrays
x=np.array(x)
y=np.array(y)
z=np.array(z)
# here we are creating a fine grid to interpolate the data onto
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
# here we interpolate our data from the original x,y,z unstructured grid to the new
# fine, regular grid in xi,yi, returning the values zi
zi=griddata(x,y,z,xi,yi)
# now let's do some plotting
plt.figure()
# returns the CS contour object, from which we'll be able to get the path for the
# level=0 curve
CS=plt.contour(x,y,z,levels=[0])
# can plot the original data if we want
plt.scatter(x,y,alpha=0.5,marker='x')
# now let's get the level=0 curve
for c in CS.collections:
data=c.get_paths()[0].vertices
# lineX,lineY are simply the x,y coordinates for our level=0 curve, expressed as arrays
lineX=data[:,0]
lineY=data[:,1]
# so it's easy to plot this too
plt.plot(lineX,lineY)
# now what to do if we want to interpolate some other data we have, say z2
# (also at our original x,y positions), onto
# this level=0 curve?
# well, first I tried using scipy.interpolate.griddata == scigrid like so
origdata=np.transpose(np.vstack((x,y))) # just organizing this data like the
# scigrid routine expects
lineZ2=scigrid(origdata,z2,data,method='linear')
# plotting the above curve (as plt.plot(lineZ2)) gave me really bad results, so
# trying a spline approach
Z2spline=SmoothBivariateSpline(x,y,z2)
# the above creates a spline object on our original data. notice we haven't EVALUATED
# it anywhere yet (we'll want to evaluate it on our level curve)
Z2Line=[]
# here we evaluate the spline along all our points on the level curve, and store the
# result as a new list
for i in range(0,len(lineX)):
Z2Line.append(Z2spline(lineX[i],lineY[i])[0][0]) # the [0][0] is just to get the
# value, which is enclosed in
# some array structure for some
# reason otherwise
# you can then easily plot this
plt.plot(Z2Line)
Hope this helps someone!

ValueError: setting an array element with a sequence at fit(X, y) in k-nearest neighbor

i have an error at this line:neigh.fit(X, y) :
ValueError: setting an array element with a sequence.
I checked fit function and X is: {array-like, sparse matrix, BallTree, cKDTree}
My X is a list of list with first element solidity number and second elemnt humoment list (7 cells).
If i change and i take only first humoment number for having a pure list of list
give this error: query data dimension must match BallTree data dimension.
My code:
listafeaturevector = list()
path = 'imgknn/'
for infile in glob.glob( os.path.join(path, '*.jpg') ):
print("current file is: " + infile )
gray = cv2.imread(infile,0)
element = cv2.getStructuringElement(cv2.MORPH_CROSS,(6,6))
graydilate = cv2.erode(gray, element)
ret,thresh = cv2.threshold(graydilate,127,255,cv2.THRESH_BINARY_INV)
imgbnbin = thresh
#CONTOURS
contours, hierarchy = cv2.findContours(imgbnbin, cv2.RETR_TREE ,cv2.CHAIN_APPROX_SIMPLE)
print(len(contours))
for i in range (0, len(contours)):
fv = list() #1 feature vector
#HUMOMENTS
#print("humoments")
mom = cv2.moments(contours[i], 1)
Humoments = cv2.HuMoments(mom)
#print(Humoments)
fv.append(Humoments) #query data dimension must match BallTree data dimension
#SOLIDITY
area = cv2.contourArea(contours[i])
hull = cv2.convexHull(contours[i]) #ha tanti valori
hull_area = cv2.contourArea(hull)
solidity = float(area)/hull_area
fv.append(solidity)
#fv.append(elongation)
listafeaturevector.append(fv)
print("i have done")
print(len(listafeaturevector))
lenmatrice=len(listafeaturevector)
#KNN
X = listafeaturevector
y = [0,1,2,3]* (lenmatrice/4)
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) #ValueError: setting an array element with a sequence.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
If i try to covert it in a numpy array:
listafv = np.dstack(listafeaturevector)
listafv=np.rollaxis(listafv,-1)
print(listafv.shape)
data = listafv.reshape((lenmatrice, -1))
print(data.shape)
#KNN
X = data
i got: setting an array element with a sequence
A couple of suggestions/questions:
Humoments = cv2.HuMoments(mom)
What is the class of the return value Humoments? a float or a list? If float, that is fine.
for each image file
for i in range (0, len(contours)):
fv = list() #1 feature vector
...
fv.append(Humoments)
...
fv.append(solidity)
listafeaturevector.append(fv)
The above code does not seem correct. In your problem, I think you need to a construct a feature vector for each image. So anything that is related to image i should go to the same feature vector x_i. Then you combine all feature vectors to get a list of feature vectors X. However, your listafeaturevector (or X) presents in the inner-most loop, it's obviously not correct.
Second, you have a loop against the number of elements in the contours, are you sure the number of elements stays the same for each image? Otherwise, the number of features (|x_i|) is totally different across different images, that might cause the error of
setting an array element with a sequence.
Third, are you clear about how you want to classify the images? what are the target values/labels of different images? I see you just setting labels with [0,1,2,3]* (lenmatrice/4). Can you elaborate on what you are trying to do with those images? Are they containing different type of object? Are they showing different patterns? Are those images describe different topic/color? If yes, for each different type, you give a different label - either 0,1,2 or 'red','white','black' (assume you have only 3 types). The values of the label do not matter. What matters is how many values they have. I am trying to understand the difference of labels in your case.
On the other hand, if you only want to retrieve similar images, you don't need to use a classifier or specify a label for each image. Instead, try to use NearestNeighbors.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
Fourth, the above two lines of test are not correct. You need to set an X-like object in order to get a prediction from the classifier. That is to say, you need a feature vector x with the identical structure as you constructed in your training examples (with all h,e,s in the same order).

Put pcolormesh and contour onto same grid?

I'm trying to display 2D data with axis labels using both contour and pcolormesh. As has been noted on the matplotlib user list, these functions obey different conventions: pcolormesh expects the x and y values to specify the corners of the individual pixels, while contour expects the centers of the pixels.
What is the best way to make these behave consistently?
One option I've considered is to make a "centers-to-edges" function, assuming evenly spaced data:
def centers_to_edges(arr):
dx = arr[1]-arr[0]
newarr = np.linspace(arr.min()-dx/2,arr.max()+dx/2,arr.size+1)
return newarr
Another option is to use imshow with the extent keyword set.
The first approach doesn't play nicely with 2D axes (e.g., as created by meshgrid or indices) and the second discards the axis numbers entirely
Your data is a regular mesh? If it doesn't, you can use griddata() to obtain it. I think that if your data is too big, a sub-sampling or regularization always is possible. If the data is too big, maybe your output image always will be small compared with it and you can exploit this.
If you use imshow() with "extent" and "interpolation='nearest'", you will see that the data is cell-centered, and extent provided the lower edges of cells (corners). On the other hand, contour assumes that the data is cell-centered, and X,Y must be the center of cells. So, you need to be care about the input domain for contour. The trivial example is:
x = np.arange(-10,10,1)
X,Y = np.meshgrid(x,x)
P = X**2+Y**2
imshow(P,extent=[-10,10,-10,10],interpolation='nearest',origin='lower')
contour(X+0.5,Y+0.5,P,20,colors='k')
My tests told me that pcolormesh() is a very slow routine, and I always try to avoid it. griddata and imshow() always is a good choose for me.