So the idea is I want to be able to predict properties of an entire graph based on its adjacency matrix and features for each node.
Right now for training I consider my training data as a giant sparse N x k matrix where each of the N nodes (N nodes across every graph in the training data) has k features.
I can perform the graph convolutions just fine, and I end up with an N x l matrix where there are l features for each node.
The challenging part is handling the graph readout, that is- to output a static vector for each graph rather than each node. My training labels are per graph. I need to split up my N x l tensor into c many n_i x l tensors where there are c different graphs, and graph i has n_i many nodes.
I attempted to use tf.dynamic_partition to accompslish this (since its easy to know which nodes correspond to which graph) but it requires a static number for the number of partitions (in order to be differentiable?), BUT this number obviously depends on the number of graphs that I want to train/validate/test on. Therefore it can't be static.
I'm kind of stuck on how to structure this now. It's not a typical use of tensorflow from my understanding because I perform computation on each node, then I need to sum up all the values from each particular graph so I can get the final outputs to be per graph rather than per node.
Any advice would be greatly appreciated
Related
I have sensor measurements for 10 different people performing the same experiment in which they need to complete a specific task. For each timestep in the measurements I have the corresponding label and my goal is to train a sequential classifier which predicts the action a person is performing given the sensor observations. So, basically, for each person I have a separate dataset containing timesteps, several sensor measurements and the corresponding action (activity) for each timestep. I want to perform a leave-one-out cross validation, which would mean that I will take the sequence of measurements and action labels for 9 people for the training part and 1 sequence for the test part. However, I don't know how to train my model on the 9 different independent measurement sequences (they have also different lengths).
My idea is to first apply masking/padding to make the sequences of equal length L, then concatenate the padded sequences and for the training to use a batch size of n, where L is divisible by n without remainder. I am not sure though if this is the right way to go. Maybe Keras already supports training sequential models on independent sequences?
I would be happy to hear your recommendations. Thank you!
I'm very new to Keras a neural network in general. and I was wondering if I had a list of points (x,y) that came from a quadratic function that looks like this (ax^2+bx+c) is it possible
to feed the points into a neural network and
get the coefficients a,b and c as an output from the network?
I know that I can simply use polynomial regression to achieve my goal. that is not the point.
If you are asking how to do polynomial regression using neural networks, here's the recipe.
Your dataset consists of points (x, y). Design your network to be a fully connected network (dense network) with 1 input layer and 1 output layer. The input layer consists of 2 nodes, the output layer consists of 1 node. Then, give to your network the inputs x and x^2. The output will be computed as:
y = w * X + c
where w is a matrix of learnable parameters. Specifically, it has shape 1x2 since it contains parameters a and b. c is a bias. The input matrix X has shape 2xN, where N is the number of points in your dataset and for each point, the first component is x^2 and the second component is x.
As loss function, use the standard Mean Squared Error loss. As for the optimizer, a simple Stochastic Gradient Descent should work just fine. At convergence, w and c will be good enough to approximate the true quadratic function.
I don't know keras, but I think it will not tough figuring out by yourself how to implement this naive network.
To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.
I am starting to learn TensorFlow and I have a seemingly simple modeling question. Suppose I have a C-class problem and data arrives into TensorFlow in mini-batches containing B samples each. Each sample x is a D-dimensional vector that comes with its label y (non-negative integer between 0 and C-1). I want to estimate a class-specific parameter (for example the sample mean) for each class. The estimation takes place after each sample independently undergoes a TensorFlow-defined transformation pipeline. The per-class parameter/sample-mean is then utilized in the computation of other tensors.
Intuitively, I would group the samples in each mini-batch by label, sum-combine them, and add the total of each label group to the corresponding class parameter, with appropriate normalization.
How can I implement such a simple procedure (group by label, perform a per-group operation, then use the labels as indices for writing into a tensor) or an equivalent one, using TensorFlow? What TensorFlow operations do I need to learn about to achieve this? Is it advisable to do it outside TensorFlow?
I was following this blog http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ (Also attaching the matrix here)for the rating prediction using matrix factorization . Initially we have a sparse user-movie matrix R .
We then apply the MF algorithm so as to create a new matrix R' which is the product of 2 matrix P(UxK) and Q(DxK) . We then "minimize" the error in the value given in R and R' .So far so good . But in the final step , when the matrix is filled up , I am not so convinced that these are the predicted values that the user will give . Here is the final matrix:
What is the basis of justification that these are in fact the "predicted" ratings . Also , I am planning to use the P matrix (UxK) as the user's latent features . Can we somehow "justify" that these are infact user's latent features ?
The justification for using the obtained vectors for each user as latent trait vectors is that using these values of the latent latent traits will minimize the error between the predicted ratings and the actual known ratings.
If you take a look at the predicted ratings and the known ratings in the two diagrams that you posted you can see that the difference between the two matrixes in the cells that are common to both is very small. Example: U1D4 is 1 in the first diagram and 0.98 in the second.
Since the features or user latent trait vector produces good results on the known ratings we think that it would do a good job on predicting the unknown ratings. Of course, we use regularisation to avoid overfitting the training data, but that is the general idea.
To evaluate how good your latent feature vectors are you should split your data into training, validation and test.
The training set are the observed ratings that you use to learn your latent features. The validation set is used during learning to tune your model parameters, but but due learning and your test set is used to evaluate your learnt latent features once they are learnt. You can simply set aside a percentage of observed samples for validation and test. If your ratings are time stamped a natural way to select then is but using the most recent samples as validation and test.
More details on splitting your data is here https://link.medium.com/mPpwhdhjknb