I am new in tensorflow so this might be an easy question, but it is really stuck me
I am tring to implement this paper by keras, background is tensorflow
In first stage of training, he used softmax_pair
if we got this output from last fc
vertical is batch size and this is NoneType
x11 x12 x13 x14...
x21 x22 x23 x24...
x31 x32 x33 x34...
...
and we do exponential, so we have
e11 e12 e13 e14...
e21 e22 e23 e24...
e31 e32 e33 e34...
...
and then, I am stuck here
e11/(e11+e12) e12/(e11+e12) e13/(e13+e14) e14/(e13+e14)...
e21/(e21+e22) e22/(e21+e22) e23/(e23+e24) e24/(e23+e24)...
e31/(e31+e32) e32/(e31+e32) e33/(e33+e34) e34/(e33+e34)...
...
I don't know how to do pairwise addition
tf.transpose and tf.segment_sum might be great
but after research I found transpose is expensive
further more, after tf.segment_sum I only have half size of tensor,
I don't know how to double it
oh and I am thinking how to produce segment_ids
so how can I do this calculate?
Thanks!!
----------update
The part I talked in paper is Fig.3
The fc output is P2c-1 and P2c, which is mean possibility of class c appear or not appear in the image
c=1,2,3...num of class
Is transpose not expensive? sometimes I see this,e.g. the comment ,perhaps I misunderstood this?
The tensorflow docs for tf.transpose state that unlike numpy tensorflow returns a new tensor -> memory.
Assuming X is your tensor of size R x C:
_, C = X.get_shape()
X_split = tf.split(1, C/2, X)
Y_split = [tf.nn.softmax(slice) for slice in X_split]
Y = tf.concat(1, Y_split)
C will be the number of colums, X_split will be a list of subtensors, each having a two columns, Y_split will calculate regular softmax for each of the tensors, Y will join the results of softmaxes.
Related
I have a matrix A of dimension 1000x70000.
my loss function includes A and I want to find optimal value of A using gradient descent where the constraint is that the rows of A remain in probability simplex (i.e. every row sums up to 1). I have initialised A as given below
A=np.random.dirichlet(np.ones(70000),1000)
A=torch.tensor(A,requires_grad=True)
and my training loop looks like as given below
for epoch in range(500):
y_pred=forward(X)
y=model(torch.mm(A.float(),X))
l=loss(y,y_pred)
l.backward()
A.grad.data=-A.grad.data
optimizer.step()
optimizer.zero_grad()
if epoch%2==0:
print("Loss",l,"\n")
An easy way to accomplish that is not to use A directly for computation but use a row normalized version of A.
# you can keep 'A' unconstrained
A = torch.rand(1000, 70000, requires_grad=True)
then divide each row by its summation (keeping row sum always 1)
for epoch in range(500):
y_pred = forward(X)
B = A / A.sum(-1, keepdim=True) # normalize rows manually
y = model(torch.mm(B, X))
l = loss(y,y_pred)
...
So now, at each step, B is the constrained matrix - i.e. the quantity of your interest. However, the optimization would still be on (unconstrained) A.
Edit: #Umang Gupta remined me in the comment section that OP wanted to have "probability simplex" which means there would be another constraint, i.e. A >= 0.
To accomplish that, you may simply apply some appropriate activation function (e.g. torch.exp, torch.sigmoid) on A in each iteration
A_ = torch.exp(A)
B = A_ / A_.sum(-1, keepdim=True) # normalize rows
the exact choice of function depends on the behaviour of training dynamics which needs to be experimented with.
I'm trying to write fast, optimized code based on matrices, and have recently discovered einsum as a tool for achieving significant speed-up.
Is it possible to use this to set the diagonals of a multidimensional array efficiently, or can it only return data?
In my problem, I'm trying to set the diagonals for an array of square matrices (shape: M x N x N) by summing the columns in each square (N x N) matrix.
My current (slow, loop-based) solution is:
# Build dummy array
dimx = 2 # Dimension x (likely to be < 100)
dimy = 3 # Dimension y (likely to be between 2 and 10)
M = np.random.randint(low=1, high=9, size=[dimx, dimy, dimy])
# Blank the diagonals so we can see the intended effect
np.fill_diagonal(M[0], 0)
np.fill_diagonal(M[1], 0)
# Compute diagonals based on summing columns
diags = np.einsum('ijk->ik', M)
# Set the diagonal for each matrix
# THIS IS LOW. CAN IT BE IMPROVED?
for i in range(len(M)):
np.fill_diagonal(M[i], diags[i])
# Print result
M
Can this be improved at all please? It seems np.fill_diagonal doesn't accepted non-square matrices (hence forcing my loop based solution). Perhaps einsum can help here too?
One approach would be to reshape to 2D, set the columns at steps of ncols+1 with the diagonal values. Reshaping creates a view and as such allows us to directly access those diagonal positions. Thus, the implementation would be -
s0,s1,s2 = M.shape
M.reshape(s0,-1)[:,::s2+1] = diags
If you do np.source(np.fill_diagonal) you'll see that in the 2d case it uses a 'strided' approach
if a.ndim == 2:
step = a.shape[1] + 1
end = a.shape[1] * a.shape[1]
a.flat[:end:step] = val
#Divakar's solution applies this to your 3d case by 'flattening' on 2 dimensions.
You could sum the columns with M.sum(axis=1). Though I vaguely recall some timings that found that einsum was actually a bit faster. sum is a little more conventional.
Someone has has asked for an ability to expand dimensions in einsum, but I don't think that will happen.
I am looking at the below image.
Can someone explain how they are calculated?
I though it was -1 for an N and +1 for a yes but then I can't figure out how the little girl has .1. But that doesn't work for tree 2 either.
I agree with #user1808924. I think it's still worth to explain how XGBoost works under the hood though.
What is the meaning of leaves' scores ?
First, the score you see in the leaves are not probability. They are the regression values.
In Gradient Boosting Tree, there's only regression tree. To predict if a person like computer games or not, the model (XGboost) will treat it as a regression problem. The labels here become 1.0 for Yes and 0.0 for No. Then, XGboost puts regression trees in for training. The trees of course will return something like +2, +0.1, -1, which we get at the leaves.
We sum up all the "raw scores" and then convert them to probabilities by applying sigmoid function.
How to calculate the score in leaves ?
The leaf score (w) are calculated by this formula:
w = - (sum(gi) / (sum(hi) + lambda))
where g and h are the first derivative (gradient) and the second derivative (hessian).
For the sake of demonstration, let's pick the leaf which has -1 value of the first tree. Suppose our objective function is mean squared error (mse) and we choose the lambda = 0.
With mse, we have g = (y_pred - y_true) and h=1. I just get rid of the constant 2, in fact, you can keep it and the result should stay the same. Another note: at t_th iteration, y_pred is the prediction we have after (t-1)th iteration (the best we've got until that time).
Some assumptions:
The girl, grandpa, and grandma do NOT like computer games (y_true = 0 for each person).
The initial prediction is 1 for all the 3 people (i.e., we guess all people love games. Note that, I choose 1 on purpose to get the same result with the first tree. In fact, the initial prediction can be the mean (default for mean squared error), median (default for mean absolute error),... of all the observations' labels in the leaf).
We calculate g and h for each individual:
g_girl = y_pred - y_true = 1 - 0 = 1. Similarly, we have g_grandpa = g_grandma = 1.
h_girl = h_grandpa = h_grandma = 1
Putting the g, h values into the formula above, we have:
w = -( (g_girl + g_grandpa + g_grandma) / (h_girl + h_grandpa + h_grandma) ) = -1
Last note: In practice, the score in leaf which we see when plotting the tree is a bit different. It will be multiplied by the learning rate, i.e., w * learning_rate.
The values of leaf elements (aka "scores") - +2, +0.1, -1, +0.9 and -0.9 - were devised by the XGBoost algorithm during training. In this case, the XGBoost model was trained using a dataset where little boys (+2) appear somehow "greater" than little girls (+0.1). If you knew what the response variable was, then you could probably interpret/rationalize those contributions further. Otherwise, just accept those values as they are.
As for scoring samples, then the first addend is produced by tree1, and the second addend is produced by tree2. For little boys (age < 15, is male == Y, and use computer daily == Y), tree1 yields 2 and tree2 yields 0.9.
Read this
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
and then this
https://medium.com/#gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5
and the appendix
https://gabrieltseng.github.io/appendix/2018-02-25-XGB.html
I have the following model file from LIBSVM:
svm_type c_svc kernel_type linear nr_class 2 total_sv 3 rho 0.0666415
label 1 -1 nr_sv 2 1 SV
0.004439511653718091 1:4.5 2:0.5
0.07111595083031433 1:2 2:2
-0.07555546248403242 1:-0.5 2:-2.5
My question is how do I figure out the weight vector from this information?
The weights of the support vectors are the first numbers on each of the support vector lines (the last three). Despite using a linear kernel, libsvm is for general kernel SVMs, so it isn't storing a weight vector and bias explicitly.
If you know you want a linear kernel, and you want that information, you can use liblinear (from the same folks as libsvm). Given this trivial data:
1 1:1 2:1
0 1:-1 2:-1
you can get this model, which has explicit weight and bias:
solver_type L2R_L2LOSS_SVC_DUAL
nr_class 2
label 1 0
nr_feature 2
bias -1
w
0.4327936
0.4327936
I want to express and solve below equations in a constraint programming language.
I have variables t and trying to find best multipliers k which minimizes my objective function.
Time: t1, t2, t3... given in input
Multipler k1, k2, k3... (This is continuous variables which needs to be found)
c1, c2,.. cN are constants
Main equation k1*sin(c1*x)+k2*sin(c2*x)+k3*sin(c3*x)+k4*cos(c1*x)...
Problem is to minimize results of all equations below with best possible values of (k1, k2, k3..). Also it is known that there is not an exact solution to the problem. So,
when x is t1 --> P1-k1*sin(c1*t1)-k2*sin(c2*t1)-k3*sin(c3*t1)-k4*cos(c1*t1)...
when x is t2 --> P2-k1*sin(c1*t2)-k2*sin(c2*t2)-k3*sin(c3*t2)-k4*cos(c1*t2)...
when x is t3 --> P3-k1*sin(c1*t3)-k2*sin(c2*t3)-k3*sin(c3*t3)-k4*cos(c1*t3)...
P1 is a bound value of time variable. But P(t) is not a analytic function, i just have values for them, like when
t1 = 5 P1=0.7
t2= 6 P2= 0.3 etc..
Is it possible to solve this in minizinc or any other CP system?
I don't think that CP is particularly suited to solve this problem, as you don't really have constraints here. All you have are functions you want to minimize (f1,.., fi), and a few degrees of freedom to do so (k1,.., ki).
I feel like the problem is a pretty good candidate for the least squares method. Instead of trying to "fit" your functions f to a given value, you are trying to minimize them. So what you can do is try to fit f² to 0. (So we would be dealing with non-linear least squares in that care).
Here is what it would like written in Python:
import numpy as np
from scipy.optimize import curve_fit
xdata = np.array([t1, t2, t3, t4, ..., t10])
ydata = np.zeros(10) # this is your "target". 10 = Number of ti
def func(x, k1,k2,...ki):
return (P(x)-k1*sin(c1*x)-k2*sin(c2*x)-k3*sin(c3*x)-k4*cos(c1*x)...)**2 # The square is a trick to minimize the function
popt, pcov = curve_fit(func, xdata, ydata, k0=(1.0,1.0,...)) # Initial set of ki