When batch sizes are different, how to determine the size of wq, Wk and WV of self-attention - batch-processing

I plan to use self-attention on pyG mini-batch, wq, WK, WV size should be [batch_nodes],[batch_num_nodes],[batch_num_nodes]
However, there is a problem now. Because the number of nodes in each Nini-Batch diagram is different, the dimensions of wq, Wk and WV are changing all the time. May I ask how to deal with this situation?

Related

Optimal 2D FFT sizes on NVIDIA GPUs

We are benchmarking 2D FFT performance on an NVIDIA A100 in order to determine which sizes have the best performance. The following shows how the runtime for each size is performed. GPU memroy is cleared after each size is run.
def run_fft():
fft2(array, axes=(-2, -1), overwrite_x=True)
timing = cupyx.timing.repeat(run_fft, repeat=10, n_warmup=1)
Running it across a range of possible sizes results in the measurements below .
As you can see, there seems to be a set of sizes that are slower than the rest (the quasi-linear streaks above the main sequence). These sizes also include ones which are factors of low prime numbers (such as 2 and 3). I was wondering whether there is a general rule to define which 2D FFT sizes run optimally (for example, for the CPU case and when using fftw3, the general rule is defined here:.

k-mean clustering - inertia only gets larger

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.

Why the value to explained various ratio is too low in binary classification problem?

My input data $X$ has 100 samples and 1724 features. I want to classify my data into two classes. But explained various ratio value is too low like 0.05,0.04, No matter how many principal components I fix while performing PCA. I have seen in the literature that usually explained variance ratio is close to 1. I'm unable to understand my mistake. I tried to perform PCA by reducing number of features and increasing number of samples. But it doesn't make any significant effect on my result.
`pca = PCA(n_components=10).fit(X)
Xtrain = pca.transform(X)
explained_ratio=pca.explained_variance_ratio_
EX=explained_ratio
fig,ax1=plt.subplots(ncols=1,nrows=1)
ax1.plot(np.arange(len(EX)),EX,marker='o',color='blue')
ax1.set_ylabel(r"$\lambda$")
ax1.set_xlabel("l")

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Why COCO evaluate with AP, AR by size? what are AR max =1, 10, 100's meaning

I'm reading COCO Metrics right now. And I have 2 questions about it.
This is the Metrics of COCO
I'm wondering why COCO evaluate AP and AR by size. What effect does image size have?
They measure AR by max which are 1, 10, 100. And they said AR max=1 is 'AR given 1 detection per image". Then, if model detect multiple objects per image, how to calculate AR? I can't understand the meaning of 'max'.
Because small objects are hard to detect for deeplearning detectors, so the detetctor who's AP small higher, is strongger at detect small object.
Per class base, caculate AR1
https://github.com/rafaelpadilla/review_object_detection_metrics/blob/e4acc7ecab9a1e34110bad59e0c1e3d84ac7a127/src/evaluators/coco_evaluator.py#L306
https://github.com/rafaelpadilla/review_object_detection_metrics/blob/e4acc7ecab9a1e34110bad59e0c1e3d84ac7a127/src/evaluators/coco_evaluator.py#L85