How to copy SPIR-V Private storage-qualified matrix array into Function storage-qualified matrix array? - vulkan

I am fighting with driver bug, and long story short i have created OpVariables with proper data in invocation-level global memory, aka Private storage-qualified OpVariable of type float4x3[6].
Now, i need this data converted to Function storage qualifier, as OpVariables in OpFunction scope. But i am kind of lost about when do i apply what copying operations, especially with matrices and arrays, and I have both at once. Do i just OpLoad and OpStore? Or do i need to OpLoad, OpCompositeExtract each matrix from indices, OpCompositeConstruct from them and only afterwards OpStore? The SPIR-V specification on the subject is rather dense, and i cannot seem to find one place where copying operations are described. The rules are probably scattered all over the spec.

After some experimentation it seems that all that is required to convert variables from Private to Function is simple OpLoad(Private) and OpStore to Function variable.

Related

Does Pandas DataFrame constructor (and helper construction functions) release the GIL while copying the source array?

Preamble
On another question I understood that python constructors routines does not make a copy of the provided numpy array only if the data-type of the array is the same for all the entries. In case the constructor is fed with a structured numpy array with different types on columns, it makes a copy.
Implementation reference
df_dict = {}
for i in range(5):
obj = Object(1000000)
arr = obj.getNpArr()
print(arr[:10])
df_dict[i] = pandas.DataFrame.from_records(arr)
print("The DataFrames are :")
for i in range(5):
print(df_dict[i].head(10))
In this case Object(N) constructs an instance of Object which internally allocates and initializes a 2D array of shape (N,3), with dtypes 'f8','i4','i4' on each row. Object manages the life of these data, deallocating it on destruction. The function Object.getNpArr() returns a np.recarray pointing to the internal data and it has the above mentioned dtype. Importantly, the returned array does not own the data, it is just a view.
Problem
The DataFrame printed at the end show corrupted data (with respect to the printed array inside the first loop). I am not expecting such behaviour, since the array fed to the pandas construction function is copied (I separately checked this behaviour).
I have not many ideas about the cause and solutions to avoid data corruption. The only guess I can make is:
the constructor starts allocating the memory for its own data, which takes long because of the big size, and then copies
before/during the allocation/copy, the GIL is released and it is taken back to the for loop
for loop proceed before the copy of the array is completed, going to the next iteration
at the next iteration the obj name is moved to the new Object and the memory is deallocated, which causes data corruption in the copy of the DataFrame at the previous iteration, which is probably still running.
If this is really the cause of the issue, how can I find a workaround? Is there a way to let the GIL go through only when the copy of the array is effectively done?
Or, if my guess is wrong, what is the cause of the data corruption?

Fortran arrays argument [duplicate]

I'm going through a Fortran code, and one bit has me a little puzzled.
There is a subroutine, say
SUBROUTINE SSUB(X,...)
REAL*8 X(0:N1,1:N2,0:N3-1),...
...
RETURN
END
Which is called in another subroutine by:
CALL SSUB(W(0,1,0,1),...)
where W is a 'working array'. It appears that a specific value from W is passed to the X, however, X is dimensioned as an array. What's going on?
This is non-uncommon idiom for getting the subroutine to work on a (rectangular in N-dimensions) subset of the original array.
All parameters in Fortran (at least before Fortran 90) are passed by reference, so the actual array argument is resolved as a location in memory. Choose a location inside the space allocated for the whole array, and the subroutine manipulates only part of the array.
Biggest issue: you have to be aware of how the array is laid out in memory and how Fortran's array indexing scheme works. Fortran uses column major array ordering which is the opposite convention from c. Consider an array that is 5x5 in size (and index both directions from 0 to make the comparison with c easier). In both languages 0,0 is the first element in memory. In c the next element in memory is [0][1] but in Fortran it is (1,0). This affects which indexes you drop when choosing a subspace: if the original array is A(i,j,k,l), and the subroutine works on a three dimensional subspace (as in your example), in c it works on Aprime[i=constant][j][k][l], but in Fortran in works on Aprime(i,j,k,l=constant).
The other risk is wrap around. The dimensions of the (sub)array in the subroutine have to match those in the calling routine, or strange, strange things will happen (think about it). So if A is declared of size (0:4,0:5,0:6,0:7), and we call with element A(0,1,0,1), the receiving routine is free to start the index of each dimension where ever it likes, but must make the sizes (4,5,6) or else; but that means that the last element in the j direction actually wraps around! The thing to do about this is not use the last element. Making sure that that happens is the programmers job, and is a pain in the butt. Take care. Lots of care.
in fortran variables are passed by address.
So W(0,1,0,1) is value and address. so basically you pass subarray starting at W(0,1,0,1).
This is called "sequence association". In this case, what appears to be a scaler, an element of an array (actual argument in caller) is associated with an array (implicitly the first element), the dummy argument in the subroutine . Thereafter the elements of the arrays are associated by storage order, known as "sequence". This was done in Fortran 77 and earlier for various reasons, here apparently for a workspace array -- perhaps the programmer was doing their own memory management. This is retained in Fortran >=90 for backwards compatibility, but IMO, doesn't belong in new code.

Stan variable declarations: difference in use between var_type var_name[length] and vector[length] var_name

I am new at Stan and I'm struggling to understand the difference in how different variable declaration styles are used. In particular, I am confused about when I should put square brackets after the variable type and when I should put them after the variable name. For example, given int<lower = 0> L; // length of my data, let's consider:
real N[L]; // my variable
versus
vector[L] N; // my variable
From what I understand, both declare a variable N as a vector of length L.
Is the only difference between the two that the first way specifies the variable type? Can they be used interchangeably? Should they belong do different parts of the Stan code (e.g., data vs parameters or model)?
Thanks for explaining!
real name[size] and vector[size] name can be used pretty interchangeably. They are stored differently internally, so you can get better performance with one or the other. Some operations might also be restricted to one and the other (e.g. vector multiplication) and the optimal order to loop over them changes. E.g. with a matrix vs. a 2-D array, it is more efficient to loop over rows first vs. columns first, but those will come up if you have a more specific example. The way to read this is:
real name[size];
means name is an array of type real, so a bunch of reals that are stored together.
vector[size] name;
means that name is a vector of size size, which is also a bunch of reals stored together. But the vector data type in STAN is based on the eigen c++ library (c++) and that allows for other operations.
You can also create arrays of vectors like this:
vector[N] name[K];
which is going to produce an array of K vectors of size N.
Bottom line: You can get any model running with using vector or real, but they're not necessarily equivalent in the computational efficiency.

How to serialize slice without length using bincode?

I'm using the bincode crate to write a structure into a file. The structure contains a slice with a fixed size. How can I force bincode to write only the slice's content without the slice's length?
#![allow(unstable)]
#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]
extern crate serde;
extern crate bincode;
use std::fs::File;
use bincode::serde::serialize_into;
use bincode::SizeLimit;
#[derive(Serialize)]
struct Foo([u8; 16]);
fn main() {
let data = Foo([0; 16]);
let mut writer = File::create("test.bin").unwrap();
serialize_into::<File, Foo>(&mut writer, &data, SizeLimit::Infinite).unwrap();
}
File 'test.bin' has 24 bytes size instead of 16.
I saw related remark in documentation of bincode, but I did not understand how to use it.
a slice with a fixed size
[u8; 16] is not a slice. It is an array which may be coerced to a slice.
Anyway... I do not believe that you can. The important function appears to be Serializer::serialize_fixed_size_array which is not implemented by the current serializer. That means it defaults to behaving the same as a slice.
Since slices do not have a length known at compile time, they must have their size written when serialized.
If no one else can provide a better suggestion, it's possible that the maintainer could find a way to make this happen. You may want to politely ask the maintainer for this feature or offer to help with the work.
Beyond that, it sounds like you are trying to make the bincode output fit a pre-existing format. That doesn't really make sense; bincode is its own format and had already made various choices and tradeoffs.
If you need to, you could implement your own encoder / decoder (either using serde or not). If you are concerned about file size, you can combine bincode with a compression step as well.

Difference between memcpy_htod and to_gpu in Pycuda?

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation, .get().
For example,transfer the contents of self into array or a newly allocated numpy.ndarray. If array is given, it must have the right size (not necessarily shape) and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.
Is this saying that .get() is implemented exactly the same way as pycuda.driver.memcpy_dtoh? Somehow, I think I am mis-interpreting it.
pycuda.gpuarray.GPUArray.get() stores the GPUArray as a numpy.ndarray.
pycuda.driver.memcpy_dtoh() and friends copy plain buffers between CPU and GPU memory without any processing of the data in the buffers.