benchmark code:
func BenchmarkSth(b *testing.B) {
var x []int
b.ResetTimer()
for i := 0; i < b.N; i++ {
x = append(x, i)
}
}
result:
BenchmarkSth-4 50000000 20.7 ns/op 40 B/op 0 allocs/op
question/s:
where did 40 B/op come from? (any way of tracing + instructions is greatly appreciated)
how is it possible to have 40 B/op while having 0 allocs?
which one affects GC and how? (B/op or allocs/op)
is it really possible to have 0 B/op using append?
The Go Programming Language Specification
Appending to and copying slices
The variadic function append appends zero or more values x to s of
type S, which must be a slice type, and returns the resulting slice,
also of type S.
append(s S, x ...T) S // T is the element type of S
If the capacity of s is not large enough to fit the additional values,
append allocates a new, sufficiently large underlying array that fits
both the existing slice elements and the additional values. Otherwise,
append re-uses the underlying array.
For your example, on average, [40, 41) bytes per operation are allocated to increase the capacity of the slice when necessary. The capacity is increased using an amortized constant time algorithm: up to len 1024 increase to 2 times cap then increase to 1.25 times cap. On average, there are [0, 1) allocations per operation.
For example,
func BenchmarkMem(b *testing.B) {
b.ReportAllocs()
var x []int64
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkMem-4 50000000 26.6 ns/op 40 B/op 0 allocs/op
--- BENCH: BenchmarkMem-4
bench_test.go:32: op 1 B 8 alloc 1 lx 1 cx 1
bench_test.go:32: op 100 B 2040 alloc 8 lx 100 cx 128
bench_test.go:32: op 10000 B 386296 alloc 20 lx 10000 cx 12288
bench_test.go:32: op 1000000 B 45188344 alloc 40 lx 1000000 cx 1136640
bench_test.go:32: op 50000000 B 2021098744 alloc 57 lx 50000000 cx 50539520
For op = 50000000,
B/op = floor(2021098744 / 50000000) = floor(40.421974888) = 40
allocs/op = floor(57 / 50000000) = floor(0.00000114) = 0
Read:
Go Slices: usage and internals
Arrays, slices (and strings): The mechanics of 'append'
'append' complexity
To have zero B/op (and zero allocs/op) for append, allocate a slice with sufficient capacity before appending.
For example, with var x = make([]int64, 0, b.N),
func BenchmarkZero(b *testing.B) {
b.ReportAllocs()
var x = make([]int64, 0, b.N)
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkZero-4 100000000 11.7 ns/op 0 B/op 0 allocs/op
--- BENCH: BenchmarkZero-4
bench_test.go:51: op 1 B 0 alloc 0 lx 1 cx 1
bench_test.go:51: op 100 B 0 alloc 0 lx 100 cx 100
bench_test.go:51: op 10000 B 0 alloc 0 lx 10000 cx 10000
bench_test.go:51: op 1000000 B 0 alloc 0 lx 1000000 cx 1000000
bench_test.go:51: op 100000000 B 0 alloc 0 lx 100000000 cx 100000000
Note the reduction in benchmark CPU time from around 26.6 ns/op to around 11.7 ns/op.
Related
I am trying to implement matrix multiplication, but get_global_id returns incorrect values.
This is the host code (n, m, TILE_SIZE = 4):
int dimention = 2;
size_t global_item_size[] = {n, m};
size_t local_item_size[] = {TILE_SIZE, TILE_SIZE};
ret = clEnqueueNDRangeKernel(command_queue, kernel, dimention, NULL, global_item_size, local_item_size, 0, NULL, &perf_event);
And part of the kernel:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(1);
printf("aa %i %i\n", i, j);
}
This code prints this:
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
After some time I realized that get_global_id(0) returns correct index when I call it the first time and zero when I call it the second time:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(0);
printf("aa %i %i\n", i, j);
}
So, this kernel prints the same thing.
In some cases get_global_id(2) returns 2-nd dimension indexes. But when I just rename variables it starts printing zeroes.
This problem looks like some driver bug. I use GeForce GT 745M, Ubuntu 20.04 and recommended drivers(nvidia-driver-440).
I am modeling a production problem in AMPL with start and end inventories to be zero.
Simple 3 products being produced over 4 day span, on a single machine.
however my model keeps having problems with the bigM constraint and no inventory is allocated.Model is infeasible per AMPL presolve.
I would really appreciate if someone could point whats wrong or what i should try. Thanks in advance!
HERE IS THE CODE -
param n;
param m;
param M;
param T;
set I := {1..n};
set J := {1..m};
param r {I} integer; #productionrate
param stp {I} integer; #setuptime
param rev {I} integer; #sellingprice
param h {I} integer; #inventorycost
param d {I,J} integer; #demand
var t {I,J} integer >= 0, <=480 default 50; #production time allocated for product i on day j
var s {I,J} integer >= 0 default 50; #inventory of product i on day j
var z {I,J} binary default 1; #setup for product i done on day j or not
#sum of revenue - sum of inventory cost
maximize K: sum {i in I,j in J}(rev[i] * t[i,j] * r[i] * z[i,j]) - sum {i in I,j in J}(h[i] * s[i,j]);
#inventory equation
s.t. c2 {i in I,j in 2..4}: s[i,j] = s[i,j-1] + t[i,j]*r[i] - d[i,j];
s.t. c14 {i in I}: s[i,1] = t[i,1]*r[i] - d[i,1];
#initial and end inventories
s.t. c3 {i in I}: s[i,1] = 0;
s.t. c4 {i in I}: s[i,4] = 0;
#ensuring time allocation only when setup
s.t. c5 {i in I,j in J}: t[i,j] <= M*z[i,j];
#ensuring demand is satisfied
s.t. c6 {i in I,j in 2..4}: s[i,j-1] + t[i,j]*r[i] = d[i,j];
s.t. c11 {i in I}: t[i,1]*r[i] = d[i,1];
#production time constraint
s.t. c7 {j in J}: sum{i in I}(z[i,j]*stp[i]) + sum{i in I}(t[i,j]) <= T;
#other non-negativity constraints
s.t. c12 {i in I,j in J}: s[i,j] >= 0;
#s.t. c13 {i in I,j in J}: t[i,j] >= 0;
end;
HERE IS THE DATA -
param n :=3;
param m :=4;
param T :=480;
param M :=480;
#param o :=2;
param d: 1 2 3 4 :=
1 400 600 200 800
2 240 440 100 660
3 80 120 40 100;
param r:=
1 5
2 4
3 2;
param stp:=
1 45
2 60
3 100;
param rev:=
1 50
2 70
3 120;
param h:=
1 2
2 1
3 3;
end;
RUN FILE -
#RESET THE AMPL ENVIROMENT
reset;
#LOAD THE MODEL
model 'EDX 15.053 Production problem.mod';
#LOAD THE DATA
data 'EDX 15.053 Production problem.dat';
#DISPLAY THE PROBLEM FORMULATION
expand K;
expand c2;
#expand c3;
expand c4;
expand c5;
expand c6;
expand c7;
expand c11;
#let s[3,3] := 20;
#CHANGE THE SOLVER (optional)
#option solver CBC;
option solver cplex, cplex_options ’presolve 0’;
option show_stats 1;
option send_statuses 0;
#option presolve 0;
#SOLVE
solve;
print {j in 1.._nvars:_var[j].status = "pre"}: _varname[j];
print {i in 1.._ncons:_con[i].status = "pre"}: _conname[i];
display solve_message;
display solve_result_num, solve_result;
#SHOW RESULTS
display t,s,z,K;
end;
How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k).
Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.
This is one way:
n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]
EDIT:
I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:
n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))
EDIT 2:
About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k ≤ n), the probability that an iteration is successful is:
p = product((n - (i - 1) / n) for i in 1 .. k)
Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:
def avg_iter(k, n):
if k > n or n <= 0 or k < 0:
raise ValueError()
avg_it = 1.0
for p in (float(n) / (n - i) for i in range(k)):
avg_it *= p
return avg_it
And here are some results:
+-------+------+----------+
| n | k | Avg iter |
+-------+------+----------+
| 10 | 5 | 3.3 |
| 100 | 10 | 1.6 |
| 1000 | 10 | 1.1 |
| 1000 | 100 | 167.8 |
| 10000 | 10 | 1.0 |
| 10000 | 100 | 1.6 |
| 10000 | 1000 | 2.9e+22 |
+-------+------+----------+
You can see it varies wildy depending on the parameters.
It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k2). In pure Python it goes like this:
import random
def sample_wo_replacement(n, k):
sample = [0] * k
for i in range(k):
sample[i] = random.randint(0, n - 1 - len(sample))
for i, v in reversed(list(enumerate(sample))):
for p in reversed(sample[:i]):
if v >= p:
v += 1
sample[i] = v
return sample
random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]
This is a possible way to do it in TensorFlow (not sure if the best one):
import tensorflow as tf
def sample_wo_replacement_tf(n, k):
# First loop
sample = tf.constant([], dtype=tf.int64)
i = 0
sample, _ = tf.while_loop(
lambda sample, i: i < k,
# This is ugly but I did not want to define more functions
lambda sample, i: (tf.concat([sample,
tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
axis=0),
i + 1),
[sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
# Second loop
def inner_loop(sample, i):
sample_size = tf.shape(sample)[0]
v = sample[i]
j = i - 1
v, _ = tf.while_loop(
lambda v, j: j >= 0,
lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
[v, j])
return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
i = tf.shape(sample)[0] - 1
sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
return sample
And an example:
with tf.Graph().as_default(), tf.Session() as sess:
tf.set_random_seed(100)
sample = sample_wo_replacement_tf(10, 5)
for i in range(10):
print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]
This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.
EDIT 4:
One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if c ≥ k and n mod c = 0 several parts get simplified.
import numpy as np
def sample_chunked(n, k, chunk=None):
chunk = chunk or n
last_chunk = chunk
parts = n // chunk
# Distribute k among chunks
max_p = min(float(chunk) / k, 1.0)
max_p_last = max_p
if n % chunk != 0:
parts += 1
last_chunk = n % chunk
max_p_last = min(float(last_chunk) / k, 1.0)
p = np.full(parts, 2)
# Iterate until a valid distribution is found
while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
p = np.random.uniform(size=parts)
p /= np.sum(p)
dist = (k * p).astype(np.int64)
sample_size = np.sum(dist)
# Account for rounding errors
while sample_size < k:
i = np.random.randint(len(dist))
while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
i = np.random.randint(len(dist))
dist[i] += 1
sample_size += 1
while sample_size > k:
i = np.random.randint(len(dist))
while dist[i] == 0:
i = np.random.randint(len(dist))
dist[i] -= 1
sample_size -= 1
assert sample_size == k
# Generate sample parts
sample_parts = []
for i, v in enumerate(np.nditer(dist)):
if v <= 0:
continue
c = chunk if i < parts - 1 else last_chunk
base = chunk * i
sample_parts.append(base + np.random.choice(c, v, replace=False))
sample = np.concatenate(sample_parts, axis=0)
np.random.shuffle(sample)
return sample
np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8 9 12 13 3]
A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.
use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260
z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1)))
_, indices = tf.nn.top_k(logits + z,K)
indices are what you want. This tick is so easy~!
The following works fairly fast on the GPU, and I did not encounter memory issues when using n~100M and k~10k (using NVIDIA GeForce GTX 1080 Ti):
def random_choice_without_replacement(n, k):
"""equivalent to 'numpy.random.choice(n, size=k, replace=False)'"""
return tf.math.top_k(tf.random.uniform(shape=[n]), k, sorted=False).indices
I have been told that the time complexity of this code is O(n),could you please explain me how is it so. The time complexity of the outer loop is log n,inner?
int fun(int n)
{
int count = 0;
for (int i = n; i > 0; i /= 2)
for (int j = 0; j < i; j++)
count += 1;
return count;
}
That's actually quite easy once you understand what the function does.
The inner loop basically adds i to count. So you can just reduce the code here to this:
int fun(int n) {
int count = 0;
for (int i = n; i > 0; i /= 2)
count += i;
return count;
}
So now what you have is it halves i and adds that half to count.
If you keep doing that you'll end up with a number close to n. If you'd do it with an infinitely accurate float representation of i and count you will actually get exactly n (though it will take you an infinite amount of steps).
For example, let's look at n=1000, these are the steps that will be taken:
i: 500 count: 500
i: 250 count: 750
i: 125 count: 875
i: 62 count: 937
i: 31 count: 968
i: 15 count: 983
i: 7 count: 990
i: 3 count: 993
i: 1 count: 994
The 6 that are missing from count in the end are rounding mistakes at i=125/2, i=31/2, i=15/2, i=7/2, i=3/3 and i=1/2.
So you can see, the complexity of that function is almost exactly n, which certainly makes it O(n).
I need help please. I started to program a common brute forcer / password guesser with CUDA (2.3 / 3.0beta).
I tried different ways to generate all possible plain text "candidates" of a defined ASCII char set.
In this sample code I want to generate all 74^4 possible combinations (and just output the result back to host/stdout).
$ ./combinations
Total number of combinations : 29986576
Maximum output length : 4
ASCII charset length : 74
ASCII charset : 0x30 - 0x7a
"0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"
CUDA code (compiled with 2.3 and 3.0b - sm_10) - combinaions.cu:
#include <stdio.h>
#include <cuda.h>
__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];
__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
int totalThreads = blockDim.x * gridDim.x ;
int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
if( endIdx > N) endIdx = N;
const unsigned int m = 74 + 0x30;
for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
charset[threadIdx.x].x = charset_global.x;
charset[threadIdx.x].y = charset_global.y;
charset[threadIdx.x].z = charset_global.z;
charset[threadIdx.x].w = charset_global.w;
__threadfence();
if(charset[threadIdx.x].x < m) {
charset[threadIdx.x].x++;
} else if(charset[threadIdx.x].y < m) {
charset[threadIdx.x].x = 0x30; // = 0
charset[threadIdx.x].y++;
} else if(charset[threadIdx.x].z < m) {
charset[threadIdx.x].y = 0x30; // = 0
charset[threadIdx.x].z++;
} else if(charset[threadIdx.x].w < m) {
charset[threadIdx.x].z = 0x30;
charset[threadIdx.x].w++;; // = 0
}
charset_global.x = charset[threadIdx.x].x;
charset_global.y = charset[threadIdx.x].y;
charset_global.z = charset[threadIdx.x].z;
charset_global.w = charset[threadIdx.x].w;
result_d[idx].x = charset_global.x;
result_d[idx].y = charset_global.y;
result_d[idx].z = charset_global.z;
result_d[idx].w = charset_global.w;
}
}
#define BLOCKS 65535
#define THREADS 128
int main(int argc, char **argv)
{
const int ascii_chars = 74;
const int max_len = 4;
const unsigned int N = pow((float)ascii_chars, max_len);
size_t size = N * sizeof(uchar4);
uchar4 *result_d, *result_h;
result_h = (uchar4 *)malloc(size );
cudaMalloc((void **)&result_d, size );
cudaMemset(result_d, 0, size);
printf("Total number of combinations\t: %d\n\n", N);
printf("Maximum output length\t: %d\n", max_len);
printf("ASCII charset length\t: %d\n\n", ascii_chars);
printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
for(int i=0; i < ascii_chars; i++)
printf("%c",i + 0x30);
printf("\n\n");
combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
cudaThreadSynchronize();
printf("CUDA kernel done\n");
printf("hit key to continue...\n");
getchar();
cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);
for (unsigned int i=0; i<N; i++)
printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);
free(result_h);
cudaFree(result_d);
}
The code should compile without any problems but the output is not what i expected.
On emulation mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[000128] 5000
On release mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[012288] 5000
I also used __threadfence() and or __syncthreads() on different lines of the code also without success...
ps. if possible I want to generate everything inside of the kernel function . I also tried "pre" generating of possible plain text candidates inside host main function and memcpy to device, this works only with a very limited charset size (because of limited device memory).
any idea about the output, why the repeating (even with __threadfence() or __syncthreads()) ?
any other method to generate plain text (candidates) inside CUDA kernel fast :-) (~75^8) ?
thanks a million
greets jan
Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.
for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)
Let's see:
When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
In z < m, then both x == m and y == m and must both be set to 0.
Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.
It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.
EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:
int m = 2;
int x = 0, y = 0, z = 0, w = 0;
for (int i = 0; i < m * m * m * m; i++)
{
printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
if(x < m) {
x++;
} else if(y < m) {
x = 0; // = 0
y++;
} else if(z < m) {
y = 0; // = 0
z++;
} else if(w < m) {
z = 0;
w++;; // = 0
}
}
The output of which is this:
x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0