i'm currently using a Graph Database using Redis for a Julia project.
Sometimes Redis requests are taking 300 ms to execute and i don't understand why.
I run a simple request 10.000 times (the code of the request is below) and it took me :
using Redis, BenchmarkTools
conn = RedisConnection(port=6382) Redis.execute_command(conn,["FLUSHDB"])
q = string("CREATE (:Type {nature :'Test',val:'test'})") BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1000 BenchmarkTools.DEFAULT_PARAMETERS.samples = 10000
stats = #benchmark Redis.execute_command(conn,[ "GRAPH.QUERY", "GraphDetection", q])
And got this results :
BenchmarkTools.Trial: memory estimate: 3.09 KiB allocs estimate: 68
minimum time: 1.114 ms (0.00% GC)
median time: 1.249 ms (0.00% GC)
mean time: 18.623 ms (0.00% GC)
maximum time: 303.269 ms (0.00% GC)
samples: 10000 evals/sample: 1
The Huge difference between median time and mean time came from the problem i'm talking about (the request take either [1-3] ms or [300-310] ms )
I'm not familiar with Julia but please note RedisGraph report its internal execution time, I'll suggest using this report for measurement,
In addition it would be helpful to understand when (on which sample) did RedisGraph took over 100ms to process the query, usually it is the first query which causes RedisGraph to do some extra work.
Related
I have a CPU with 83.2 GFLOPS/s + 4 cores. So i understand that each core is (83.2 / 4) = 20.8 GFLOPS/s.
What i am trying to do is to estimate the execution time of an algorithm. I found that we can estimate the execution time roughly by using the following formula :
estimation_exec_time = algorithm_time_complexity / Gflops/s
So if we have a bubble_sort algorithm with time complexity O(n^2) that runs on a VM that uses 1 core of my CPU the estimation exec time would be :
estimation_exec_time = n^2 / 20.8 GFLOPS/s
The problem is that the estimation execution time is completely different from the real execution time when i am timing my code..
To be more specific the formula returns an estimation of 0.00004807s
and the real execution time gives a result of 0.74258s
Is this approach with the formula false?
I noticed that when saving large dataframes as CSVs the memory allocations are an order of magnitude higher than the size of the dataframe in memory (or the size of the CSV file on disk), at least by a factor of 10. Why is this the case? And is there a way to prevent this? Ie is there a way to save a dataframe to disk without using (much) more memory than the actual dataframe?
In the example below I generate a dataframe with one integer column and 10m rows. It weighs 76MB but writing the CSV allocates 1.35GB.
using DataFrames, CSV
function generate_df(n::Int64)
DataFrame!(a = 1:n)
end
julia> #time tmp = generate_df2(10000000);
0.671053 seconds (2.45 M allocations: 199.961 MiB)
julia> Base.summarysize(tmp) / 1024 / 1024
76.29454803466797
julia> #time CSV.write("~/tmp/test.csv", tmp)
3.199506 seconds (60.11 M allocations: 1.351 GiB)
What you see is not related to CSV.write, but to the fact that DataFrame is type-unstable. This means that it will allocate when iterating rows and accessing their contents. Here is an example:
julia> df = DataFrame(a=1:10000000);
julia> f(x) = x[1]
f (generic function with 1 method)
julia> #time sum(f, eachrow(df)) # after compilation
0.960045 seconds (40.07 M allocations: 613.918 MiB, 4.18% gc time)
50000005000000
This is a deliberate design decision to avoid unacceptable compilation times for very wide data frames (which are common in practice in certain fields of application). Now, this is the way to reduce allocations:
julia> #time CSV.write("test.csv", df) # after compilation
1.976654 seconds (60.00 M allocations: 1.345 GiB, 5.64% gc time)
"test.csv"
julia> #time CSV.write("test.csv", Tables.columntable(df)) # after compilation
0.439597 seconds (36 allocations: 4.002 MiB)
"test.csv"
(this will work OK if the table is narrow, for wide tables it might hit compilation time issues)
This is one of the patterns that are often encountered in Julia (even Julia itself works this way as args field in Expr is Vector{Any}): often you are OK with type unstable code if you do not care about performance (but want to avoid excessive compilation latency), and it is easy to switch to type-stable mode where compilation time does not matter and type-stability does.
Python Pandas:
import pandas as pd
df = pd.DataFrame({'a': range(10_000_000)})
%time df.to_csv("test_py.csv", index=False)
memory consumption (measured in Task Manager): 135 MB (before writing) -> 151 MB (during writing), Wall time: 8.39 s
Julia:
using DataFrames, CSV
df = DataFrame(a=1:10_000_000)
#time CSV.write("test_jl.csv", df)
#time CSV.write("test_jl.csv", df)
memory consumption: 284 MB (before writing) -> 332 MB (after 1st writing),
2.196639 seconds (51.42 M allocations: 1.270 GiB, 7.49% gc time)
2nd execution (no compilation required anymore): -> 357 MB,
1.701374 seconds (50.00 M allocations: 1.196 GiB, 6.26% gc time)
The memory increase of Python and Julia during writing of the CSV file is similar (~ 15 MB). In Julia, I observe a significant memory increase after execution of the first write command, probably due to cashing of compiled code.
Note that allocations != memory requirement. Even though 1.2 GB memory is allocated in total, only 15 MB is used at the same time (max value).
Regarding performance, Julia is nearly 4 times faster than Python, even including compilation time.
The Tensorflow tf.subtract cost too long time for the large array.
My workstation configuration:
CPU: Xeon E5 2699 v3
Mem: 384 GB
GPU: NVIDIA K80
CUDA: 8.5
CUDNN: 5.1
Tensorflow: 1.1.0, GPU version
The following is the test code and result.
import tensorflow as tf
import numpy as np
import time
W=3000
H=4000
in_a = tf.placeholder(tf.float32,(W,H))
in_b = tf.placeholder(tf.float32,(W,H))
def test_sub(number):
sess=tf.Session()
out = tf.subtract(in_a,in_b)
for i in range(number):
a=np.random.rand(W,H)
b=np.random.rand(W,H)
feed_dict = {in_a:a,
in_b:b}
t0=time.time()
out_ = sess.run(out,feed_dict=feed_dict)
t_=(time.time()-t0) * 1000
print "index:",str(i), " total time:",str(t_)," ms"
test_sub(20)
Results:
index: 0 total time: 338.145017624 ms
index: 1 total time: 137.024879456 ms
index: 2 total time: 132.538080215 ms
index: 3 total time: 133.152961731 ms
index: 4 total time: 132.885932922 ms
index: 5 total time: 135.06102562 ms
index: 6 total time: 136.723041534 ms
index: 7 total time: 137.926101685 ms
index: 8 total time: 133.605003357 ms
index: 9 total time: 133.143901825 ms
index: 10 total time: 136.317968369 ms
index: 11 total time: 137.830018997 ms
index: 12 total time: 135.458946228 ms
index: 13 total time: 132.793903351 ms
index: 14 total time: 144.603967667 ms
index: 15 total time: 134.593963623 ms
index: 16 total time: 135.535001755 ms
index: 17 total time: 133.697032928 ms
index: 18 total time: 136.134147644 ms
index: 19 total time: 133.810043335 ms
The test result shows it(i.e., tf.subtract) cost more than 130 ms to dispose a 3000x4000 subtraction, which obviously is too long, especially on the NVIDIA k80 GPU platform.
Can anyone provide some methods to optimize the tf.subtract?
Thanks in advance.
You're measuring not only the execution time of tf.subtract but also the time required from transferring the input data from the CPU memory to the GPU memory: this is your bottleneck.
To avoid it, don't use placeholders to feed the data but generate it with tensorflow (if you have to randomly generate it) or if you have to read them, use the tensorflow input pipeline. (that creates threads that reads the input for you before starting and then feed the graph without exiting from the tensorflow graph)
It's important to do more possible operations within the tensorflow graph in order to remove the data transfer bottleneck.
It sounds reasonable that the time I measured contained the data transferring time from CPU memory to GPU memory.
Since I have to read the input data (e.g., the input data are images generated by mobile phone and they are sent to the tensorflow one by one), does it mean that the tensorflow placeholders must be used?
To the situation mentioned above (the input data are images generated by mobile phone and they are sent to the tensorflow one by one), if two images are not generated at the same time (i.e., the second image comes long after the first one), how can the input pipeline threads read the input data before starting (i.e., the second image are not generated when the tensorflow is disposing the first image)? So, could you give me a simple example to explain the tensorflow input pipeline?
A sparse matrix in Julia only stores nonzero elements.
Some functions, such as log(x+1) (in all bases),
map zero to zero, and thus don't need to be applied to those zero elements.
(I think we would call this a Monoid homomorphism.)
How can I use this fact to speed up an operation?
Example code:
X = sprand(10^4,10^4, 10.0^-5, rand)
function naiveLog2p1(N::SparseMatrixCSC{Float64,Int64})
log2(1+N) |> sparse
end
Running:
#time naiveLog2p1(X)
Output is:
elapsed time: 2.580125482 seconds (2289 MB allocated, 6.86% gc time in 3 pauses with 0 full sweep)
On a second time (so that the function is expected to be already compiled):
elapsed time: 2.499118888 seconds (2288 MB allocated, 8.17% gc time in 3 pauses with 0 full sweep)
Little change, presumably cos it is so simple to compile.
As per suggestion of the Julia manual on "Sparse matrix operations" I would convert the sparse matrix into a dense one using findnz(), do the log operations on the values and the reconstruc the sparse matrix with sparse().
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
I,J,V = findnz(N)
return sparse(I,J,log2(1+V))
end
#time improvedLog2p1(X)
elapsed time: 0.000553508 seconds (473288 bytes allocated)
My solution would be to actually operate on the inside of the data structure itself:
mysparselog(N::SparseMatrixCSC) =
SparseMatrixCSC(N.m, N.n, copy(N.colptr), copy(N.rowval), log2(1+N.nzval))
Note that if you want to operate on the sparse matrix in place, which would be fairly often in practice I imagine, this would be a zero-memory operation. Benchmarking reveals this performs similar to the #Oxinabox answer, as it is about the same in terms of memory operations (although that answer doesn't actually return the new matrix, as shown by the mean output):
with warmup times removed:
naiveLog2p1
elapsed time: 1.902405905 seconds (2424151464 bytes allocated, 10.35% gc time)
mean(M) => 0.005568094618997372
mysparselog
elapsed time: 0.022551705 seconds (24071168 bytes allocated)
elapsed time: 0.025841895 seconds (24071168 bytes allocated)
mean(M) => 0.005568094618997372
improvedLog2p1
elapsed time: 0.018682775 seconds (32068160 bytes allocated)
elapsed time: 0.027129497 seconds (32068160 bytes allocated)
mean(M) => 0.004995127985160583
What you are looking for is the sparse nonzeros function.
nonzeros(A)
Return a vector of the structural nonzero values in sparse matrix A. This includes zeros that are explicitly stored in the sparse
matrix. The returned vector points directly to the internal nonzero
storage of A, and any modifications to the returned vector will mutate
A as well.
You can use this as below:
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
M = copy(N)
ms = nonzeros(M) #Creates a view,
ms = log2(1+ms) #changes to ms, change M
M
end
#time improvedLog2p1(X)
running for the first time output is:
elapsed time: 0.002447847 seconds (157 kB allocated)
running for the second time output is:
0.000102335 seconds (109 kB allocated)
That is a 4 orders of magnitude improvement in speed and memory use.
What does XADisk do in addition to reading/writing from the underlying file? How does that translate into a percentage of the read/write throughput (approximately)?
depends on the size, if large set the flag "heavyWrite" as true while opening the xaFileOutputStream.
test with 500 files of size 1MB each. Below is the amount of time taken, averaged over 10 executions...
Java IO - 37.5 seconds
Java NIO - 24.8 seconds
XADisk - 30.3 seconds