u-sql job is very slow, when i add a .NET call - azure-data-lake

The code performs very fast over 2000 small files (~10-50 Kb) ~ 1 min. Parallelizm = 5.
#arenaData =
EXTRACT col1, col2, col3
FROM #in
USING Extractors.Tsv(quoting : true, skipFirstNRows : 1, nullEscape : "\\N", encoding:Encoding.UTF8);
#res =
SELECT col1, col2, col3
FROM #arenaData;
OUTPUT #res
TO #out
USING Outputters.Csv();
But if i change the code like this it takes ~ 1 hour
#arenaData =
EXTRACT col1, col2, col3
FROM #in
USING Extractors.Tsv(quoting : true, skipFirstNRows : 1, nullEscape : "\\N", encoding:Encoding.UTF8);
#res =
SELECT
col1.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture) AS col1_converted,
, col2, col3
FROM #arenaData;
OUTPUT #res
TO #out
USING Outputters.Csv();
Why the .NET call so slow? I need to convert the date format in the source CSV files to "yyyy-MM-dd HH:mm:ss"? How can i do that effectively?

Great to hear you are getting the better performance now!
Your job runs on over 2800 very small files using an expression that is being executed in managed code and not translated into C++ as some of the more common C# expressions in U-SQL are.
This leads to the following problem:
You start your job with a certain number of AUs. Each AU then starts a YARN container to execute part of your job. This means that the container needs to be initialized cleanly which takes some time (you can see it in the Vertex Execution View as creation time). Now this takes a bit of time, that is not that much overhead if your vertex does some large amount of processing. Unfortunately in your case, the processing is very quick on a small file so there is a large overhead.
If the vertex only executes system generated code that we codegen into C++ code, then we can reuse containers without the re-initialization time. Unfortunately, we cannot reuse general user-code that gets executed with the managed runtime due to potential artifacts being left behind. So in that case we need to re-initialize the containers which will take time (over 2800 times).
Now based on your feedback, we are improving our reinitialization logic (we can still reinitialize if you do not do anything fancy with inline C# expressions). Also, it will get better once we can process several small files inside a single vertex instead of one file per vertex.
Workarounds for you is to increase the sizes of your files and to possibly avoid custom code (not always possible of course) to be in too many of the vertices.

Related

Why is it faster sending data as encoded string than sending it as bytes?

I'm playing with pyzmq for inter-process transfer of 4k HDR image data and noticed that this:
byt = np.array2string(np.random.randn(3840,2160,3)).encode()
while True:
socket.send(byt)
is much much faster than:
byt = np.random.randn(3840,2160,3).asbytes()
while True:
socket.send(byt)
Can someone explain why? I can't seem to wrap my head around it.
Q : Why is it faster sending ... ? Can someone explain why ?
A :+1 for having asked WHY - people who do understand WHY are those, that strive to learn to the roots of the problems, so as to truly understand the core reasons & thus can next design better systems, knowing the very WHY ( taking no shortcuts in mimicking emulating or copy/paste following someone else )
So, let's start :
HDR is not the SDR, we will have "a lot of DATA" here to acquire - store - process - send,
Inventory of facts - in this order : DATA, process, .send(), who gets faster & WHY
DATA :were defined to be 4K-HDR sized array of triple-data-values of a numpy provided default dtype, where ITU-T Recommendation BT-2100 HDR colourspace requires at least 10-bit for increased colour dynamics-ranges
The as-is code delivers numpy.random.randn( c4K, r4K, 3 )'s default
dtype of np.float64. Just for the sake of proper & right-sized system design, the HDR ( extending a plain 8-bit sRGB triple-byte colourspace ) shall always prefer int{10|12|16|32|...}-based storage, not to skew any numerical image post-processing in pipeline's later phase(s).
process :
Actual message-payload generating processes were defined to beCase-A ) np.array2string( ) followed by an .encode() methodCase-B ) a numpy.ndarray-native (sic) .asbytes()-method
.send() :
ZeroMQ Scalable Formal Communication Archetype pattern (of unknown type) finally receives a process-generated message-payload, into a ( blocking-form of the ) .send()-method
Solution of WHY & tips for HOW :
The core difference is hidden in the fact, that we try to compare apples to oranges.
>>> len( np.random.randn( c4K, r4K, 3 ).tobytes() ) / 1E6
199.0656 [MB]
>>> len( np.array2string( np.random.randn( c4K, r4K, 3 ) ) ) / 1E6
0.001493 [MB] ... Q.E.D.
While the (sic) .asbytes()-method produces a full copy ( incl. RAM-allocation + RAM-I/O-traffic [SPACE] + [TIME]-domains' costs ), i.e. spending some extra us before ZeroMQ starts a .send()-method ZeroCopy magicks :
print( np.random.randn( c4K, r4K, 3 ).tobytes.__doc__ )
a.tobytes(order='C')
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of
data memory. The bytes object is produced in C-order by default.
This behavior is controlled by the ``order`` parameter.
.. versionadded:: 1.9.0
the other case, the Case-A, first throws away (!), and a lot (!)... depending here on actual numpy matrix-UI-presentation configuration settings, lot of original 4K-HDR DATA even before moving them into the .encode()-phase :
>>> print( np.array2string( np.random.randn( c4K, r4K, 3 ) ) )
[[[ 1.54482944 -0.23189048 -0.67866246]
...
[ 0.13461456 1.47855833 -1.68885902]]
[[-0.18963557 -1.1869201 1.34843493]
...
[-0.3022641 -0.44158803 0.75750368]]
[[-1.05737969 0.864752 0.36359686]
...
[ 1.70240612 -0.12574642 -1.03325878]]
...
[[ 0.41776933 1.73473723 0.28723299]
...
[-0.47635911 0.15901325 -0.56407537]]
[[-1.41571874 1.66735309 0.6259928 ]
...
[-0.93164127 0.95708002 1.3470873 ]]
[[ 0.16426176 -0.00317156 0.77522962]
...
[ 0.32960196 -1.74369368 -0.34177759]]]
So, sending less-DATA means taking less time to move them.
Tips HOW :
ZeroMQ methods & the overall performance will benefit from using zmq.DONTWAIT flag, when passing a reference to the .send()-method
try to re-use the most of the great numpy-tooling, where possible, to minimise repetitive RAM-allocation(s) ( we may pre-allocate & re-use once allocated variable )
try to use as compact DATA-representation as possible, if hunting for maximum performance with minimum latency - redundancy-avoided, compact, CPU-cache-lines' hierarchy & associativity matching formats will always win in the race for ultimate performance ( using a view of internal numpy-storage area, i.e. without using any mediating methods to read-access the actual block of 4K-HDR data may help to move the whole pipeline to become ZeroCopy down to the ZeroMQ .send()-pushing the DATA-references only ( i.e. without copying or moving a single byte of DATA from / into RAM, up until loading it onto the wire ... ) ... which is the coolest performance result of our design efforts here, isn't it? )
in any case, in all critical sections, avoid effects of blocking the flow by gc.disable(), to at least defer a potential .collect() not to happen "here"

Parallelism with #sync #async in Julia

I have some heavy csv table that i would like to import in parallel with #sync #sync macros.
Not very familiar to this, I tried this way :
#import files
#sync #async begin
df1=CSV.File(libname*"df1.csv")|> DataFrame!
df2=CSV.File(libname*"df2.csv")|> DataFrame!
end
I have the task done, but the data subset I make after seems to be impacted :
select!(df1, Not("Var1"))
ArgumentError : Column :Var1 not found in the data frame
PS : without #sync macro the code works well
I probably make something wrong. Any idea would be helpful.
Thanks
#sync #async do not do anything in your code other than introducing a begin... end block with its local scope.
What happens here is that you are creating a new scope and never modify the global values of df1 and df2 - rather than that you are seeing their old values.
If I/O is the bottleneck in your code the correct code would be the following:
dfs = Vector{DataFrame}(undef, 2)
#sync begin
#async dfs[1]=CSV.File(libname*"df1.csv")|> DataFrame!
#async dfs[2]=CSV.File(libname*"df2.csv")|> DataFrame!
end
However, usually it is not the I/O that is the issue but rather the CPU. In that case green threads are not that much useful and you need normal regular threads:
dfs = Vector{DataFrame}(undef, 2)
Threads.#threads for i in 1:2
dfs[i]=CSV.File(libname*"df$i.csv")|> DataFrame!
end
Note that for this code to use multi-threading you need to set the JULIA_NUM_THREADS system variable before running Julia such as:
set JULIA_NUM_THREADS=2

Filtering with regex and .contains in Perl 6

I often have to filter elements of an array of strings, containing some substring (e.g. one character). Since it can be done either by matching a regex or with .contains method, I've decided to make a small test to be sure that .contains is faster (and therefore more appropriate).
my #array = "aa" .. "cc";
my constant $substr = 'a';
my $time1 = now;
my #a_array = #array.grep: *.contains($substr);
my $time2 = now;
#a_array = #array.grep: * ~~ /$substr/;
my $time3 = now;
my $time_contains = $time2 - $time1;
my $time_regex = $time3 - $time2;
say "contains: $time_contains sec";
say "regex: $time_regex sec";
Then I change the size of #array and the length of $substr and compare the times which each method took to filter the #array. In most cases (as expected), .contains is much faster than regex, especially if #array is large. But in case of a small #array (as in the code above) regex is slightly faster.
contains: 0.0015010 sec
regex: 0.0008708 sec
Why does this happen?
In an entirely unscientific experiment I just switched the regex version and the contains version around and found that the difference in performance you're measuring is not "regex vs contains" but in fact "first thing versus second thing":
When contains comes first:
contains: 0.001555 sec
regex: 0.0009051 sec
When regex comes first:
regex: 0.002055 sec
contains: 0.000326 sec
Benchmarking properly is a difficult task. It's really easy to accidentally measure something different from what you wanted to figure out.
When I want to compare the performance of multiple things I will usually run each thing in a separate script, or maybe have a shared script but only run one of the tasks at once (for example using a multi sub MAIN("task1") approach). That way any startup work gets shared.
In the #perl6 IRC channel on freenode we have a bot called benchable6 which can do benchmarks for you. Read the section "Comparing Code" on its wiki page to find out how it can compare two pieces of code for you.

Accessing intermediate results from a tensorflow graph

If I have a complex calculation of the form
tmp1 = tf.fun1(placeholder1,placeholder2)
tmp2 = tf.fun2(tmp1,placeholder3)
tmp3 = tf.fun3(tmp2)
ret = tf.fun4(tmp3)
and I calculate
ret_vals = sess.run(ret,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
fun1, fun2 etc are possibly costly operations on a lot of data.
If I run to get ret_vals as above, is it possible to later or at the same time access the intermediate values as well without re-running everything up to that value? For example, to get tmp2, I could re-run everything using
tmp2_vals = sess.run(tmp2,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
But this looks like a complete waste of computation? Is there a way to access several of the intermediate results in a graph after performing one run?
The reason why I want to do this is for debugging or testing or logging of progress when ret_vals gets calculated e.g. in an optimization loop. Every step where I run the ret_vals calculations is costly but I want to see some of the intermediate results that were calculated.
If I do something like
tmp2_vals, ret_vals = sess.run([tmp2, ret], ...)
does this guarantee that the graph will only get run once (instead of one time for tmp2 and one time for ret) like I want it?
Have you looked at tf.Print? This is an identity op with printing funciton. You can insert it in your graph right after tmp2 to get the value of it. Note that the default setting only allows you to print the first n values of the tensor, you can modify the value n by giving attribute first_n to the op.

NDepend Count, Average, etc.. reporting... aggregates. Possible? clean work arounds?

We have a huge code base, where methods with too many local variables alone returns 226 methods. I don't want this huge table being dumped into the xml output to clutter it up, and I'd like the top 10 if possible, but what I really want is the count so we can do trending and executive summaries. Is there a clean/efficient/scalable non-hacky way to do this?
I imagine I could use an executable task, instead of the ndepend task (so that the merge is not automatic) and the clutter doesn't get merged. Then manually operate on those files to get a summary, but I'd like to know if there is a shorter path?
What about defining a base line to only take account of new flaws?
what I really want is the count so we can do trending and executive summaries
Trending can be easily achieved with code queries and rules over LINQ (CQLinq) like: Avoid making complex methods even more complex (Source CC)
// <Name>Avoid making complex methods even more complex (Source CC)</Name>
// To visualize changes in code, right-click a matched method and select:
// - Compare older and newer versions of source file
// - Compare older and newer versions disassembled with Reflector
warnif count > 0
from m in JustMyCode.Methods where
!m.IsAbstract &&
m.IsPresentInBothBuilds() &&
m.CodeWasChanged()
let oldCC = m.OlderVersion().CyclomaticComplexity
where oldCC > 6 && m.CyclomaticComplexity > oldCC
select new { m,
oldCC ,
newCC = m.CyclomaticComplexity ,
oldLoc = m.OlderVersion().NbLinesOfCode,
newLoc = m.NbLinesOfCode,
}
or Avoid transforming an immutable type into a mutable one.