Sonar analyse source step : JavaSourceImporter - code-analysis

I was wondering what the step " Sensor JavaSourceImporter..." means while analyzing source using command : sonar-runner ?
Actually, I began an analyze but it's now blocked at this step for the last 10minutes !
Thanks in advance !

The "Sensor JavaSourceImporter..." is a phase during which Sonar reads the source files and stores them in the database.
This step can last quite some time, depending on:
the size of your code base,
the distance between the Sonar batch (which does the analysis) and the database (where the sources are pushed). If those 2 parts run on different servers, their location on the network can have significant performance impacts.

Related

Xpress Mosel - Command to stop an optimization after a certain amount of time

I'm currenlty trying to resolve an optimization problem but it's way to big and the software can't handle it fast enough. What's the command to stop it after a certain amount of time?
Can't find it on documentation
Thx
Assuming that you are working with Xpress Optimizer (module "mmxprs"), you can set a time limit via
setparam("XPRS_MAXTIME", 60) ! time limit of 60 seconds (if solving MIP: continue until first solution is found)
setparam("XPRS_MAXTIME", -60) ! hard time limit of 60 seconds
(see the Xpress Optimizer reference manual for the documentation of the 'MAXTIME' control).
If you are working with a different solver, you need to check the corresponding solver manual to find out the name of the parameter to use for this purpose (for example, with the "kalis" module there is a parameter "KALIS_MAX_COMPUTATION_TIME").
An easy way of seeing the available parameters for a given solver module (e.g. mmxprs) is this command:
mosel exam -p mmxprs

Does dataframe.repartition(x) makes execution faster

I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.

Measure query execution time excluding start-up cost in postgres

I want to measure the total time taken by postgres to execute my query excluding the start-up cost. Earlier I was using \timing but now I found \timing includes start-up cost.
I also tried: "explain analyze" in which I found that actual time is specified in a particular format like: actual time=12.04..12.09
So, does this mean that the time taken to execute postgres query excluding start-up time is 0.05. If not, then is there a way to exclude start-up costs and measure query execution time?
What you want is actually quite ill-defined.
"Startup cost" could mean:
network connection overhead and back-end start cost of establishing a new connection. Avoided by re-using the same session.
network round-trip times for sending the query and getting the results. Avoided by measuring the timing server-side with log_statement_min_duration = 0 or (with timing overhead) using explain analyze or the auto_explain module.
Query planning time. Avoided by PREPAREing the query, then timing only the subsequent EXECUTE.
Lock acquisition time. There is not currently any way to exclude this.
Note that using EXPLAIN ANALYZE may not be ideal for your purposes: it throws the query result away, and it adds its own costs because of the detailed timing it does. I would set log_statement_min_duration = 0, set client_min_messages appropriately, and capture the timings from the log output.
So it sounds like you want to PREPARE a query then EXPLAIN ANALYZE EXECUTE it or just EXECUTE it with log_statement_min_duration set to 0.
For exploring PLANNING costs and EXECUTE costs separately you need to set on several postgres.conf parameters:
log_planner_stats = on
log_executor_stats = on
and explore your log file.
Update:
1. find your config file location with executing:
SHOW config_file;
2. Set parameters. Don't foget to remove comment-symbol '#'.
3. Restart postgresql service
4. Execute your query
5. Explore your log file.

Hadoop counters - tuning and optimization

I just wrote my first hadoop job. It processes many files and generates multipleoutput files for each input file. I am running it on a two node cluster and it takes about 10 minutes for my largest input set. Looking at the counters below, what are the optimizations I can do to make it run faster? Are there any specific indicators which one should look for in these counters-
Version: 2.0.0-mr1-cdh4.1.2
Map task Capacity:20
Reduce task Capacity:20
Avg task per node:20
We can see here that most of data reduction happens in the map phase (number of map output bytes is much less then HDFS read bytes, The same about map input records - it is much lower then map input record). We also see that a lot of CPU time spent. We also see low number of shuffling bytes
So this job is:
a) A lot of data reduction is done on Map phase.
b) The job is CPU bound.
So I think code of mapper and reducer should be optimized. I/O probably is not important for this job.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability