How to associate stats printings to m5ops in FS mode of Gem5? - gem5

In SE mode, it's much easier to associate each stats printing to its corresponding m5op.
However, in FS mode where there are multiple (tens or even hundreds of) stats printings take place in the same 'stats.txt' file; How can we identify the following:
Which of the stats printing corresponds to what?
OR
At least, which of the stats printings are the sequel of m5ops invoked by a user?

Each stats dump is wrapped with:
---------- Begin Simulation Statistics ----------
sim_seconds 0.000001 # Number of seconds simulated
sim_ticks 1000 # Number of ticks simulated
...
---------- End Simulation Statistics ----------
In full system the only time the simulator dumps stats is on exit, all of the rest are driven by your runscript and your application. So all stats blocks except the last one, wrapped between Begin-End Simulation Statistics correspond to a dump issued by an m5op. If you need more accuracy as to which event caused which dump, you can check the pseudo_inst.cc file and add/modify it accordingly. This might be implemented in the newer versions, but I am not up to date on that.

Related

the disk iops from scylla_setup iotune study my disk is different from fio test data

when use scylla_setup, iotune study my reuslt is:
Measuring sequential write bandwidth: 473 MB/s
Measuring sequential read bandwidth: 499 MB/s
Measuring random write IOPS: 1902 IOPS
Measuring random read IOPS: 1999 IOPS
iops is 1900-2000,
when use fio,
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdc1 --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
the result is
test: (groupid=0, jobs=1): err= 0: pid=11697: Wed Jun 26 08:58:13 2019
read: IOPS=47.6k, BW=186MiB/s (195MB/s)(3070MiB/16521msec)
bw ( KiB/s): min=187240, max=192136, per=100.00%, avg=190278.42, stdev=985.15, samples=33
iops : min=46810, max=48034, avg=47569.61, stdev=246.38, samples=33
write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(1026MiB/16521msec)
bw ( KiB/s): min=62656, max=65072, per=100.00%, avg=63591.52, stdev=590.96, samples=33
iops : min=15664, max=16268, avg=15897.88, stdev=147.74, samples=33
cpu : usr=4.82%, sys=12.81%, ctx=164053, majf=0, minf=23
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=785920,262656,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=186MiB/s (195MB/s), 186MiB/s-186MiB/s (195MB/s-195MB/s), io=3070MiB (3219MB), run=16521-16521msec
WRITE: bw=62.1MiB/s (65.1MB/s), 62.1MiB/s-62.1MiB/s (65.1MB/s-65.1MB/s), io=1026MiB (1076MB), run=16521-16521msec
Disk stats (read/write):
sdc: ios=780115/260679, merge=0/0, ticks=792798/230409, in_queue=1023170, util=99.47%
read iops is 46000 - 48000,write iops is 15000-16000
(NB: It looks like the questioner filed this as a Scylla Github issue too - https://github.com/scylladb/scylla/issues/4604 )
[Why is] the disk iops from scylla_setup iotune [...] different from fio test data
Different benchmarks, different results:
Scylla may have been using a much bigger block size (e.g. 64k) per I/O (this is likely the biggest factor). As you make the block size bigger (up to some maximum due to diminishing returns) the bandwidth (i.e. total amount of data you can send in say a second) achieved with that block size goes up but the IOPS you get will typically down (you are sending more data per I/O after all). This is normal!
Scylla could be using buffered I/O (rather than direct I/O)
Scylla may have been benchmarking reads and writes separately
Scylla may have been using a bigger queue depth
Scylla may have been batching its submissions differently
Scylla may be writing a different type of data
And so on...
In general, it's very difficult to take benchmarks done with different tools and compare them directly to each other - you would need to know what they are doing under the hood for any comparison to be meaningful. Trying to look at IOPS or bandwidth in isolation without more context is meaningless as you typically trade one off against the other. It's better to use the same benchmark tool with identical options to compare two different machines changes or to measure the impact of tuning on the same machine.
TLDR; This is likely an apples to oranges comparison where the tools are measuring different contexts.
PS: gtod_reduce is a go faster stripe that very few people actually need. If your hardware isn't capable of doing gigabytes per second and you're not seeing your CPU maxed out it's unlikely reducing gettimeofday calls is going to nudge the result very much.
(This question might be more appropriate for Server Fault (and thus get better replies there) because it's not directly about programming)

How can I monitor multiple Statistics from different classes in Gem5 at the same time dynamiclly?

Which class in Gem5 has access to all the Stats from different objects?
Do the statistics of each object being returned to specific class continuously or these stats are collected just at the end of the simulation?
For example servicedByWrQ is a Scalar stat defined in the dram_ctrl.hh. On the other hand condPredicted is another Scalar stat which is defined in the bpred_unit.hh. How can I monitor these two statistics at the same time during the simulation not through the output file in Gem5?
My ultimate goal is to change the behavior other hardware units during the simulations such as branch prediction or cache replacement policy, etc. based on those statistics.

Lambda Architecture Modelling Issue

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.

Measure query execution time excluding start-up cost in postgres

I want to measure the total time taken by postgres to execute my query excluding the start-up cost. Earlier I was using \timing but now I found \timing includes start-up cost.
I also tried: "explain analyze" in which I found that actual time is specified in a particular format like: actual time=12.04..12.09
So, does this mean that the time taken to execute postgres query excluding start-up time is 0.05. If not, then is there a way to exclude start-up costs and measure query execution time?
What you want is actually quite ill-defined.
"Startup cost" could mean:
network connection overhead and back-end start cost of establishing a new connection. Avoided by re-using the same session.
network round-trip times for sending the query and getting the results. Avoided by measuring the timing server-side with log_statement_min_duration = 0 or (with timing overhead) using explain analyze or the auto_explain module.
Query planning time. Avoided by PREPAREing the query, then timing only the subsequent EXECUTE.
Lock acquisition time. There is not currently any way to exclude this.
Note that using EXPLAIN ANALYZE may not be ideal for your purposes: it throws the query result away, and it adds its own costs because of the detailed timing it does. I would set log_statement_min_duration = 0, set client_min_messages appropriately, and capture the timings from the log output.
So it sounds like you want to PREPARE a query then EXPLAIN ANALYZE EXECUTE it or just EXECUTE it with log_statement_min_duration set to 0.
For exploring PLANNING costs and EXECUTE costs separately you need to set on several postgres.conf parameters:
log_planner_stats = on
log_executor_stats = on
and explore your log file.
Update:
1. find your config file location with executing:
SHOW config_file;
2. Set parameters. Don't foget to remove comment-symbol '#'.
3. Restart postgresql service
4. Execute your query
5. Explore your log file.

Hadoop counters - tuning and optimization

I just wrote my first hadoop job. It processes many files and generates multipleoutput files for each input file. I am running it on a two node cluster and it takes about 10 minutes for my largest input set. Looking at the counters below, what are the optimizations I can do to make it run faster? Are there any specific indicators which one should look for in these counters-
Version: 2.0.0-mr1-cdh4.1.2
Map task Capacity:20
Reduce task Capacity:20
Avg task per node:20
We can see here that most of data reduction happens in the map phase (number of map output bytes is much less then HDFS read bytes, The same about map input records - it is much lower then map input record). We also see that a lot of CPU time spent. We also see low number of shuffling bytes
So this job is:
a) A lot of data reduction is done on Map phase.
b) The job is CPU bound.
So I think code of mapper and reducer should be optimized. I/O probably is not important for this job.