How can I monitor multiple Statistics from different classes in Gem5 at the same time dynamiclly? - gem5

Which class in Gem5 has access to all the Stats from different objects?
Do the statistics of each object being returned to specific class continuously or these stats are collected just at the end of the simulation?
For example servicedByWrQ is a Scalar stat defined in the dram_ctrl.hh. On the other hand condPredicted is another Scalar stat which is defined in the bpred_unit.hh. How can I monitor these two statistics at the same time during the simulation not through the output file in Gem5?
My ultimate goal is to change the behavior other hardware units during the simulations such as branch prediction or cache replacement policy, etc. based on those statistics.

Related

Analyzing data flow of Dask dataframes

I have a dataset stored in a tab-separated text file. The file looks as follows:
date time temperature
2010-01-01 12:00:00 10.0000
...
where the temperature column contains values in degrees Celsius (°C).
I compute the daily average temperature using Dask. Here is my code:
from dask.distributed import Client
import dask.dataframe as dd
client = Client("<scheduler URL")
inputDataFrame = dd.read_table("<input file>").drop('time', axis=1)
groupedData = inputDataFrame.groupby('date')
meanDataframe = groupedData.mean()
result = meanDataframe.compute()
result.to_csv('result.out', sep='\t')
client.close()
In order to improve the performance of my program, I would like to understand the data flow caused by Dask data frames.
How is the text file read into a data frame by read_table()? Does the client read the whole text file and send the data to the scheduler, which partitions the data and sends it to the workers? Or does each worker read the data partitions it works on directly from the text file?
When an intermediate data frame is created (e.g. by calling drop()) is the whole intermediate data frame sent back to the client and then sent to the workers for further processing?
The same question for groups: where is the data for a group object create and stored? How does it flow between client, scheduler and workers?
The reason for my question is that if I run a similar program using Pandas, the computation is roughly two times faster, and I am trying to understand what causes the overhead in Dask. Since the size of the result data frame is very small compared to the size of the input data, I suppose there is quite some overhead caused by moving the input and intermediate data between client, scheduler and workers.
1) The data are read by the workers. The client does read a little ahead of time, to figure out the column names and types and, optionally, to find line-delimiters for splitting files. Note that all workers must be able to reach the file(s) of interest, which can require some shared file-system when working on a cluster.
2), 3) In fact, the drop, groupby and mean methods do not generate intermediate data-frames at all, they just accumulate a graph of operations to be executed (i.e., they are lazy). You could time these steps and see they are fast. During execution, intermediates are made on workers, copies to other workers as required, and discarded as soon as possible. There are never copies to the scheduler or client, unless you explicitly request so.
So, to the root of your question: you can investigate the performance or your operation best by looking at the dashboard.
There are many factors that govern how quickly things will progress: the processes may be sharing an IO channel; some tasks do not release the GIL, and so parallelise poorly in threads; the number of groups will greatly affect the amount of shuffling of data into groups... plus there is always some overhead for every task executed by the scheduler.
Since Pandas is efficient, it is not surprising that for the case where data fits easily into memory, it performs well compared to Dask.

Lambda Architecture Modelling Issue

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.

Parallel Processing in MATLAB with more than 12 cores

I created a function to compute the correct number of ks for a dataset using the Gap Statistics algorithm. This algorithm requires at one point to compute the dispersion (i.e., the sum of the distances between every point and its centroid) for, let's say, 100 different datasets (called "test data(set)" or "reference data(set)"). Since these operations are independent I want to parallel them across all the cores. I have the Mathworks' Parallel Toolbox but I am not sure how to use it (problem 1; I can use past threads to understand this, I guess). However, my real problem is another one: this toolbox seems to allow the usage of just 12 cores (problem 2). My machine has 64 cores and I need to use all of them. Do you know how to parallel a process among 12+ cores?
For your information this is the bit of code that should run in parallel:
%This cycle is repeated n_tests times where n_tests is equal
%to the number of reference datasets we want to use
for id_test = 2:n_tests+1
test_data = generate_test_data(data);
%% Calculate the dispersion(s) for the generated dataset(s)
dispersions(id_test, 1:1:max_k) = zeros;
%We calculate the dispersion for the id_test reference dataset
for id_k = 1:1:max_k
dispersions(id_test, id_k) = calculate_dispersion(test_data, id_k);
end
end
Please note that in R2014a the limit on the number of local workers was removed. See the release notes.
The number of local workers available with Parallel Computing Toolbox is license dependent. When introduced, the limit was 4; this changed to 8 in R2009a; and to 12 in R2011b.
If you want to use 16 workers, you will need a 16-node MDCS licence, and you'll also need to set up some sort of scheduler to manage those. There are detailed instructions about how to do this here:http://www.mathworks.de/support/product/DM/installation/ver_current/. Once you've done that, yes, you'll be able to do "matlabpool open 16".
EDIT: As of Matlab version R2014a there is no longer a limit on the number of local workers for the Parallel Computing Toolbox. That is, if you are using an up-to-date version of Matlab you will not encounter the problem described by the OP.
The fact that matlab creates this restriction on its parallel toolbox make it often not worth the money and effort of using it.
One way of solving is by using a combination of the matlab compiler and virtual machines using either vmware or virtual box.
Compile the code required to run your tests.
Load your compiled code with the MCR(matlab compiler runtime)on a VM template.
Create multiple copies of the VM template, let each template run the required calculations for some of the datasets.
Gather the data of all your results
This method is time consuming and only worth it if it saves more time than porting the code and the code is already highly optimised.
I had the same problem on 32 core machine and 6 datasets. I've overcame this by creating shell script, which started matlab six times, one for each data set. I could do this, becase the computations weren't dependent. From what I understand, You could use similar approach. By starting around 6 instances, each counting around 16 datasets. It depends how much RAM you have and how much each instance consumes.

Hadoop counters - tuning and optimization

I just wrote my first hadoop job. It processes many files and generates multipleoutput files for each input file. I am running it on a two node cluster and it takes about 10 minutes for my largest input set. Looking at the counters below, what are the optimizations I can do to make it run faster? Are there any specific indicators which one should look for in these counters-
Version: 2.0.0-mr1-cdh4.1.2
Map task Capacity:20
Reduce task Capacity:20
Avg task per node:20
We can see here that most of data reduction happens in the map phase (number of map output bytes is much less then HDFS read bytes, The same about map input records - it is much lower then map input record). We also see that a lot of CPU time spent. We also see low number of shuffling bytes
So this job is:
a) A lot of data reduction is done on Map phase.
b) The job is CPU bound.
So I think code of mapper and reducer should be optimized. I/O probably is not important for this job.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability