Fail to restore the checkpoint in gem5 se mode - gem5

I want to use checkpoints to accelerate my simulation. The problem is that when I restore from the checkpoint, the gem5 simulation aborted.
The mode I am using is se mode. I am using m5 pseudo instruction m5_checkpoint(0,0) in
my application program to create checkpoints.
I change the CPU model when restoring checkpoints and find out when the system doesn't have cache, the restoration is successful.
The error outputs are as below:
0: system.remote_gdb: listening for remote gdb on port 7005
build/ARM/sim/process.cc:389: warn: Checkpoints for pipes, device drivers and sockets do
not work.
Switch at curTick count:10000
gem5.opt: build/ARM/sim/eventq.hh:766: void gem5::EventQueue::schedule(gem5::Event*,
gem5::Tick, bool): Assertion `when >= getCurTick()' failed.
Program aborted at tick 16277372800
The command line to create checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type AtomicSimpleCPU \
--mem-type DDR3_2133_8x8 --mem-size 1GB \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
"$OUTPUT_PATH/output.txt"
The command line to restore checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type O3_ARM_v7a_3 \
--caches --l2cache --l1i_size 64kB --l1d_size 32kB --l2_size 256kB --l1i_assoc 8
--l1d_assoc 8 --l2_assoc 16 --cacheline_size 128 \
--l2-hwp-type StridePrefetcher --mem-type DDR3_2133_8x8 --mem-size 1GB \
-r 1 --checkpoint-dir "$CHECK_PATH" \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
$OUTPUT_PATH/output.txt" \
The version of gem5 I am using is 21.1.0.2.
Best Regards, Gelin

Related

How to select a gpu with minimum gpu-memory of 20GB in qsub/PBS (for tensorflow2.0)?

In a node of our cluster we have gpus some of them are already in use by someone else. I am submitting a job using qsub that runs a jupyter-notebook using one gpu.
#!/bin/sh
#PBS -N jupyter_gpu
#PBS -q long
##PBS -j oe
#PBS -m bae
#PBS -l nodes=anodeX:ppn=16:gpus=3
jupyter-notebook --port=111 --ip=anodeX
However, I find that qsub blocks the gpu that is already in use (the available memory shown is pretty low), thus my code gets an error of low memory. If I aks for more gpus (say 3), the code runs fine only if the GPU:0 has sufficient memory. I am struggling to understand what is happening.
Is there a way to request gpu-memory in qsub?
Note that #PBS -l mem=20gb demands only the cpu memory. I am using tensorflow 2.9.1.

Same statistics error while using gem5 driven Ramulator in full system simulation for x86

I am trying to run a script in gem5 full system simulation with Ramulator. I have a checkpoint simulation in order to not to boot every time that I have to simulate. The goal is to get debug traces from the simulation in order to inspect the efficiency of the script. When I simulate without Ramulator, it works great. However, when I add the arguments for Ramulator, I get the following error:
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
I looked around but couldn't find anything about this error. Here is my script:
# Initialize the common paths
source path_init.sh
# Path to the directories
config_py=$GEM5_CONFIGS/example/fs.py
outdir=$RESULTS/example_fs/Ramulator
# disk and binaries for the full system simulation
kernel=/home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
image=/home/tohumcu/Stage/gem5/scratch/system/disks/linux-x86.img
ramulator_conf=/home/tohumcu/Stage/gem5/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Ramulator parameters
ramulator_config=$GEM5_REPOSITORY/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Flag parameters
touch=$outdir/exec_debug_ramulator_fs.txt
# checkpoint
rcS_file=/home/tohumcu/Stage/gem5/scratch/default/ManCommand.rcS
chkpt_dir=$RESULTS/example_fs/Checkpoint/
mkdir -p $outdir
#--debug-flags=Exec \
#--debug-file=$outdir/exec_debug_ramulator_fs.txt \
$GEM5 --debug-flags=Exec \
--debug-file=$outdir/exec_debug_ramulator_fs.txt \
-d $outdir $config_py $* \
--cpu-type AtomicSimpleCPU \
--caches \
--l2cache \
--mem-size 10GB \
--mem-type=Ramulator \
--ramulator-config=$ramulator_config \
--disk-image $image \
--kernel $kernel \
--script $rcS_file \
--checkpoint-dir $chkpt_dir \
--checkpoint-restore=1 \
--num-cpus 2 \
> $outdir/cmd.txt \
2> $outdir/cerr.txt
# Ramulator arguments to add:
# --mem-type=Ramulator \
# --ramulator-config=$ramulator_config \
and here is the full cerr.txt file:
warn: Physical memory size specified is 10GB which is greater than 3GB. Twice the number of memory controllers would be created.
info: kernel located at: /home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
system.pc.com_1.device: Listening for connections on port 3456
0: system.remote_gdb: listening for remote gdb on port 7000
0: system.remote_gdb: listening for remote gdb on port 7001
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
Program aborted at tick 0
--- BEGIN LIBC BACKTRACE ---
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z15print_backtracev+0x2c)[0x56040fb2eabc]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z12abortHandleri+0x4a)[0x56040fb40cca]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f35d2cc4890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f35d1706e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f35d1708801]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x74642f)[0x56040eb4d42f]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats4Info7setNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x205)[0x56040fa8afa5]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats8DataWrapINS_6ScalarENS_15ScalarInfoProxyEE4nameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1d)[0x56040ebe59dd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator8StatBaseIN5Stats6ScalarEE4nameENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x61)[0x56040fceb7b9]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator4DRAMINS_4DDR4EE8regStatsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd)[0x56040fcc7c71]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE15populate_memoryERKNS_6ConfigEPS1_ii+0x136)[0x56040fcbf511]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE6createERKNS_6ConfigEi+0x28a)[0x56040fcbcead]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNSt17_Function_handlerIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEPS6_E9_M_invokeERKSt9_Any_dataS5_Oi+0x49)[0x56040fcc0883]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNKSt8functionIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEEclES5_i+0x60)[0x56040fcbe774]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator11Gem5WrapperC2ERKNS_6ConfigEi+0x118)[0x56040fcba968]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9Ramulator4initEv+0x8d)[0x56040f854bbd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x11e5d16)[0x56040f5ecd16]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x7674c4)[0x56040eb6e4c4]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ff3)[0x7f35d2f72763]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ac0)[0x7f35d2f72230]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x76)[0x7f35d301b6f6]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z6m5MainiPPc+0x63)[0x56040fb3f7d3]
--- END LIBC BACKTRACE ---
Thanks in advance for your help!
When I check the cerr.txt of my checkpoint simulation, I saw that gem5 was creating two memory modules in config.ini and thus creating the problem for Ramulator. This is due to gem5s behaviour of creating another memory module after 3G. Ramulator is not capable of working with more than 1 module. When I fixed my memory size to the max limit of 3GB and re-creating a checkpoint, I had no more issues.

How to run TensorFlow 2 in a distributed environment with Horovod?

I have successfully set up the distributed environment and run the example with Horovod. And I also know that if I want to run the benchmark on TensorFlow 1 in a distributed setup, e.g. 4 nodes, following the tutorial, the submission should be:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod \
--data_dir /path/to/imagenet/tfrecords \
--data_name imagenet \
--num_batches=2000
But now I want to run the TensorFlow 2 official models, for example BERT model. What command should I use?

virt-install hangs - GPU Passthrough for Virtual Machines

I want to run VMs that uses host's GPU. For that, I followed this docs to enable modules/grub configurations. Looks like I successfully configured, I can see dmesg | grep -i vfio. But when I run virt-install, it is hanging forever, parallely I can't run even virsh list --all. Every time I have to restart my laptop, in order to run any virsh/virt-install commands again.
veeru#ghost:~$ sudo su
[sudo] password for veeru:
root#ghost:/home/veeru# virt-install \
> --name vm0 \
> --ram 12028 \
> --disk path=/home/veeru/ubuntu14-HD.img,size=30 \
> --vcpus 2 \
> --os-type linux \
> --os-variant ubuntu16.04 \
> --network bridge=bridge:br0 \
> --graphics none \
> --console pty,target_type=serial \
> --location /home/veeru/Downloads/ubuntu-16.04.5.iso --force \
> --extra-args 'console=ttyS0,115200n8 serial' \
> --host-device 01:00.0 \
> --features kvm_hidden=on \
> --machine q35
Starting install...
Retrieving file .treeinfo... | 0 B 00:00:00
Retrieving file content... | 0 B 00:00:00
Retrieving file info... | 67 B 00:00:00
Retrieving file vmlinuz... | 6.8 MB 00:00:00
Retrieving file initrd.gz... | 14 MB 00:00:00
Below is the output when I do strace of process for above command
veeru#ghost:~$ sudo strace -p 9747
strace: Process 9747 attached
restart_syscall(<... resuming interrupted poll ...>
PS: My laptop is Predator Helios 300(UEFI-Secure Boot), GPU: Nvidia GeForce GTX1050Ti, Ubuntu Mate 18.04(Installed nvidia drivers), 8GB Ram,
Ok, I see the problem, the GPU is already being used by host(my laptop) i.e it is busy. So, when I run virt-install command, it hangs forever which is no wonder.
In order to resolve the issue, switch your X11 to use CPU. I use Ubuntu Mate 18.06 which has handy tool to switch like in below screenshot
Ater that logout and login and check nvidia GPU is not being used by any process by running nvidia-smi; it should similar output like below.
veeru#ghost:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Now you should able to run virt-install like me.

Distributed TensorFlow hangs during CreateSession

I am new to distributed TensorFlow. Right now I am just trying to get some existing examples to work so I can learn how to do it right.
I am following the instruction here to train the inception network on one Linux machine with one worker and one PS.
https://github.com/tensorflow/models/tree/master/research/inception#how-to-train-from-scratch-in-a-distributed-setting
The program hangs during CreateSession with the message:
CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
This my command to start a worker:
./bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/datasets/BigLearning/jinlianw/imagenet_tfrecords/ \
--job_name='worker' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
This is my command to start a PS:
./bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
And the PS process hangs after printing:
2018-06-29 21:40:43.097361: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]
Started server with target: grpc://localhost:2222
Is the inception model still a valid example for distributed TensorFlow or did I do something wrong?
Thanks!
Problem resolved. Turns out it's due to GRPC. My cluster machines have an environment variable http_proxy set. Unset this variable solves the problem.