Same statistics error while using gem5 driven Ramulator in full system simulation for x86 - gem5

I am trying to run a script in gem5 full system simulation with Ramulator. I have a checkpoint simulation in order to not to boot every time that I have to simulate. The goal is to get debug traces from the simulation in order to inspect the efficiency of the script. When I simulate without Ramulator, it works great. However, when I add the arguments for Ramulator, I get the following error:
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
I looked around but couldn't find anything about this error. Here is my script:
# Initialize the common paths
source path_init.sh
# Path to the directories
config_py=$GEM5_CONFIGS/example/fs.py
outdir=$RESULTS/example_fs/Ramulator
# disk and binaries for the full system simulation
kernel=/home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
image=/home/tohumcu/Stage/gem5/scratch/system/disks/linux-x86.img
ramulator_conf=/home/tohumcu/Stage/gem5/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Ramulator parameters
ramulator_config=$GEM5_REPOSITORY/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Flag parameters
touch=$outdir/exec_debug_ramulator_fs.txt
# checkpoint
rcS_file=/home/tohumcu/Stage/gem5/scratch/default/ManCommand.rcS
chkpt_dir=$RESULTS/example_fs/Checkpoint/
mkdir -p $outdir
#--debug-flags=Exec \
#--debug-file=$outdir/exec_debug_ramulator_fs.txt \
$GEM5 --debug-flags=Exec \
--debug-file=$outdir/exec_debug_ramulator_fs.txt \
-d $outdir $config_py $* \
--cpu-type AtomicSimpleCPU \
--caches \
--l2cache \
--mem-size 10GB \
--mem-type=Ramulator \
--ramulator-config=$ramulator_config \
--disk-image $image \
--kernel $kernel \
--script $rcS_file \
--checkpoint-dir $chkpt_dir \
--checkpoint-restore=1 \
--num-cpus 2 \
> $outdir/cmd.txt \
2> $outdir/cerr.txt
# Ramulator arguments to add:
# --mem-type=Ramulator \
# --ramulator-config=$ramulator_config \
and here is the full cerr.txt file:
warn: Physical memory size specified is 10GB which is greater than 3GB. Twice the number of memory controllers would be created.
info: kernel located at: /home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
system.pc.com_1.device: Listening for connections on port 3456
0: system.remote_gdb: listening for remote gdb on port 7000
0: system.remote_gdb: listening for remote gdb on port 7001
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
Program aborted at tick 0
--- BEGIN LIBC BACKTRACE ---
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z15print_backtracev+0x2c)[0x56040fb2eabc]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z12abortHandleri+0x4a)[0x56040fb40cca]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f35d2cc4890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f35d1706e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f35d1708801]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x74642f)[0x56040eb4d42f]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats4Info7setNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x205)[0x56040fa8afa5]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats8DataWrapINS_6ScalarENS_15ScalarInfoProxyEE4nameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1d)[0x56040ebe59dd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator8StatBaseIN5Stats6ScalarEE4nameENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x61)[0x56040fceb7b9]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator4DRAMINS_4DDR4EE8regStatsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd)[0x56040fcc7c71]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE15populate_memoryERKNS_6ConfigEPS1_ii+0x136)[0x56040fcbf511]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE6createERKNS_6ConfigEi+0x28a)[0x56040fcbcead]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNSt17_Function_handlerIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEPS6_E9_M_invokeERKSt9_Any_dataS5_Oi+0x49)[0x56040fcc0883]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNKSt8functionIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEEclES5_i+0x60)[0x56040fcbe774]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator11Gem5WrapperC2ERKNS_6ConfigEi+0x118)[0x56040fcba968]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9Ramulator4initEv+0x8d)[0x56040f854bbd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x11e5d16)[0x56040f5ecd16]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x7674c4)[0x56040eb6e4c4]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ff3)[0x7f35d2f72763]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ac0)[0x7f35d2f72230]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x76)[0x7f35d301b6f6]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z6m5MainiPPc+0x63)[0x56040fb3f7d3]
--- END LIBC BACKTRACE ---
Thanks in advance for your help!

When I check the cerr.txt of my checkpoint simulation, I saw that gem5 was creating two memory modules in config.ini and thus creating the problem for Ramulator. This is due to gem5s behaviour of creating another memory module after 3G. Ramulator is not capable of working with more than 1 module. When I fixed my memory size to the max limit of 3GB and re-creating a checkpoint, I had no more issues.

Related

s2e-block: dirty sectors on close:11104 Terminating node id 0 (instance slot 0)

I tried to test OpenVSwitch using S2E. I wrote the OpenVSwitch installation script in bootstrap.sh. The image in the qemu virtual machine is the same as the image in the host machine, so the executable file compiled in the host machine should also be executed in the virtual machine. So after I installed OpenVSwitch and started ovsdb-server and ovs-vsctl, ovs-vswitchd should be able to execute successfully, but I got the following error:
18 [State 0] BaseInstructions: Killing state 0
18 [State 0] Terminating state: State was terminated by opcode
message: "bootstrap terminated"
status: 0x0
18 [State 0] TestCaseGenerator: generating test case at address 0x40717d
18 [State 0] TestCaseGenerator: All states were terminated
qemu-system-x86_64: terminating on signal 15 from pid 42128 (/home/lz/s2e/install/bin/qemu-system-x86_64)
s2e-block: dirty sectors on close:11104
Terminating node id 0 (instance slot 0)
bootstrap.sh and the installation script ovs-install.sh are as follows:
bootstrap.sh
#!/bin/bash
#
# This file was automatically generated by s2e-env at 2022-09-29 14:22:53.271106
#
# This bootstrap script is used to control the execution of the target program
# in an S2E guest VM.
#
# When you run launch-s2e.sh, the guest VM calls s2eget to fetch and execute
# this bootstrap script. This bootstrap script and the S2E config file
# determine how the target program is analyzed.
#
set -x
mkdir -p guest-tools32
TARGET_TOOLS32_ROOT=guest-tools32
mkdir -p guest-tools64
TARGET_TOOLS64_ROOT=guest-tools64
# 64-bit tools take priority on 64-bit architectures
TARGET_TOOLS_ROOT=${TARGET_TOOLS64_ROOT}
# To save the hassle of rebuilding guest images every time you update S2E's guest tools,
# the first thing that we do is get the latest versions of the guest tools.
function update_common_tools {
local OUR_S2ECMD
OUR_S2ECMD=${S2ECMD}
# First, download the common tools
for TOOL in ${COMMON_TOOLS}; do
${OUR_S2ECMD} get ${TARGET_TOOLS_ROOT}/${TOOL}
if [ ! -f ${TOOL} ]; then
${OUR_S2ECMD} kill 0 "Could not get ${TOOL} from the host. Make sure that guest tools are installed properly."
exit 1
fi
chmod +x ${TOOL}
done
}
function update_target_tools {
for TOOL in $(target_tools); do
${S2ECMD} get ${TOOL} ${TOOL}
chmod +x ${TOOL}
done
}
function prepare_target {
# Make sure that the target is executable
chmod +x "$1"
}
function get_ramdisk_root {
echo '/tmp/'
}
function copy_file {
SOURCE="$1"
DEST="$2"
cp ${SOURCE} ${DEST}
}
# This prepares the symbolic file inputs.
# This function takes as input a seed file name and makes its content symbolic according to the symranges file.
# It is up to the host to prepare all the required symbolic files. The bootstrap file does not make files
# symbolic on its own.
function download_symbolic_file {
SYMBOLIC_FILE="$1"
RAMDISK_ROOT="$(get_ramdisk_root)"
${S2ECMD} get "${SYMBOLIC_FILE}"
if [ ! -f "${SYMBOLIC_FILE}" ]; then
${S2ECMD} kill 1 "Could not fetch symbolic file ${SYMBOLIC_FILE} from host"
fi
copy_file "${SYMBOLIC_FILE}" "${RAMDISK_ROOT}"
SYMRANGES_FILE="${SYMBOLIC_FILE}.symranges"
${S2ECMD} get "${SYMRANGES_FILE}" > /dev/null
# Make the file symbolic
if [ -f "${SYMRANGES_FILE}" ]; then
export S2E_SYMFILE_RANGES="${SYMRANGES_FILE}"
fi
# The symbolic file will be split into symbolic variables of up to 4k bytes each.
${S2ECMD} symbfile 4096 "${RAMDISK_ROOT}${SYMBOLIC_FILE}" > /dev/null
}
function download_symbolic_files {
for f in "$#"; do
download_symbolic_file "${f}"
done
}
# This function executes the target program given in arguments.
#
# There are two versions of this function:
# - without seed support
# - with seed support (-s argument when creating projects with s2e_env)
function execute {
local TARGET
TARGET="$1"
shift
execute_target "${TARGET}" "$#"
}
###############################################################################
# This section contains target-specific code
function make_seeds_symbolic {
echo 1
}
# This function executes the target program.
# You can customize it if your program needs special invocation,
# custom symbolic arguments, etc.
function execute_target {
local TARGET
TARGET="$1"
shift
#wo tian jia de
sudo ./install_ovs.sh
S2E_SO="${TARGET_TOOLS64_ROOT}/s2e.so"
# ovs-vswitchd is dynamically linked, so s2e.so has been preloaded to
# provide symbolic arguments to the target if required. You can do so by
# using the ``S2E_SYM_ARGS`` environment variable as required
S2E_SYM_ARGS="" LD_PRELOAD="${S2E_SO}" "${TARGET}" "$#" > /dev/null 2> /dev/null
}
# Nothing more to initialize on Linux
function target_init {
# Start the LinuxMonitor kernel module
sudo modprobe s2e
}
# Returns Linux-specific tools
function target_tools {
echo "${TARGET_TOOLS32_ROOT}/s2e.so" "${TARGET_TOOLS64_ROOT}/s2e.so"
}
S2ECMD=./s2ecmd
COMMON_TOOLS="s2ecmd"
###############################################################################
update_common_tools
update_target_tools
# Don't print crashes in the syslog. This prevents unnecessary forking in the
# kernel
sudo sysctl -w debug.exception-trace=0
# Prevent core dumps from being created. This prevents unnecessary forking in
# the kernel
ulimit -c 0
# Ensure that /tmp is mounted in memory (if you built the image using s2e-env
# then this should already be the case. But better to be safe than sorry!)
if ! mount | grep "/tmp type tmpfs"; then
sudo mount -t tmpfs -osize=10m tmpfs /tmp
fi
# Need to disable swap, otherwise there will be forced concretization if the
# system swaps out symbolic data to disk.
sudo swapoff -a
target_init
# Download the target file to analyze
${S2ECMD} get "ovs-vswitchd"
#wo tian jia de
#${S2ECMD} get "ovsdb-server"
#${S2ECMD} get "ovs-vsctl"
${S2ECMD} get "openvswitch-3.0.0.tar.gz"
${S2ECMD} get "install_ovs.sh"
download_symbolic_files
# Run the analysis
TARGET_PATH='./ovs-vswitchd'
prepare_target "${TARGET_PATH}"
#wo tian jia de
#prepare_target "./ovsdb-server"
#prepare_target "./ovs-vsctl"
prepare_target "openvswitch-3.0.0.tar.gz"
prepare_target "install_ovs.sh"
execute "${TARGET_PATH}" --pidfile --detach --log-file
ovs-install.sh
#!/bin/bash
tar zxvf openvswitch-3.0.0.tar.gz
cd openvswitch-3.0.0
./configure
make -j4
sudo make install
export PATH=$PATH:/usr/local/share/openvswitch/scripts
sudo mkdir -p /usr/local/etc/openvswitch
sudo ovsdb-tool create /usr/local/etc/openvswitch/conf.db vswitchd/vswitch.ovsschema
#/usr/local/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd start
sudo ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
sudo ovs-vsctl --no-wait init
#sudo ovs-vswitchd --pidfile --detach
Does anybody can tell me how to fix this? Or is OpenVSwitch simply not testable by S2E?

Fail to restore the checkpoint in gem5 se mode

I want to use checkpoints to accelerate my simulation. The problem is that when I restore from the checkpoint, the gem5 simulation aborted.
The mode I am using is se mode. I am using m5 pseudo instruction m5_checkpoint(0,0) in
my application program to create checkpoints.
I change the CPU model when restoring checkpoints and find out when the system doesn't have cache, the restoration is successful.
The error outputs are as below:
0: system.remote_gdb: listening for remote gdb on port 7005
build/ARM/sim/process.cc:389: warn: Checkpoints for pipes, device drivers and sockets do
not work.
Switch at curTick count:10000
gem5.opt: build/ARM/sim/eventq.hh:766: void gem5::EventQueue::schedule(gem5::Event*,
gem5::Tick, bool): Assertion `when >= getCurTick()' failed.
Program aborted at tick 16277372800
The command line to create checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type AtomicSimpleCPU \
--mem-type DDR3_2133_8x8 --mem-size 1GB \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
"$OUTPUT_PATH/output.txt"
The command line to restore checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type O3_ARM_v7a_3 \
--caches --l2cache --l1i_size 64kB --l1d_size 32kB --l2_size 256kB --l1i_assoc 8
--l1d_assoc 8 --l2_assoc 16 --cacheline_size 128 \
--l2-hwp-type StridePrefetcher --mem-type DDR3_2133_8x8 --mem-size 1GB \
-r 1 --checkpoint-dir "$CHECK_PATH" \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
$OUTPUT_PATH/output.txt" \
The version of gem5 I am using is 21.1.0.2.
Best Regards, Gelin

Gem5,computer architecture

I am trying to run gem5 in FS mode by using command as : "build/ARM/gem5.opt configs/example/fs.py --disk-image=/home/coep/gem5%202/full_system_images/aarch32-ubuntu-natty-headless.img --arm=/home/coep/gem5 2/full_system_images/vmlinux.arm.smp.fb.3.2/vmlinux.arm.smp.fb.3.2"
and getting error as : "Usage: fs.py [options] fs.py: error: option --arm-iset: invalid choice: '/home/coep/gem5' (choose from 'arm', 'thumb', 'aarch64')"
please help me to solve this error.
Thank you.
I assume the --arm=/home/coep/gem5...vmlinux.arm.smp.fb.3.2 argument specifies the path to the guest kernel, in which case it should be --kernel=...:
build/ARM/gem5.opt \
configs/example/fs.py \
--disk-image=/home/coep/gem5\ 2/full_system_images/aarch32-ubuntu-natty-headless.img \
--kernel=/home/coep/gem5\ 2/full_system_images/vmlinux.arm.smp.fb.3.2/vmlinux.arm.smp.fb.3.2
Arguments and their explanations are found in configs/common/Options.py
There can be multiple reasons why are getting this error, One of them can be an incorrect path to the disk image files.
I have run the gem5 in the FS mode and have booted Linux on top of it on Ubuntu 18.04 LTS
You can follow the below steps, the first step is to download and install the full-system binary and disk image files.
1. $ mkdir full_system_image
2. $ cd full_system_image/
3. $ wget http://www.m5sim.org/dist/current/arm/aarch-system-2014-10.tar.bz2
4. $ tar jxf aarch-system-2014-10.tar.bz2
5. $ echo "export M5_PATH=/Path to the full_system_image directory/full_system_images/" >> ~/.bashrc
6. $ source ~/.bashrc
7. $ echo $M5_PATH (- check if the path is set correct)
Now the path has been set, the next step is to run the gem5 in FS mode.
1. connect to gem5 base directory
2. $ ./build/ARM/gem5.opt configs/example/fs.py --disk-image=/home/full_system_image/disks/aarch32-ubuntu-natty-headless.img
3. Note: --disk-image=path to the full_system_image/disks/aarch32-ubuntu-natty-headless.img
4. open a new terminal and listen to port 3456
5. $ telnet localhost 3456
6. Here 3456 is a port number on the gem5 terminal
7. this will take around 30 mins depending on the machine performance.
8. After this, at the end you will get something like this
input: AT Raw Set 2 keyboard as /devices/smb.14/motherboard.15/iofpga.17/1c060000.kmi/serio0/input/input0
input: touchkitPS/2 eGalax Touchscreen as
/devices/smb.14/motherboard.15/iofpga.17/1c070000.kmi/serio1/input/input2
kjournald starting. Commit interval 5 seconds
EXT3-fs (sda1): using internal journal
EXT3-fs (sda1): mounted filesystem with writeback data mode
VFS: Mounted root (ext3 filesystem) on device 8:1.
Freeing unused kernel memory: 292K (806aa000 - 806f3000)
random: init urandom read with 14 bits of entropy available
Ubuntu 11.04 gem5sim ttySA0
9. login as root
Voila, you have run the gem5 in FS mode.

passing parameters to cwl from snakemake

I'm trying to execute some cwl pipelines using snakemake. I need to pass parameters to cwltool, which snakemake is using to execute the pipeline.
cwltool has a number of options. Currently, when snakemake calls it, the only option that I can figure out how to pass is --singularity, and that is easy, because if you call snakemake with the --use-singularity flag, it automatically inserts it into the call to cwltool.
snakemake --jobs 999 --printshellcmds --rerun-incomplete --use-singularity
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 999
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 align
1
[Tue Aug 11 06:55:18 2020]
rule align:
input: /project/uoa00571/projects/rapidAutopsy/data/raw/dna_batch2/BA1_1.fastq.gz, /nesi/project/uoa00571/projects/rapidAutopsy/data/raw/dna_batch2/BA1_2.fastq.gz
output: /nobackup/uoa00571/data/intermediate/rapidAutopsy/BA1.unaligned.bam
jobid: 0
threads: 8
cwltool --singularity file:/project/uoa00571/projects/rapidAutopsy/src/gridss/gridssAlignment.cwl /tmp/tmpfgsz5uvr
Unfortunately, I can't figure out how to add additional arguments to the cwltool call. When snakemake calls the cwl tool, it doesn't appear to pass on the working directory, so if I do :
snakemake --jobs 999 --printshellcmds --rerun-incomplete --use-singularity --directory "/nesi/nobackup/uoa00571/data/intermediate/rapidAutopsy/"
instead of landing intermediate files in the specified directory, singularity appears to be binding a directory in the /tmp directory for the intermediate files, which on the system I am working on, is not big enough, resulting in a disk quota exceeded error.
singularity \
--quiet \
exec \
--contain \
--pid \
--ipc \
--home \
/tmp/xq5co13r:/sSyVnR \
--bind \
/tmp/vbjabhqx:/tmp:rw \
Or at least, that's what I think is happening. If I run the cwl pipeline using cwltool, I can add in the option --cacheDir and the pipeline will run to completion, so I'd like to be able to pass that from snakemake to the cwltool call, if that's possible?

Distributed TensorFlow hangs during CreateSession

I am new to distributed TensorFlow. Right now I am just trying to get some existing examples to work so I can learn how to do it right.
I am following the instruction here to train the inception network on one Linux machine with one worker and one PS.
https://github.com/tensorflow/models/tree/master/research/inception#how-to-train-from-scratch-in-a-distributed-setting
The program hangs during CreateSession with the message:
CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
This my command to start a worker:
./bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/datasets/BigLearning/jinlianw/imagenet_tfrecords/ \
--job_name='worker' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
This is my command to start a PS:
./bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='localhost:2222' \
--worker_hosts='localhost:2223'
And the PS process hangs after printing:
2018-06-29 21:40:43.097361: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]
Started server with target: grpc://localhost:2222
Is the inception model still a valid example for distributed TensorFlow or did I do something wrong?
Thanks!
Problem resolved. Turns out it's due to GRPC. My cluster machines have an environment variable http_proxy set. Unset this variable solves the problem.