tensorFlow serving batch configure ineffectiveness

tensorFlow serving batch configure ineffectiveness - tensorflow-serving

docker run command
docker run -t --rm -p 8500:8500 -p 8501:8501
-v /home/zhi.wang/tensorflow-serving/model:/models
-e MODEL_NAME=beidian_cart_ctr_wdl_model tensorflow/serving:1.12.0
--enable_batching=true --batching_parameters_file=/models/batching_parameters.txt &
batching_parameters.txt
num_batch_threads { value: 40 }
batch_timeout_micros { value: 5000}
max_batch_size {value: 20000000}
server configuration
40 cpu and 64G memory
test result
1 thread predict cost 30ms
40 thread predict one predict cost 300ms
cpu usage
cpu usage in docker can only up to 300% and host cpu usage is low
java test script
TensorProto.Builder tensor = TensorProto.newBuilder();
tensor.setTensorShape(shapeProto);
tensor.setDtype(DataType.DT_STRING);
// batch set 200
for (int i=0; i<200; i++) {
tensor.addStringVal(example.toByteString());
}

i also face the same proble and i found that's maybe the network io problem,you can use dstat to monitor your network interface.
and i fount example.toByteString() also cost much time

Related

Inconsistent performance of GPU subclusters

I'm running my MATLAB code on subclusters provided by my school. One subcluster named 'G' uses Nvidia A100 GPU card and has 12 nodes (G[000-011]) and 128 cores/node.
Whenever I run my code on G[005] and G[006], my code finishes running in just 2 hours. However, strangely, when I run it on any other nodes (i.e.G[000-004, 007-011]), the computation becomes extremely slow (> 4 hours). Since all the nodes should be using the same hardware, I have no idea what is causing this difference.
Does anyone have an idea what is going on? Below is my SLURM job submission file.
Note that I already consulted with a support center at my school, but they also have no idea about this problem yet, so I thought I could get some help here...
#!/bin/sh -l
#SBATCH -A standby
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -n 12
#SBATCH -t 4:00:00
#SBATCH --constraint="C|G|I|J"
#SBATCH --output=slurm-%j-%N.out
/usr/bin/sacct -j "$SLURM_JOBID" --batch-script
/usr/bin/sacct -j "$SLURM_JOBID" --format=NodeList,JobID
echo "------------------------"
cd ..
module load matlab/R2022a
matlab -batch "myfuncion(0,0,0)"

How to select a gpu with minimum gpu-memory of 20GB in qsub/PBS (for tensorflow2.0)?

In a node of our cluster we have gpus some of them are already in use by someone else. I am submitting a job using qsub that runs a jupyter-notebook using one gpu.
#!/bin/sh
#PBS -N jupyter_gpu
#PBS -q long
##PBS -j oe
#PBS -m bae
#PBS -l nodes=anodeX:ppn=16:gpus=3
jupyter-notebook --port=111 --ip=anodeX
However, I find that qsub blocks the gpu that is already in use (the available memory shown is pretty low), thus my code gets an error of low memory. If I aks for more gpus (say 3), the code runs fine only if the GPU:0 has sufficient memory. I am struggling to understand what is happening.
Is there a way to request gpu-memory in qsub?
Note that #PBS -l mem=20gb demands only the cpu memory. I am using tensorflow 2.9.1.

Stuck at training model with CPU

As the example points out:
docker run -it -p 8500:8500 --gpus all tensorflow/serving:latest-devel
should train the mnist mode, however I want to use intel cpu for training, not gpu. But no luck, it stucked at Training model...
Here is the command I used:
docker run -it -p 8500:8500 tensorflow/serving:latest-devel

I found out that it will download resources at first, which a proxy is needed sometimes.

Fail to restore the checkpoint in gem5 se mode

I want to use checkpoints to accelerate my simulation. The problem is that when I restore from the checkpoint, the gem5 simulation aborted.
The mode I am using is se mode. I am using m5 pseudo instruction m5_checkpoint(0,0) in
my application program to create checkpoints.
I change the CPU model when restoring checkpoints and find out when the system doesn't have cache, the restoration is successful.
The error outputs are as below:
0: system.remote_gdb: listening for remote gdb on port 7005
build/ARM/sim/process.cc:389: warn: Checkpoints for pipes, device drivers and sockets do
not work.
Switch at curTick count:10000
gem5.opt: build/ARM/sim/eventq.hh:766: void gem5::EventQueue::schedule(gem5::Event*,
gem5::Tick, bool): Assertion `when >= getCurTick()' failed.
Program aborted at tick 16277372800
The command line to create checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type AtomicSimpleCPU \
--mem-type DDR3_2133_8x8 --mem-size 1GB \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
"$OUTPUT_PATH/output.txt"
The command line to restore checkpoint is:
$GEM5_BIN --outdir=$OUTPUT_PATH $GEM5_PATH/configs/example/se.py \
--num-cpu 1 --cpu-clock 2.5GHz --cpu-type O3_ARM_v7a_3 \
--caches --l2cache --l1i_size 64kB --l1d_size 32kB --l2_size 256kB --l1i_assoc 8
--l1d_assoc 8 --l2_assoc 16 --cacheline_size 128 \
--l2-hwp-type StridePrefetcher --mem-type DDR3_2133_8x8 --mem-size 1GB \
-r 1 --checkpoint-dir "$CHECK_PATH" \
-c "$TARGET_PATH" --options "$DATA_PATH" --output
$OUTPUT_PATH/output.txt" \
The version of gem5 I am using is 21.1.0.2.
Best Regards, Gelin

Same statistics error while using gem5 driven Ramulator in full system simulation for x86

I am trying to run a script in gem5 full system simulation with Ramulator. I have a checkpoint simulation in order to not to boot every time that I have to simulate. The goal is to get debug traces from the simulation in order to inspect the efficiency of the script. When I simulate without Ramulator, it works great. However, when I add the arguments for Ramulator, I get the following error:
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
I looked around but couldn't find anything about this error. Here is my script:
# Initialize the common paths
source path_init.sh
# Path to the directories
config_py=$GEM5_CONFIGS/example/fs.py
outdir=$RESULTS/example_fs/Ramulator
# disk and binaries for the full system simulation
kernel=/home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
image=/home/tohumcu/Stage/gem5/scratch/system/disks/linux-x86.img
ramulator_conf=/home/tohumcu/Stage/gem5/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Ramulator parameters
ramulator_config=$GEM5_REPOSITORY/ext/ramulator/Ramulator/configs/DDR4-config.cfg
# Flag parameters
touch=$outdir/exec_debug_ramulator_fs.txt
# checkpoint
rcS_file=/home/tohumcu/Stage/gem5/scratch/default/ManCommand.rcS
chkpt_dir=$RESULTS/example_fs/Checkpoint/
mkdir -p $outdir
#--debug-flags=Exec \
#--debug-file=$outdir/exec_debug_ramulator_fs.txt \
$GEM5 --debug-flags=Exec \
--debug-file=$outdir/exec_debug_ramulator_fs.txt \
-d $outdir $config_py $* \
--cpu-type AtomicSimpleCPU \
--caches \
--l2cache \
--mem-size 10GB \
--mem-type=Ramulator \
--ramulator-config=$ramulator_config \
--disk-image $image \
--kernel $kernel \
--script $rcS_file \
--checkpoint-dir $chkpt_dir \
--checkpoint-restore=1 \
--num-cpus 2 \
> $outdir/cmd.txt \
2> $outdir/cerr.txt
# Ramulator arguments to add:
# --mem-type=Ramulator \
# --ramulator-config=$ramulator_config \
and here is the full cerr.txt file:
warn: Physical memory size specified is 10GB which is greater than 3GB. Twice the number of memory controllers would be created.
info: kernel located at: /home/tohumcu/Stage/gem5/scratch/system/binaries/vmlinux-4.14.134
system.pc.com_1.device: Listening for connections on port 3456
0: system.remote_gdb: listening for remote gdb on port 7000
0: system.remote_gdb: listening for remote gdb on port 7001
panic: same statistic name used twice! name=ramulator.active_cycles_0
Memory Usage: 10703888 KBytes
Program aborted at tick 0
--- BEGIN LIBC BACKTRACE ---
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z15print_backtracev+0x2c)[0x56040fb2eabc]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z12abortHandleri+0x4a)[0x56040fb40cca]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f35d2cc4890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f35d1706e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f35d1708801]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x74642f)[0x56040eb4d42f]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats4Info7setNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x205)[0x56040fa8afa5]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN5Stats8DataWrapINS_6ScalarENS_15ScalarInfoProxyEE4nameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1d)[0x56040ebe59dd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator8StatBaseIN5Stats6ScalarEE4nameENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x61)[0x56040fceb7b9]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator4DRAMINS_4DDR4EE8regStatsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd)[0x56040fcc7c71]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE15populate_memoryERKNS_6ConfigEPS1_ii+0x136)[0x56040fcbf511]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator13MemoryFactoryINS_4DDR4EE6createERKNS_6ConfigEi+0x28a)[0x56040fcbcead]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNSt17_Function_handlerIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEPS6_E9_M_invokeERKSt9_Any_dataS5_Oi+0x49)[0x56040fcc0883]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZNKSt8functionIFPN9ramulator10MemoryBaseERKNS0_6ConfigEiEEclES5_i+0x60)[0x56040fcbe774]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9ramulator11Gem5WrapperC2ERKNS_6ConfigEi+0x118)[0x56040fcba968]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_ZN9Ramulator4initEv+0x8d)[0x56040f854bbd]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x11e5d16)[0x56040f5ecd16]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(+0x7674c4)[0x56040eb6e4c4]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ff3)[0x7f35d2f72763]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ac0)[0x7f35d2f72230]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f35d2f71366]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f35d30b0908]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f35d2f6b5d9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x76)[0x7f35d301b6f6]
/home/tohumcu/Stage/gem5/build/X86/gem5.opt(_Z6m5MainiPPc+0x63)[0x56040fb3f7d3]
--- END LIBC BACKTRACE ---
Thanks in advance for your help!

When I check the cerr.txt of my checkpoint simulation, I saw that gem5 was creating two memory modules in config.ini and thus creating the problem for Ramulator. This is due to gem5s behaviour of creating another memory module after 3G. Ramulator is not capable of working with more than 1 module. When I fixed my memory size to the max limit of 3GB and re-creating a checkpoint, I had no more issues.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

tensorFlow serving batch configure ineffectiveness - tensorflow-serving

i also face the same proble and i found that's maybe the network io problem,you can use dstat to monitor your network interface. and i fount example.toByteString() also cost much time

Related

Inconsistent performance of GPU subclusters

How to select a gpu with minimum gpu-memory of 20GB in qsub/PBS (for tensorflow2.0)?

Stuck at training model with CPU

Fail to restore the checkpoint in gem5 se mode

Same statistics error while using gem5 driven Ramulator in full system simulation for x86

Categories

Resources